PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
Alexander K. Lew, Monica Agrawal, David Sontag, Vikash K. Mansinghka
PPClean: Bayesian Data Cleaning at Scale withDomain-Specific Probabilistic Programming
Alexander K. Lew, Monica Agrawal, David Sontag, Vikash Mansinghka
Massachusetts Institute of Technology, Cambridge, MA 02139 [email protected], [email protected], [email protected], [email protected]
Abstract
Data cleaning is naturally framed as probabilistic inference in a generative model,combining a prior distribution over ground-truth databases with a likelihood thatmodels the noisy channel by which the data are filtered, corrupted, and joined toyield incomplete, dirty, and denormalized datasets. Based on this view, we presentPClean, a unified generative modeling architecture for cleaning and normalizingdirty data in diverse domains. Given an unclean dataset and a probabilistic programencoding relevant domain knowledge, PClean learns a structured representation ofthe data as a relational database of interrelated objects, and uses this latent structureto impute missing values, identify duplicates, detect errors, and propose correctionsin the original data table. PClean makes three modeling and inference contributions:(i) a domain-general non-parametric generative model of relational data, for infer-ring latent objects and their network of latent connections; (ii) a domain-specificprobabilistic programming language, for encoding domain knowledge specific toeach dataset being cleaned; and (iii) a domain-general inference engine that adaptsto each PClean program by constructing data-driven proposals used in sequentialMonte Carlo and particle Gibbs. We show empirically that short (< 50-line) PCleanprograms deliver higher accuracy than state-of-the-art data cleaning systems basedon machine learning and weighted logic; that PClean’s inference algorithm is fasterthan generic particle Gibbs inference for probabilistic programs; and that PCleanscales to large real-world datasets with millions of rows.
Real-world data is often noisy and incomplete, littered with NULL values, typos, duplicates, andother inconsistencies. This can make it difficult to integrate multiple sources of data, or to extractuseful information even from a single dataset. Cleaning dirty data—e.g. detecting and correctingerrors, imputing missing values, or linking duplicate records—is thus an important first step in mostdata analysis workflows. Unfortunately, data cleaning has proven remarkably resistant to reliableautomation, due to the heterogeneity of error patterns in real-world applications [1].Researchers have long recognized probabilistic generative modeling as an appealing approach todata cleaning problems [5, 6, 17, 21, 57]. A generative model for data cleaning specifies a priorprobability distribution over latent clean data, together with a likelihood model describing howthe clean data is noisily observed. Bayesian inference algorithms can then be applied to infer aposterior distribution over latent clean datasets from dirty observations. In theory, generative modelscan exploit modularly encoded domain knowledge about particular datasets or error types, andweigh heterogeneous factors to detect errors and propose fixes in dirty data. But existing generativemodeling techniques have not seen widespread adoption, compared to discriminative and weighted-logic approaches [4, 31, 51, 52, 40], some of which power industry data cleaning solutions. This isthe case even though discriminative methods have significant drawbacks: they make it difficult to
Preprint. Under review. a r X i v : . [ c s . L G ] A ug ncorporate domain knowledge (and thus achieve acceptable accuracy), quantify uncertainty aboutproposed corrections, or, in some machine-learning-based approaches, audit the process by whichcleaning decisions are made. Why isn’t the more flexible generative modeling approach more widelyused? Several key challenges for generative data-cleaning remain to be solved: Challenge 1. It is not feasible to create bespoke generative models and inference algorithmsfor each new dataset.
Over decades, researchers have built special-purpose generative models andinference algorithms for narrow domains or particular types of errors [19, 30, 36, 43, 44, 53, 55, 57],leveraging carefully encoded domain knowledge to deliver high accuracy. Unfortunately, designingsuch models, and deriving and implementing effective inference algorithms for them, is a time-consuming task that requires significant expertise. Probabilistic programming tools [2, 3, 9, 10, 11,12, 33, 54] aim to ease this burden by providing languages for concisely specifying probabilisticmodels and tools for automating aspects of inference. But today’s automatic inference technology isnot sufficient for automatic data cleaning: it typically relies either on gradient-based sampling andoptimization (e.g., using Hamiltonian Monte Carlo or ADVI), which is not directly applicable inmodels with many discrete latent variables, or on domain-general stochastic search algorithms (e.g.,single-site Metropolis-Hastings) that can be prohibitively slow to converge in complex models.
Challenge 2. One-size-fits-all models cannot exploit dataset-specific domain knowledge, andso tend to underfit on real-world data.
An alternative to creating bespoke models per dataset isto design a single, one-size-fits-all model that encodes assumptions about data cleaning in generalbut not about specific datasets [5, 17, 29]. However, this approach strips generative models of oneof their key advantages: the ability to incorporate arbitrary knowledge about a problem, leading toimproved accuracy and results that are interpretable in the context of a domain. One-size-fits-allmodels instead make simplifying assumptions about the data (e.g., independence of data entries orsometimes columns; limited error types; no continuous variables), limiting their applicability to orperformance on real-world datasets. This challenge arises also for approaches that attempt to learn amodel from data [26, 41, 49]: in this case, the simplifying assumptions are baked into the class ofmodels over which the learning algorithm searches.
Challenge 3. Furthermore, systems support for efficient Bayesian inference is lacking.
Bayesian inference algorithms, and especially Monte Carlo algorithms for posterior sampling, have areputation for being slow. There is no fundamental reason why this must be the case: for particularmodels and in particular data regimes, it is often possible to develop efficient algorithms that yield ac-curate results quickly in practice (even if existing theory cannot accurately characterize the regimes inwhich they work well). But little tooling exists for deriving these fast algorithms, or for implementingthem using efficient data structures and computation strategies (though see [3, 18, 28, 50] for somework in this direction). Compare this to the state-of-the-art in deep learning, in which specializedhardware, software libraries, and compilers help to ensure that compute-intensive training algorithmscan be run in a reasonable amount of time, with little or no performance engineering by the user.
Our approach.
In this work, we present PClean, a domain-specific probabilistic programminglanguage for Bayesian data cleaning. PClean’s architecture is based on three modeling and inferencecontributions, each addressing limitations of prior work in probabilistic data cleaning and probabilisticprogramming:1.
A domain-general generative model for cleaning data, customizable via dataset-specificpriors. This model posits a non-parametric, relational generative process in which latentdatabase tables are generated, joined, and corrupted to yield dirty, denormalized data.2.
A domain-general particle MCMC inference algorithm combining sequential MonteCarlo initialization with particle Gibbs rejuvenation. This algorithm is user-configurablevia concise dataset-specific inference hints that inform how the problem is decomposed andwhat data-driven proposals are used.3.
A domain-specific probabilistic programming language for augmenting the domain-general model with dataset-specific priors about latent relational structure and likely errors,and dataset-specific inference hints to improve inference performance.This paper also contributes empirical demonstrations that short (< 50 line) probabilistic programsdeliver higher accuracy than state-of-the-art machine learning and weighted logic baselines, withaccuracy that improves as more domain knowledge is incorporated. Finally, we show that PClean can2cale to handle large, real-world datasets, by applying it to detect and correct errors in Medicare’s2.2-million-row database of health care professionals.To the best of our knowledge, this paper is the first to show that a generative Bayesian approach(jointly modeling identity uncertainty, record linkage, and data errors) can be deployed across a broadrange of real-world problems with modest problem-specific effort, including instances with millionsof rows, to yield higher accuracy and comparable performance relative to strong weighted logicbaselines. Our results show that it is feasible and useful to integrate modeling and inference insightsfrom the Bayesian non-parametrics, relational learning, data cleaning, and Monte Carlo inferenceliteratures, into a single domain-specific probabilistic programming language.
Much research on automated data cleaning leverages probabilistic models to weigh noisy signals,detect errors, and propose fixes:
Domain-specific probabilistic models.
Models of specific domains or classes of errors [19, 17, 30,43, 44, 55, 57] can leverage domain knowledge for more accurate repairs. But this approach requiresderiving and implementing complex problem-specific models and inference algorithms, and so isoften prohibitively time-consuming to apply. PClean allows users to customize a domain-generalmodel to a particular dataset using short probabilistic programs, and automates much of the inference.
Non-parametric Bayesian approaches.
Non-parametric Bayesian techniques [46] have been usedto learn hierarchical mixture models for imputation and outlier detection, as well as to identifyduplicate records to be merged for entity resolution [26, 29, 41, 42]. PClean’s non-parametric modelextends this work in three ways: (i) it is relational (i.e., the non-parametric priors are over networksof interrelated objects); (ii) it can be customized to specific datasets via probabilistic programs; and(iii) it jointly models identity uncertainty, missing values, and data errors, rather than focusing ononly one data cleaning task (e.g., models for only entity resolution).
Domain-general generative data modeling and knowledge extraction.
Open-universe generativemodels have been proposed for deduplication and record linkage tasks [36], due to their ability toexpress intricate patterns of uncertainty over linked collections of objects. Probabilistic frameworkshave also been proposed for modeling clean hidden data and dirty observed data [6], for modelingtabular data [26, 11, 49], and for automatically extracting knowledge bases [53]. Unfortunately, theseconceptually appealing approaches can be difficult to apply to clean each new dirty structured dataset.For example, for open-universe models, general-purpose probabilistic programming languages [3,9, 10, 33, 54] make it possible to specify them concisely, but inference can be complex and hard toscale [34]. Although PClean’s non-parametric relational models are arguably less expressive thanopen-universe modeling languages like BLOG [33], PClean models admit a relatively robust inferencealgorithm that does not require much tuning for use with new priors and datasets (Section 4).
Weighted logic and undirected modeling approaches.
Multiple large-scale data cleaning systemsuse undirected models or weighted logic. For example, [40] compiles user-specified constraintsinto factor graphs with learned weights, and [4] compiles rules into variable-weighted MAX-SATproblems. Several approaches jointly model uncertainty over object identity, references, and attributes,often via discriminatively trained undirected models [31, 51, 52]. We benchmark against state-of-the-art weighted logic approaches in Section 5.
PClean is based on a domain-general generative model for dirty data, that can be specialized toparticular datasets via domain-specific probabilistic programs. The generative model is relational:it posits a latent network of interrelated objects underlying the observed data (e.g. landlords andneighborhoods in a dataset of apartment listings), organized into a set of latent classes. The modelis also non-parametric: the number of latent objects in each class is unbounded. Observed datarecords are modeled as depending on attributes of one or more of these latent objects, via a noisychannel. Domain knowledge is incorporated via generative probabilistic programs that define thelatent relational domain, probable connections, attribute-level priors, and likely errors.3 ataset-Specific PClean Program
A program declares one or more classes .The class dependency graph must be acyclic. latent class Complex loc ~
Neighborhood base ~ N (loc.avgrent, 50) endlatent class Landlord parameter σ ~ unif(1, 500) parameter p ~ beta(1, 1)individual ~ flip(p) 𝛿 ~ N (0, individual ? σ : 150) end observation class Obs parameter σ ~ invgamma(20, 2000) subproblem begin complex ~ Complex desc ~ typos("Apt in $(complex.loc.name)") endsubproblem begin landlord ~
Landlord rent ~ N (landlord. 𝛿 + complex.base, sqrt(σ )) end end latent class Neighborhood name ~ string_prior(ξ=known_names)avgrent ~ N (1500, 1000) end Each class declares parameters sharedamong all objects of the class, reference slots that refer to other classes, and attributes sampled from primitive distributions.
Domain-General Generative Model
Given a probabilistic program declaring latent classes C ,we consider the following generative process for tabular data:G ENERATE D ATASET (): for latent class C ∈ T OPOLOGICAL S ORT ( C ) do θ C ∼ p θ C () G C ∼ G ENERATE C OLLECTION ( C, θ C , { G C (cid:48) } C (cid:48) ∈ Pa ( C ) ) θ Obs ∼ p θ Obs () for i ∈ { , . . . , n } do r i ∼ G ENERATE O BJECT ( Obs , θ
Obs , { G C } C ∈ Pa ( Obs ) ) G ENERATE C OLLECTION ( C, θ C , { G C (cid:48) } C (cid:48) ∈ Pa ( C ) ): s C ∼ Gamma (1 , d C ∼ Beta (1 , G C ∼ P Y ( s C , d C , G ENERATE O BJECT ( C, θ C , { G C (cid:48) } C (cid:48) ∈ Pa ( C ) )) G ENERATE O BJECT ( C, θ C , { G C (cid:48) } C (cid:48) ∈ Pa ( C ) ): for reference slot Y ∈ R ( C ) do r.Y ∼ G T ( C.Y ) for attribute X ∈ A ( C ) do r.X ∼ φ C.X ( θ C , { r.τ } τ ∈ Pa ( C.X ) ) Figure 1: PClean’s generative model for tabular data (right) is parameterized by a probabilisticprogram encoding dataset-specific domain knowledge (left). The program declares a set of latent classes C = ( C , . . . , C k ) representing the kinds of objects reflected in the dataset. Each class C isequipped with parameters θ C , attributes A ( C ) and reference slots R ( C ) . The generative processproceeds along the dependency graph induced by the reference slots, generating objects of each classonly once it has finished processing the class’s parents. For each latent class C , class-wide parameters θ C are generated first, followed by an infinite weighted collection of objects G C , sampled from aPitman-Yor Process. Finally, the n observed data entries are generated from the Obs class.
The precise generative process our model describes depends on the details of a PClean probabilisticprogram (Figure 1, left), which defines a set of latent classes C = ( C , . . . , C k ) representing thetypes of object (e.g. Neighborhood, Landlord) that populate the latent object network, as well as an observation class Obs modeling the records of the observed dataset (e.g. apartment listings).The declaration of a PClean class C may include three kinds of statement: reference statements ( Y ∼ C (cid:48) ), which define a foreign key or reference slot C.Y that connects objects of class C to objects of a target class T ( C.Y ) = C (cid:48) ; attribute statements ( X ∼ φ C.X ( . . . ) ), which definea new field or attribute C.X that objects of the class possess, and declare an assumption aboutthe probability distribution φ C.X that the attribute typically follows; and parameter statements ( parameter θ C ∼ p θ C ( . . . ) ), which introduce global parameters shared among all objects of theclass C , to be learned from the noisy dataset. As in probabilistic relational models [8], the distribution φ C.X of an attribute may depend on the values of a parent set
P a ( C.X ) of attributes, potentiallyaccessed via reference slots. For example, in Figure 1, the Obs class has a complex reference slotwith target class
Complex , and a desc attribute whose value depends on complex.loc.name .Programs may also invoke arbitrary deterministic computations, e.g. string manipulation or arithmetic.Figure 2 shows how PClean’s modeling language can be used to capture diverse data and error patterns.
Given a program defining latent classes C and observation class Obs , the generative process for adataset is depicted in the right panel of Figure 1. We describe two representations of this process:
Discrete Random Measure Representation.
We process classes one at a time, in topological order.For each latent class, we (1) generate class-wide parameters θ C from their corresponding priors,4 atent class Website name ~ string_prior()reliability ~ beta(100, 10) end latent class Flight arr_time ~ time_prior()… end observation class Obs src ~ Websiteflight ~ Flightarr_time ~ maybe_swap(flight.arr_time, arr_times, src.reliability)… endlatent class ZoningCode name ~ string_prior() end observation class Obs code ~ ZoningCodeobs_code ~ typos(code.name) end Systematic Unit Errors latent class Country name ~ string_prior()unit_dist ~ dirichlet([1,1]) end observation class Obsparameter avg_km ~ normal(100, 10)country ~ Countryunit ~ discrete([km,miles], country.unit_dist)distance ~ transformed_normal(unit, avg_km, 1) end miles(x) = 0.62xkm(x)=x
Data Integration with Learned Reliability Rates Per SourceNoisy Categorical with Unknown Domain
Figure 2: PClean programs can model a variety of data cleaning scenarios. All of these patterns canbe used individually or combined in a single script, depending on the user’s dataset.and (2) generate an infinite weighted collection of objects of class C . An object r of class C is anassignment of each attribute C.X to a value r.X and of each reference slot
C.Y to an object r.Y ofclass T ( C.Y ) . An infinite collection of latent objects is generated via a Pitman-Yor Process [46]: G C ∼ P Y ( s C , d C , G ENERATE O BJECT ( C, θ C , { G C (cid:48) } C (cid:48) ∈ P a ( C ) )) The Pitman-Yor Process is a discrete random measure that generalizes the Dirichlet Process. Itcan be understood as first sampling an infinite vector of probabilities ρ ∼ GEM ( s C , d C ) from atwo-parameter GEM distribution, then setting G C = (cid:80) ∞ i =1 ρ i δ r Ci , where each of the infinitely manyobjects r Ci is distributed according to G ENERATE O BJECT ( C, θ C , { G C (cid:48) } C (cid:48) ∈ P a ( C ) ) . This itself is adistribution over objects , which first samples reference slots and then attributes (see Figure 1).To generate the observed dataset, we sample θ Obs from its prior distribution, then, for i ∈ { , . . . , n } ,generate the i th observed entry: r i ∼ G ENERATE O BJECT ( Obs , θ
Obs , { G C } C ∈ P a ( Obs ) ) . Chinese Restaurant Process Representation.
We can also describe a finitely representable
ChineseRestaurant version of this process. Consider a collection of restaurants, one for each class C , whereeach table serves a dish r representing an object of class C . Upon entering a restaurant, customerseither sit at an existing table or start a new one, as in the usual generalized CRP construction. Butthese restaurants require that to start a new table, customers must first send |R ( C ) | friends to other restaurants (one to the target of each reference slot). Once they are seated at these parent restaurants,they phone the original customer to help decide what to order, i.e., how to sample the attributes r.X of the new table’s object, informed by their dishes (the objects r.Y of class T ( C.Y ) ). The processstarts with n customers at the observation class Obs ’s restaurant, who sit at separate tables.
Generic sequential Monte Carlo and particle Gibbs algorithms are supported by several probabilisticprogramming languages [9, 25, 35, 48, 54]. PClean’s inference algorithm (Algorithm 1) is basedon sequential Monte Carlo initialization and particle Gibbs rejuvenation, but differs from genericPPL implementations in two ways. First, PClean uses per-object PGibbs updates that exploit theexchangeability of the domain-general model [23]. These updates allow for joint sampling of alllatent variables associated with a single object. In contrast, generic PPL implementations operateover the complete state space. For our model, this would entail repeated costly iterations, eachmaintaining multiple copies of the complete latent state, rather than fast sweeps limited to a singlelatent object . Second, PClean uses data-driven proposals that lead to accurate results much fasterthan proposing from the prior (Figure 3). These proposals are informed by dataset-specific “inferencehints” embedded in PClean programs, enabling end-users to concisely customize PClean’s algorithmand empirically optimize performance, without deriving custom proposals themselves. The Venture inference engine can perform PGibbs over subproblems defined by single objects, but Venture’sgeneral-purpose implementation is not fast enough for large-scale problems. lgorithm 1 Sequential Monte Carlo inference with particle Gibbs rejuvenation and data-drivenproposals R (0) j ← ∅ , ∀ j ∈ { , . . . , N SMC } (cid:46) Initialize each particle with empty set of objects w j ← , ∀ j ∈ { , . . . , N SMC } (cid:46) Initialize particle weights for i ∈ { , . . . , |D|} do (cid:46) Process each data record in sequence for j ∈ { , . . . , N SMC } do (cid:46) Update each particle ˜ R j , ˜ w ← D ATA D RIVEN P ROPOSAL ( Obs , y i , R ( i − j ) (cid:46) Incorporate observation y i w j ← w j ˜ wa ,...,N SMC ← C ATEGORICAL ( w (cid:80) j w j , . . . , w N SMC (cid:80) j w j ) (cid:46) Resample using particle weights R ( i ) j ← ˜ R a j , ∀ j ∈ { , . . . , N SMC } w j = 1 , ∀ j ∈ { , . . . , N SMC } for t ∈ { , . . . , M PG } , j ∈ { , . . . , N SMC } do (cid:46) Particle Gibbs sweeps for object r in R j do (cid:46) Update each latent object R − rj ← R EMOVE O BJECT ( R j , r ) (cid:46) Remove r and anything that only it references ˜ R , ˜ w ← R j , w ( R j ; R − rj ) (cid:46) Set retained particle for k ∈ { , . . . , N PG } do (cid:46) Propose N PG − other particles ˜ R k , ˜ w k ← D ATA D RIVEN P ROPOSAL ( C, ∅ , R − rj ) a ← C ATEGORICAL ( ˜ w (cid:80) k ˜ w k , . . . , ˜ w N PG (cid:80) k ˜ w k ) (cid:46) Select a particle and update state R j ← ˜ R a procedure D ATA D RIVEN P ROPOSAL (class C , direct observations y , partial state R − r ) r ← {} w ← for subproblem i ∈ { , . . . , k C } do r i , ∆ R ∼ Q iC ( r i , ∆ R | r ; R − r , y ) (cid:46) Propose new attributes and reference slots w ← wQ iC ( r i , ∆ R | r ; R − r ,y ) for r (cid:48) ∈ ∆ R do (cid:46) Add newly created objects w ← wP Class ( r (cid:48) ) ( r (cid:48) | R − r ) R − r ← R − r ∪ { r (cid:48) } r ← r ∪ r i (cid:46) Add proposed attributes and reference slots to rw ← wP C ( r | R − r ) w ← w (cid:81) { ( r (cid:48) ,τ,X ) | r (cid:48) ∈ R − r ,r (cid:48) .τ = (cid:63) } P ( r (cid:48) .X | R ) (cid:46) Score downstream observations R ← R − r ∪ { r } return R , w Notation.
The algorithm operates over a finite representation of the model’s state space basedon the Chinese Restaurant Process analogue described in Section 3.2. We maintain for each par-ticle an initially empty collection R of currently instantiated objects of all classes. We writeR EMOVE O BJECT ( R , r ) to denote the operation of removing an object from R : all reference slotspointing to r are filled with the placeholder value (cid:63) , and if r was the only object that referred to someother object r (cid:48) , then this subroutine is invoked recursively to remove r (cid:48) as well. Data-driven proposals.
During SMC and particle Gibbs sweeps (Algorithm 1), object attributes andreference slots are proposed using a data-driven proposal Q C ( r, ∆ R ; R − r , y ) . Q C is a distributionover a new object r of class C , and a (potentially empty) set of new objects ∆ R that may be createdas targets for r ’s reference slots. It is informed by observed attributes y , as well as by R − r , acollection of existing objects of all classes, with some objects r (cid:48) having placeholder reference slots r (cid:48) .Y = (cid:63) targeting the object r to be proposed by Q C . Q C uses subproblem decomposition [3,25, 27] based on user inference hints ( subproblem begin ... end in Figure 1), which partition C ’sattributes and reference slots into an ordered set of subproblems S C , . . . , S k C C . We write A ( S iC ) for the attributes introduced in subproblem i , and R ( S iC ) for the reference slots . The proposal6 r e c a ll PClean proposal (2 particles)PPL default proposal (100 particles)
25 50 75 100 125 150 175
Time (seconds) p r e c i s i o n
25 50 75 100 125 150 1750.00.20.40.60.8 F Figure 3: Accuracy (recall, precision, and F1) vs. time (seconds) for three independent runs of twoalgorithms: PClean’s data-driven SMC+PG with two particles, and 100-particle SMC+PG that usesgeneric proposals from general-purpose probabilistic programming languages. The default algorithmsamples proposals from the prior, so it sometimes fails to propose the correct clean value even incases where the cell is already clean. This leads to very low precision and thus low F1. PClean’sprecision is also poor after a single SMC pass (around 15-20 seconds), but improves greatly with asingle particle Gibbs sweep (around 40 seconds later). The dataset is a version of
Hospital (19,000cells) with synthetically introduced typos and 20% of the values deleted. Q C ( r, ∆ R ; R − r , y ) = (cid:81) k C i =1 Q iC ( r i , ∆ i R | r
In the worst case, a fully connected subproblem with m discretevariables of domain size k can take O ( k m ) time to solve, though in practice, conditional independenceenables much more efficient variable elimination routines. When a reference slot Y is involved, thecomplexity will depend on N Y , the number of objects in the current state that might be the targetof the reference slot. Letting ˆ N = max Y N Y , the overall complexity of the algorithm becomes7 ask Metric PClean HoloClean HoloClean [40]
NADEEF [4]
PClean AblationFlights
Prec 0.91 0.79 0.39 0.76 0.41Rec 0.89 0.55 0.45 0.03 0.36F1
Hosp.
Prec 1.0 0.95 1.0 0.99 0.92Rec 0.83 0.85 0.71 0.73 0.17F1
Rent
Prec 0.68 0.83 0.83 0 0.48Rec 0.69 0.34 0.34 0 0.44F1 Unpublished version of HoloClean, on the dev branch of https://github.com/HoloClean/holocleanTable 1: Results of PClean and various baseline systems on three diverse cleaning tasks.
ModelDescription
Baseline. Sitesequally reliable. + Modeling oftimestamp format + Learned per-sitereliability scores + Prior that airline’ssite is better F1 Lines of Code
16 16 17 18Table 2: Four PClean models on
Flights . Each model (L to R) uses more domain knowledge. O ( N ˆ N ( N SMC + M PG N PG )) where we treat domain sizes k of discrete random variables as foldedinto the constant. Note that ˆ N is upper-bounded by N , but will often grow much more slowly than N . PClean hashes all objects into “canopies” [32] by their directly observed attributes, to access theset of N Y possible (i.e., non-zero likelihood) target objects of each reference slot Y in O (1) time. We empirically evaluate PClean’s accuracy, customizability, and scalability, versus state-of-the-artbaselines. All experiments were performed on one laptop running macOS Catalina with a 2.6 GHzcore i7 processor and 32 GB of memory, with PClean experiments using two-particle data-drivensequential Monte Carlo with particle Gibbs sweeps.
We first evaluate PClean’s accuracy and performance on three tasks with known ground-truth:the
Flights and
Hospital tasks, which are standard benchmarks from the data cleaning literature [40],and a
Rent task that we developed.
Hospital contains artificial typos in 5% of cells.
Flights lists flightdeparture and arrival times from often conflicting websites. We use the version from [24].
Rents consists of 50,000 apartment listings created using county-level HUD and census statistics [47], withmisspellings of counties, missing states and number of bedrooms, and incorrect units on rent prices.Additional details (links to data, PClean source code, and all hyperparameters used) can be found inthe Supplementary Materials.
Baselines.
We compare against three baselines.
HoloClean is a data cleaning system based onprobabilistic machine learning [40], which compiles custom integrity constraints into a factor graphwith learned weights. We use both the latest (unpublished) version from the author’s Github and theversion published in [40].
NADEEF is a data cleaning system that leverages user-specified cleaningrules via a variable-weighted MAX-SAT solver based algorithm [4].
PClean Ablation is an ablationof PClean that removes the relational structure and only allows the single class
Obs . PClean models.
We briefly describe the key features of the models we use for each dataset; seeSupplementary Materials for the probabilistic programs. For
Hospital , latent classes are City, Place,Hospital, Metric, Condition, and HospitalType, and each field is modeled as a potentially misspelledversion of a latent value. For
Flights , latent classes are Flight and TrackingWebsite. Each site has alearned reliability property, and is presumed to be more reliable if it is the site of the airline running8 ame Specialty Degree School Address City State Zip
K. Ryan Family Medicine PCOM 6317 York Rd Baltimore MD 21212-2310K. Ryan Family Medicine PCOM 100 Walter Ward Blvd Abington MD 21009-1285S. Evans Internal Medicine MD UMD 100 Walter Ward Blvd Abington MD 21009-1285M. Grady Physical Therapy Other 3491 Merchants Blvd Abingdon MD 21009-2030(2,183,988 more rows) observed class Obs
Physician Address (2,183,988 more rows) latent class BusinessAddr
Address Zip City latent class City
City State
Baltimore MDAbingdon MD(14,966 more rows) latent class School
School Degree Distribution
PCOM DO .79, MD .18, …UMD MD .89, PT 0.03,…Other MD .32, NP 0.22(395 more rows)
Name Specialty Degree School Address City State Zip
K. Ryan Family Medicine DO PCOM 6317 York Rd Baltimore MD 21212-2310K. Ryan Family Medicine DO PCOM 100 Walter Ward Blvd
Abingdon
MD 21009-1285S. Evans Internal Medicine MD UMD 100 Walter Ward Blvd
Abingdon
MD 21009-1285M. Grady Physical Therapy PT Other 3491 Merchants Blvd Abingdon MD 21009-2030(2,183,988 more rows)
Dirty observationsCleaned dataInferred structure parameter specialty_probs
Degree P(Specialty | Degree)
MD Internal .15, Family .12,…DO Family .33, Internal .13, …PT Physical Therapy .94, …(19 more parameters) latent class Physician
School Name Degree Specialty
K. Ryan DO Family MedS. Evans MD Internal MedM. Grady PT Physical Therapy(1,142,213 more rows)
Figure 4: PClean applied to Medicare’s Physician Compare National database. Displayed is the input,the actual inferred latent entities, and cleaned output. PClean corrects systematic errors (e.g. themisspelled
Abington, MD appears 152 times in the dataset) and imputes missing values.the flight. For
Rents , the latent class is County, whose attributes are a name, a state, and an averagerent for each apartment type (studio, 1BR, etc.).
Results.
Table 1 records recall ( R = correct repairstotal errors ) , precision ( P = correct repairstotal repairs ) , and F ( F =2 P R/ ( P + R ) ) for each experiment. PClean achieves the highest F score on all tasks. Table 2 shows that small source code additions to incorporate additional domain knowledge canimprove cleaning quality. See Supplementary Materials for full PClean source code for each example.
Figure 4 shows results from applying PClean to Medicare’s 2.2-million-row Physician Comparedatabase [7]. This dataset contains many missing values and systematic errors.
Results.
PClean took 7h36m, changing 8,245 values and imputing 1,535,415 missing cells. In arandom subsample of 100 imputed cells, we found 90% agreed with manually obtained ground truthvalues. We also manually checked PClean’s changes, finding that 7,954 were correct, for a precisionof 96.5%. Of these, some were correct normalization (e.g. choosing a single spelling for cities whosenames could be spelled in multiple ways). We cannot quantify recall without ground truth, but tocalibrate overall result quality, we ran NADEEF and HoloClean on this data. NADEEF changed only88 cells, and HoloClean domain initialization alone takes 28 hours.Figure 4 illustrates PClean’s behavior on four rows from the dataset (showing 8/38 columns). Considerthe misspelling
Abington, MD , which appears in 152 entries. The correct spelling
Abingdon, MD occurs in only 42. However, PClean recognizes
Abington, MD as an error because all 152 instancesshare a single address, and errors are modeled as happening systematically at the address level. Nowconsider PClean’s correct inference that K. Ryan’s degree is DO . PClean leverages the fact that herschool PCOM awards more DOs than MDs, even though more
Family Medicine doctors are MDsthan DOs. The parameters that enable this reasoning are learned from the dirty dataset itself.9
Discussion
Limitations and future work.
PClean has limitations relative to probabilistic relational models [8]:(1) PClean handles only acyclic class dependency graphs; (2) one object’s attributes can depend onanother’s, but not on aggregate properties of sets of related objects; and (3) PClean models cannotencode priors over reference slots that depend on attributes, e.g. cinemas selecting movies to showbased on genre (though such dependencies can arise in the posterior). Hierarchical Pitman-Yorprocesses and more sophisticated probabilistic programming architectures could potentially be usedto lift these restrictions, without sacrificing scalability. PClean could also be extended in other ways,e.g. to leverage uncertainty quantification for interactive cleaning [20], or to synthesize error modelsand functional dependencies automatically [13, 15].
Contributions
This paper has shown that it is possible to clean dirty, denormalized data from diversedomains, via domain-specific probabilistic programs. The modeling assumptions and inference hintsin these programs require under 50 lines of code, and yield state-of-the-art accuracy compared tostrong baselines using weighted logic and machine learning. This language also scales to datasetswith millions of rows. The power of domain-specific probabilistic programming languages, thatintegrate sophisticated styles of modeling and inference developed over years of research into simplelanguages, has been demonstrated in other challenging domains such as 3D computer vision [22, 26].PClean offers analogous benefits for Bayesian data cleaning. We hope that PClean is useful topractitioners, helping non-experts clean data that would otherwise be unusable for analysis, and thatthe success of this approach encourages AI and probabilistic programming researchers to developdomain-specific probabilistic programming languages for other important problem domains.
The widespread need for clean data.
Many organizations struggle with dirty data — includinglarge corporations, schools, non-profits, and governmental agencies [38]. In surveys, data analystsreport that cleaning dirty data takes up more than sixty percent of their time, and the economic costof unclean data has been estimated at trillions of dollars [37, 39]. PClean represents a step toward areliable, broadly accessible data-cleaning approach that is simultaneously customizable and robustenough to be applicable to a broad range of problems in multiple domains.
Public interest uses for data cleaning.
Data cleaning is a central bottleneck in data journalism,social science, and policy evaluation [14]. We intend to assist people seeking to apply PClean to thesekinds of public interest use cases, especially where the data comes from manual entry or scrapingfrom volunteers. We plan to open-source the software as a Julia package, eliminating financialobstacles to PClean’s use.
Harmful uses for data cleaning.
PClean could make it easier for companies, governments, andpolitical organizations to link personal information from multiple sources [45]. Research on de-anonymization has shown repeatedly that vulnerable populations are easier to de-anonymize [16].Better data cleaning technology, applied by oppressive "surveillance states" and/or for-profit com-panies whose business models depend on surveillance of consumers, could make these vulnerablepeople even easier to identify and exploit or persecute.
Data biases.
PClean’s inference algorithm may infer parameters or impute missing values accordingto biases encoded in the source data. PClean could thus produce imputations that reflect systemicdiscrimination. One potentially mitigating factor is that unlike many machine learning approaches todata cleaning, PClean makes explicit generative assumptions. It is thus conceptually straightforwardto prevent PClean from incorporating known biases, e.g. by requiring columns containing informationabout race, gender, or sexual orientation to be modeled independently of outcomes.
Acknowledgments
The authors are grateful to Zia Abedjan, Divya Gopinath, Marco Cusumano-Towner, Raul CastroFernandez, Cameron Freer, Christina Ji, Tim Kraska, Feras Saad, Michael Stonebraker, and JoshTenenbaum for useful conversations and feedback. This work is supported in part by a FacebookProbability and Programming Award, the National Science Foundation, and a philanthropic gift fromthe Aphorism Foundation. 10 eferences [1] Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani,Paolo Papotti, Michael Stonebraker, and Nan Tang. Detecting data errors: Where are we andwhat needs to be done? In
Proceedings of the VLDB Endowment , 2016.[2] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, MichaelBetancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilisticprogramming language.
Journal of statistical software , 76(1), 2017.[3] Marco F. Cusumano-Towner, Alexander K. Lew, Feras A. Saad, and Vikash K. Mansinghka.Gen: A general-purpose probabilistic programming system with programmable inference.In
Proceedings of the ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI) , 2019.[4] Michele Dallachiesat, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, MouradOuzzani, and Nan Tang. NADEEF: A commodity data cleaning system. In
Proceedings of theACM SIGMOD International Conference on Management of Data , 2013.[5] Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, and Subbarao Kambhampati.BayesWipe: A scalable probabilistic framework for improving data quality.
Journal of Dataand Information Quality , 8(1), oct 2016.[6] Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas.A formal framework for probabilistic unclean databases. In
Leibniz International Proceedingsin Informatics, LIPIcs , 2019.[7] Centers for Medicare and Medicaid Services (CMS). Physician compare national downloadablefile, 2019.[8] Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning probabilistic relationalmodels. In
IJCAI International Joint Conference on Artificial Intelligence , 1999.[9] Hong Ge, Kai Xu, and Zoubin Ghahramani. Turing: A Language for Flexible ProbabilisticInference. In
Proceedings of the 21st International Conference on Artificial Intelligence andStatistics (AISTATS 2018) , volume 84 of
Proceedings of Machine Learning Research , pages1682–1690. PMLR, 2018.[10] Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum.Church: A Language for Generative Models. In
Proceedings of the 24th Annual Conference onUncertainty in Artificial Intelligence (UAI 2008) , pages 220–229. AUAI Press, 2008.[11] Andrew D. Gordon, Thore Graepel, Nicolas Rolland, Claudio Russo, Johannes Borgström,and John Guiver. Tabular: A schema-driven probabilistic programming language.
ConferenceRecord of the Annual ACM Symposium on Principles of Programming Languages , (1):321–334,2014.[12] Maria I Gorinova, Andrew D Gordon, and Charles Sutton. Probabilistic programming with den-sities in slicstan: efficient, flexible, and deterministic.
Proceedings of the ACM on ProgrammingLanguages , 3(POPL):1–30, 2019.[13] Zhihan Guo and Theodoros Rekatsinas. Unsupervised Functional Dependency Discovery forData Preparation.
Iclr - Lld , pages 1–5, 2019.[14] Alon Y Halevy and Susan McGregor. Data management for journalism.
IEEE Data Eng. Bull. ,35(3):7–15, 2012.[15] Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. HoloDetect: Few-shot learning for error detection. In
Proceedings of the ACM SIGMOD International Conferenceon Management of Data , 2019.[16] Sameera Horawalavithana, Clayton Gandy, Juan Arroyo Flores, John Skvoretz, and AdrianaIamnitchi. Diversity, homophily and the risk of node re-identification in labeled social graphs.In
Studies in Computational Intelligence , 2019.[17] Yuheng Hu, Sushovan De, Yi Chen, and Subbarao Kambhampati. Bayesian Data Cleaning forWeb Data. 2012.[18] Daniel Huang, Jean-Baptiste Tristan, and Greg Morrisett. Compiling markov chain monte carloalgorithms for probabilistic modeling. In
Proceedings of the 38th ACM SIGPLAN Conference n Programming Language Design and Implementation , PLDI 2017, page 111–125, New York,NY, USA, 2017. Association for Computing Machinery.[19] Mohammad Ahangar Kiasari, Gil-Jin Jang, and Minho Lee. Novel iterative approach usinggenerative and discriminative models for classification with missing features. Neurocomputing ,225:23–30, feb 2017.[20] Ari Kobren. Integrating User Feedback under Identity Uncertainty in Knowledge Base Con-struction.
Akbc , 2019.[21] Jeremy Kubica and Andrew Moore. Probabilistic noise identification and data cleaning.
Pro-ceedings - IEEE International Conference on Data Mining, ICDM , pages 131–138, 2003.[22] Tejas D. Kulkarni, Pushmeet Kohli, Joshua B. Tenenbaum, and Vikash Mansinghka. Picture: Aprobabilistic programming language for scene perception. In
Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition , 2015.[23] James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani, and Daniel M. Roy. Exchangeabledatabases and their functional representation.
Snap.Stanford.Edu , pages 1–7, 2014.[24] Mohammad Mahdavi, Samuel Madden, Ziawasch Abedjan, Mourad Ouzzani, Nan Tang,Raul Castro Fernandez, and Michael Stonebraker. Raha: A configuration-free error detec-tion system. In
Proceedings of the ACM SIGMOD International Conference on Management ofData , 2019.[25] Vikash Mansinghka, Daniel Selsam, and Yura N Perov. Venture: a higher-order probabilisticprogramming platform with programmable inference. pages 1–78, 2014.[26] Vikash Mansinghka, Richard Tibbetts, Jay Baxter, Pat Shafto, and Baxter Eaves. BayesDB: Aprobabilistic programming system for querying the probable implications of data. pages 1–45,2015.[27] Vikash K Mansinghka, Ulrich Schaechtle, Shivam Handa, Alexey Radul, Yutian Chen, andMartin Rinard. Probabilistic Programming with Programmable Inference. In
Proceedings ofthe 39th ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI 2018) , pages 603–616, New York, NY, USA, 2018. ACM.[28] Vikash Kumar Mansinghka.
Natively probabilistic computation . PhD thesis, 2009.[29] Nicholas Elias Matsakis. Active Duplicate Detection with Bayesian Nonparametric Models.page 137, 2010.[30] Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. A statistical method for integrated datacleaning and imputation. Technical report, 2009.[31] A Mccallum and B Wellner. Object Consolodation by Graph Partitioning with a Conditionally-Trained Distance Metric.
In Proceedings of the KDD-2003 Workshop on Data Cleaning, RecordLinkage, and Object Consolidation , pages 19–24, 2003.[32] Andrew McCallum, Kamal Nigam, and Lyle H Ungar. Efficient clustering of high-dimensionaldata sets with application to reference matching. In
Proceedings of the sixth ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 169–178, 2000.[33] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L Ong, and AndreyKolobov. BLOG: Probabilistic Models With Unknown Objects. In
Proceedings of the Nine-teenth International Joint Conference on Artificial Intelligence (IJCAI 2005) , pages 1352–1359.Morgan Kaufmann Publishers Inc., 2005.[34] Brian Milch and Stuart Russell. General-purpose MCMC inference over relational structures.
Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, UAI 2006 , pages349–358, 2006.[35] Lawrence M. Murray. Bayesian state-space modelling on high-performance hardware usingLibBi.
Journal of Statistical Software , 67(10):1–28, 2015.[36] Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser. Identityuncertainty and citation matching.
Advances in Neural Information Processing Systems , 2003.[37] Gil Press. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task,Survey Says.
Forbes Tech , 2016.[38] E Rahm and Hh Do. Data cleaning: Problems and current approaches.
IEEE Data Eng. Bull. ,2000. 1239] Thomas C Redman. Bad Data Costs the U.S. $3 Trillion Per Year, 2016.[40] Theodoros Rekatsinas, Xu Chuy, Ihab F. Ilyasy, and Christopher Ré. HoloClean: Holistic datarepairs with probabilistic inference. In
Proceedings of the VLDB Endowment , 2017.[41] Feras Saad and Vikash Mansinghka. A probabilistic programming approach to probabilisticdata analysis. In
Advances in Neural Information Processing Systems , 2016.[42] Matthew S. Shotwell and Elizabeth H. Slate. Bayesian outlier detection with dirichlet processmixtures.
Bayesian Analysis , 6(4):665–690, 2011.[43] David Sontag, Rohit Singh, and Bonnie Berger. Probabilistic modeling of systematic errorsin two-hybrid experiments. In
Pacific Symposium on Biocomputing 2007, PSB 2007 , pages445–457, 2007.[44] Rebecca C. Steorts, Rob Hall, and Stephen E. Fienberg. A Bayesian Approach to Graph-ical Record Linkage and Deduplication.
Journal of the American Statistical Association ,111(516):1660–1672, 2016.[45] Latanya Sweeney.
Computational disclosure control: A Primer on Data Privacy Protection .PhD thesis, 2001.[46] Yee Whye Teh and Michael I. Jordan. Hierarchical Bayesian nonparametric models withapplications.
Bayesian Nonparametrics , pages 158–207, 2011.[47] US Census Bureau. County Population Totals: 2010-2019, 2019.[48] Jan Willem Van De Meent, Hongseok Yang, Vikash Mansinghka, and Frank Wood. ParticleGibbs with ancestor sampling for probabilistic programs.
Journal of Machine Learning Research ,38:986–994, 2015.[49] Antonio Vergari, Alejandro Molina, Robert Peharz, Zoubin Ghahramani, Kristian Kersting, andIsabel Valera. Automatic Bayesian Density Analysis.
Proceedings of the AAAI Conference onArtificial Intelligence , 33:5207–5215, 2019.[50] Rajan Walia, Praveen Narayanan, Jacques Carette, Sam Tobin-Hochstadt, and Chung-chiehShan. From high-level inference algorithms to efficient code.
Proc. ACM Program. Lang. ,3(ICFP), July 2019.[51] B Wellner, A McCallum, F Peng, and M Hay. An integrated, conditional model of informationextraction and coreference with application to citation matching.
Proceedings of the 20thconference on Uncertainty in artificial intelligence , pages 593–601, 2004.[52] Michael Wick, Sameer Singh, Harshal Pandya, and Andrew McCallum. A joint model fordiscovering and linking entities.
AKBC 2013 - Proceedings of the 2013 Workshop on AutomatedKnowledge Base Construction, Co-located with CIKM 2013 , pages 67–71, 2013.[53] John Winn, John Guiver, Sam Webster, Yordan Zaykov, Martin Kukla, and Dany Fabian.Alexandria : Unsupervised High-Precision Knowledge Base Construction using a ProbabilisticProgram. pages 1–20, 2017.[54] Frank Wood, Jan Willem Meent, and Vikash Mansinghka. A New Approach to ProbabilisticProgramming Inference. In
Proceedings of the 17th International Conference on ArtificialIntelligence and Statistics (AISTATS 2014) , volume 33 of
Proceedings of Machine LearningResearch , pages 1024–1032. PMLR, 2014.[55] Liang Xiong, Barnabás Póczos, Jeff Schneider, Andrew Connolly, and Jake VanderPlas. Hi-erarchical probabilistic models for group anomaly detection.
Journal of Machine LearningResearch , 15:789–797, 2011.[56] Boris Yangel, Tom Minka, and John Winn. Belief propagation with strings.[57] Bo Zhao, Benjamin I. P. Rubinstein, Jim Gemmell, and Jiawei Han. A Bayesian approachto discovering truth from conflicting sources for data integration.
Proceedings of the VLDBEndowment , 5(6):550–561, feb 2012.
A PClean model programs
In this section, we give the PClean source code for all models used in experiments. Code forPClean’s implementation itself, as well as for running these PClean models, is also included in theaccompanying .zip file. 13 .1 Hospital
The
Hospital dataset is modeled with six latent classes. Typos are modeled as independentlyintroduced for each cell of the dataset. Some fields are modeled as draws from broad priors overstrings, whereas others are modeled as categorical draws whose domain is the set of unique observedvalues in the relevant column (some of which are in fact typos).Inference hints are used to focus proposals for string_prior choices on the set of strings that haveactually been observed in a given column, and also to set a custom subproblem decomposition for the
Obs class (all other classes use the default decomposition). latent class
County parameter state_proportions ∼ dirichlet(ones(num_states))state ∼ discrete(observed_values[:State], state_proportions)county ∼ string_prior(3, 30, observed_values[:CountyName]) endlatent class Placecounty ∼ Countycity ∼ string_prior(3, 30, observed_values[:City]) endlatent class Conditiondesc ∼ string_prior(5, 35, observed_values[:Condition]) endlatent class Measurecode ∼ uniform(observed_values[:MeasureCode])name ∼ uniform(observed_values[:MeasureName])condition ∼ Condition endlatent class
HospitalTypedesc ∼ string_prior(10, 30, observed_values[:HospitalType]) endlatent class Hospital parameter owner_dist ∼ dirichlet(ones(num_owners)) parameter service_dist ∼ dirichlet(ones(num_services))loc ∼ Placetype ∼ HospitalTypeid ∼ uniform(observed_values[:ProviderNumber])name ∼ string_prior(3, 50, observed_values[:HospitalName])addr ∼ string_prior(10, 30, observed_values[:Address1])phone ∼ string_prior(10, 10, observed_values[:PhoneNumber])owner ∼ discrete(observed_values[:HospitalOwner], owner_dist)zip ∼ uniform(observed_values[:ZipCode])service ∼ discrete(observed_values[:EmergencyService], service_dist) endobservation class Obs subproblem begin hosp ∼ Hospital; service ∼ typos(hosp.service)id ∼ typos(hosp.id); name ∼ typos(hosp.name)addr ∼ typos(hosp.addr); city ∼ typos(hosp.loc.city)state ∼ typos(hosp.loc.county.state); zip ∼ typos(hosp.zip)county ∼ typos(hosp.loc.county.county); phone ∼ typos(hosp.phone)type ∼ typos(hosp.type.desc); owner ∼ typos(hosp.owner) endsubproblem begin metric ∼ Measurecode ∼ typos(metric.code); mname ∼ typos(metric.name);condition ∼ typos(metric.condition.desc)stateavg = "$(hosp.loc.county.state)_$(metric.code)"stateavg_obs ∼ typos(stateavg) endend A.2 Flights
We first introduce the model for
Flights that we used to obtain the results presented in Section 5.1.We then show variants of the model with less domain knowledge incorporated, which are shown toperform worse.The primary model is shown below. In the parameter declaration for error_probs , we use the syntax error_probs[_] ∼ beta(10, 50) to introduce a collection of parameters; the declared variablebecomes a dictionary, and each time it is used with a new index, a new parameter is instantiated. Weuse this to learn a different error_prob parameter for each tracking website. We could alternativelydeclare error_prob as an attribute of the TrackingWebsite class. However, PClean’s inferenceengine uses smarter proposals for declared parameter s (taking advantage of conjugacy relationships),14o for our experiments, we use the parameter declaration instead. We hope to extend automaticconjugacy detection to all attributes, not just parameters, in the near future. latent class
TrackingWebsitename ∼ string_prior(2, 30, websites) endlatent class Flightflight_id ∼ string_prior(10, 20, flight_ids); guaranteed flight_idsdt ∼ time_prior(observed_values["$flight_id-sched_dep_time"])sat ∼ time_prior(observed_values["$flight_id-sched_arr_time"])adt ∼ time_prior(observed_values["$flight_id-act_dep_time"])aat ∼ time_prior(observed_values["$flight_id-act_arr_time"]) endobservation class Obs parameter error_probs[_] ∼ beta(10, 50)flight ∼ Flight; src ∼ TrackingWebsiteerror_prob = lowercase(src.name) == lowercase(flight.flight_id[1:2]) ? 1e-5 : error_probs[src.name]sdt ∼ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], error_prob)sat ∼ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], error_prob)adt ∼ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], error_prob)aat ∼ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], error_prob) end As in
Hospital , we use observed_values to provide inference hints to the broad time_prior ; thisexpresses a belief that the true timestamp for a certain field is likely one of the timestamps that hasactually been observed, in the dirty dataset, with the given flight ID.
Alternative models.
We now briefly list the alternative models (Models 1-3) from Section 4.2. The unmodeled keyword is used to introduce a value that is guaranteed to be observed, but which is notmodeled as a random variable. latent class
TrackingWebsitename ∼ string_prior(2, 30, websites) endlatent class Flightflight_id ∼ unmodeled(); guaranteed flight_idsdt ∼ uniform(observed_values["$flight_id-sched_dep_time"])sat ∼ uniform(observed_values["$flight_id-sched_arr_time"])adt ∼ uniform(observed_values["$flight_id-act_dep_time"])aat ∼ uniform(observed_values["$flight_id-act_arr_time"]) endobservation class Obs begin flight ∼ Flight; src ∼ TrackingWebsitesdt ∼ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], 0.1)sat ∼ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], 0.1)adt ∼ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], 0.1)aat ∼ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], 0.1) end latent class TrackingWebsitename ∼ string_prior(2, 30, websites) endlatent class Flightflight_id ∼ unmodeled(); guaranteed flight_idsdt ∼ time_prior(observed_values["$flight_id-sched_dep_time"])sat ∼ time_prior(observed_values["$flight_id-sched_arr_time"])adt ∼ time_prior(observed_values["$flight_id-act_dep_time"])aat ∼ time_prior(observed_values["$flight_id-act_arr_time"]) endobservation class Obs begin flight ∼ Flight; src ∼ TrackingWebsitesdt ∼ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], 0.1)sat ∼ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], 0.1)adt ∼ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], 0.1)aat ∼ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], 0.1) end latent class TrackingWebsitename ∼ string_prior(2, 30, websites) endlatent class Flightflight_id ∼ unmodeled(); guaranteed flight_idsdt ∼ time_prior(observed_values["$flight_id-sched_dep_time"])sat ∼ time_prior(observed_values["$flight_id-sched_arr_time"])adt ∼ time_prior(observed_values["$flight_id-act_dep_time"])aat ∼ time_prior(observed_values["$flight_id-act_arr_time"]) ndobservation class Obs beginparameter error_probs[_] ∼ beta(10, 50)flight ∼ Flight; src ∼ TrackingWebsitesdt ∼ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], error_probs[src.name])sat ∼ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], error_probs[src.name])adt ∼ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], error_probs[src.name])aat ∼ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], error_probs[src.name]) end A.3 Rents
The program we use for
Rents is shown below. We model the fact that the rent may be in grand instead of dollars , as well as that the county name may contain typos. We introduce an artificialfield, block , consisting of the first and last letters of the observed (possibly erroneous) County field,and use it to inform an inference hint: we hint that posterior mass for a county’s name concentrateson those strings observed somewhere in the dataset that share a first and last letter in common withthe observed county name for this row. Without this approximation, inference is much slower (butpotentially more accurate). data_table.block = map(x -> "$(x[1])$(x[end])", data_table.County)units = [Transformation(identity, identity, x -> 1.0),Transformation(x -> x/1000.0, x -> x*1000.0, x -> 1/1000.0)] latent class
County parameter state_pops ∼ dirichlet(ones(num_states))block ∼ unmodeled(); guaranteed blockname ∼ string_prior(10, 35, observed_values[block])state ∼ discrete(states, state_pops) endobservation class Obs parameter avg_rent[_] ∼ normal(1500, 1000) subproblem begin county ∼ Countycounty_name ∼ typos(county.name, 2)br ∼ uniform(room_types)unit ∼ uniform(units)rent_base = avg_rent["$(county.state)_$(county.name)_$(br)"]observed_rent ∼ transformed_normal(rent_base, 150.0, unit) end rent = round(unit.backward(observed_rent)) end A.4 Physicians
The model for Physicians is given below. Many columns are not modeled. Similar to
Rents , we use a parameter in the
Physician class for degree_probs , although it might seem more natural to use anattribute of the
School class; the resulting model is the same, but using parameter allows PClean toexploit conjugacy. latent class
Schoolname ∼ unmodeled(); guaranteed name endlatent class Physician parameter error_prob ∼ beta(1.0, 1000.0) parameter degree_proportions[_] ∼ dirichlet(3 * ones(num_degrees)) parameter specialty_proportions[_] ∼ dirichlet(3 * ones(num_specialties))npi ∼ number_code_prior()school ∼ School subproblem begin degree ∼ discrete(observed_values[:Credential], degree_proportions[school.name])specialty ∼ discrete(observed_values[Symbol("Primary specialty")], specialty_proportions[degree])degree_obs ∼ maybe_swap(degree, observed_values[:Credential], error_prob) endendlatent class Cityc2z3 ∼ unmodeled(); guaranteed c2z3name ∼ string_prior(3, 30, cities[c2z3]) endlatent class BusinessAddraddr ∼ unmodeled(); guaranteed addr ddr2 ∼ unmodeled(); guaranteed addr2zip ∼ string_prior(3, 10); guaranteed ziplegal_name ∼ unmodeled(); guaranteed legal_name subproblem begin city ∼ Citycity_name ∼ typos(city.name) endendobservation class Obsp ∼ Physiciana ∼ BusinessAddr end
B Inference hints
PClean supports three types of inference hint : subproblem declarations, preferred values arguments,and guaranteed observation statements. In this section, we describe each, and give an empiricaldemonstration of how the results in Section 4 depend on them (Table 3).
B.1 Subproblem declarations
Subproblem declarations allow users to explicitly control which attributes and reference slots areincluded within each subproblem S iC (see Section 4). Larger subproblems lead to more expensivesubproblem proposals Q iC , but can lead to more accurate results. Users declare a subproblem bywrapping adjacent attribute and reference statements into a subproblem begin ... end block.To see the value of subproblem declarations, consider the Hospital program above. If we remove the subproblem begin ... end inference hint around the second subproblem in that model, then each lineis treated as its own subproblem. We ran cleaning using this modified model (but kept other settingsequal) and obtained an F score of only 0.18 (recall = 0.71, precision = 0.11). B.2 Preferred values arguments
Preferred values arguments ξ C.X are optional arguments to distributions like string_prior that haveinfinite or very large support. The model itself does not change as a result of providing a preferredvalues argument. However, the proposal Q iC is adapted to reflect that posterior mass is expectedto concentrate on the user-provided list of preferred values. In the models for this paper, we oftenspecified preferred values equal to all observed values in a particular column, or values observed toco-occur with another value in a separate column. For example, we model the name of a city in the Hospital dataset as a string generated from a broad prior, but indicate that we expect posterior massto concentrate on the set of strings that have actually been observed as city names in the dataset.To illustrate the value of preferred values arguments, we perform two targeted experiments. First,for the
Flights dataset, we consider removing the preferred values argument from the time_prior call for the scheduled departure time of a flight. This yields a diminished overall F of 0.62 (recall =0.70, precision = 0.55), due to mostly incorrect inferences about the scheduled departure time field.(Runtime is also slightly longer, because of the increased number of sampling operations from the time_prior proposal.) Second, we consider the effect of a preferred values argument that is verybroad: we replace the more targeted preferred values argument for scheduled departure time with alist of all timestamps appearing anywhere in the Flights data (around 800 options). Inference qualityis unaffected ( F = 0.90, precision = 0.91, recall = 0.89), but running time is 7x longer: completingfive iterations of particle Gibbs requires 95 seconds, instead of 13.Preferred values arguments are a simple way to make inference in large discrete domains (e.g.,strings) tractable. Researchers have also explored more sophisticated techniques for inference withstrings [53, 56]. It may be possible to incorporate such strategies into a future version of PClean formore complex string-based inference, alleviating the need for preferred values arguments in somecases. 17ataset + Inference Hints F1 Recall Precision RuntimeHospital (with both subproblem blocks) 0.91 0.83 1.0 39.9sHospital (with only first subproblem block) 0.18 0.71 0.11 37.8sFlights (with preferred values, guaranteed flight ID) 0.90 0.89 0.91 13.5sFlights (no preferred values for departure time) 0.62 0.70 0.55 17.6sFlights (overly broad preferred values for departure time) 0.90 0.89 0.91 95.4sFlights (no guaranteed flight ID) 0.90 0.89 0.91 17.2sTable 3: Effect of removing each kind of inference hint on accuracy and runtime of inference. B.3 Guaranteed observation statements
Guaranteed observations are declared using the guaranteed keyword, which tells PClean that thevalue of a particular variable in a PClean model is guaranteed to be observed. This allows PCleanto index object collections by these observed variable values, enabling fast lookups of all existinglatent objects that are consistent with a particular observation of the variable. In the
Flights dataset,removing the guaranteed statement from the flight_id attribute yields no change in inferenceresults, but a modest 31% increase in running time (up to 17s from 13s). On a larger dataset, thisruntime difference would be more pronounced.