A Bayesian Approach to Linking Data Without Unique Identifiers
AA Bayesian Approach to Linking Data WithoutUnique Identifiers
Edwin Farley
Division of Applied Mathematics, Brown University
Roee Gutman
Department of Biostatistics, Brown University
Abstract
Existing file linkage methods may produce sub-optimal results because they considerneither the interactions between different pairs of matched records nor relationships be-tween variables that are exclusive to one of the files. In addition, many of the currentmethods fail to address the uncertainty in the linkage, which may result in overly preciseestimates of relationships between variables that are exclusive to one of the files. Bayesianmethods for record linkage can reduce the bias in the estimation of scientific relationshipsof interest and provide interval estimates that account for the uncertainty in the linkage;however, implementation of these methods can often be complex and computationallyintensive. This article presents the
GFS package for the R programming language thatutilizes a Bayesian approach for file linkage. The linking procedure implemented in GFS samples from the joint posterior distribution of model parameters and the linking permu-tations. The algorithm approaches file linkage as a missing data problem and generatesmultiple linked data sets. For computational efficiency, only the linkage permutations arestored and multiple analyses are performed using each of the permutations separately.This implementation reduces the computational complexity of the linking process and theexpertise required of researchers analyzing linked data sets. We describe the algorithmimplemented in the
GFS package and its statistical basis, and demonstrate its use on asample data set.
Keywords : File linking, Bayesian analysis, Multiple imputation, MCMC, R , Python . a r X i v : . [ s t a t . C O ] D ec Bayesian Data Linkage
1. Introduction
In health care and the social sciences, individual subjects’ characteristics and outcomes areoften dispersed over multiple files. To investigate relationships between variables that appearin different data sources, researchers seek to link individuals across these data sources whileadapting to privacy regulations. In some record linkage applications, such as fraud detectionand law enforcement, identifying records that belong to the same individual is essential;however, in many record linkage applications in epidemiology, medicine, and biostatistics, itis the preservation of associations between variables that is crucial, while the identification ofindividuals across data sets is not (D’Orazio 2006).The statistical literature describes two broad classes of methods to link different data sources:statistical matching and record linkage. The objective of statistical matching algorithms is tolearn about relationships among variables that are not jointly observed in a single data source(Rässler 2012; Rodgers 1984). The data sources may comprise a disjoint set of units, and thelinking variables will typically be scientifically relevant variables, as opposed to identifyinglabels. Associations mediated through the linking variables can be estimated, but associationsconditional on the linking variables cannot (Rubin 1974).In record linkage applications, the linked files represent overlapping units such that recordsin different files represent similar entities (Fellegi and Sunter 1969; Winkler 2002). Recordlinkage procedures can be classified into two main types of algorithms: deterministic and probabilistic . Deterministic record linkage methods identify records that belong to the sameentity based on a deterministic agreement function applied to data elements that are commonto both records. Probabilistic record linkage methods link records across data sets based onprobabilities that pairs of records from the two different files represent the same unit. Theseprobabilities are commonly estimated from the distribution of elements’ agreement in theobserved data or from a previously identified subset of records.Deterministic methods are widely used and can be as simple as establishing that recordsrepresent the same units when they match exactly on one or more common data elements, suchas first and last names. Deterministic linking based on perfect agreements has been shownto have a higher rate of true links than probabilistic linking; however, when the underlyingdata elements are subject to error in recording, deterministic linking may have a high levelof missed true links compared to probabilistic linking (Gomatam, Carter, Ariet, and Mitchell2002; Campbell, Deck, and Krupski 2008).We describe a Bayesian probabilistic linkage algorithm that views record linkage as a missingdata problem. The algorithm accounts for relationships between pairs of records and rela-tionships between variables exclusive to one of the files. The algorithm is implemented in the
GFS package for the R programming language ( R Core Team 2017). The paper proceeds asfollows: Section 2 provides a brief review of existing record linkage procedures and their im-plementations. Section 3 describes the proposed Bayesian record linking procedure. Section4 describes the Markov chain Monte Carlo (MCMC) sampling algorithm for the proposedBayesian record linkage procedure. Section 5 provides an empirical example demonstratingthe use of
GFS . Section 6 provides a summary and a discussion of future work.
2. Probabilistic linkage and related work
Probabilistic record linkage relies on a framework proposed by Fellegi and Sunter (1969).Let γ ij = ( γ ij , . . . , γ ijP ) be an agreement vector between record i in file A and record j infile B , where P is the number of covariates that appear in both files and γ ijk ∈ { , .., L k } represents the level of agreement of the values of covariate k in the two files. The Fellegi andSunter model assumes that γ ij follows a mixture distribution, such that if ( i, j ) is a true link, γ ij ∼ f M ( γ ij | θ M ), and if ( i, j ) is a non-link, γ ij ∼ f U ( γ ij | θ U ). Formally, the mixture modelis: P ( γ ij ) = πf M ( γ ij | θ M ) + (1 − π ) f U ( γ ij | θ U ) , (1)where π is the probability that a pair of records is a true link.Because in many applications π, θ M and θ U are unknown, the Expectation Maximization (EM)algorithm is commonly used to estimate these parameters (Belin and Rubin 1995; Larsen andRubin 2001). Let ˆ π, ˆ θ M and ˆ θ U be the estimates of π, θ M and θ U , respectively. The Fellegiand Sunter algorithm calculates weights for each pair of records ( i, j ), w ij = f M ( γ ij | ˆ θ M ) f U ( γ ij | ˆ θ U ) . These weights commonly inform a “greedy” algorithm that iteratively links and removes fromthe matching pool the pair of records with the highest probability of a match until a certainpredefined probability threshold cannot be met by any remaining record pairs (Fellegi andSunter 1969). These remaining record pairs with weights that are below the threshold areeither clerically reviewed or declared non-links. Although the computational simplicity of thealgorithm is considered a strength, it may produce a globally sub-optimal linkage, because itdoes not consider the relationships between distinct pairs of records. One possible solutionto this issue is to rely on an optimization algorithm that forces one-to-one linkage aftercalculation of w ij (Jaro 1995).Estimation of the probabilities that pairs of records represent the same entities can be com-putationally intensive. “Blocking” is a file linkage technique that reduces the number of pos-sible links. Once the data sets are “blocked,” only pairs of records that agree on the blockingvariables are considered for linking (Newcombe and Kennedy 1962; Newcombe 1988). Thisreduces computational complexity and increases the accuracy of the linkage at the same time.Murray (2015) described other possible indexing methods, and possible limitations of linkageerror propagation when blocking is used.Despite the popularity of the Fellegi and Sunter algorithm for linking records, it has a numberof weaknesses. First, the mixture model will always identify two clusters regardless of whetherthe two files include records that represent the same entities (Winkler 2002). Moreover, theFellegi and Sunter algorithm assumes that linkage identification of different record pairs areindependent (Sadinle 2017), and thus may not be efficient in identifying accurate links evenwith the inclusion of the optimization step proposed by Jaro (Jaro 1995). Lastly, in many ap-plications, the linkage process is not the final goal of the analysis, and adjustments for errors inthe linkage process should be addressed when estimating relationships between variables thatare exclusive to one of the files. To address this issue, estimation procedures that rely on thenon-informative linkage assumption have been proposed (Lahiri and Larsen 2005; Chambers, Bayesian Data Linkage
Chipperfield, Davis, and Kovacevic 2009). These procedures also assume that the probabil-ities that a pair ( i, j ) is a true link, P (( i, j ) ∈ M | γ ij ), or a false link, P (( i, j ) ∈ U | γ ij ), areknown or can be estimated accurately. The non-informative linking assumption asserts thatassociations between variables that are exclusive to one of the files do not inform the linkage.Based on this assumption, P (( i, j ) ∈ M | γ ij ) and P (( i, j ) ∈ U | γ ij ) can be used as weights in aregression model (Lahiri and Larsen 2005; Han and Lahiri 2019), or in generalized estimatingequations (Chambers 2009). In many applications, the non-informative linkage assumption isviolated (Gutman, Afendulis, and Zaslavsky 2013; Gutman, Sammartino, Green, and Mon-tague 2016), which may lead to biased estimates. This violation is especially pronouncedwhen the true links and the false links cannot be separated well using variables that appearin both files.Bayesian procedures can address these shortcomings by introducing a latent linking structurethat maps records in one file to records in another file (Gutman et al. et al. R packages for record linkage are based on the Fellegi and Sunter algorithm andhave similar limitations. One R package is fastLink , which links records based on the mixturemodel in Equation (1) (Enamorado, Fifield, and Imai 2019). The fastLink package estimatesthe parameters of the model using the EM algorithm, relying on efficient data structures toincrease the speed of the estimation algorithm. To adjust for linkage error, it provides theprobabilities that two records are a match, and these can be used as weights in regressionmodels. An additional limitation of this package is that support for blocking is not well-integrated into the package, so its use may increase computation time. A different R packageis RecordLinkage (De Bruin 2019). This package is slower than fastLink in terms of estimatingthe model parameters, but it can handle blocking in a simpler manner. Another R package is StatMatch . This package performs statistical matching, relying on predictive mean matchingto impute variables that are not jointly observed in a single data source (D’Orazio 2015).This method implements the statistical matching procedure proposed by Little (1988), andit does not assume that both files include similar entities, unlike the other two packages.A principal reason that Bayesian record linkage is under utilized in applied research is thelack of statistical software that implements it. We describe a software package called
GFS that implements the Bayesian linkage algorithm described by Gutman et al. (2013). In thispackage, sampling from the joint posterior distribution of linkage permutations and modelparameters is performed using the
Python programming language for computational efficiency(Van Rossum and Drake 2009). We provide a set of R wrapper functions for ease of use. Inaddition, we provide simulated data sets that are based on real data as an example for possibleanalyses that can be performed with this package.
3. The proposed method and notation
We assume that data on similar units is dispersed across two files, A and B ,comprised of n A and n B records, respectively. Let X i = ( X i, , . . . , X i,P ), i = 1 , . . . , n A , denote a vectorof covariates for entity i in complete data set X . The P covariates are partitioned intothree components of size P , P , and P , such that P + P + P = P . The first componentincludes P covariates that are exclusive to file A , XA i = ( X i, , . . . , X i,P ). The secondcomponent includes P covariates exclusive to file B , XB i = ( X i,P +1 , . . . , X i,P + P ). Thethird component includes P covariates that are recorded without error in both files. Let Z i = ( X i,P + P +1 , . . . , X i,P ) denote the P covariates in both files. The variables in Z can beused to create blocks and restrict the number of possible record pairs that are considered aslinks.Sampling from the posterior distribution of a linkage structure is computationally complex,and blocking is a practical technique to reduce computational complexity. We assume thatrecords are partitioned into J blocks defined by the values of Z . In file A , each block j ∈{ , . . . , J } comprises I A j records. Similarly, in file B , block j comprises I B j records. Wedenote records from file A in block j as XA ji , and records from file B in block j as XB ji .We introduce a latent structure C = { C j : j = 1 , . . . , J } , where C j represents the matchingpermutations in block j . For each record in file A , this structure indicates the matchingrecord in file B , such that C j ∈ { C jk : k = 1 , ..., K j } , where C jk is one possible permutationfor block j and K j = (cid:16) max (cid:16) I A j , I B j (cid:17)(cid:17) ! | I A j − I B j | ! . In the j th block, C jk [ i ] ∈ { , . . . , I B j } is the index for the record in file B that is matched torecord i ∈ { , . . . , I A j } in file A according to permutation k . For a given matching permutation k for block j , the linked data for record i of file A is ( XA ji , XB jC jk [ i ] , Z i ). For example, C j [2] = 4 implies that in the fifth possible linking permutation for block j , the second recordin file A is linked to the fourth record in file B . The newly-linked record is ( XA j , XB j , Z ).When I A j = I B j , all records in block j of files A and B are linked. When I A j = I B j somerecords in either file A or file B are left unmatched. Let UA j be the set of indices of unmatchedrecords in block j of file A , and UB j be the set of indices of unmatched records in block j offile B . Given C j = C jk the density for one entity is: L i,j ( θ, C j = C jk ) = f AB ( XA ji , XB jC jk ( i ) | θ, Z i ) f Z ( Z i | θ ) , i UA j f A ( XA ji | Z i , θ ) f Z ( Z i | θ ) i ∈ UA j (2) L l,j ( θ, C j = C jk ) = f B ( XB jl | Z i , θ ) f Z ( Z l | θ ) l ∈ UB j where θ is a parameter vector, f A , f B , and f AB are, respectively, the marginal densities of Bayesian Data Linkage XA ji and XB jl , and their joint density conditional on Z i , and f Z is the marginal density of Z i . Multiplying over cases in a block, the likelihood for θ and C j for block j is L j ( θ, C j ) = f Z ( Z i | θ ) max( I Aj ,I Bj ) × Y i ∈ UA j f A ( XA ji | θ, Z i ) × Y l ∈ UB j f B ( XB jl | θ, Z l ) × Y i UA j f AB ( XA ji , XB jC jk ( i ) | θ, Z i ) (3)The form of f AB is specific to an application, but it is often convenient to express it as aproduct of conditional distributions, for example f AB ( XA i , XB i | θ, Z i ) = f A ( XA i | Z i , θ A ) · f B | A ( XB i | Z i , XA i , θ B · A ) , (4)where θ = ( θ A , θ B · A ). The densities f A and f B | A may represent models of scientific interest aswell as models for relationships that are useful purely for identifying true links. An exampleof a model that is not of scientific interest may describe relationships between zipcode digitsthat appear in both files but are recorded inconsistently across files.A possible formulation of f B | A is based on a combination of univariate generalized linearmodels. Formally, f B | A ( XB i | Z i , XA i , θ B · A ) = f B ( XB i, | Z i , XA i , θ ) × f B ( XB i, | Z i , XA i , XB i ( − , θ ) × · · · × f B P ( XB i,P | Z i , XA i , XB i ( − P ) , θ P ) , (5)where XB i ( − p ) = ( XB i, , . . . , XB i,p − ) and θ B · A = ( θ , . . . , θ P ). In the GFS package, de-pending on the data type of XB i,p , each univariate conditional density in Equation (5) canbe modeled as either a Normal linear regression model, a Poisson regression model with loglink model, or a logistic regression model.Equation (2) shows that we need the marginal distributions f A and f B as well as the jointdistribution f AB , which is defined in Equation (4). In blocks where I A j ≤ I B j , derivingthe marginal distribution f B from Equation (4) requires integration over the XA i , which isoften analytically intractable. To overcome this issue, we impute I B j − I A j unobserved XA ji values in such blocks. This results in blocks with an identical number of records in both files,eliminating the need to define f B analytically. The unobserved XA ji are imputed by samplingfrom a uniform distribution over the observed records in block j of file A . This distributionassumes that the unobserved records in A follow the same distribution as the observed recordsin that block. In blocks where I A j > I B j , we choose only I B j observations from file A to belinked, exploiting the monotone missing data pattern (Little and Rubin 2019).To complete the Bayesian model we postulate a uniform independent prior distribution overthe possible permutations C j , which are independent of θ . In addition, θ , . . . , θ P are as-sumed to independently follow the default prior distributions in the pymc3 package for Python (Salvatier, Wiecki, and Fonnesbeck 2016). This results in a posterior distribution of P ( θ, C | XA , XB , Z ) ∝ π ( θ ) J Y j =1 K j X k =1 L j ( θ, C jk ) , (6)where π ( θ ) = π ( θ A ) × π ( θ ) × · · · × π ( θ P ) is the prior distribution of θ .
4. The three-step iterative sampling scheme
To sample from the joint posterior distribution of the parameters θ and linking structure C ,the GFS package relies on a Gibbs Sampler with a Metropolis-Hastings step within each block(Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin 2013). The starting values for all C j aredetermined based on the initial order of records in the given files. Let C ( t ) be the resultantpermutation at the completion of iteration t and θ ( t ) the values of the regression parametersat the completion of iteration t . Iteration t ∈ , . . . , T of the Gibbs sampling algorithm is:1. Given C ( t − , sample the regression parameter θ ( t ) from Equation 6. Because θ , . . . , θ P are independent given C ( t − , each sub-model in Equation 5 is sampled separately usingthe No-U-Turn Sampler implemented in pymc3 (Hoffman and Gelman 2014).2. Given θ ( t ) , sample from C ( t ) jk independently using the Metropolis-Hastings algorithmdescribed by Wu (1995) and Gutman et al. (2013) for each block j to obtain C ( t ) .3. For each block j , if I B j > I A j , sample I B j − I A j records from the I A j observed recordsfrom block j in file A to link with the remaining records from file B in block j .Below we expand on the three steps of the Gibbs sampling procedure for block j . Step 1: Regression sampling
We assume independent prior distributions for each conditional model in Equation 5, so theparameters of each generalized linear regression model, θ , . . . , θ P , can be sampled indepen-dently from their respective posterior distributions. For example, let XB i = ( XB i, , XB i, ),such that XB i, is continuous and XB i, is binary. The following regression models would beused: f B : XB C j [ i ] , | XA i , θ ∼ N ( XA i · β , σ ) f B : XB C j [ i ] , | XB C j [ i ] , , XA i , θ ∼ Bernoulli (logit − (( XB C j [ i ] , , XA i ) · θ )) , (7)where N is the Normal distribution, θ = ( β , σ ), and logit − ( p ) = exp( p ) / (1 − exp( p )).We will assume independent prior distributions for each regression parameter, π ( θ , θ ) ∝ π ( θ ) π ( θ ). Formally, π : β | σ ∼ N (0 , × I P ) , P ( σ ) ∝ π : θ ∼ N (0 , × I P +1 ) , (8)where I p is a p × p identity matrix. Based on this prior distributions and likelihoods, the pos-terior distributions for θ and θ are independent given C j , and can be sampled independentlyfrom their conditional posterior distributions at each iteration of the Gibbs sampler: P ( θ | XA , XB , Z , C ( t ) j ) ∝ f B ( θ | XA , XB , Z , C ( t ) j ) × π ( θ ) P ( θ | XA , XB , Z , C ( t ) j ) ∝ f B ( θ | XA , XB , Z , C ( t ) j ) × π ( θ ) (9) Bayesian Data Linkage
Step 2: Metropolis-Hastings step
Sampling directly from the distribution of all possible linking permutations is computation-ally intensive and practically intractable, even within individual blocks, because it requirescalculating the likelihood function over all possible permutations. To reduce the compu-tational complexity of this process, we rely on a Metropolis-Hastings algorithm originallyproposed by Wu (1995). In one iteration of the Metropolis-Hastings step in block j , twoindices i , i ∈ { , . . . , I A j } are chosen at random. From permutation C ( t ) j we obtain C ∗ j byinterchanging the entries at indices i and i to obtain a new permutation. The ratio be-tween the likelihood of C ∗ j and likelihood of C ( t ) j is the Metropolis-Hastings acceptance ratio.Formally, Accept C ∗ j with probability min , L (cid:16) θ ( t ) , C ∗ j | XA , XB , Z (cid:17) L (cid:16) θ ( t ) , C ( t ) j | XA , XB , Z (cid:17) . (10)The number of iterations of the Metropolis-Hastings algorithm is set by the user and shouldbe informed by the average block size.Calculation of the likelihoods ratio in (10) can be simplified to only the terms that include i and i , because the rest of the likelihood does not change under the swap. When I A j > I B j ,some records in file A will not be linked to records in file B , in which case terms correspondingto non-linked records cancel out in the computation of the acceptance ratio. Formally, forthe Normal and logistic models described by Equation (7), for i and i in block j , assumingthey have a link, the likelihood ratio in Equation (10) is L (cid:16) θ ( t ) , C ∗ j | XA , XB , Z (cid:17) L (cid:16) θ ( t ) , C ( t ) j | XA , XB , Z (cid:17) == min (cid:0) IAj ,IBj (cid:1)Q i =1 " φ (cid:16) XB C ∗ j [ i ] , ; µ ( t ) i ,σ ( t ) (cid:17) × " g ( t ) i ( C ∗ j ) XBC ∗ j [ i ] , (cid:16) − g ( t ) i ( C ∗ j ) (cid:17) − XBC ∗ j [ i ] , min (cid:0) IAj ,IBj (cid:1)Q i =1 " φ (cid:18) XB C ( t ) j [ i ] , ; µ ( t ) i ,σ ( t ) (cid:19) × " g ( t ) i (cid:16) C ( t ) j (cid:17) XBC ( t ) j [ i ] , (cid:16) − g ( t ) i (cid:16) C ( t ) j (cid:17)(cid:17) − XBC ( t ) j [ i ] , = Q i ∈{ i ,i } " φ (cid:16) XB C ∗ j [ i ] , ; µ ( t ) i ,σ ( t ) (cid:17) × " g ( t ) i ( C ∗ j ) XBC ∗ j [ i ] , (cid:16) − g ( t ) i ( C ∗ j ) (cid:17) − XBC ∗ j [ i ] , i ∈{ i ,i } " φ (cid:18) XB C ( t ) j [ i ] , ; µ ( t ) i ,σ ( t ) (cid:19) × " g ( t ) i (cid:16) C ( t ) j (cid:17) XBC ( t ) j [ i ] , (cid:16) − g ( t ) i (cid:16) C ( t ) j (cid:17)(cid:17) − XBC ( t ) j [ i ] , (11)where φ ( · ; µ ( t ) , σ ( t ) ) is the density of a Normal distribution with mean µ and standard devia-tion σ ( t ) , µ ( t ) i = XA i θ ( t )1 and g ( t ) i ( C j ) = logit − (( XB C ( t ) j [ i ] , , XA i ) · θ ( t )2 ). Step 3: Imputing missing records in file A When file A contains fewer records than file B in block j , we add I B j − I A j records of XA i bysampling with replacement from the I A j observed XA i values. This imputation step ensuresthat for all of the blocks I A j ≥ I B j .
5. Empirical example
We present the functionality of the
GFS package with an example based on real data. The
GFS package can be found at https://github.com/edwinfarley/GFS . The package can beinstalled in an R session with a call to install_github from the devtools package with theprovided url. Complete instructions for installing the package can be found on Github andin the Appendix (Section 7). The empirical example is based on a subset of 3,200 randomly selected participants fromthe 2001 Behavioral Risk Factor Surveillance System (BRFSS) data set (Centers for DiseaseControl (CDC) 2001). The participants were randomly selected only among those that hadcomplete data on all of the variables used for the analysis.The original data set is partitioned into two files, in order to mimic the input data that iscommonly available in file linkage applications. From the available variables in BRFSS, weidentify a set of variables that exists in both files and are recorded accurately and variablesthat are exclusive to each of the files. The variables that appear in both files were used tocreate blocks. The blocking variables are the individual’s sex, state of residence, geographicstratum, and time of year the entry was recorded, with the year split into four 3-monthperiods. These chosen blocking variables represent generic measures that could plausibly beshared by two distinct data sets similar to the BRFSS data set. This blocking scheme resultsin 1725 blocks, 1073 of which contain a single one-to-one link. The remaining 652 blocks withmore than 1 record in each file have an average of 3.26 records and a maximum of 24 records.The variables that were selected to be included in this analysis include individuals’ generalhealth, physical health, mental health, age, alcohol consumption, weight, and whether theysuffer from asthma. A summary of the data using R is: > str(samples) 'data.frame': 3200 obs. of 9 variables:$ X.1 : int 1 2 3 4 5 6 7 8 9 10 ...$ GENHLTH : int 3 3 3 4 3 1 5 5 5 4 ...$ PHYSHLTH: int 10 15 2 20 2 5 30 7 30 30 ...$ MENTHLTH: int 20 6 5 20 10 7 6 7 30 20 ...$ AGE : int 36 57 24 63 37 28 54 57 67 59 ...$ ALCDAY : int 215 207 102 204 101 207 215 210 202 105 ...$ WEIGHT : int 202 260 175 275 161 170 190 240 215 210 ...$ ASTHMA : int 1 1 0 0 1 1 0 1 1 1 ...$ block : int 1 2 3 4 5 6 7 7 8 9 ... The variables in the first file include the individuals’ weight, physical health, mental health,and age. A summary of the data in the first file using R is: > str(samples1) 'data.frame': 3200 obs. of 5 variables:$ WEIGHT : int 202 260 175 275 161 170 190 240 215 210 ... Bayesian Data Linkage $ PHYSHLTH: int 10 15 2 20 2 5 30 7 30 30 ...$ MENTHLTH: int 20 6 5 20 10 7 6 7 30 20 ...$ AGE : int 36 57 24 63 37 28 54 57 67 59 ...$ block : int 1 2 3 4 5 6 7 7 8 9 ...
The variables in the second file include the individuals’ general health, alcohol consumption,and an indicator for asthma. A summary of the data in the second file using R is: > str(samples2) 'data.frame': 3200 obs. of 5 variables:$ X.1 : int 1 2 3 4 5 6 7 8 9 10 ...$ GENHLTH: int 2 4 3 4 2 3 2 3 1 2 ...$ ALCDAY : int 101 228 201 201 103 201 201 202 203 205 ...$ ASTHMA : int 0 0 1 1 0 1 1 1 1 1 ...$ block : int 624 514 1120 272 145 777 1030 753 1456 105 ... The block column present in both files indicates the blocks to which each record belongs, withsimilar indexing in both data sets.
We applied the record linkage algorithm under three separate f B | A models. The first modelwas a linear regression model with the general health measure, GENHLTH , as the response vari-able, and physical health (
PHYSHLTH ) and mental health (
MENTHLTH ) as explanatory variables.The second model was a logistic regression model with the asthma indicator,
ASTHMA , as theresponse variable, and
PHYSHLTH , WEIGHT , and
AGE as the explanatory variables. The thirdmodel relies on both models jointly to link records from both files.
To sample from the joint posterior distribution of the linkage structure and the parameters in R , we use the permute_inputs() function from the GFS package. The following R commandis used to link the two data sets using the linear regression model from the "Normal" family: > P_samples = permute_inputs("samples1.csv", "samples2.csv",+ "GENHLTH~PHYSHLTH+MENTHLTH", "Normal", 10, 25, 5, 750, 50,+ conda_env = "gfs") The first two input parameters are the names of the files to be linked. The third variabledefines a R string formula that includes the response and explanatory variables. The fourthparameter defines the form of the regression, where "Normal" stands for a linear regressionmodel. The fifth parameter (10) defines the number of linkage permutations that will be pro-duced by the sampling algorithm. The sixth parameter (25) defines the number of iterationsin sampling the parameters θ within each Gibbs sampling iteration. The seventh parameter(5) defines the number of Metropolis-Hastings iterations that will be used to sample newpermutations. The actual number of iterations is the product of this value and the number1of records in each block. The eighth parameter (750) is the number of Gibbs sampler burn-initerations. The ninth parameter (50) is the thinning interval, which is the number of Gibbssampler iterations discarded between each saved sample. The conda_env parameter is usedto specify the Python virtual environment in which the
Python processes will run. In our case,we used the Anaconda distribution of
Python and installed the requisite packages to an envi-ronment named gfs using the utility function included with the
GFS package, as describedin the installation instructions.The following R command is used to link the two data sets using the logistic regression modelwith similar parameters: > P_log_samples = permute_inputs("samples1.csv", "samples2.csv",+ "ASTHMA~PHYSHLTH+AGE+WEIGHT", "Logistic", 10, 25, 5, 750, 50,+ conda_env = "gfs") The return value of these calls is a data frame of size 3200 ×
10, where each column containsrow indices defining a full permutation. Each permutation is a sampled matching of recordsfrom the second data set to the records of the first data set, defining a combined data set.This is an efficient manner to store the permutations without needing to save multiple copiesof the two files and we provide a function to generate a complete data set from two files and apermutation. Point estimates can be obtained by computing the estimates of interest withineach linked data set and averaging these values. Interval estimates for these quantities canbe obtained using common multiple imputation combination rules (Rubin 2004).
For bivariate Normal records that are partitioned across two files, it was shown that theexpected gain in information using the proposed algorithm increases as the correlation betweenthe variables increases (Gutman et al.
GFS package supports the use of multiple linkage models to inform the linkage procedure.We consider combining the linear regression model and the logistic regression model describedin Section 5.3. The R command for running the two models is: > P_joint_samples = permute_inputs("samples1.csv", "samples2.csv",+ c("GENHLTH~PHYSHLTH+MENTHLTH", "ASTHMA~PHYSHLTH+AGE+WEIGHT+GENHLTH"),+ c("Normal", "Logistic"), 10, 25, 5, 750, 25,+ conda_env = "gfs") To display possible analyses using the linked data sets, we describe regression analyses usingvariables that are exclusive to one of the two data sets. Our analysis consists of generalizedlinear models, in which the response variable is from one file and the explanatory variable isfrom the other file. To obtain a linked data set based on a sampled permutation, we create2
Bayesian Data Linkage a copy of the second data set, reordered according to the indices in the permutation. In R this is implemented by indexing the records of samples2 with an array corresponding topermutation p , which could be a column from the result of permute_inputs . This results inrecords from file samples2 that can be horizontally concatenated to records in file samples1 .In R this is implemented with the following command: > samples_linked = build_permutation(samples1, samples2, p) Applying this function to each of the sampled permutations will result in a fully linked dataset. The statistic of interest can be calculated using each data set separately. The individualresults are then combined using the common multiple imputation combination rules to obtainpoint and interval estimates.We present the estimates for the slopes in generalized linear models with one explanatoryvariable using two linkage procedures. The first procedure estimates the slopes based onpermutations sampled uniformly from the distribution of permutations. The second procedureis the proposed linking procedure. For the proposed procedure we examined three differentmodels described in Sections 5.3 and 5.4. In addition, we compare the slope estimates afterapplying the different linkage procedures to the slope estimates on the full data set under thetrue linkage.
We summarize the results of applying our method to estimate the slopes in generalized linearmodels with one explanatory variable when linkage is performed with a single linear regressionmodel, a logistic regression model, and a combination of these models.
Linear regression linkage model
Figure 1 shows the results of multiple generalized linear models with one covariate whenlinkage is implemented with a linear regression model. The title of each set of plots is theresponse variable, and the plots within each set show the point and interval estimates ofthe slopes for the covariate indicated on each y-axis. In each plot, the red line depicts thevalue of the slope calculated with the complete data set, the black and grey intervals showthe 95% confidence intervals under linkage permutations sampled using the proposed methodand those sampled under random permutations, respectively.The proposed method generally reduces bias in point estimates compared to the randomlysampled permutations. This is most apparent when
GENHLTH is the response variable. Theproposed method results in significantly smaller sampling variance than is observed withrandomly selected permutations. Moreover, the interval estimates of both methods displaysimilar coverage of the true slope. This shows the gain in accuracy and precision using theproposed method.3Figure 1:
GENHLTH , ALCDAY , and
ASTHMA under the Normal model.
Logistic regression linkage model
Figure 2 depicts the results for different generalized linear models with a single covariatewhen the linkage method is based on a logistic regression model. The linkage using the pro-posed method for the
GENHLTH and the
ALCDAY response variables are similar to the randompermutations. For the response variable
ASTHMA , we observe slightly lower bias with most ex-planatory variables and smaller sampling variance. An explanation for the worse performanceof the logistic model compared to the regression model in the previous section is that there are984 records that belong to blocks where the value of
ASTHMA is identical for all records in theblock. The coefficients of the logistic regression model are sampled at each MCMC iterationover the entire data set; however, when sampling a new permutation, if all the binary responsevalues in a block are the same, all the block’s linkage permutations have identical likelihood.Therefore, in any block with identical binary response values the proposed procedure sampleslinkage permutations uniformly.4
Bayesian Data Linkage
Figure 2:
GENHLTH , ALCDAY , and
ASTHMA under the Logistic model.By excluding the blocks containing identical
ASTHMA values we are left with 2216 records,of which 1073 records are single-record blocks and the remaining records are distributedamong 263 blocks. Figure 3 shows the results for the blocks with dissimilar
ASTHMA values.Without the blocks containing identical
ASTHMA values, the bias and the sampling variance ofcoefficient estimates over the linked data set are reduced compared to when they were includedand compared to the random permutations. Nonetheless, the improvement over the randompermutations with the logistic model in either case is not as pronounced when compared tothe results under the linkage that is generated using the linear regression linkage model.
Joint generalized linear linkage models
Figure 4 displays the results for different generalized linear models with one covariate, whenpermutations are generated using a combination of two models. The joint use of two linkagemodels results in similar or improved operating characteristics compared to each of the linkage5Figure 3:
GENHLTH , ALCDAY , and
ASTHMA under the Logistic model without uniform blocks.models separately. Specifically, the bias and variability of the interval estimates were similar orsmaller than those observed for either the linear regression or the logistic regression linkagemodels. The improvement over the linear regression model is not substantial in this case,because the associations in the linear regression model are more informative than those underthe logistic regression model, as evidenced by our analysis of their individual results.
Correct matches
In this simulated example we know the 3,200 true links, so we can compare the number of truelinks under the proposed method and the number of correct links when rows are permutedat random. Table 1 shows the average number of correctly linked records and the standarddeviation of correct links over 10 permutations for each model.There is a significant improvement in the number of correct links under the linear regressionmodel and the joint model compared to random permutation sampling. These results mimic6
Bayesian Data Linkage
Figure 4:
GENHLTH , ALCDAY , and
ASTHMA under combination of models.the results for the bias and sampling variance of estimates for the slopes of the generalizedlinear models with a single covariate. Excluding the 1,073 blocks with only one record fromeach file, the random permutation selection approach yields on average 657 correct links inblocks with more than one record from each file. Using multiple linkage models resulted in806 correct links in blocks with more than one record from each file. This is an improvementof approximately 23%.With the current data set, the number of correctly linked records using random samples issimilar to the number of correctly linked records using the logistic regression linkage model.This result is a reminder that the aim of many file linkage applications is not necessarilyto correctly link individual records, rather to allow for downstream analyses of the linkeddata set, and these goals are not necessarily equivalent. In blocks with non-uniform responsevalues, the proposed method with the logistic regression model results in 1,335 correct matcheson average, on par with the random permutations method (Table 2); however, compared torandom permutation sampling, the proposed method using the logistic regression linkage7model results in smaller sampling variability and some reduction in the bias of the slopes inthe generalized linear models with a single covariate (Figures 2 and 3).Table 1: All BlocksModel Correct Links Correct Links excl. singleton blocks Std. DeviationRandom 1730 657 34Normal 1872 799 22Logistic 1725 652 31Joint 1879 806 27Table 2: Excluding Blocks with Identical values for
ASTHMA
Model Correct Links Correct Links excl. singleton blocks Std. DeviationRandom 1338 265 17Logistic 1335 262 22
6. Conclusions
Probabilistic record linkage methods are becoming increasingly relevant in epidemiology andthe social sciences, because they allow researchers to perform complex analyses using gran-ular information that is not available in one data set. The approach presented by Gutman et al. (2013) and implemented in the
GFS package mitigates some of the limitations of otherapproaches to probabilistic record linkage. The
GFS packages employs a Bayesian recordlinkage approach that incorporates data that is exclusive to one of the files as well as datacommon to both files.We have presented an example to illustrate the improvements that can be achieved by usingthe proposed probabilistic linkage method. The analysis shows that the choice of model isimportant in improving the accuracy of the linkage as well as the accuracy and precision ofestimates of the parameters of interest. It is important to note that the aim of the methodis not necessarily to correctly link individual records, but rather to estimate associationsbetween variables dispersed across two files. As shown in our example, in some cases there islittle information from which to infer which record from one file is related to a record from theother file. If the association between a binary response variable and the explanatory variablesis of interest, this is not a concern for inference, because when all the response variables withina certain block are similar, any set of linked records provides identical information; however,if other relationships in the data are of interest, they should be included during the linkageprocess. Moreover, our analysis shows that including scientifically important correlations, aswell as those that are not, improves the linkage and the estimates of parameters of interest.This is similar to the concept of congeniality when implementing multiple imputation (Meng1994).In many applications of file linkage, adjustments for possible errors in the linkage are com-monly neglected, but such errors arise regardless of which linkage procedure is applied (Shlomo2013). A number of approaches have been proposed to measure and account for these errorsin regression settings (Shlomo 2013). The proposed Bayesian approach enables the estimation8
Bayesian Data Linkage of linkage errors by sampling from the posterior distribution of possible linkage permutations.This approach is not restricted to regression models, and it allows for propagation of linkageerrors to estimate any statistic after the linkage is performed.The implemented approach that is based on multiple imputation is computationally efficient.First, its underlying implementation is in
Python , which enables efficient MCMC sampling.Second, the linkage structure is saved in an efficient data structure. Specifically, we onlysave permutations of indices so that the original files are not saved for every sampled linkage.Third, because in many applications, record linkage serves as a tool to investigate specificscientific questions, our program produces multiple linkage structures that can be used toestimate any parameters of interest. The more computationally expensive process of linkagecan be performed once, after which researchers can perform multiple analyses and combinetheir results using common multiple imputations rules. Fourth, the method allows for un-balanced blocks in the files to be linked by relying on a monotone missing data pattern andimputing mismatched records in only one of the files as a part of the sampling process.We have used blocking to increase efficiency and scalability; however, in cases when theblocking variables are recorded with errors, blocking may exclude true links and influencesubsequent inferences. An area of improvement for the current method is to extend thesampling algorithm to allow for errors in blocking variables (Dalzell and Reiter 2018). Whilethis could mitigate the potential of erroneous blocking to skew inferences, such an additioncould increase the computational complexity of the sampling algorithm to the point where itwould become computationally prohibitive for large data sets. Another area of improvementfor the current implementation is support for additional regression models. Hierarchicalregression models and two-part models, such as zero inflated models, are potential additions.The design of the code allows for modular addition of new distributions. Lastly, the use ofparallel computing and targeted improvements to the MCMC sampling procedure akin tothose proposed by Zanella (2020) may improve the performance of the algorithm on largedata sets. Selection of models that should be used in the linkage process is another area forfurther inquiry and improvement. The best choice of response and explanatory variables arevariables that are highly correlated; however, this correlation cannot be computed across theseparate files. External data sources could inform selection, but incorporating this informationin the modeling process is an area of further research. In addition, including more modelshas the potential to improve the linkage performance, but it may also increase computationalcomplexity. Defining a way to measure the contributions of additional variables is anotherarea for future research.In conclusion, we describe a computationally efficient algorithm to perform file linkage withvariables that are exclusive to one file or are recorded with errors in one of the files. Thealgorithm generates multiple linkage structures, allowing for the propagation of errors inlinkage through multiple imputation. In addition, we illustrate the use of the
GFS recordlinkage package on a real data set. This example can serve as a starting point for researchersinterested in implementing a Bayesian procedure to link records across files in the absence ofunique identifiers.9
Acknowledgements
This research was partly supported through a Patient-Centered Outcomes Research Institute(PCORI) Award ME-1403-12104. Disclaimer: All statements in this report, including itsfindings and conclusions, are solely those of the authors and do not necessarily represent theviews of the PCORI, its Board of Governors or Methodology Committee.We would like to thank Preston Schwartz, for his early contributions to development andtesting of the
GFS package.
Bibliography
Barnard J, Rubin DB (1999). “Miscellanea. Small-sample degrees of freedom with multipleimputation.”
Biometrika , (4), 948–955.Belin TR, Rubin DB (1995). “A method for calibrating false-match rates in record linkage.” Journal of the American Statistical Association , (430), 694–707.Campbell KM, Deck D, Krupski A (2008). “Record linkage software in the public domain:a comparison of Link Plus, The Link King, and abasic’deterministic algorithm.” Healthinformatics journal , (1), 5–15.Centers for Disease Control (CDC) (2001). “Behavioral Risk Factor Surveillance SystemSurvey Data.”Chambers R (2009). Regression analysis of probability-linked data . Statistics New ZealandWellington.Chambers R, Chipperfield J, Davis W, Kovacevic M (2009). “Inference Based on EstimatingEquations and Probability-Linked Data, Centre for Statistical and Survey Methodology.”
Technical Report 18-09 , University of Wollongong. URL https://ro.uow.edu.au/cssmwp/38 .Dalzell NM, Reiter JP (2018). “Regression Modeling and File Matching Using Possibly Er-roneous Matching Variables.”
Journal of Computational and Graphical Statistics , (4),728–738.De Bruin J (2019). “ Python
Record Linkage Toolkit: A Toolkit For Record Linkage andDuplicate Detection in
Python .” doi:10.5281/zenodo.3559043 . URL https://doi.org/10.5281/zenodo.3559043 .D’Orazio M (2006). Statistical Matching, Theory and Practice . Wiley, Chichester.D’Orazio M (2015). “Integration and Imputation of Survey Data in R : The StatMatch Pack-age.” Romanian Statistical Review , (2), 57–68.Enamorado T, Fifield B, Imai K (2019). “Using a Probabilistic Model to Assist Merging ofLarge-Scale Administrative Records.” American Political Science Review , (2), 353–371. doi:10.1017/S0003055418000783 .0 Bayesian Data Linkage
Fellegi IP, Sunter AB (1969). “A Theory for Record Linkage.”
Journal of the American Sta-tistical Association , (328), 1183–1210. doi:10.1080/01621459.1969.10501049 . , URL .Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013). Bayesian dataanalysis . CRC press.Gomatam S, Carter R, Ariet M, Mitchell G (2002). “An empirical comparison of recordlinkage procedures.”
Statistics in medicine , (10), 1485–1496.Gutman R, Afendulis CC, Zaslavsky AM (2013). “A Bayesian Procedure for File Link-ing to Analyze End-of-Life Medical Costs.” Journal of the American Statistical Associa-tion , (501), 34–47. doi:10.1080/01621459.2012.726889 . PMID: 23645944, https://doi.org/10.1080/01621459.2012.726889 , URL https://doi.org/10.1080/01621459.2012.726889 .Gutman R, Sammartino C, Green T, Montague B (2016). “Error Adjustments for File LinkingMethods Using Encrypted Unique Client Identifier (eUCI) With Application to RecentlyReleased Prisoners Who are HIV+.” Statistics in Medicine , (1), 115–129. doi:10.1002/sim.6586 . https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.6586 , URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.6586 .Han Y, Lahiri P (2019). “Statistical Analysis With Linked Data.” International StatisticalReview , , S139–S157.Hoffman MD, Gelman A (2014). “The No-U-Turn sampler: Adaptively Setting Path Lengthsin Hamiltonian Monte Carlo.” Journal of Machine Learning Research , (1), 1593–1623.Jaro MA (1995). “Probabilistic Linkage of Large Public Health Data Files.” Statistics inMedicine , (57), 491–498. doi:10.1002/sim.4780140510 . https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.4780140510 , URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780140510 .Lahiri P, Larsen MD (2005). “Regression Analysis With Linked Data.” Jour-nal of the American Statistical Association , (469), 222–230. doi:10.1198/016214504000001277 . https://doi.org/10.1198/016214504000001277 , URL https://doi.org/10.1198/016214504000001277 .Larsen MD, Rubin DB (2001). “Iterative automated record linkage using mixture models.” Journal of the American Statistical Association , (453), 32–41.Little RJ (1988). “Missing-data adjustments in large surveys.” Journal of Business & Eco-nomic Statistics , (3), 287–296.Little RJ, Rubin DB (2019). Statistical Analysis With Missing Data , volume 793. John Wiley& Sons.Meng XL (1994). “Multiple-Imputation Inferences With Uncongenial Sources of Input.”
Sta-tistical Science , pp. 538–558.1Murray JS (2015). “Probabilistic Record Linkage and Deduplication After Indexing, Blocking,and Filtering.”
Journal of Privacy and Confidentiality , (1).Newcombe HB (1988). Handbook of record linkage: methods for health and statistical studies,administration, and business . Oxford University Press, Inc.Newcombe HB, Kennedy JM (1962). “Record Linkage: Making Maximum Use of the Discrimi-nating Power of Identifying Information.”
Commun. ACM , (11), 563–566. ISSN 0001-0782. doi:10.1145/368996.369026 . URL https://doi.org/10.1145/368996.369026 . R Core Team (2017). R : A Language and Environment for Statistical Computing . R Founda-tion for Statistical Computing, Vienna, Austria. URL .Rässler S (2012).
Statistical Matching: A Frequentist Theory, Practical Applications, andAlternative Bayesian Approaches . Lecture Notes in Statistics. Springer New York. ISBN9781461300533.Rodgers WL (1984). “An Evaluation of Statistical Matching.”
Journal of Business & Eco-nomic Statistics , (1), 91–102. ISSN 07350015. URL .Rubin DB (1974). “Characterizing the Estimation of Parameters in Incomplete-Data Prob-lems.” Journal of the American Statistical Association , (346), 467–474. ISSN 01621459.URL .Rubin DB (2004). Multiple imputation for nonresponse in surveys , volume 81. John Wiley &Sons.Sadinle M (2017). “Bayesian estimation of bipartite matchings for record linkage.”
Journalof the American Statistical Association , (518), 600–612.Salvatier J, Wiecki TV, Fonnesbeck C (2016). “Probabilistic programming in Python usingPyMC3.” PeerJ Computer Science , , e55. doi:10.7717/peerj-cs.55 . URL https://doi.org/10.7717/peerj-cs.55 .Shlomo N (2013). Data-Driven Policy Impact Evaluation: How Access to Microdata isTransforming Policy Design , chapter Overview of Data Linkage Methods for PolicyDesign and Evaluation, pp. 47–65. Springer International Publishing. ISBN 978-3-319-78461-8. doi:10.1007/978-3-319-78461-8_4 . URL https://doi.org/10.1007/978-3-319-78461-8_4 .Steorts RC, Hall R, Fienberg SE (2016). “A Bayesian Approach to Graphical Record Linkageand Deduplication.”
Journal of the American Statistical Association , (516), 1660–1672.Van Rossum G, Drake FL (2009). Python . CreateSpace, Scotts Valley,CA. ISBN 1441412697.Winkler WE (2002). “Methods for record linkage and bayesian networks.”
Technical report ,Technical report, Statistical Research Division, US Census Bureau . . . .Wu Y (1995). “Random Shuffling: A New Approach to Matching Problem.” In
ASA Proceed-ings of the Statistical Computing Section , pp. 69–74. American Statistical Association.2
Bayesian Data Linkage
Zanella G (2020). “Informed proposals for local MCMC in discrete spaces.”
Journal of theAmerican Statistical Association , (530), 852–865.
7. Appendix
1. Installing GFS: With devtools , use install_github to fetch and install the package fromhttps://github.com/edwinfarley/GFS: > library("devtools")> intall_github("https://github.com/edwinfarley/GFS")> library("GFS")
Alternatively, clone the repository yourself from the same link and then use install_local ,passing the path to the local directory that contains the package.2. Set up the
Python environment: We recommend using an Anaconda distribution of
Python create_python_environment() functionto set up a Conda environment with all the required packages. This function takes a singleargument, conda_envname , the name of the environment to be created. The new environmentwill include the following packages: numpy, pandas, pymc3, mkl, theano, and mkl-service.3. Installing
Python components: The
Python components will be installed automaticallywhen the permute_inputs is called if no
Python directory is found; however, the packagedoes include a py_setup() function in R. This function call clones the repository athttps://github.com/edwinfarley/GFSPython to a directory named “Python” in the GFSpackage directory.4. Ready to go: These are the required steps for preparing to run the permute_inputs function to sample permutations. Take a look at the documentation with ?permute_inputs in R to see information about the arguments and sampling parameters. Be sure to pass thename of your Conda environment to the conda_env parameter when using permute_inputs .If the
Python components have not been installed manually with the provided function, theywill be installed automatically when ?permute_inputs is run for the first time.
Affiliation: