[PDF] Maximum Entropy classification for record linkage

Abstract

By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in text mining to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is scalable and fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.

Full PDF

MMaximum entropy classiﬁcationfor record linkage

Danhyang Lee , Li-Chun Zhang , and Jae Kwang Kim Department of Information Systems, Statistics and Management Science,University of Alabama, Tuscaloosa, AL, U.S.A. Department of Social Statistics and Demography, University ofSouthampton, Southampton, U.K. Statistics Norway, Oslo, Norway Department of Mathematics, University of Oslo, Oslo, Norway Department of Statistics, Iowa State University, Ames, IA, U.S.A.October 1, 2020

Abstract

By record linkage one joins records residing in separate ﬁles which are believedto be related to the same entity. In this paper we approach record linkage as aclassiﬁcation problem, and adapt the maximum entropy classiﬁcation method in textmining to record linkage, both in the supervised and unsupervised settings of machinelearning. The set of links will be chosen according to the associated uncertainty.On the one hand, our framework overcomes some persistent theoretical ﬂaws of theclassical approach pioneered by Fellegi and Sunter (1969); on the other hand, theproposed algorithm is scalable and fully automatic, unlike the classical approach thatgenerally requires clerical review to resolve the undecided cases.

Keywords:

Probabilistic linkage; Density ratio; False link; Missing match; Survey sampling1 a r X i v : . [ s t a t . M E ] S e p Introduction

Combining information from multiple sources of data is a frequently encountered problemin many disciplines. To combine information from diﬀerent sources, one assumes that it ispossible to identify the records associated with the same entity, which is not always the casein practice. If the data do not contain unique identiﬁcation number, identifying recordsfrom the same entity becomes a challenging problem.

Record linkage is the term describingthe process of joining records that are believed to be related to the same entity. Whilerecord linkage may entail the linking of records within a single computer ﬁle to identifyduplicate records, we focus on linking of records across separate ﬁles.Record linkage is particularly an important topic in survey sampling and oﬃcial statis-tics. As pointed out by Fellegi (1997), the development of the computer technology and theincreased data storage facilitated the maintenance of the data ﬁles and the extraction ofcomplex information from them. The demand for detailed statistical information has beenincreasing and the use of administrative data can at least partially satisfy the demand.Furthermore, combining administrative ﬁles with survey sample data can greatly improvethe quality and resolution of the oﬃcial statistics. To satisfy these demands, the sameentity across diﬀerent data sources needs to be identiﬁed as accurately as possible.The classical approach pioneered by Fellegi and Sunter (1969) is the most popularmethod of record linkage in practice. It has been successful at producing large-scale in-dustrial strength applications, such as when post-enumeration survey and census data arelinked for census coverage evaluation (Jaro, 1989; Winkler and Thibaudeau, 1991), popu-lation census data ﬁles are linked over time (Zhang and Campbell, 2012), administrativeregisters are linked to create a single statistical population dataset (Owen et al., 2015),or when medical records are linked to enhance data on clinical performance and patient2ealth outcomes (Harron et al., 2016; M´eray et al., 2007).The probabilistic decision rule of Fellegi and Sunter (1969) is based on the likelihoodratio test idea, by which we can determine how likely a particular record pair is a true match.In applying the likelihood ratio test idea, one needs to estimate the model parameters ofthe underlying model and determine the thresholds of the decision rule. To estimate themodel parameters, Winkler (1988) and Jaro (1989) treated the matching status variable asan unobservable and proposed an EM algorithm for computation. See Herzog et al. (2007)and Christen (2012) for overviews. However, as explained in Section 2, the theory has somepersistent ﬂaws. See also Tancredi and Liseo (2011) for related critiques.To consider an alternative approach, we ﬁrst note that the record linkage problem isessentially a classiﬁcation problem, where each record pair is classiﬁed into either “match”or “non-match” class. Classiﬁcation is one of the main topics in machine learning and,therefore, we can potentially employ modern techniques of classiﬁcation to the record link-age problem. Speciﬁcally, we can view the likelihood ratio of the Fellegi-Sunter method asa special case of the density ratio and apply the advanced techniques for density ratio esti-mation. For example, Nigam et al. (1999) use the maximum entropy for text classiﬁcationand Nguyen et al. (2010) develop a more uniﬁed theory of maximum entropy method fordensity ratio estimation. There is however a key diﬀerence of record linkage to the standardsetting of classiﬁcation problems, in that the diﬀerent record pairs at not distinct ‘units’because the same record is part of many record pairs.In this paper, we adapt the maximum entropy method for text analysis to record linkage.The classiﬁcation of the set of links can be formulated either in a supervised setting oran unsupervised setting, where the latter is by far the most common in practice. Wepresent detailed algorithm for both. Our main contributions concern the unsupervisedcase, where it is impossible to estimate the density ratio based on the true matches and3on-matches. Frequentist methods based on joint modelling of the unobserved match statusand the observed comparison scores, over all the record pairs, are diﬃcult to materialise.To overcome this problem, we develop estimation methods tailored to the record linkageproblem and the associated measures of the uncertainty of record linkage. In our proposedframework, the estimation procedure of Winkler (1988) and Jaro (1989) can be incorporatedas a special case, which explains why it can give reasonable results in many situationsdespite its ﬂaw. The choice of the set of links is guided by the estimated uncertaintymeasures. Our procedure is scalable and fully automatic, without the need for resource-demanding clerical review that is required under the classical approach.The paper is organised as follows. In Section 2, the basic setup and the classicalapproach are introduced. In Section 3, the proposed method is developed under the settingof supervised record linkage. In Section 4, we extend the proposed method to the morechallenging case of the unsupervised record linkage. Discussions of some related estimationapproaches and technical details are presented in Section 5 and the supplementary material.Results from an extensive simulation study are presented in Section 6. Some concludingremarks and comments on further works are given in Section 7.

Suppose that we have two data ﬁles A and B that are believed to have many commonentities. Our goal is to ﬁnd the true matches among all possible pairs of the two dataﬁles. Let the bipartite comparison space Ω = A × B = M ∪ U consist of matches M and non-matches U between the records in ﬁles A and B . For any pair of records ( a, b ) ∈ Ω,let γ ab be the comparison vector between a set of key variables associated with a ∈ A and b ∈ B , respectively, such as name, sex, birthdate. The key variables and the comparison4ector γ ab are fully observed over Ω. In cases where the key variables may be aﬀected byerrors, a match ( a, b ) may not have complete agreement in terms of γ ab , and a non-match( a, b ) can nevertheless agree on some (even all) of the key variables.In the classical approach of Fellegi and Sunter (1969), one recognizes the probabilisticnature of γ ab due to the perturbations that cause key-variable errors. The related methodsare referred to as probabilistic record linkage . To explain the probabilistic record linkagemethod of Fellegi and Sunter (1969), let m ab = f ( γ ab | ( a, b ) ∈ M ) be the probabilitymass function of the discrete values γ ab can take given ( a, b ) ∈ M . Similarly, we can deﬁne u ab = f ( γ ab | ( a, b ) ∈ U ). The ratio r ab = m ab u ab is then the basis of the likelihood ratio test (LRT) for H : ( a, b ) ∈ M vs. H : ( a, b ) ∈ U .Let M ∗ = { ( a, b ) : r ab > c M } be the pairs classiﬁed as matches and U ∗ = { ( a, b ) : r ab < c U } the non-matches, the remaining pairs are classiﬁed by clerical review, where ( c M , c U ) arerelated to the probabilities of false links (of pairs in U ) and false non-links (of pairs in M ),respectively, deﬁned as µ = (cid:88) γ u ( γ ) δ ( M ∗ ; γ ) and λ = (cid:88) γ m ( γ ) δ ( U ∗ ; γ ) , (1)where δ ( M ∗ ; γ ) = 1 if γ ab = γ means ( a, b ) ∈ M ∗ and 0 otherwise, similarly for δ ( U ∗ ; γ ).In reality m ab and u ab are unknown. Nor is the prevalence π = | M | / | Ω | := n M /n . Let η contain π and the unknown parameters of m ( γ ) and u ( γ ). Let g ab = 1 if ( a, b ) ∈ M and0 if ( a, b ) ∈ U . Given the complete data { ( g ab , γ ab ) : ( a, b ) ∈ Ω } , Winkler (1988) and Jaro51989) assume the log-likelihood to be h ( η ) = (cid:88) ( a,b ) ∈ Ω g ab log( πm ab ) + (cid:88) ( a,b ) ∈ Ω (1 − g ab ) log (cid:0) (1 − π ) u ab (cid:1) . (2)An EM-algorithm follows by treating g Ω = { g ab : ( a, b ) ∈ Ω } as the missing data.There are two fundamental problems with this classical approach.[Problem-I] Record linkage is not a direct application of the LRT, because one needs toevaluate all the pairs in Ω instead of any given pair. The classiﬁcation of Ω into M ∗ and U ∗ is incoherent generally, since a given record can belong to multiple pairs in M ∗ . Post-classiﬁcation deduplication of M ∗ would be necessary then, although it is not part of the theoretical formulation above.[Problem-II] In reality the comparison vectors of any two pairs are not independent, aslong as they share a record. For example, given ( a, b ) ∈ M and γ ab not subjected toerrors, then g ab (cid:48) must be 0, for b (cid:48) (cid:54) = b and b (cid:48) ∈ B , as long as there are no duplicatedrecords in either A or B , and γ ab (cid:48) depends only on the key-variable errors of b (cid:48) .Whereas, marginally, g ab (cid:48) = 1 with probability π and γ ab (cid:48) depends also on the key-variable errors of a . It follows that h ( η ) in (2) does not correspond to the truejoint-data distribution of γ Ω = { γ ab : ( a, b ) ∈ Ω } , even when the marginal m and u -probabilities are correctly speciﬁed. Similarly, although one may deﬁne marginally π = Pr[( a, b ) ∈ M | ( a, b ) ∈ Ω] for a randomly selected record pair from Ω, it does notfollow that log f ( g Ω ) = n M log π + ( n − n M ) log(1 − π ) jointly as in (2). For bothreasons, h ( η ) given by (2) cannot be the complete-data log-likelihood.In the next two sections, we develop maximum entropy classiﬁcation for record linkage,after which more discussions of the classical approach will be given.6 Maximum entropy classiﬁcation: Supervised

As noted in Section 1, record linkage problem is a classiﬁcation problem. Maximum en-tropy classiﬁcation has been used in image restoration or text analysis (Gull and Daniell,1984; Berger et al., 1996).

Maximum entropy classiﬁcation (MEC) has been proposed forsupervised learning (SL) to standard classiﬁcation problems, where the units are knownbut the true classes of the units are unknown apart from a sample of labelled units . Let Y ∈ { , } be the true class and X the random vector of features. Let the density ratio be r ( x ; η ) = f ( x | Y = 1; η ) f ( x | Y = 0; η )where f and f are the conditional density functions of X given Y = 1 or 0, respec-tively, and η contains the unknown parameters. For MEC based on r ( x ), one ﬁnds ˆ η thatmaximises the Kullback-Leibler (KL) divergence from f to f subjected a constraint, i.e. D = (cid:90) S f ( x ; η ) log r ( x ; η ) d x subjected to (cid:90) S f ( x ; ˆ η ) r ( x ; ˆ η ) d x = 1 , where S k is the support of X given Y = k ∈ { , } , and the normalization constraint arisessince r ( x ; ˆ η ) f ( x ; ˆ η ) is an estimate of f ( x ). Provided common support S = S , one canuse the empirical distribution function (EDF) of X over { x i : y i = 1 } in place of f for D ,and that over { x i : y i = 0 } in place of f for the constraint. For SL-based MEC to record linkage, suppose M is observed for the given Ω, and thetrained classiﬁer is to be applied to the record pairs outside of Ω. To ﬁx the idea, suppose B is a non-probability sample that overlaps with the population P , and A is a probability7ample from P with known inclusion probabilities. While γ M = { γ ab : ( a, b ) ∈ M } may beconsidered as an IID sample, since each ( a, b ) in M refers to a distinct entity, this is notthe case with { γ ab : ( a, b ) (cid:54)∈ M } , whose joint distribution is troublesome to model. Probability ratio (I)

Let r f ( γ ) be the probability ratio given by r f ( γ ) = m ( γ ) f ( γ )where m ( γ ) is the probability mass function of γ ab = γ given g ab = 1, and f ( γ ) is thatover γ Ω = { γ ab : ( a, b ) ∈ Ω } . The KL divergence measure from f ( γ ) to m ( γ ) and thenormalisation constraint are D f = (cid:88) γ ∈S ( M ) m ( γ ) log r f ( γ ) and (cid:88) γ ∈S ( M ) ˆ f ( γ )ˆ r f ( γ ) = 1 , where S ( M ) is the support of γ ab given g ab = 1. This set-up allows S ( M ) to be a subsetof S , where S is the support of all possible γ ab . It follows that, based on the IID sample γ M of size n M = | M | , the objective function to be minimised for r f can be given by Q f = (cid:88) ( a,b ) ∈ M f ( γ ab ) n M ( γ ab ) r f ( γ ab ) − n M (cid:88) ( a,b ) ∈ M log r f ( γ ab ) , (3)where n M ( γ ab ) = (cid:80) ( i,j ) ∈ M I ( γ ij = γ ab ) based on the observed support S ( M ). Probability ratio (II) Provided S ( M ) ⊆ S ( U ), where S ( U ) is the support of γ ab over U ,one can let the probability ratio be given by r ( γ ) = m ( γ ) u ( γ )8here u ( γ ) is the probability of γ ab = γ given g ab = 0. We have r f ( γ ) = m ( γ ) f ( γ ) = m ( γ ) πm ( γ ) + (1 − π ) u ( γ ) = r ( γ ) π (cid:0) r ( γ ) − (cid:1) + 1where f ( γ ) = πm ( γ ) + (1 − π ) u ( γ ), so that r f ( γ ) and r ( γ ) are one-to-one. Meanwhile,the KL divergence measure from u ( γ ) to m ( γ ) is given by D = (cid:88) γ ∈S ( M ) m ( γ ) log r ( γ )and the objective function to be minimised for r can now be given by Q = (cid:88) ( a,b ) ∈ M u ( γ ab ) n M ( γ ab ) r ( γ ab ) − n M (cid:88) ( a,b ) ∈ M log r ( γ ab ) . (4) Models of γ Under the multinomial model, one can simply use the EDF of γ over γ Ω as f ( γ ), for each distinct level of γ , as long as | Ω | is large compared to |S| . Similarly for m ( γ ) over γ M and u ( γ ) over U . For linkage outside of Ω, the estimated m ( γ ) from M (Ω)applies, if the selection of A from P is non-informative.For γ made up of K binary agreement indicators, γ k = 0 , k = 1 , ..., K , there areup to 2 K distinct levels of γ , which can sometimes be relatively large compared to | M | . Amore parsimonious model of m ( γ ; θ ) that is commonly used is given by m ( γ ; θ ) = K (cid:89) k =1 θ γ k k (1 − θ k ) − γ k (5)where θ k = Pr( γ ab,k = 1 | g ab = 1), and γ ab,k is the k -th component of γ ab . More complicatedmodels that allow for correlated γ k can also be considered.9inally, it is possible to model θ k based on the distributions of the key variables thatgive rise to γ , which makes use of the diﬀerential frequencies of their values, such as thefact that some names are more common than others. Provided there are no duplicated records in either A or B , a classiﬁcation set for recordlinkage, denoted by ˆ M , consists of record pairs from Ω, where any record in A or B appearsat most in one record pair in ˆ M . Let the entropy of a classiﬁcation set ˆ M be given by D ˆ M = 1 | ˆ M | (cid:88) ( a,b ) ∈ ˆ M log r ( γ ab ) . A MEC set of given size n ∗ = | ˆ M | is the ﬁrst classiﬁcation set that is of size n ∗ , obtainedby deduplication in the descending order of r ( γ ab ) over Ω. It is possible to have ( a, b (cid:48) ) (cid:54)∈ ˆ M and r ( γ ab (cid:48) ) > r ( γ a (cid:48) ,b (cid:48) ) for ( a (cid:48) , b (cid:48) ) ∈ ˆ M , if there exists ( a, b ) ∈ ˆ M with r ( γ ab ) > r ( γ ab (cid:48) ).A MEC set of size n ∗ is not necessarily the largest possible classiﬁcation set with themaximum entropy, to be referred to as a maximal MEC set, which is the largest classiﬁca-tion set such that r ( γ ab ) = max γ r ( γ ) for every ( a, b ) in it. In practice, a maximal MECset is given by the ﬁrst pass of deterministic linkage , which only consists of the record pairswith perfect and unique agreement of all the key variables.Probabilistic linkage methods for MEC set are useful if one would like to allow foradditional links, even though their key variables do not agree perfectly with each other.For the uncertainty measure associated with a given MEC set ˆ M , we consider two types of10rrors. First, we deﬁne the false link rate (FLR) among the links in ˆ M to be ψ = 1 | ˆ M | (cid:88) ( a,b ) ∈ ˆ M (1 − g ab ) , (6)which is diﬀerent to µ by (1) where the denominator is | U | . Second, the missing matchrate (MMR) of ˆ M , which is related to the false non-link probability λ in (1), is given by τ = 1 − n M (cid:88) ( a,b ) ∈ ˆ M g ab . (7)While µ and λ in (1) are theoretical probabilities, the FLR and MMR are actual errors.It is instructive to consider the situation, where one is asked to form MEC sets inΩ given all the necessary estimates related to the probability ratio r ( γ ), which can beobtained under the SL setting, without being given n M , g Ω or M directly.First, the perfect MEC set should have the size n M . Let n ( γ ) = (cid:80) ( a,b ) ∈ Ω I ( γ ab = γ ).One can obtain n M as the solution to the following ﬁxed-point equation: n M = (cid:88) ( a,b ) ∈ Ω ˆ g ( γ ab ) = (cid:88) γ ∈S n ( γ )ˆ g ( γ ) (8)where ˆ g ( γ ) := Pr( g ab = 1 | γ ab = γ ) = πr ( γ ) π (cid:0) r ( γ ) − (cid:1) + 1 = n M r ( γ ) n M (cid:0) r ( γ ) − (cid:1) + n (9)and the probability is deﬁned with respect to completely random sampling of a single recordpair from Ω. To see that ˆ g ( γ ) by (9) satisﬁes (8), notice ˆ g ( γ ) = n M m ( γ ) /n ( γ ) satisﬁes (8)for any well deﬁned m ( γ ), and n ( γ ) /n = πm ( γ ) + (1 − π ) u ( γ ) by deﬁnition.Next, apart from a maximal MEC set, one would need to accept discordant pairs. In11he SL setting, one observes the EDF of γ over M , giving rise to ˆ θ k = n M (1; k ) /n M , where n M (1; k ) is the number of agreements on the k -th key variable over M . The perfect MECset ˆ M should have these agreement rates. We have then, for k = 1 , ..., K ,ˆ θ k = 1 | ˆ M | (cid:88) ( a,b ) ∈ ˆ M I ( γ ab,k = 1) for | ˆ M | = n M . (10)Thus, no matter how one models m ( γ ), the perfect MEC set should satisfy jointly the K + 1 equations deﬁned by (8) and (10), given the knowledge of r ( γ ). Let z be the K -dimensional vector of key variables, which may be imperfect for two reasons:it is not rich enough if the true z -values are not unique for each distinct entity underlyingthe two ﬁles to be linked, or it may be subjected to errors if the observed z is not equal toits true value. Let A contain only the distinct z -vectors from the ﬁrst ﬁle, after removingany other record that has a duplicated z -vector to some record that is retained in A . Inother words, if the ﬁrst ﬁle initially contains two or more records with exactly the samevalue of the combined key, then only one of them will be retained in A for record linkageto the second ﬁle. Similarly let B be the deduplicated version of the second ﬁle. Thereason for separate deduplication of keys is that no comparisons between the two ﬁles candistinguish among the duplicated z in either ﬁle, which is an issue to be resolved otherwise.Given A and B preprocessed as above, the maximal MEC set M by deterministic link-age only consists of the record pairs with the perfect agreement of all the key variables. Forprobabilistic linkage beyond M , one can follow the same scheme of MEC in the supervisedsetting, as long as one is able to obtain an estimate of the probability ratio, given which12ne can form the MEC set of any chosen size. Nevertheless, to estimate the associatedFLR (6) and MMR (7), an estimate of n M is also needed. The idea now is to apply (8) and (10) jointly. Since setting ˆ n M = | M | and ˆ θ k ≡ n M > | M | and θ k < k = 1 , ..., K .Moreover, unless there is external information that dictates it otherwise, one can onlyassume common support S ( M ) = S ( U ) in the unsupervised setting. Let r ( γ ) = m ( γ ; θ ) u ( γ ; ξ ) (11)where the probability of observing γ is m ( γ ; θ ) by (5) given that a randomly selected recordpair from Ω belongs to M , and u ( γ ; ξ ) otherwise, similarly given by (5) with parameters ξ k instead of θ k . An iterative algorithm of unsupervised MEC is given below.I. Set θ (0) = ( θ (0)1 , . . . , θ (0) K ) and n (0) M = | M | , where M is the maximal MEC set.II. For the t -th iteration, where t ≥ u ( γ ; ξ ( t ) ) by a suitable method given n ( t − M and θ ( t − , and r ( t ) ( γ ) = m ( γ ; θ ( t − ) /u ( γ ; ξ ( t ) ) g ( t ) ( γ ) = min (cid:110) n ( t − M r ( t ) ( γ ) n ( t − M (cid:0) r ( t ) ( γ ) − (cid:1) + n , (cid:111) n ( t ) M = (cid:88) γ n ( γ ) g ( t ) ( γ )13ii. form the MEC set M ( t ) given | M ( t ) | = n ( t ) M and { r ( t ) ( γ ab ) : ( a, b ) ∈ Ω } , update θ ( t ) k = 1 n ( t ) M (cid:88) ( a,b ) ∈ M ( t ) I ( γ ab,k = 1) (12)III. Iterate until n ( t ) M = n ( t +1) M or (cid:107) θ ( t ) − θ ( t +1) (cid:107) < (cid:15) , where (cid:15) is a small positive value.Notice that, insofar as Ω = M ∪ U is highly imbalanced, where the prevalence of g ab = 1 isvery close to 0, one could simply ignore the contributions from M and useˆ ξ k = 1 n (cid:88) ( a,b ) ∈ Ω I ( γ ab,k = 1) (13)under the model (5) of u ( γ ; ξ ), in which case there is no updating of u ( γ ; ξ ( t ) ). Otherpossibilities of estimating u ( γ ; ξ ) will be discussed in Section 5. The MEC for record linkage should generally be guided by the errors rates, FLR and MMR,without being restricted by the estimate of n M .Note that { ˆ g ab : ( a, b ) ∈ ˆ M } of any MEC set ˆ M are among the largest ones over Ω,because MEC follows the descending order of ˆ r ab , except for necessary deduplication whenthere are multiple pairs involving a given record. To exercise greater control of the FLR,let ψ be the target FLR, and consider the following bisection procedure.i. Choose a threshold value c ψ and form the corresponding MEC set ˆ M ( c ψ ), whereˆ r ab ≥ c ψ for any ( a, b ) ∈ ˆ M ( c ψ ). 14i. Calculate the estimated FLR of the resulting MEC set ˆ M asˆ ψ = 1 | ˆ M | (cid:88) ( a,b ) ∈ ˆ M (1 − ˆ g ab ) . (14)If ˆ ψ > ψ , then increase c ψ ; if ˆ ψ < ψ , then reduce c ψ .Iteration between the two steps would eventually lead to a value of c ψ that makes ˆ ψ asclose as possible to ψ , for the given probability ratio ˆ r ( γ ).The ﬁnal MEC set ˆ M can be chosen in light of the corresponding FLR estimate ˆ ψ . Itis also possible to take into consideration the estimated MMR given byˆ τ = 1 − (cid:88) ( a,b ) ∈ ˆ M ˆ g ab / ˆ n M , (15)where ˆ n M is given by unsupervised MEC algorithm. Note that if | ˆ M | = ˆ n M , then we shallhave ˆ ψ = ˆ τ ; but not if ˆ M is guided by a given target value of FLR or MMR. Table 1 provides an overview of MEC for record linkage in the supervised or unsupervisedsetting. It can be seen that one follows the same framework, but diﬀers in the way thenecessary parameters are estimated. The diﬀerence is due to the fact that in the supervisedsetting, one observes γ for the matched record pairs in M , so that the probability m ( γ )can be estimated from them directly. One can then apply the estimated m ( γ ) to any twoﬁles out of the training space Ω, as long as the selection of M (Ω) and the two ﬁles to belinked is non-informative for the model of m ( γ ). Whereas, for MEC in the unsupervisedsetting, one cannot separate the estimation of m ( γ ) and n M .15able 1: MEC for record linkage in supervised or unsupervised settingSupervised UnsupervisedΩ = M ∪ U Observed UnobservedProbability ratio r f ( γ ) generally applicable r ( γ ) generally r ( γ ) given S ( M ) ⊆ S ( U ) assuming S ( M ) = S ( U )Model of γ Multinomial if only discrete comparison scoresDirectly or via key variables and perturbation errorsMEC set Guided by FLR and MMRRequire estimate of n M in additionEstimation m ( γ ; θ ) from γ M in Ω m ( γ ; θ ) and n M n M by (8) outside Ω jointly by (8) and (10) Below we discuss and compare two other approaches in the unsupervised setting, includingthe ways by which some of their elements can be incorporated into the MEC approach.Other less practical approaches are discussed in the supplementary material.

Recall Problems I and II of the classical approach mentioned in Section 2.From a practical point of view, Problem I can be dealt with by any deduplication methodof the set M ∗ of classiﬁed records pairs, where r ( γ ab ) is above a chosen threshold value forall ( a, b ) ∈ M ∗ . Moreover, a reasonable deduplication method can often be formulated assome kind of an optimisation procedure. In forming the MEC set one deals with ProblemI directly, based on the concept of maximum entropy that has relevance in many areas ofscientiﬁc investigation. The implementation is simple, fast and scalable to large datasets.The estimated error rates FLR (14) and MMR in (15) are directly deﬁned for a given MECset. In contrast, the probabilities of false links and non-links deﬁned by (1) do not directly16efer to the deduplicated set of links.Problem II concerns the parameter estimation. As explained earlier, applying the EMalgorithm based on the objective function (2) proposed by Winkler (1988) and Jaro (1989)is not a valid approach of maximum likelihood estimation (MLE). One may easily comparethis algorithm to that given in Section 4.1, where both adopt the same model (5) and thesame estimator of u ( γ ; ξ ) via ˆ ξ k given by (13). It is then clear that the same formula isused for updating n ( t ) M at each iteration, but a diﬀerent formula is used for θ ( t ) k = 1 n ( t ) M (cid:88) ( a,b ) ∈ Ω ˆ g ( t ) ab γ ab,k , (16)where the numerator is derived from all the pairs in Ω, whereas θ ( t ) k given by (12) uses onlythe pairs in the MEC set M ( t ) . Notice that the two diﬀer only in the unsupervised setting,but they would become the same in the supervised setting, where one can use the observedbinary g ab instead of the estimated fractional ˆ g ab .Thus, one may incorporate the estimation procedure of Winkler (1988) and Jaro (1989)as a variation of the unsupervised MEC algorithm, where the formulae (16) and (13) arechosen speciﬁcally. This is the reason why it can give reasonable parameter estimates inmany situations, despite its misconception as the MLE. Simulations will be used later tocompare empirically the two formulae (12) and (16) for θ ( t ) k . This requires a model of the key variables, which explicates the assumptions of key-variableerrors. Let z k be the k -th key variable which takes value 1 , ..., D k . Copas and Hilton (1990)envisage a non-informative hit-miss generation process, where the observed z k can takethe true value despite the perturbation. Copas and Hilton (1990) demonstrate that the17it-miss model is plausible in the SL setting based on labelled datasets.We adapt the hit-miss model to the unsupervised setting as follows. First, for any( a, b ) ∈ M , let α k = Pr( e ab,k = 1), where e ab,k = 1 if the associated pair of key variables aresubject to any form of perturbation that could potentially cause disagreement of the k -thkey variable, and e ab,k = 0 otherwise. Let θ k = (1 − α k ) + α k D k (cid:88) d =1 m kd = 1 − α k (1 − D k (cid:88) d =1 m kd )where we assume that α k must be positive for some k = 1 , ..., K , and m kd = Pr( z ik = d | g ab = 1 , e ab,k = 1) = Pr( z ik = d | g ab = 1 , e ab,k = 0)for i = a or b . Next, for any record i in either A or B , let δ i = 1 if it has a match in theother ﬁle and δ i = 0 otherwise. Given δ i = 0, with or without perturbation, letPr( z ik = d | δ i = 0) = u kd . We have β kd := m kd ≡ u kd if δ i is non-informative . A slightly more relaxed assumptionis that δ i is only non-informative in one of the two ﬁles. To be more resilient against itspotential failure, one can assume m kd to hold for all the records in the smaller ﬁle, andallow u kd to diﬀer for the records with δ i = 0 in the larger ﬁle. Suppose n A < n B . Let p = Pr( δ b = 1) = E ( n M ) /n B = n A π be the probability that a record in B has a match in A . One may assume z A = { z a : a ∈ A }

18o be independent over A , giving (cid:96) A = (cid:88) a ∈ A K (cid:88) k =1 log m ak , where m ak = (cid:80) D k d =1 m kd I ( z ak = d ). The complete-data log-likelihood based on ( δ B , z B ) is (cid:96) B = (cid:88) b ∈ B δ b log (cid:16) p K (cid:89) k =1 m bk (cid:17) + (cid:88) b ∈ B (1 − δ b ) log (cid:16) (1 − p ) K (cid:89) k =1 u bk (cid:17) , (17)where m bk = (cid:80) D k d =1 m kd I ( z bk = d ) and u bk = (cid:80) D k d =1 u kd I ( z bk = d ), based on an assumptionof independent ( δ b , z b ) across the entities in B .Under separate modelling of z A and ( z B , δ B ), let ˆ m kd be the MLE based on (cid:96) A , givenwhich an EM-algorithm for estimating p and u kd follows from (17) by treating δ B as themissing data. However, the estimation is feasible only if { u kd } and { m kd } are not exactlythe same; whereas the MLE of n M has a large variance, when { m kd } and { u kd } are closeto each other, even if they are not exactly equal.Meanwhile, the closeness between { m kd } and { u kd } does not aﬀect the MEC approach,where ˆ n M is obtained from solving (8) given ˆ r ( γ ) = ˆ m ( γ ) / ˆ u ( γ ), and ˆ u ( γ ) is indeed mostreliably estimated when { m kd } = { u kd } . Moreover, one can incorporate a proﬁle EM-algorithm , based on (17) given n ( t ) M , to update u ( γ ; ξ ( t ) ) in the unsupervised MEC algorithmof Section 4.1. At the t -th iteration, where t ≥

1, given p ( t ) = n ( t ) M / max( n A , n B ) and ˆ m kd estimated from the smaller ﬁle A , obtain u ( t ) kd by ξ ( t ) k = (cid:16) (1 − p ( t ) ) D k (cid:88) d =1 u ( t ) kd ˆ m kd + p ( t ) (1 − n A ) D k (cid:88) d =1 ˆ m kd (cid:17) / (cid:0) − p ( t ) /n A (cid:1) . (18)19 Simulation

To explore the practical feasibility of the unsupervised MEC algorithm for record linkage, weconduct a simulation study based on the data sets listed in Table 2, which are disseminatedby ESSnet-DI (McLeod et al., 2011) and freely available online. Each record in a data sethas associated synthetic key variables, which may be distorted by missing values and typoswhen they are created, in ways that imitate real-life errors (McLeod et al., 2011).Table 2: Data set description (size in parentheses)Data set DescriptionCensus A ﬁctional data set to represent some observations(25 , , , ≡ C120, Hilton ≡ H435.The twelve key variables for record linkage are presented in Table 3.We set up two scenarios to generate linkage ﬁles. We use the unique identiﬁcationvariable (PERSON-ID) for sampling, which are available in all the three data sets. Wesample n A = 500 and n B = 1000 individuals from PRD and CIS, respectively. Let p A bethe proportion of records in the smaller ﬁle (PRD) that are also selected in the larger ﬁle20able 3: Twelve key variables available in the three data sets.Variable Description No. of CategoriesPERNAME1 1 First letter of forename 262 First digit of Soundex code of forename 73 Second digit of Soundex code of forename 74 Third digit of Soundex code of forename 7PERNAME2 1 First letter of surname 262 First digit of Soundex code of surname 73 Second digit of Soundex code of surname 74 Third digit of Soundex code of surname 7SEX Male / Female 2DOB DAY Day of birth 31MON Month of birth 12YEAR Year of birth (1910 ∼ AB ,between A and B . We use p A = 0 . , . . Scenario-I (Non-informative) • Sample n = n B /p A individuals randomly from Census ∩ PRD ∩ CIS. • Sample n A randomly from these n as the individuals of PRD, denoted by A . • Sample n B randomly from these n as the individuals of CIS, denoted by B .Under this scenario both δ a and δ b are non-informative for the key-variable distribution.For any given p A , we have E ( n M ) = n A p A and π = E ( n M ) /n , where n M is the randomnumber of matched individuals between the simulated ﬁles A and B . Scenario-II (Informative) Sample n A randomly from Census ∩ PRD ∩ CIS, denoted by A from PRD. • Sample n M = n A p A randomly from A as the matched individuals, denoted by AB . • Sample n B − n M randomly from CIS \ A having SEX = F , YEAR ≤ B . Let B = AB ∪ B be the sampled individuals of CIS.Under this scenario the key-variable distribution is the same in A , whether or not δ a = 1,but it is diﬀerent for the records b ∈ B , or δ b = 0. Hence, scenario-II is informative. Forany given p A , we have ﬁxed n M = n A p A and π = p A /n B . For the unsupervised MEC algorithm given in Section 4.1, one can adopt (12) or (16)for updating θ ( t ) k . Moreover, one can use (13) for ˆ ξ k directly, or (18) for updating ξ ( t ) k iteratively. In particular, choosing (16) and (13) eﬀectively incorporates the procedure ofWinkler (1988) and Jaro (1989) for parameter estimation. Note that the MEC approachstill diﬀers to that of Jaro (1989), with respect to the formation of the linked set ˆ M .Table 4 compares the performance of the unsupervised MEC algorithm, using diﬀerentformulae for θ ( t ) k and ξ ( t ) k , where the size of ˆ M is equal to the corresponding estimate ˆ n M .In addition, we include ˆ θ k = n M (1; k ) /n M estimated directly from the matched pairs in M ,as if we were in the supervised setting, together with (13) for ˆ ξ k . The true parameters anderror rates are given in addition to their estimates.As can be expected, the best results are obtained when the parameters are estimatedas in the supervised setting. The estimator of π based on ˆ θ k and ˆ ξ k by (13) is essentiallyunbiased under both Scenario-I and II, although ˆ ξ k is not exactly unbiased.22able 4: Parameters and averages of their estimates, averages of error rates and theirestimates, over 200 simulations. Median of estimate of n M given as ˜ n M .Scenario IParameter Formulae Estimation π E ( n M ) θ ( t ) k ξ ( t ) k ˆ π ˆ n M ˜ n M FLR MMR (cid:91)

FLR (cid:92)

MMR.0008 400 ˆ θ k (13) .00080 400.0 397 .0264 .0266 .0357 .0357(12) (18) .00082 407.9 405 .0425 .0257 .0509 .0509(12) (13) .00083 414.7 407 .0549 .0244 .0620 .0620(16) (13) .00081 406.0 405 .0399 .0269 .0503 .0503.0005 250 ˆ θ k (13) .00050 251.6 249 .0340 .0301 .0370 .0370(12) (18) .00052 258.3 255 .0559 .0296 .0533 .0533(12) (13) .00053 266.9 256.5 .0742 .0277 .0680 .0680(16) (13) .00052 261.7 259 .0676 .0305 .0636 .0636.0003 150 ˆ θ k (13) .00030 152.3 151 .0439 .0356 .0381 .0381(12) (18) .00033 165.9 156.5 .0873 .0244 .0620 .0620(12) (13) .00041 205.4 161 .1632 .0308 .1251 .1251(16) (13) .00054 271.4 169 .3015 .0785 .1639 .1639Scenario IIParameter Formulae Estimation π n M θ ( t ) k ξ ( t ) k ˆ π ˆ n M ˜ n M FLR MMR (cid:91)

FLR (cid:92)

MMR.0008 400 ˆ θ k (13) .00080 398.3 400 .0230 .0273 .0326 .0326(12) (18) .00080 401.4 401 .0305 .0277 .0403 .0403(12) (13) .00081 405.2 404 .0379 .0262 .0467 .0467(16) (13) .00080 401.4 401 .0316 .0286 .0438 .0438.0005 250 ˆ θ k (13) .00050 249.6 250 .0284 .0302 .0334 .0334(12) (18) .00050 251.8 251 .0383 .0320 .0410 .0410(12) (13) .00052 257.7 253 .0513 .0295 .0516 .0516(16) (13) .00051 255.4 253.5 .0510 .0336 .0520 .0520.0003 150 ˆ θ k (13) .00030 150.5 150 .0382 .0355 .0350 .0350(12) (18) .00031 153.0 153 .0559 .0377 .0452 .0452(12) (13) .00032 158.5 155 .0708 .0342 .0558 .0558(16) (13) .00038 189.3 156 .1414 .0524 .0903 .090323evertheless, the approximate estimator ˆ ξ k can be improved, since the proﬁle-EM es-timator given by (18) is seen to perform better across all the set-ups, where both arecombined with (12) for θ ( t ) k . In particular, as explained before, non-informative key errorscause problems for the MLE of n M based on (17), but not the MEC algorithm. The resultsunder the non-informative Scenario-I provide evidences in this regard, where the proﬁle-EM estimator of ξ k , given the MEC-estimate ˆ n M , is better than (13). It seems the extracomputation required of (18) may be worthwhile in many situations.When it comes to the two formulae of θ ( t ) k by (12) and (16), and the resulting n M -estimators and the error rates FLR and MMR, we notice the followings. • Scenario-I: When the size of the matched set M is relatively large at p A = 0 . n M , and the diﬀerence is just a couple of false links in terms of the linkage errors.Figures 1 shows that (12) results in a few larger errors of ˆ n M than (16) over the200 simulations, when p A = 0 . π = 0 . M decreases, the averages and medians of the estimators of n M resulting from (12) and(18) are closer to the true values than those of the other estimators. Especially whenthe matched set M is relatively small, where π = 0 . n M in every respect. While this is partly due to theuse of (13) instead of (18), most of the diﬀerence is down to the choice of θ ( t ) k , whichcan be seen from intermediary comparisons to the results based on (12) and (13). • Scenario-II: The use of (12) and (18) for the unsupervised MEC algorithm performsbetter than using the other formulae in terms of both estimation of n M and error ratesacross the three sizes of the matched set (Figure 2). Relatively greater improvementis achieved by using (12) and (18) for the smaller matched sets.24 = 0.0003 𝜋 = 0.0005 𝜋 = 0.0008 ො 𝑛 𝑀 − 𝑛 𝑀 (12) & (13) (12) & (18) (16) & (13) (12) & (13) (12) & (18) (16) & (13) (12) & (13) (12) & (18) (16) & (13) Figure 1: Box plots of ˆ n M − n M based on 200 Monte Carlo samples under Scenario I. 𝜋 = 0.0003 𝜋 = 0.0005 𝜋 = 0.0008 ො 𝑛 𝑀 − 𝑛 𝑀 (12) & (13) (12) & (18) (16) & (13) (12) & (13) (12) & (18) (16) & (13) (12) & (13) (12) & (18) (16) & (13) Figure 2: Box plots of ˆ n M − n M based on 200 Monte Carlo samples under Scenario II.25he results suggest that the unsupervised MEC algorithm tends to be more aﬀectedby the size of the matched set under Scenario-I than Scenario-II. Choosing (12) and (18),however, seems to yield the most robust estimation of n M and error rates against the smallsize of the matched set M , regardless the informativeness of key-variable errors. Thus, theformulae (12) and (18) may be preferable to the other formulae, unless one is quite certainthat the number of matched entities n M is relatively large compared to min( n A , n B ). Thereason must be the fact that the numerator of θ ( t ) k is calculated in (16) over all the pairs inΩ instead of the MEC set M ( t ) , which seems more sensitive when the imbalance between M and U is aggravated, while the sizes of A and B remain ﬁxed. Aiming the MEC set ˆ M at the estimated size ˆ n M is generally not a reasonable approachto record linkage. Record linkage should be guided directly by the associated uncertainty,i.e. the error rates FLR and MMR, based on their estimates (14) and (15), as described inSection 4.2. Note that this does require the estimation of n M in addition to r ( γ ).We have (cid:91) FLR = (cid:92)

MMR in Table 4, because | ˆ M | = ˆ n M here. It can be seen that thesefollow the true FLR more closely than the MMR, especially when ˆ n M is estimated using theformulae (12) and (18). This is hardly surprising. Take, for example, the maximal MEC set M that consists of the pairs whose key variables agree completely and uniquely. Providedreasonably rich key variables, as the setting here, one can expect the FLR of M to be low,such that even a na¨ıve estimate (cid:91) FLR = 0 probably does not matter much. Meanwhile, thetrue MMR has a much wider range from one application to another, because the diﬀerencebetween n M and | M | is determined by the extent of key-variable errors, such that theestimate of MMR depends more critically on that of n M . The situation is similar for anyMEC set beyond M , as long as ˆ g ab remains very high for any ( a, b ) ∈ ˆ M .26able 5: Parameters and averages of their estimates, averages of error rates and theirestimates, over 200 simulations, n = | Ω | = n A n B .Scenario IParameter Target Estimation π E ( n M ) FLR ˆ n M | ˆ M | /n | ˆ M | FLR MMR (cid:91)

FLR (cid:92)

MMR.0008 400 0.05 407.9 .00080 401.9 .0313 .0280 .0393 .05270.03 .00079 395.0 .0196 .0328 .0271 .0568.0005 250 0.05 258.3 .00050 251.9 .0396 .0326 .0385 .05760.03 .00049 246.7 .0246 .0374 .0264 .0650.0003 150 0.05 165.9 .00031 153.4 .0533 .0403 .0389 .07830.03 .00030 149.3 .0355 .0483 .0256 .0905Scenario IIParameter Target Estimation π n M FLR ˆ n M | ˆ M | /n | ˆ M | FLR MMR (cid:91)

FLR (cid:92)

MMR.0008 400 0.05 401.4 .00080 397.8 .0239 .0294 .0337 .04180.03 .00079 393.1 .0164 .0334 .0256 .0451.0005 250 0.05 251.8 .00050 248.6 .0305 .0361 .0328 .04470.03 .00049 245.2 .0226 .0416 .0245 .0497.0003 150 0.05 153.0 .00030 150.1 .0445 .0443 .0333 .05140.03 .00029 147.4 .0322 .0489 .0238 .0588Table 5 shows the performance of the MEC set using the bisection procedure describedin Section 4.2, across the same set-ups as in Table 4. We use only (12) for θ ( t ) k and (18) for ξ ( t ) k to obtain the corresponding ˆ n M . We let the target FLR be ψ = 0 .

05 or 0.03, where thelatter is clearly lower than the true FLR of ˆ M that is of the size ˆ n M (Table 4), especiallywhen the prevalence is relatively low (at π = 0 . M , whosesize | ˆ M | is close to the true n M across all the set-ups. Indeed, under Scenario-I, the meanof | ˆ M | is closer to n M than the mean (or median) of ˆ n M over all the simulations, which27esults directly from parameter estimation, especially when the match set is relatively small(at π = 0 . n M is most sensitive. In other words, the fact that | ˆ M | diﬀers to the estimate ˆ n M is not necessarily a cause of concern for the MEC algorithmguided by targeting the FLR.Targeting a smaller FLR, i.e. 0.03 instead of 0.05 here, reduces the size of | ˆ M | , becausethe MEC set is formed in the descending order of ˆ g ab . The corresponding true FLR doesvary accordingly with the target value. The estimator (cid:91) FLR performs reasonably. Forinstance, for E ( n M ) = 400 and target FLR ψ = 0 .

03 under Scenario-I, the estimatednumber of false links in ˆ M , according to (cid:91) FLR, is on average 2 fewer than the target FLR,whereas the true number of false links in ˆ M is about 2 fewer than the estimate (cid:91) FLR.To estimate the MMR by (15), one can either use | ˆ M | as the estimate of n M , or onecan use ˆ n M from parameter estimation based on (12) and (18). In the former case, onewould obtain (cid:92) MMR = (cid:91)

FLR. While this (cid:92)

MMR is not unreasonable in absolute terms since | ˆ M | is close to n M here, as can be seen from comparing the mean of (cid:91) FLR with that ofthe true MMR in Table 5, it has a drawback a priori , in that it decreases as the targetFLR decreases, although one is likely to miss out on more true matches when more linksare excluded from the MEC set ˆ M . Using ˆ n M from parameter estimation directly makessense in this respect, since the true n M must remain the same, regardless the target FLR.However, the estimator (cid:92) MMR could then become less reliable given relatively low prevalence π , where ˆ n M could be sensitive in such situations.For example, for E ( n M ) = 400 and target FLR ψ = 0 .

03 under Scenario-I, the numberof correct links in ˆ M is on average about 386 ≈ − . (cid:92) MMRderived from ˆ n M is on average 385 ≈ . − . (cid:92) MMR is givenin Table 5, and that using (cid:92)

MMR = (cid:91)

FLR is on average 384 ≈ . − . E ( n M ) = 150 and target FLR ψ = 0 .

03 under28cenario-I, where ˆ n M has a noticeable upward bias, the number of correct links in ˆ M is onaverage 143, its estimate using (cid:92) MMR derived from ˆ n M is 151, and that using (cid:92) MMR = (cid:91)

FLRis 146. The latter is closer, while the former seems implausible in light of the actual | ˆ M | .In short, the estimation of FLR tends to be more reliable than that of MMR, especiallyif the prevalence π is relatively low in its theoretical range 0 < π ≤ min( n A , n B ) /n . Thefollowing recommendations for unsupervised record linkage seem warranted. • When forming the MEC set ˆ M according to the uncertainty of linkage, it is morerobust to rely on the FLR, estimated by (14). • The estimate of MMR given by (15), derived from the parameter estimate ˆ n M basedon (12) and (18) provides an additional uncertainty measure. However, one shouldbe aware that this measure can be sensitive when the prevalence π is relatively low. • Between two target values of the FLR, ψ < ψ (cid:48) , more attention can be given to theestimate of additional missing matches in ˆ M ( ψ ) compared to ˆ M ( ψ (cid:48) ), given by (cid:88) ( a,b ) ∈ ˆ M ( ψ (cid:48) ) ˆ g ab − (cid:88) ( a,b ) ∈ ˆ M ( ψ ) ˆ g ab = (cid:88) ( a,b ) ∈ ˆ M ( ψ (cid:48) ) \ ˆ M ( ψ ) ˆ g ab . In the above we have developed the maximum entropy classiﬁcation approach to recordlinkage. The proposed approach provides a uniﬁed probabilistic record linkage frameworkboth in the supervised and unsupervised settings, where a coherent classiﬁcation set of linksare chosen with respect to the associated uncertainty measure. Our theoretical formulationovercomes some persistent ﬂaws of the classical approach. Furthermore, the proposedMEC algorithm is scalable and fully automatic, unlike the classical approach that generally29equires clerical review to resolve the undecided cases. Therefore, the proposed recordlinkage methods can be widely applicable even to big data setup.An important issue that is worth further research concerns the estimation of relevantparameters of the model of key-variable errors that cause problems for record linkage.First, as pointed out earlier, treating record linkage as a classiﬁcation problem allows oneto explore many modern machine learning techniques. A key challenge in this respect isthe fact that the diﬀerent record pairs are not distinct ‘units’, such that any powerfulsupervised learning technique needs to be adapted to the unsupervised setting, where it isimpossible to estimate the relevant parameters based on the true matches and non-matches,including the number of matched entities. Next, the model of the key-variable errors or thecomparison scores can be reﬁned. Together, they may hopefully be able to further improvethe parameter estimation, which will beneﬁt both the classiﬁcation of the set of links andthe assessment of the associated uncertainty.Another issue that is interesting to explore in practice is the various possible forms ofinformative key-variable errors, insofar as the model pertaining to the matched entities inone way or another diﬀers to that of the unmatched entities. Suitable variations of theMEC approach may need to be conﬁgured in diﬀerent situations.

SUPPLEMENTARY MATERIAL

In the supplementary material, we present some special cases of MEC sets for recordlinkage and discuss two less practical approaches in the unsupervised setting.30 eferences

Berger, A. L., S. A. Della Pietra, and V. J. Della Pietra (1996). A maximum entropyapproach to natural language processing.

Computational Linguistics 22 , 39–71.Christen, P. (2012). A survey of indexing techniques for scalable record linkage and dedu-plication.

IEEE transactions on knowledge and data engineering 24 (9), 1537–1555.Copas, J. and F. Hilton (1990). Record linkage: statistical models for matching computerrecords.

Journal of the Royal Statistical Society: Series A (Statistics in Society) 153 (3),287–312.Fellegi, I. P. (1997). Record linkage and public policy - a dynamic evolution. In

Recordlinkage techniques . Federal Committee on Statistical Methodology, Oﬃce of Managementand Budget.Fellegi, I. P. and A. B. Sunter (1969). A theory for record linkage.

Journal of the AmericanStatistical Association 64 (328), 1183–1210.Gull, S. F. and G. J. Daniell (1984). Maximum entropy method in image processing.

IEEProceedings 131F , 646–659.Harron, K., R. Gilbert, D. Cromwell, and J. van der Meulen (2016). Linking data formothers and babies in de-identiﬁed electronic health data.

PloS one 11 (10), e0164667.Herzog, T. N., F. J. Scheuren, and W. E. Winkler (2007).

Data quality and record linkagetechniques . Springer Science & Business Media.Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the31985 census of tampa, ﬂorida.

Journal of the American Statistical Association 84 (406),414–420.McLeod, P., D. Heasman, and I. Forbes (2011). Simulated data for the on the job training. .M´eray, N., J. B. Reitsma, A. C. Ravelli, and G. J. Bonsel (2007). Probabilistic record linkageis a valid and transparent tool to combine databases without a patient identiﬁcationnumber.

Journal of clinical epidemiology 60 (9), 883–e1.Nguyen, X., M. J. Wainwright, and M. I. Jordan (2010). Estimating divergence functionalsand the likelihood ratio by convex risk minimization.

IEEE Transactions on InformationTheory 56 (11), 5847–5861.Nigam, K., J. Laﬀerty, and A. McCallum (1999). Using maximum entropy for text classiﬁ-cation. In

IJCAI-99 workshop on machine learning for information ﬁltering , Volume 1,pp. 61–67. Stockholom, Sweden.Owen, A., P. Jones, and M. Ralphs (2015). Large-scale linkage for total populations inoﬃcial statistics.

Methodological Developments in Data Linkage , 170–200.Tancredi, A. and B. Liseo (2011). A hierarchical bayesian approach to record linkage andpopulation size problems.

The Annals of Applied Statistics 5 (2B), 1553–1585.Winkler, W. E. (1988). Using the em algorithm for weight computation in the fellegisuntermodel of record linkage. In

Proceedings of the Section on Survey Research Methods , pp.667–671. American Statistical Association.Winkler, W. E. and Y. Thibaudeau (1991). An application of the fellegi-sunter model of32ecord linkage to the 1990 us decennial census.

Statistical Research Division TechnicalReport, U.S. Bureau of the Census , 91–99.Zhang, G. and P. Campbell (2012). Data survey: developing the statistical longitudinalcensus dataset and identifying its potential uses.