[PDF] Invited Discussion of "A Unified Framework for De-Duplication and Population Size Estimation"

Abstract

Invited Discussion of "A Unified Framework for De-Duplication and Population Size Estimation", published in Bayesian Analysis. My discussion focuses on two main themes: Providing a more nuanced picture of the costs and benefits of joint models for record linkage and the "downstream task" (i.e. whatever we might want to do with the linked and de-duplicated files), and how we should measure performance.

Full PDF

aa r X i v : . [ s t a t . M E ] S e p Vol. 00 (0000) 1DOI:

Invited Discussion of “A UniﬁedFramework for De-Duplication andPopulation Size Estimation”

Jared S. Murray ?? , ∗ , ?? Department of Information, Risk, and Operations Management and Department ofStatistical Science. University of Texas at Austin. e-mail: [email protected]

I would like to congratulate the authors on a stimulating contribution to the lit-erature on record linkage/de-duplication and population size estimation. Tancredi and Liseo(2011) was one of the papers that ﬁrst piqued my interest in record linkage, soI am pleased to see more work along these lines (with an author population sizeof N+1!) My discussion below focuses on two main themes: Providing a morenuanced picture of the costs and beneﬁts of joint models for record linkage andthe “downstream task” (i.e. whatever we might want to do with the linked andde-duplicated ﬁles), and how we should measure performance.

1. The promise and peril of joint modeling: A partial defense ofdisunity

The promise of a joint model for record linkage, de-duplication, and popula-tion size estimation is likely obvious to the readership of Bayesian Analysis:We immediately obtain valid posterior inference over the population size thataccounts for uncertainty about duplicates and links across ﬁles – provided thatwe specify an adequate joint model. Which leads us predictably to the peril ofjoint modeling, the fact that specifying a model for any of these three tasksalone is nontrivial. Addressing them simultaneously in a single model requiresspecifying a joint model suﬃciently rich to do well on all three tasks (linkage,de-duplication, and population size estimation) while being tractable enough tounderstand its properties and perform posterior inference.The model presented here necessarily makes some compromises in serviceof joint modeling, and I wonder about their impact. For example, assumptionsabout the sampling process generating the lists are essential to modeling theunknown population size and therefore must appear in any uniﬁed model. Thiswill consequently restrict the prior distribution over the overlap between ﬁlesin the record linkage/de-duplication portion of the model, despite the fact that ∗ Supported by SES-1824555 The author gratefully acknowledges support from the NationalScience Foundation under grant number SES-1824555. Any opinions, ﬁndings, and conclusionsor recommendations expressed in this material are those of the author(s) and do not necessarilyreﬂect the views of the funding agencies. 1 imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion the assumption of simple random sampling from the population – or any sortof random sampling at all – is otherwise irrelevant to record linkage and de-duplication. The assumptions made by the authors imply a very particular,informative prior distribution on Z , the partition of records into co-referentsets, and therefore on K , the number of distinct units captured across all lists(as reported in Table 1).This choice is consequential. Indeed, immediately prior to Section 3.1 theauthors note that the induced prior distribution on K is probably not well-suited to record linkage tasks in general, which makes me wonder why we shouldexpect it to work well when doing record linkage and population size estimationsimultaneously. I have to assume that either 1) we actually don’t expect itto work particularly well but the joint model at hand demands it or 2) theassumptions about the sampling process are actually warranted here, at leastapproximately, while they may not be in general applications of record linkage.If the former, this seems to beg the question and ignore options beyond jointmodeling. If the latter, things are more interesting.If the assumptions are in fact correct, we would expect to obtain more accu-rate and eﬃcient inferences by inducing the “true” prior over Z and K using thejoint model. But what happens when the sampling assumptions are violated? Itis diﬃcult to say, and it must depend on a host of factors (such as the degree andfrequency of errors among co-referent records). However, it is not hard to imag-ine a case where relatively minor deviations from the sampling assumptions aremore or less innocuous in the context of a population size model with knownpartition Z but become inﬂuential when Z is unknown and jointly modeled,due to the inﬂuence of the “misspeciﬁed” informative prior over Z . It wouldbe interesting to try and draw this out via a simulation exercise particularly inlight of how inﬂuential Steorts et al. (2016) found a similar prior to be in a purerecord linkage/de-duplication context).If posterior inference is not robust to deviations from the sampling assump-tions, what could we do instead? The desire to mitigate this undesirable “feed-back” from a misspeciﬁed sub-model appears in many diﬀerent settings, fromBayesian causal inference with propensity score models (McCandless et al., 2010;Zigler et al., 2013) to astrophysics (Yu et al., 2018) and beyond (see Jacob et al.(2017) for additional examples). This is a diﬃcult problem and an active areaof research. The proposed solutions often take the form of (possibly incoherent)multistage inference, in this case inferring the linkage structure in stage 1 andthe population size in stage 2, propagating uncertainty from stage 1 to stage 2without allowing any information from stage 2 to ﬂow to stage 1. Jacob et al.(2017) give examples of settings where these “posteriors” are better than theposterior under a misspeciﬁed joint model in a decision-theoretic sense.In the context of de-duplication and population size modeling, Sadinle (2018)proposes a related two-stage alternative to joint modeling termed “linkage av-eraging”. If (in the notation of the current paper) h ( λ ) is the estimate of pop-ulation size we would compute given complete data (i.e., a de-duplicated andlinked set of ﬁles) then under certain conditions the posterior for h ( λ ) undera record linkage/de-duplication model alone will give the same inferences as a imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion proper Bayesian joint model for linkage, de-duplication, and population size es-timation. With a single set of posterior samples one can perform inference overmultiple models for the population size, again provided that they all satisfysome relatively mild conditions.These conditions do necessarily demand a degree of compatibility betweenthe prior on λ and the population size model. They bear a striking similarity tothe conditions under which multiple imputation delivers (asymptotically) validBayesian inference (“congeniality”, (Meng, 1994; Xie and Meng, 2017; Murray,2018)). This raises the interesting question of whether the compatibility condi-tions might be relaxed while still yielding conservative inferences, similar to theway one can obtain conservative inferences using imputations under an uncon-genial imputation model, provided it is uncongenial in the “right” way (roughly,by making fewer assumptions during imputation than analysis).

2. Measuring and improving performance

Various sub-specialties of statistics have spawned their own de-facto benchmarkdatasets – think of the iris data for clustering or the galaxy dataset for den-sity estimation. Likewise,

RLdata500 and

RLdata10000 have arguably becomesomething of a benchmark in record linkage problems due in large part to theiraccessibility via the popular

RecordLinkage

R package. I have used them inpublications myself (Murray, 2015). Benchmark datasets form a sort of linguafranca that is useful for teaching, exposition, and as a sort of sanity check (whenour brilliant new method ﬁnds six distinct clusters in the iris data, it’s back tothe drawing board).However, we have to be careful extrapolating from these datasets to morecomplex settings. In the provocatively titled “Leave the Pima Indians Alone”,Chopin and Ridgway (2017) make the case that an excessive focus on relativelysimple binary regression problems like the Pima Indians diabetes dataset hashad a distortive impact on the Bayesian computation literature. I worry a littlethat repeatedly going back to the

RLdata datasets might lead the record linkageliterature up the same path. In particular, the errors in these synthetic datasetsare rather minimal, and the duplicate record pairs are quite well-separated fromthe non-duplicates. In my experience this not representative of the datasets wesee in the wild, at least not those that demand sophisticated statistical modeling.Like Britney and the Pima Indians, I think it may be time to leave

RLdata alone.However, the primary evidence that the authors provide in favor of theirmodel is its performance on

RLdata datasets. Even setting aside whether thisis a representative testbed, I wonder if this is much evidence at all since noalternative approaches are presented. Several are available, at least for therecord linkage and de-duplication tasks, including some developed by the au-thors themselves (e.g. Steorts et al. (2015) reports FNR and FDR of 0.02 and0.04 on

RLdata500 , versus 0.015 and 0.08 using the model in the current paper).How well do existing Bayesian models perform on the linkage/de-duplicationtask? What about even simpler methods, like the point estimates generated imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion by Fellegi-Sunter methods (Fellegi and Sunter, 1969) or their generalizations(Sadinle and Fienberg, 2013; Murray, 2015)? This is important context; whilethe model proposed here oﬀers richer inference, should we trust those inferencesif the model does not perform relatively well on the linkage/de-duplication task?The authors actually seem to go a step further and use results on R Ldata to in-form parameter selection when modeling the Syrian casualty data. This franklyseems like a bad idea; in my own experience with similar ﬁles (Dalmasso et al.,2019), including expert hand-linked datasets, we observed very diﬀerent patternsof distortion among co-referent records than the simple patterns one would ﬁndin R Ldata. Given how variable performance is across parameter settings in Sec-tion 4, I would suggest that at least some sensitivity analysis might be in orderfor the Syria application.Rather than rely on unrepresentative benchmark datasets to measure perfor-mance and select parameters, what could we do instead? The longer I work onrecord linkage problems the more I am convinced of the need to include a hand-labeling exercise alongside every serious application. The synthetic datasets atour disposal are limited in the range of errors they include and are often poorrepresentations of the problem at hand. Model-based estimates of error ratesare only as good as the model, and if we’re not sure about the model... How-ever, provided that the true error rates are low, precise estimates of false matchrates (false discovery rates) can be obtained via random sampling from matchedrecord pairs. False match rates aren’t everything, but they aren’t nothing either.Sadly the authors missed an opportunity to do even a little inspection here; af-ter ﬁnding a small number of duplicates in the Syria application, they note onlythat “visual inspection of these pairs may eventually conﬁrm their matchingstatus”.Ideally a labeling exercise to evaluate a record linkage/de-duplication methodshould include matches generated by other methods (to remove potential biastoward declaring estimated matches correct), blinding (to the method(s) that de-clared the link), multiple review, an “indeterminate” or “unsure” option for thelabelers, and should present labelers with neighboring “near-match”record pairs.Stellar examples of hand-labeling study designs include Bailey et al. (2017);Frisoli and Nugent (2018). In McVeigh et al. (2019) we hand-labeled a rela-tively small number of links to compare two competing methods, including oneBayesian model. For the Bayesian model we also used these labels to obtainthe posterior distribution of false match rate adjusted estimators by comput-ing them on each posterior sample of the linkage structure (similar to Sadinle(2018)’s linkage averaging). For our estimands, we only found it necessary toadjust for the false match rate and we did not grapple with simultaneous de-duplication or multiple ﬁles. But we did ﬁnd that variation due to assumptionsabout bias from linkage error tended to swamp variation due to uncertaintyabout the linkage structure.Reducing or otherwise accounting for linkage error seems important in thecontext of the current paper as well. Observe that in Figure 3, the estimates of K are worse in the blocks with higher error rates (blocks 7, 1, 10, 3) and in eachcase the estimate for K is biased down with a rather concentrated posterior dis- imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion tribution. If the model cannot be improved further, perhaps we would be betteroﬀ looking at the posterior distribution of linkage error adjusted estimates ofthe population size. Linkage error adjusted estimators for the population sizedo exist, at least for relatively simple settings (e.g. Ding and Fienberg (1994);Di Consiglio and Tuoto (2018); Heijden (2019)) and perhaps could be cast inSadinle (2018)’s framework of linkage averaging (although I have not checked thecompatibility conditions myself). These estimators depend on false non-matchrates, which are more diﬃcult to obtain through hand labeling but often canbe reasoned about based on plausible levels of duplication and overlap. Thisreasoning could form the basis of a computationally eﬃcient sensitivity anal-ysis. This seems like a promising avenue for future research, alongside furtherimprovements in model and prior speciﬁcation to minimize error rates. References

Bailey, M., Cole, C., Henderson, M., and Massey, C. (2017). “How well doautomated methods perform in historical samples? Evidence from new groundtruth.” Technical report, National Bureau of Economic Research, Inc. 4Chopin, N. and Ridgway, J. (2017). “Leave Pima Indians alone: binary regres-sion as a benchmark for Bayesian computation.”

Statistical Science , 32(1):64–87. 3Dalmasso, N., Mejia, R., Rodu, J., Price, M., and Murray, J. (2019). “FeatureEngineering for Entity Resolution with Arabic Names: Improving Estimatesof Observed Casualties in the Syrian Civil War.”

Artiﬁcial Intelligence forHumanitarian Assistance and Disaster Response Workshop Workshop, NIPS .4Di Consiglio, L. and Tuoto, T. (2018). “Population size estimation and linkageerrors: The multiple lists case.”

Journal of Oﬃcial Statistics , 34(4): 889–908.5Ding, Y. and Fienberg, S. (1994). “Dual system estimation of Census undercountin the presence of matching error.”

Survey Methodology , 20(2): 149–158. 5Fellegi, I. P. and Sunter, A. B. (1969). “A Theory for Record Linkage.”

Journalof the American Statistical Association , 64(328): 1183–1210.URL ,502–509. 4Heijden, P. V. D. (2019). “A linkage error correction model for populationsize estimation with multiple sources.” In .URL https://eprints.soton.ac.uk/436665/ arXiv preprintarXiv:1708.08719 . 2 imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion McCandless, L. C., Douglas, I. J., Evans, S. J., and Smeeth, L. (2010). “Cuttingfeedback in Bayesian regression adjustment for the propensity score.”

Theinternational journal of biostatistics , 6(2). 2McVeigh, B. S., Spahn, B. T., and Murray, J. S. (2019). “Scaling BayesianProbabilistic Record Linkage with Post-Hoc Blocking: An Application to theCalifornia Great Registers.”URL https://arxiv.org/abs/1905.05337

Statistical Science , 538–558. 3Murray, J. S. (2015). “Probabilistic Record Linkage and Deduplication afterIndexing, Blocking, and Filtering.”

Journal of Privacy and Conﬁdentiality ,7(1).URL https://journalprivacyconfidentiality.org/index.php/jpc/article/view/643

3, 4— (2018). “Multiple imputation: A review of practical and theoretical ﬁndings.”

Statistical Science , 33(2): 142–159. 3Sadinle, M. (2018). “Bayesian propagation of record linkage uncertainty intopopulation size estimation of human rights violations.”

The Annals of AppliedStatistics , 12(2): 1013–1038. 2, 4, 5Sadinle, M. and Fienberg, S. E. (2013). “A generalized Fellegi–Sunter frame-work for multiple record linkage with application to homicide record systems.”

Journal of the American Statistical Association , 108(502): 385–397. 4Steorts, R. C., Hall, R., and Fienberg, S. E. (2016). “A Bayesian approachto graphical record linkage and deduplication.”

Journal of the AmericanStatistical Association , 111(516): 1660–1672. 2Steorts, R. C. et al. (2015). “Entity resolution with empirically motivated pri-ors.”

Bayesian Analysis , 10(4): 849–875. 3Tancredi, A. and Liseo, B. (2011). “A hierarchical Bayesian approach to recordlinkage and population size problems.”

The Annals of Applied Statistics ,5(2B): 1553–1585. 1Xie, X. and Meng, X.-L. (2017). “DISSECTING MULTIPLE IMPUTATIONFROM A MULTI-PHASE INFERENCE PERSPECTIVE: WHAT HAP-PENS WHEN GOD’S, IMPUTER’S AND ANALYST’S MODELS ARE UN-CONGENIAL?”

Statistica Sinica , 1485–1545. 3Yu, X., Del Zanna, G., Stenning, D. C., Cisewski-Kehe, J., Kashyap, V. L.,Stein, N., van Dyk, D. A., Warren, H. P., and Weber, M. A. (2018). “Incor-porating Uncertainties in Atomic Data Into the Analysis of Solar and StellarObservations: A Case Study in Fe XIII.”

The Astrophysical Journal , 866(2):146. 2Zigler, C. M., Watts, K., Yeh, R. W., Wang, Y., Coull, B. A., and Dominici, F.(2013). “Model feedback in Bayesian propensity score estimation.”

Biomet-rics , 69(1): 263–273. 2, 69(1): 263–273. 2