Invited Discussion of "A Unified Framework for De-Duplication and Population Size Estimation"
aa r X i v : . [ s t a t . M E ] S e p Vol. 00 (0000) 1DOI:
Invited Discussion of “A UnifiedFramework for De-Duplication andPopulation Size Estimation”
Jared S. Murray ?? , ∗ , ?? Department of Information, Risk, and Operations Management and Department ofStatistical Science. University of Texas at Austin. e-mail: [email protected]
I would like to congratulate the authors on a stimulating contribution to the lit-erature on record linkage/de-duplication and population size estimation. Tancredi and Liseo(2011) was one of the papers that first piqued my interest in record linkage, soI am pleased to see more work along these lines (with an author population sizeof N+1!) My discussion below focuses on two main themes: Providing a morenuanced picture of the costs and benefits of joint models for record linkage andthe “downstream task” (i.e. whatever we might want to do with the linked andde-duplicated files), and how we should measure performance.
1. The promise and peril of joint modeling: A partial defense ofdisunity
The promise of a joint model for record linkage, de-duplication, and popula-tion size estimation is likely obvious to the readership of Bayesian Analysis:We immediately obtain valid posterior inference over the population size thataccounts for uncertainty about duplicates and links across files – provided thatwe specify an adequate joint model. Which leads us predictably to the peril ofjoint modeling, the fact that specifying a model for any of these three tasksalone is nontrivial. Addressing them simultaneously in a single model requiresspecifying a joint model sufficiently rich to do well on all three tasks (linkage,de-duplication, and population size estimation) while being tractable enough tounderstand its properties and perform posterior inference.The model presented here necessarily makes some compromises in serviceof joint modeling, and I wonder about their impact. For example, assumptionsabout the sampling process generating the lists are essential to modeling theunknown population size and therefore must appear in any unified model. Thiswill consequently restrict the prior distribution over the overlap between filesin the record linkage/de-duplication portion of the model, despite the fact that ∗ Supported by SES-1824555 The author gratefully acknowledges support from the NationalScience Foundation under grant number SES-1824555. Any opinions, findings, and conclusionsor recommendations expressed in this material are those of the author(s) and do not necessarilyreflect the views of the funding agencies. 1 imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion the assumption of simple random sampling from the population – or any sortof random sampling at all – is otherwise irrelevant to record linkage and de-duplication. The assumptions made by the authors imply a very particular,informative prior distribution on Z , the partition of records into co-referentsets, and therefore on K , the number of distinct units captured across all lists(as reported in Table 1).This choice is consequential. Indeed, immediately prior to Section 3.1 theauthors note that the induced prior distribution on K is probably not well-suited to record linkage tasks in general, which makes me wonder why we shouldexpect it to work well when doing record linkage and population size estimationsimultaneously. I have to assume that either 1) we actually don’t expect itto work particularly well but the joint model at hand demands it or 2) theassumptions about the sampling process are actually warranted here, at leastapproximately, while they may not be in general applications of record linkage.If the former, this seems to beg the question and ignore options beyond jointmodeling. If the latter, things are more interesting.If the assumptions are in fact correct, we would expect to obtain more accu-rate and efficient inferences by inducing the “true” prior over Z and K using thejoint model. But what happens when the sampling assumptions are violated? Itis difficult to say, and it must depend on a host of factors (such as the degree andfrequency of errors among co-referent records). However, it is not hard to imag-ine a case where relatively minor deviations from the sampling assumptions aremore or less innocuous in the context of a population size model with knownpartition Z but become influential when Z is unknown and jointly modeled,due to the influence of the “misspecified” informative prior over Z . It wouldbe interesting to try and draw this out via a simulation exercise particularly inlight of how influential Steorts et al. (2016) found a similar prior to be in a purerecord linkage/de-duplication context).If posterior inference is not robust to deviations from the sampling assump-tions, what could we do instead? The desire to mitigate this undesirable “feed-back” from a misspecified sub-model appears in many different settings, fromBayesian causal inference with propensity score models (McCandless et al., 2010;Zigler et al., 2013) to astrophysics (Yu et al., 2018) and beyond (see Jacob et al.(2017) for additional examples). This is a difficult problem and an active areaof research. The proposed solutions often take the form of (possibly incoherent)multistage inference, in this case inferring the linkage structure in stage 1 andthe population size in stage 2, propagating uncertainty from stage 1 to stage 2without allowing any information from stage 2 to flow to stage 1. Jacob et al.(2017) give examples of settings where these “posteriors” are better than theposterior under a misspecified joint model in a decision-theoretic sense.In the context of de-duplication and population size modeling, Sadinle (2018)proposes a related two-stage alternative to joint modeling termed “linkage av-eraging”. If (in the notation of the current paper) h ( λ ) is the estimate of pop-ulation size we would compute given complete data (i.e., a de-duplicated andlinked set of files) then under certain conditions the posterior for h ( λ ) undera record linkage/de-duplication model alone will give the same inferences as a imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion proper Bayesian joint model for linkage, de-duplication, and population size es-timation. With a single set of posterior samples one can perform inference overmultiple models for the population size, again provided that they all satisfysome relatively mild conditions.These conditions do necessarily demand a degree of compatibility betweenthe prior on λ and the population size model. They bear a striking similarity tothe conditions under which multiple imputation delivers (asymptotically) validBayesian inference (“congeniality”, (Meng, 1994; Xie and Meng, 2017; Murray,2018)). This raises the interesting question of whether the compatibility condi-tions might be relaxed while still yielding conservative inferences, similar to theway one can obtain conservative inferences using imputations under an uncon-genial imputation model, provided it is uncongenial in the “right” way (roughly,by making fewer assumptions during imputation than analysis).
2. Measuring and improving performance
Various sub-specialties of statistics have spawned their own de-facto benchmarkdatasets – think of the iris data for clustering or the galaxy dataset for den-sity estimation. Likewise,
RLdata500 and
RLdata10000 have arguably becomesomething of a benchmark in record linkage problems due in large part to theiraccessibility via the popular
RecordLinkage
R package. I have used them inpublications myself (Murray, 2015). Benchmark datasets form a sort of linguafranca that is useful for teaching, exposition, and as a sort of sanity check (whenour brilliant new method finds six distinct clusters in the iris data, it’s back tothe drawing board).However, we have to be careful extrapolating from these datasets to morecomplex settings. In the provocatively titled “Leave the Pima Indians Alone”,Chopin and Ridgway (2017) make the case that an excessive focus on relativelysimple binary regression problems like the Pima Indians diabetes dataset hashad a distortive impact on the Bayesian computation literature. I worry a littlethat repeatedly going back to the
RLdata datasets might lead the record linkageliterature up the same path. In particular, the errors in these synthetic datasetsare rather minimal, and the duplicate record pairs are quite well-separated fromthe non-duplicates. In my experience this not representative of the datasets wesee in the wild, at least not those that demand sophisticated statistical modeling.Like Britney and the Pima Indians, I think it may be time to leave
RLdata alone.However, the primary evidence that the authors provide in favor of theirmodel is its performance on
RLdata datasets. Even setting aside whether thisis a representative testbed, I wonder if this is much evidence at all since noalternative approaches are presented. Several are available, at least for therecord linkage and de-duplication tasks, including some developed by the au-thors themselves (e.g. Steorts et al. (2015) reports FNR and FDR of 0.02 and0.04 on
RLdata500 , versus 0.015 and 0.08 using the model in the current paper).How well do existing Bayesian models perform on the linkage/de-duplicationtask? What about even simpler methods, like the point estimates generated imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion by Fellegi-Sunter methods (Fellegi and Sunter, 1969) or their generalizations(Sadinle and Fienberg, 2013; Murray, 2015)? This is important context; whilethe model proposed here offers richer inference, should we trust those inferencesif the model does not perform relatively well on the linkage/de-duplication task?The authors actually seem to go a step further and use results on R Ldata to in-form parameter selection when modeling the Syrian casualty data. This franklyseems like a bad idea; in my own experience with similar files (Dalmasso et al.,2019), including expert hand-linked datasets, we observed very different patternsof distortion among co-referent records than the simple patterns one would findin R Ldata. Given how variable performance is across parameter settings in Sec-tion 4, I would suggest that at least some sensitivity analysis might be in orderfor the Syria application.Rather than rely on unrepresentative benchmark datasets to measure perfor-mance and select parameters, what could we do instead? The longer I work onrecord linkage problems the more I am convinced of the need to include a hand-labeling exercise alongside every serious application. The synthetic datasets atour disposal are limited in the range of errors they include and are often poorrepresentations of the problem at hand. Model-based estimates of error ratesare only as good as the model, and if we’re not sure about the model... How-ever, provided that the true error rates are low, precise estimates of false matchrates (false discovery rates) can be obtained via random sampling from matchedrecord pairs. False match rates aren’t everything, but they aren’t nothing either.Sadly the authors missed an opportunity to do even a little inspection here; af-ter finding a small number of duplicates in the Syria application, they note onlythat “visual inspection of these pairs may eventually confirm their matchingstatus”.Ideally a labeling exercise to evaluate a record linkage/de-duplication methodshould include matches generated by other methods (to remove potential biastoward declaring estimated matches correct), blinding (to the method(s) that de-clared the link), multiple review, an “indeterminate” or “unsure” option for thelabelers, and should present labelers with neighboring “near-match”record pairs.Stellar examples of hand-labeling study designs include Bailey et al. (2017);Frisoli and Nugent (2018). In McVeigh et al. (2019) we hand-labeled a rela-tively small number of links to compare two competing methods, including oneBayesian model. For the Bayesian model we also used these labels to obtainthe posterior distribution of false match rate adjusted estimators by comput-ing them on each posterior sample of the linkage structure (similar to Sadinle(2018)’s linkage averaging). For our estimands, we only found it necessary toadjust for the false match rate and we did not grapple with simultaneous de-duplication or multiple files. But we did find that variation due to assumptionsabout bias from linkage error tended to swamp variation due to uncertaintyabout the linkage structure.Reducing or otherwise accounting for linkage error seems important in thecontext of the current paper as well. Observe that in Figure 3, the estimates of K are worse in the blocks with higher error rates (blocks 7, 1, 10, 3) and in eachcase the estimate for K is biased down with a rather concentrated posterior dis- imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion tribution. If the model cannot be improved further, perhaps we would be betteroff looking at the posterior distribution of linkage error adjusted estimates ofthe population size. Linkage error adjusted estimators for the population sizedo exist, at least for relatively simple settings (e.g. Ding and Fienberg (1994);Di Consiglio and Tuoto (2018); Heijden (2019)) and perhaps could be cast inSadinle (2018)’s framework of linkage averaging (although I have not checked thecompatibility conditions myself). These estimators depend on false non-matchrates, which are more difficult to obtain through hand labeling but often canbe reasoned about based on plausible levels of duplication and overlap. Thisreasoning could form the basis of a computationally efficient sensitivity anal-ysis. This seems like a promising avenue for future research, alongside furtherimprovements in model and prior specification to minimize error rates. References
Bailey, M., Cole, C., Henderson, M., and Massey, C. (2017). “How well doautomated methods perform in historical samples? Evidence from new groundtruth.” Technical report, National Bureau of Economic Research, Inc. 4Chopin, N. and Ridgway, J. (2017). “Leave Pima Indians alone: binary regres-sion as a benchmark for Bayesian computation.”
Statistical Science , 32(1):64–87. 3Dalmasso, N., Mejia, R., Rodu, J., Price, M., and Murray, J. (2019). “FeatureEngineering for Entity Resolution with Arabic Names: Improving Estimatesof Observed Casualties in the Syrian Civil War.”
Artificial Intelligence forHumanitarian Assistance and Disaster Response Workshop Workshop, NIPS .4Di Consiglio, L. and Tuoto, T. (2018). “Population size estimation and linkageerrors: The multiple lists case.”
Journal of Official Statistics , 34(4): 889–908.5Ding, Y. and Fienberg, S. (1994). “Dual system estimation of Census undercountin the presence of matching error.”
Survey Methodology , 20(2): 149–158. 5Fellegi, I. P. and Sunter, A. B. (1969). “A Theory for Record Linkage.”
Journalof the American Statistical Association , 64(328): 1183–1210.URL ,502–509. 4Heijden, P. V. D. (2019). “A linkage error correction model for populationsize estimation with multiple sources.” In .URL https://eprints.soton.ac.uk/436665/ arXiv preprintarXiv:1708.08719 . 2 imsart-generic ver. 2014/10/16 file: jmurray_discussion.tex date: September 2, 2020 . S. Murray/Invited Discussion McCandless, L. C., Douglas, I. J., Evans, S. J., and Smeeth, L. (2010). “Cuttingfeedback in Bayesian regression adjustment for the propensity score.”
Theinternational journal of biostatistics , 6(2). 2McVeigh, B. S., Spahn, B. T., and Murray, J. S. (2019). “Scaling BayesianProbabilistic Record Linkage with Post-Hoc Blocking: An Application to theCalifornia Great Registers.”URL https://arxiv.org/abs/1905.05337
Statistical Science , 538–558. 3Murray, J. S. (2015). “Probabilistic Record Linkage and Deduplication afterIndexing, Blocking, and Filtering.”
Journal of Privacy and Confidentiality ,7(1).URL https://journalprivacyconfidentiality.org/index.php/jpc/article/view/643
3, 4— (2018). “Multiple imputation: A review of practical and theoretical findings.”
Statistical Science , 33(2): 142–159. 3Sadinle, M. (2018). “Bayesian propagation of record linkage uncertainty intopopulation size estimation of human rights violations.”
The Annals of AppliedStatistics , 12(2): 1013–1038. 2, 4, 5Sadinle, M. and Fienberg, S. E. (2013). “A generalized Fellegi–Sunter frame-work for multiple record linkage with application to homicide record systems.”
Journal of the American Statistical Association , 108(502): 385–397. 4Steorts, R. C., Hall, R., and Fienberg, S. E. (2016). “A Bayesian approachto graphical record linkage and deduplication.”
Journal of the AmericanStatistical Association , 111(516): 1660–1672. 2Steorts, R. C. et al. (2015). “Entity resolution with empirically motivated pri-ors.”
Bayesian Analysis , 10(4): 849–875. 3Tancredi, A. and Liseo, B. (2011). “A hierarchical Bayesian approach to recordlinkage and population size problems.”
The Annals of Applied Statistics ,5(2B): 1553–1585. 1Xie, X. and Meng, X.-L. (2017). “DISSECTING MULTIPLE IMPUTATIONFROM A MULTI-PHASE INFERENCE PERSPECTIVE: WHAT HAP-PENS WHEN GOD’S, IMPUTER’S AND ANALYST’S MODELS ARE UN-CONGENIAL?”
Statistica Sinica , 1485–1545. 3Yu, X., Del Zanna, G., Stenning, D. C., Cisewski-Kehe, J., Kashyap, V. L.,Stein, N., van Dyk, D. A., Warren, H. P., and Weber, M. A. (2018). “Incor-porating Uncertainties in Atomic Data Into the Analysis of Solar and StellarObservations: A Case Study in Fe XIII.”
The Astrophysical Journal , 866(2):146. 2Zigler, C. M., Watts, K., Yeh, R. W., Wang, Y., Coull, B. A., and Dominici, F.(2013). “Model feedback in Bayesian propensity score estimation.”
Biomet-rics , 69(1): 263–273. 2, 69(1): 263–273. 2