[PDF] VIEW: a framework for organization level interactive record linkage to support reproducible data science

Abstract

Objective: To design and evaluate a general framework for interactive record linkage using a convenient algorithm combined with tractable Human Intelligent Tasks (HITs; i.e. micro tasks requiring human judgment) that can support reproducible data science. Materials and Methods: Accurate linkage of real data requires both automatic processing of well-defined tasks and human processing of tasks that require human judgment (i.e., HITs) on messy data. We present a reproducible, interactive, and iterative framework for record linkage called VIEW (Visual Interactive Entity-resolution Workbench). We implemented and evaluated VIEW by integrating two commonly used hospital databases, the American Hospital Association (AHA) Annual Survey of Hospitals and the Medicare Cost Reports for Hospitals from CMS. Results: Using VIEW to iteratively standardize and clean the data, we linked all Texas hospitals common in both databases with 100% precision by confirming 78 approximate linkages using HITs and manually linking 28 hospitals using HITs. Discussion: Similarities in hospital names and addresses and the dynamic nature of hospital attributes over time make it impossible to build a fully automated linkage system for hospitals that can be maintained over time. VIEW is a software that supports a reproducible semi-automated process that can generate and track HITs to be reviewed and linked manually for messy data elements such as hospitals that have been merged. Conclusion: Effective software that can support the interactive and iterative process of record linkage, and well-designed HITs can streamline the linkage processes to support high quality replicable research using messy real data.

Full PDF

VVIEW: A F

RAMEWORK FOR O RGANIZATION L EVEL I NTERACTIVE R ECORD L INKAGE TO S UPPORT R EPRODUCIBLE D ATA S CIENCE

Mohammad Karim

Population Informatics LabDepartment of Health Policy and ManagementTexas A&M University1266 TAMU, College Station, TX 77843 [email protected]

Mahin Ramezani

Population Informatics LabDepartment of Computer Science & EngineeringTexas A&M University1266 TAMU, College Station, TX 77843 [email protected]

Tenaya Sunbury

Washington State Department of Social andHealth Services [email protected]

Robert Ohsfeldt

Department of Health Policy and ManagementTexas A&M UniversityCollege Station, TX 77843 [email protected]

Hye-Chung Kum

Population Informatics LabDepartment of Health Policy and ManagementDepartment of Computer Science & EngineeringTexas A&M University1266 TAMU, College Station, TX 77843 [email protected]

February 17, 2021 A BSTRACT

Objective:

To design and evaluate a general framework for interactive record linkage using aconvenient algorithm combined with tractable Human Intelligent Tasks (HITs; i.e. micro tasksrequiring human judgment) that can support reproducible data science.

Materials and Methods:

Accurate linkage of real data requires both automatic processing of well-deﬁned tasks and human processing of tasks that require human judgment (i.e., HITs) on messydata. We present a reproducible, interactive, and iterative framework for record linkage calledVIEW (Visual Interactive Entity-resolution Workbench). We implemented and evaluated VIEWby integrating two commonly used hospital databases, the American Hospital Association (AHA)Annual Survey of Hospitals and the Medicare Cost Reports for Hospitals from CMS.

Results:

Using VIEW to iteratively standardize and clean the data, we linked all Texas hospitalscommon in both databases with 100% precision by conﬁrming 78 approximate linkages using HITsand manually linking 28 hospitals using HITs.

Discussion:

Similarities in hospital names and addresses and the dynamic nature of hospital attributesover time make it impossible to build a fully automated linkage system for hospitals that can bemaintained over time. VIEW is a software that supports a reproducible semi-automated process thatcan generate and track HITs to be reviewed and linked manually for messy data elements such ashospitals that have been merged. a r X i v : . [ c s . H C ] F e b PREPRINT - F

EBRUARY

17, 2021

Conclusion:

Effective software that can support the interactive and iterative process of record linkage,and well-designed HITs can streamline the linkage processes to support high quality replicableresearch using messy real data. K eywords Data linkage · Interactive record linkage · Reproducible data science · Human Intelligent Tasks (HITs) · Hospital level record linkage

Secondary use of large existing databases for research is increasingly common. The important characteristics of suchdata are that (1) an extensive amount of data exists on the population served, (2) data are continuously generated, (3)data change over time as programs evolve and originate from multiple sources, and (4) data have varied levels of validitywith data directly required for operations being the most valid. These represent the “four Vs” of big data: volume,velocity, variety, and veracity, respectively [1]. Using big data to extract valuable information requires a tractable andreproducible data processing pipeline. A critical step in the data processing pipeline is data integration. Often calledrecord linkage or entity resolution, it presents a challenge when there is no common, error-free, unique identiﬁer withwhich to identify records across databases pertaining to the same real-world entities.Much has been published about automatic record linkage of person level data [2–13]. This paper contributes to theliterature by documenting the iterative process of developing a linkage algorithm and its application to hospital leveldata. Documenting the process of linkage is important because researchers often rely on manual, ad hoc tools for dataintegration due to the lack of standardized approaches or appropriate software. Non-transparent record linkage is amajor issue in replicable research.We present a systematic framework, VIEW (Visual Interactive Entity-resolution Workbench), for incrementallydeveloping a tractable algorithm to link organization-level data. VIEW can be used to develop a well-documentedsemi-automated process for linking two or more hospital datasets with no common identiﬁers. We evaluate VIEW in aproject that required the development and maintenance of a comprehensive hospital database across ﬁve different datasources with timely updates each year.

Constructing useful measures for secondary data analysis to answer broad questions often requires the integrationof data from multiple systems. For example, our project had to integrate data from ﬁve sources, which used a totalof four independent identiﬁers for the providers, the Texas Provider ID (TPI), the National Provider ID (NPI), theMedicare provider ID, and a facility ID (FID). In this paper, we only discuss the process for building a crosswalk fromthe MedicareID to the FID, which demonstrates the process best. The other linkages were conducted using VIEW insimilar ways.A major challenge in integrating hospital level data is that hospitals are not static entities but evolve over time (i.e.,mergers, closings, name changes, address changes). Thus, maintaining a clean identiﬁer system for all providers overtime is challenging. To further complicate this issue, there may be multiple identiﬁers (i.e., federal, state, and local)used for providers often requiring a system to build a crosswalk between different identiﬁers when combining data fromheterogeneous systems. As a result, there is a pressing need to develop a reproducible process for standardizing andintegrating multiple sources of provider level data. There are a variety of applications for such integrated data, includingpay-for-performance programs and public reporting, as well as organizational-level quality assurance and performancetracking using big data [14–16].

The most common methods for linking individual-level data are probabilistic and deterministic record linkage [2–13].The probabilistic method scores a statistical probability of two records being a ‘true’ link based on a model developedtypically using training data. Even though there are many different probabilistic methods in statistics and machinelearning that currently investigate how to best develop the model given the data, the researcher must still determine twothresholds to group linkages into match, uncertain, or non-match once the data have been scored [3]. In comparison,deterministic methods are rule-based, where the researcher speciﬁes the rules under which the two records are considereda match (e.g., pairs that have exact match on name and address), uncertain (e.g., pairs with approximate match on nameor address), or non-match (e.g., all other pairs). Often a stepwise approach is used to build the rules [11, 12].2

PREPRINT - F

EBRUARY

17, 2021Probabilistic methods tend to work better on complex data at the cost of less interpretable models. In comparison,simple deterministic methods are easier to implement and communicate when the linkage task is relatively simple, as inthe case of linking hospitals. The quality of matching results are comparable for both deterministic and probabilisticmethods as long as the process for linkage is well developed [12, 13]. More importantly, data standardization andcleaning is important in both approaches but also very difﬁcult to do top-down based on theory [17]. VIEW includesmethods to quickly standardize only the regularities in a given dataset with a bottom-up approach using the data at hand.To overcome the limitations of automatic algorithms in addressing real world problems [18], there has been increasinginterest in interactive record linkage that better document the human interaction during the linkage process [19, 20].In particular, we present how to use well deﬁned Human Intelligent Tasks (HITs; i.e. micro tasks requiring humanjudgment), to design effective human machine systems for record linkage. Using HITs is common for processingbig data because most tasks require both automatic processing of well deﬁned tasks and human processing for tasksthat require judgment [21]. The importance of human interaction in linkage is demonstrated well in Bronstein etal. [22] where pregnancies from Medicaid data were linked to birth records via 11 manual steps. There were multipleuncertainties that needed human decisions to attain an overall match rate of 87.9%. With no human interaction, thematch rate would be much lower. Ultimately, the goals of any approximate linkage method should include: 1) settingthe match threshold conservatively to avoid the false matched pairs, 2) setting the potential match threshold liberally soall missed true matches are in the uncertain matched pairs and can be recovered during the manual resolution phase,and 3) keeping the number of uncertain pairs (i.e. HITs) to be reviewed manually at reasonable levels.

The main database is the 2013 provider ID information ﬁle that comes with the Hospital Form 2552-10 on the CMSwebsite [23] containing the MedicareID, the name, and address of all providers (N=606 for Texas). To this database,we linked the Texas Annual Survey of Hospitals from 2008 to 2013, which uses the FID. It is a mandatory hospitalsurvey administered by Texas Department of State Health Services working in collaboration with the American HospitalAssociation and the Texas Hospital Association [24]. Some hospitals had multiple values for provider names in thesurvey because names change over time and both the legal business name and DBA (doing business as) name wereavailable. Hence, there were a total of 800 different names that represented the 664 unique providers.

We ﬁrst describe methods for measuring linkage quality used throughout the paper. Then, we follow with a presentationof the six core steps of the proposed human machine process (Figure 1) and demonstrate each step using our examplelinkage study.

The main quality measures in linkage are recall, aka sensitivity, and precision (equation (1) and (2)). Often, theapplication will determine the balance between recall and precision. In general, setting stringent criteria will resultin high precision and low recall whereas looser criteria will start to introduce incorrect matches reducing precisionwhile increasing recall. However, this is not a direct relationship, and carefully building more complex models canincrease recall without much reduction in precision. We only report recall in this paper because precision was 100% inour application.

Recall = Sensitivity = the number of correct linkages f oundthe number of all true linkages that exist (1) P recision = the number of correct linkages f oundthe number of total linkages f ound (2) The ﬁrst step is to select the ﬁles to build the crosswalk and then to select the common attributes to be used in thematching process. Good attributes to use are variables that tend to be recorded consistently and have high distinguishingpower (i.e. many unique values). For example, with only two possible values, type of hospital (i.e. public or private) isa low power variable. In comparison, with mostly unique values, name is a high power variable. However, names tendto have a lot of variation for the same entity, which decrease its usefulness. The discriminatory power of identiﬁers can3

PREPRINT - F

EBRUARY

17, 2021Figure 1: VIEW: Framework for iteratively developing a record linkage algorithm.be quantiﬁed using the Shannon entropy [25]. In our linkage, the common data attributes were provider name, city, zipcode, and street address. We dropped city because it had similar information as zip code, and the more granular datanumerically coded was better.

Variation in the way that attributes are represented across data used to link hospitals can result from different codingmethods (e.g. use of uppercase versus lowercase), the dynamic nature of the underlying attributes (e.g. renaming ahospital after a change in ownership), erroneous data (e.g. typos), or missing data. Standardization of common dataelements both in terms of formats (capitalization) and values (i.e. street to st) reduces the unnecessary variations in thedata and signiﬁcantly improves automatic linkage. Numerically coded attributes using the same coding scheme workbest. For example, zip code works well for linking organizations because it has high distinguishing power as well aslow variation in common values. Nonetheless, developing well coded variables is time intensive and often linkage iscarried out on raw data without the common coding by carrying out approximate matches.The most efﬁcient method to standardizing and cleaning the data is to set up a data processing pipeline to easily addin standardization rules iteratively over time as problems are discovered in the data. Setting up such a framework for4

PREPRINT - F

EBRUARY

17, 2021Figure 2: Final Name Standardization Algorithm.processing big data is critical as it is difﬁcult to know up front all the issues with any given dataset. Thus, as researchersencounter different issues in the data, the ability to go back and add rules to clean the data, then easily repeat the stepsis essential to working efﬁciently with big data in a tractable manner. In record linkage, this means that in the ﬁrstiteration, there are likely to be no data cleaning or standardization rules because the researcher does not know the issuesin the data yet. Such rules will be developed and incorporated in subsequent iterations.Using computer code to automate data cleaning and standardization has several advantages compared to manuallyediting the data. First, if the process is automated, then work will not be lost if rules need to be revised or deleted insubsequent iterations. In addition, the computer code serves as documentation of what was done. Such documentationis important for making research reproducible. And ﬁnally, using an automated process makes it simple to retract anysteps that are later detected as incorrect during the process of working with the data.Two ways of effectively standardizing provider names quickly are to drop frequently used words (e.g. hospital) andto replace terms that are frequently abbreviated (e.g. center, ctr, cntr) with a standard set of consistent abbreviations.VIEW provides a module that produces the frequent word list. Detecting the commonly used abbreviations occursiteratively during manual review of uncertain and non-matched records.In our linkage, basic standardization (i.e. using only lower case and removing all special characters) improved the recallrate to 51%. Figure 2 and Table 5.2 are the ﬁnal standardization we used for name and address after multiple iterations.Note that the order of the standardization matters. It was also important to have the last step where we use the originalname when the standardized name becomes null (e.g. memorial hospital). Using these standardizations, exact match onstandardized names improved the recall rate to 67%.

The full comparison space is the Cartesian product of the two datasets being linked, majority of which are non-matches(e.g, 606*800=484,800 comparisons in our example). To reduce the search space, one or more blocking variables areused to compare only records that share the attribute. Blocking can introduce problems when there are data errors ormissing values in the blocking variable because the correct comparisons cannot be made. Thus, it is common to use amulti-pass blocking algorithm to recapture those comparisons that are permanently lost in the ﬁrst pass. Clearly, theblocking variable has direct impact on performance in terms of time and quality.In our study, zip code is the best blocking variable because it will break up the data into small number of hospitalsin each zip code to be matched up with no missing data (Table 5.3). Blocking on zip code reduced the number of5

PREPRINT - F

EBRUARY

17, 2021Table 1: Final Address Standardization

Original word Standardized to Ignore for approximate match lane ln Xstreet st Xboulevard, boulevard blvd Xroad rd Xcircle cir Xdrive dr Xavenue ave Xloop lp Xctr, cntr center rd , nd third, second respectivelyhighway, freeway, parkway hwy, fwy, pkwy respectivelynorth, south, east, west n, s, e, w respectivelyTable 2: Search Space Medicare data AHA Survey data

Unique

The next step is the pairwise scoring of all pairs within each block. In the ﬁrst iteration, only simple standardization anda simple scoring system (i.e., if all common attributes match exactly classify as a match, if at least one attribute matchapproximately classify as uncertain, otherwise classify as nonmatch) is used. For most problems, this simple setupwill result in a low match rate and a large number of uncertain matches. In the ﬁrst few iterations, you are scanningboth the uncertain and nonmatch groups for regular patterns in the two dataset that you need to either standardize oruse for scoring a pair. As you spot them, you add the standardization code and the scoring code to be more complexand rerun until there are no more improvements you can do automatically. The goal is to iteratively develop both thestandardization and scoring algorithm to capture regularities in the data automatically as true matches and reduce theuncertain group to include only the difﬁcult cases that require human judgment. In addition, you should be reviewingthe nonmatch group to conﬁrm that these are indeed nonmatches. Typically, you will spot required standardizations (e.g.using same abbreviations such as East Text Medical Center to ETMC) in the nonmatch group in the beginning.Probabilistic record linkage methods develop statistical models for automatic scoring using training data that have beenmanually labeled. Then the researcher determines the two thresholds for match, uncertain, and nonmatch in the ﬁnalscore. However, for reasonable sized data, using simple rule based deterministic scoring methods is more tractable andinterpretable and works comparably. Deterministic methods are also easier to control precisely what you group formanual review versus automatic linkage.In our linkage, we allowed for deterministic approximate matching on both name and address as detailed in Figure3 which further improved the recall to 88%. The algorithm, builds a non-directional graph with each entity in onedatabase connected to all other entities in the other database in the same zip code. Then each link is scored on a priorityof 1 to 6 based on the similarity of names and addresses. If both name and address does not match at all, the link isdeleted. As a ﬁnal step, for each pair of entities only the closest link is kept. That is if there are two names, and onematches exactly, only the exactly matched name is considered. Then the main criteria used for automatic linkage wasto only allow for linkage that were 1-to-1 linkages in the remaining graph. Any linkages that resulted in more than 1mapping was kicked out for next iteration. The importance of 1-to-1 mapping criteria for automatic linkage is discussedlater.VIEW makes it easy to add in customized SAS code for approximate match on each variable and provide macros forcounting the number of common words and the dice coefﬁcient of common words. The dice coefﬁcient is commonly6

PREPRINT - F

EBRUARY

17, 2021Figure 3: Scoring Algorithm.used to measure the similarity of sets and is deﬁned as two times the number of common words over the total numberof words. The threshold of 2/3 for the dice coefﬁcient can account for addresses having fewer details. For example,the dice coefﬁcient for ‘865 deshong’ and ‘865 deshong 5th ﬂoor’ is 2*2/6=2/3. Of the 521 matches made, 40% wereexact match on std_name and std_addr, 30% were exact match on only std_name, 22% were exact match on std_addr,with the remaining 8% being an approximate match of some kind. These 41 approximate matches were generated asHITs and conﬁrmed manually. There were two in the HITs that required further investigation outside the databases toconﬁrm as a correct match.

You can add as many block/score/review pass as needed to recapture any matches not compared in a particular blockingpass. Typically, in subsequent passes you are only processing data that have not been linked in the previous pass. Thiskind of divide and conquer method is very effective for working with big data.In our linkage, after the ﬁrst pass blocking on zip code and allowing for approximate match on standardized name andaddress for 1-to-1 matches, there were 85 MedicareIDs and 171 FIDs that were not matched. Given the small numbers,we ran a second pass without blocking by linking any records that were an exact match on standardized name (43matches) or standardized address (12 matches). All except 3 matches were 1-to-1 matches. 13 were both name andaddress match while 39 only matched on one. We generated the 39 as HITs to be conﬁrmed for accuracy. All except twoof these were matches with different zip code due to an error in one of the datasets. This improved the recall to 97%.

Once automatic linkage is developed, the program should generate three separate outputs, one for conﬁrmed automaticmatches, one for the potential matches, and one for any entities that did not link to anything. The potential matchesare HITs that are output into an excel ﬁle. These HITs are manually resolved into another excel sheet so that humanjudgment can be incorporated back into the process as well as documented. Manual reﬁnement can occur at the end ofany block/score pass. 7

PREPRINT - F

EBRUARY

17, 2021We had one manual review step at the end. After the second pass, we had 33 MedicareIDs and 105 FIDs that were stillnot matched. These were output as HITs to be matched manually. We easily found 28 matches manually. Of the 6remaining, 3 were duplicate records and 3 were those that did not participate in the survey.

We have developed and released VIEW under GNU license to facilitate replicable methods in linking data. VIEW is aset of general SAS macro codes that implements the record linkage described above for deterministic methods. It canbe easily extended to perform probabilistic methods as well. Researchers can specify and control many aspects of thelinkage such as how to standardize and score data by adding customized SAS code to designated ﬁles. As in our surveydata, often entities have multiple names. Thus, VIEW provides an easy mechanism for properly managing multiplerows per entity so that if any of the names match the correct linkage is made by keeping the primary ID the same. Moredetails can be found on the VIEW website [26].

Figure 4 depicts the full process for using VIEW to link the MedicareID to FID After using VIEW to iterativelystandardize and clean the data, we (1) automatically linked 493 providers, both exact and approximate match, (2)manually linked 28 providers, (3) conﬁrmed valid no links for 6 providers, and (4) conﬁrmed approximate links for78 providers and no link for 1 approximate link. Most of the linkages that were manually linked using HITs couldnever be coded as automatic matching due to the complexity and insufﬁcient information in the database. We had touse additional information found on the internet to conﬁrm the links. To obtain 100% precision, this application usedconservative criteria for automatic matching leading to more manually reviewed HITs.

Of all possible matches, determining the critical conditions that conﬁrm an automatic match is important but difﬁcult.In our study, we found that a clean 1-to-1 link can conﬁrm a match automatically whereas links that have multiplematches in the database was a signal for potential issues with similar standardized names among different entities ormultiple providers located on the same street. This is because the chances of a provider in each of the databases beingan exact 1-to-1 match purely by chance is negligible given the possible range of values [12]. Thus using the 1-to-1match criteria is a good rule for protection against false matches resulting from reducing too much variation throughstandardization such that two different providers have the same standardized name. These errors show up as an N-to-1match, and need human judgment. This is intuitive in that, in sparse data space linkages are easy where as in dense dataspaces (i.e., many entities with similar names) linkages require more attention.

There are differences in how an entity is deﬁned in different hospital ID systems. For example, the FID used in thesurvey data is closely associated with the hospital licensing number. Any change made to the provider license over timeis reﬂected in the FID and the FID is managed manually by state staff to ensure high quality data. Thus, we can trackchanges in names, addresses, or closures over time by tracking the same FID. This means that same health systemscan have one or more FIDs depending on how they are licensed. On the other hand, the MedicareID is for billing, andan entity is a Medicare provider, which may or may not correspond to their licensed structure. The most common IDsystem being used for hospitals, NPI, often have multiple IDs for one health system making entity resolution verydifﬁcult. In our linkage, the Medicare data had two entities from one health system with different zip codes that wereadjacent (walking distance), but on the same street (different street number). In comparison, the survey data only hadone entity from the same hospital system. This was one of the linkages we had to investigate beyond the data at hand.Based on the number of licensed beds in the survey and their website, we concluded the correct linkage.

A small number of providers have the same name even when they are different entities. Most of these are hospitals inthe same system in different locations with separate licenses. In the survey data, we had 6 providers, which had thesame name for multiple entities. To differentiate them, we added the city name to the provider name. In addition, the8

PREPRINT - F

EBRUARY

17, 2021Figure 4: Medicare ID to FID linkage process.standardization we used made both ‘University Medical Center’ and ‘University Hospital’ become ‘University’. Thus,exact matching on names, can lead to 1 erroneous match. However, since these were hospitals in two different zipcodes, these providers were properly matched to ‘University Medical Center: Lubbock’ and ‘University Health System’respectively when we blocked on zip code before scoring.There were many providers on the same street but in combination with street number and provider name, these did notcause problems. There was a pair of providers that had both a psychiatric license and an acute care license at the exactsame address in the survey. There were also two in the Medicare data, which we could match up manually. ‘SSH SouthDallas’ is located on the 4th ﬂoor of Methodist Charlton Medical Center, which also caused confusion and requiredhuman judgment.

The manual work in record linkage is inherently dependent on the data to be linked. Both errors in data as well as gapsin how the same entity is represented in the different databases requires iteratively interacting with the data, detectingthese patterns, and coding these patterns into the process to clean the data. VIEW is setup to make this process moreefﬁcient and tractable, but cannot replace the required hard work for replicable research.9

PREPRINT - F

EBRUARY

17, 2021

Similarities in provider names and addresses, the dynamic nature of hospitals over time, and the subtle differences inentities make it impossible to build fully automated hospital linkage system. However, manually managing data linkagesfor even a small number, particularly over time, is inefﬁcient, could lead to human error, and difﬁcult to replicate.Thus, effective software that can support the interactive and iterative process of record linkage and well-designed HITsstreamline data linkage processes supporting high quality replicable research using big data.

References [1] Hye-Chung Kum, Ashok Krishnamurthy, Ashwin Machanavajjhala, and Stanley C Ahalt. Social genome: Puttingbig data to work for population informatics.

Computer , 47(1):56–63, 2013.[2] Cathy J Bradley, Lynne Penberthy, Kelly J Devers, and Debra J Holden. Health services research and data linkages:issues, methods, and directions for the future.

Health services research , 45(5p2):1468–1488, 2010.[3] Stacie B Dusetzina, Seth Tyree, Anne-Marie Meyer, Adrian Meyer, Laura Green, and William R Carpenter. Anoverview of record linkage methods.

Linking Data for Health Services Research: A Framework and InstructionalGuide [Internet] , 2014.[4] Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. Duplicate record detection: A survey.

IEEE Transactions on knowledge and data engineering , 19(1):1–16, 2006.[5] Ivan P Fellegi and Alan B Sunter. A theory for record linkage.

Journal of the American Statistical Association ,64(328):1183–1210, 1969.[6] Erel Joffe, Michael J Byrne, Phillip Reeder, Jorge R Herskovic, Craig W Johnson, Allison B McCoy, Dean F Sittig,and Elmer V Bernstam. A benchmark comparison of deterministic and probabilistic methods for deﬁning manualreview datasets in duplicate records reconciliation.

Journal of the American Medical Informatics Association ,21(1):97–104, 2014.[7] Hye-Chung Kum, Ashok Krishnamurthy, Ashwin Machanavajjhala, Michael K Reiter, and Stanley Ahalt. Privacypreserving interactive record linkage (ppirl).

Journal of the American Medical Informatics Association , 21(2):212–220, 2014.[8] William E Winkler. Overview of record linkage and current research directions. In

Bureau of the Census . Citeseer,2006.[9] Lynn M Etheredge. A rapid-learning health system: what would a rapid-learning health system look like, and howmight we get there?

Health affairs , 26(Suppl1):w107–w118, 2007.[10] Charles P Friedman, Adam K Wong, and David Blumenthal. Achieving a nationwide learning health system.

Science translational medicine , 2(57):57cm29–57cm29, 2010.[11] Hye-Chung Kum, Dean F Duncan, and C Joy Stewart. Supporting self-evaluation in local government viaknowledge discovery and data mining.

Government Information Quarterly , 26(2):295–304, 2009.[12] Luiza Antonie, Kris Inwood, Daniel J Lizotte, and J Andrew Ross. Tracking people over time in 19th centurycanada for longitudinal analysis.

Machine learning , 95(1):129–146, 2014.[13] Ying Zhu, Yutaka Matsuyama, Yasuo Ohashi, and Soko Setoguchi. When to conduct probabilistic linkage vs.deterministic linkage? a simulation study.

Journal of biomedical informatics , 56:80–86, 2015.[14] Howard B Newcombe, James M Kennedy, SJ Axford, and Allison P James. Automatic linkage of vital records.

Science , 130(3381):954–959, 1959.[15] Shanti Gomatam, Randy Carter, Mario Ariet, and Glenn Mitchell. An empirical comparison of record linkageprocedures.

Statistics in medicine , 21(10):1485–1496, 2002.[16] Francis P Boscoe, Deborah Schrag, Kun Chen, Patrick J Roohan, and Maria J Schymura. Building capacity toassess cancer care in the medicaid population in new york state.

Health services research , 46(3):805–820, 2011.[17] Sean M Randall, Anna M Ferrante, James H Boyd, and James B Semmens. The effect of data cleaning on recordlinkage quality.

BMC medical informatics and decision making , 13(1):1–10, 2013.[18] Hanna Köpcke, Andreas Thor, and Erhard Rahm. Evaluation of entity resolution approaches on real-world matchproblems.

Proceedings of the VLDB Endowment , 3(1-2):484–493, 2010.[19] Hyunmo Kang, Lise Getoor, Ben Shneiderman, Mustafa Bilgic, and Louis Licamele. Interactive entity resolutionin relational data: A visual analytic tool and its evaluation.

IEEE transactions on visualization and computergraphics , 14(5):999–1014, 2008. 10

PREPRINT - F

EBRUARY

17, 2021[20] Qiaomu Shen, Tongshuang Wu, Haiyan Yang, Yanhong Wu, Huamin Qu, and Weiwei Cui. Nameclariﬁer: A visualanalytics system for author name disambiguation.

IEEE transactions on visualization and computer graphics ,23(1):141–150, 2016.[21] Martha Larson, Mohammad Soleymani, Maria Eskevich, Pavel Serdyukov, Roeland Ordelman, and Gareth Jones.The community and the crowd: Multimedia benchmark dataset development.

IEEE multimedia , 19(3):15–23,2012.[22] Janet M Bronstein, Charles T Lomatsch, David Fletcher, Terri Wooten, Tsai Mei Lin, Richard Nugent, and Curtis LLowery. Issues and biases in matching medicaid pregnancy episodes to vital records data: the arkansas experience.

Maternal and child health journal