Rebecca C. Steorts
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rebecca C. Steorts.
privacy in statistical databases | 2014
Rebecca C. Steorts; Samuel L. Ventura; Stephen E. Fienberg
Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as “private blocking.” We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.
Journal of the American Statistical Association | 2016
Rebecca C. Steorts; Rob Hall; Stephen E. Fienberg
ABSTRACT We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online.
Bayesian Analysis | 2015
Rebecca C. Steorts
Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters.
Statistica Sinica | 2013
Rebecca C. Steorts; Malay Ghosh
We consider benchmarked empirical Bayes (EB) estimators under the basic area-level model of Fay and Herriot while requiring the standard benchmarking constraint. In this paper we determine the excess mean squared error (MSE) from constraining the estimates through benchmarking. We show that the increase due to benchmarking is O(m^{-1}), where m is the number of small areas. Furthermore, we find an asymptotically unbiased estimator of this MSE and compare it to the second-order approximation of the MSE of the EB estimator or, equivalently, of the MSE of the empirical best linear unbiased predictor (EBLUP), that was derived by Prasad and Rao (1990). Morever, using methods similar to those of Butar and Lahiri (2003), we compute a parametric bootstrap estimator of the MSE of the benchmarked EB estimator under the Fay-Herriot model and compare it to the MSE of the benchmarked EB estimator found by a second-order approximation. Finally, we illustrate our methods using SAIPE data from the U.S. Census Bureau, and in a simulation study.
The Annals of Applied Statistics | 2018
Beidi Chen; Anshumali Shrivastava; Rebecca C. Steorts
Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of
The Annals of Applied Statistics | 2015
Leila Wehbe; Aaditya Ramdas; Rebecca C. Steorts; Cosma Shalizi
191,874 \pm 1772
Test | 2011
Gauri Sankar Datta; Malay Ghosh; Rebecca C. Steorts; J. Maples
documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.
international conference on artificial intelligence and statistics | 2014
Rebecca C. Steorts; Rob Hall; Stephen E. Fienberg
Functional neuroimaging measures how the brain responds to complex stimuli. However, sample sizes are modest, noise is substantial, and stimuli are high dimensional. Hence, direct estimates are inherently imprecise and call for regularization. We compare a suite of approaches which regularize via shrinkage: ridge regression, the elastic net (a generalization of ridge regression and the lasso), and a hierarchical Bayesian model based on small area estimation (SAE). We contrast regularization with spatial smoothing and combinations of smoothing and shrinkage. All methods are tested on functional magnetic resonance imaging (fMRI) data from multiple subjects participating in two different experiments related to reading, for both predicting neural response to stimuli and decoding stimuli from responses. Interestingly, when the regularization parameters are chosen by cross-validation independently for every voxel, low/high regularization is chosen in voxels where the classification accuracy is high/low, indicating that the regularization intensity is a good tool for identification of relevant voxels for the cognitive task. Surprisingly, all the regularization methods work about equally well, suggesting that beating basic smoothing and shrinkage will take not only clever methods, but also careful modeling.
Test | 2013
Malay Ghosh; Rebecca C. Steorts
arXiv: Methodology | 2016
Jeffrey W. Miller; Brenda Betancourt; Abbas Zaidi; Hanna M. Wallach; Rebecca C. Steorts