Rob Hall
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rob Hall.
privacy in statistical databases | 2010
Rob Hall; Stephen E. Fienberg
Record linkage has a long tradition in both the statistical and the computer science literature. We survey current approaches to the record linkage problem in a privacy-aware setting and contrast these with the more traditional literature. We also identify several important open questions that pertain to private record linkage from different perspectives.
international parallel and distributed processing symposium | 2007
Rob Hall; Arnold L. Rosenberg; Arun Venkataramani
A fundamental challenge in Internet computing (IC) is to efficiently schedule computations having complex inter job dependencies, given the unpredictability of remote machines, in availability and time of access. The recent IC scheduling theory focuses on these sources of unpredictability by crafting schedules that maximize the number of executable jobs at every point in time. In this paper, we experimentally investigate the key question: does IC scheduling yield significant positive benefits for real IC? To this end, we develop a realistic computation model to match jobs to client machines and conduct extensive simulations to compare IC-optimal schedules against popular, intuitively compelling heuristics. Our results suggest that for a large range of computation-dags, client availability patterns, and two quite different performance metrics, IC-optimal schedules significantly outperform schedules produced by popular heuristics, by as much as 10-20%.
Journal of the American Statistical Association | 2016
Rebecca C. Steorts; Rob Hall; Stephen E. Fienberg
ABSTRACT We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online.
knowledge discovery and data mining | 2008
Rob Hall; Charles A. Sutton; Andrew McCallum
Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent -- because venues tend to focus on a few research areas -- but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance -- a 58% reduction in error over a standard Dirichlet process mixture.
conference on information and knowledge management | 2009
Deepak Agarwal; Evgeniy Gabrilovich; Rob Hall; Vanja Josifovski; Rajiv Khanna
Information retrieval systems conventionally assess document relevance using the bag of words model. Consequently, relevance scores of documents retrieved for different queries are often difficult to compare, as they are computed on different (or even disjoint) sets of textual features. Many tasks, such as federation of search results or global thresholding of relevance scores, require that scores be globally comparable. To achieve this, in this paper we propose methods for non-monotonic transformation of relevance scores into probabilities for a contextual advertising selection engine that uses a vector space model. The calibration of the raw scores is based on historical click data.
privacy in statistical databases | 2012
Rob Hall; Stephen E. Fienberg
We develop a statistical process for determining a confidence set for an unknown bipartite matching. It requires only modest assumptions on the nature of the distribution of the data. The confidence set involves a set of linear constraints on the bipartite matching, which permits efficient analysis of the matched data, e.g., using linear regression, while maintaining the proper degree of uncertainty about the linkage itself.
Archive | 2007
Aron Culotta; Pallika H. Kanani; Rob Hall; Michael L. Wick; Andrew McCallum
Journal of Official Statistics | 2011
Rob Hall; Stephen E. Fienberg; Yuval Nardi
Journal of Machine Learning Research | 2013
Rob Hall; Alessandro Rinaldo; Larry Wasserman
knowledge discovery and data mining | 2014
Diane J. Hu; Rob Hall; Josh Attenberg