Michaela Götz
Cornell University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michaela Götz.
international conference on management of data | 2010
Arvind Arasu; Michaela Götz; Raghav Kaushik
We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.
international conference on management of data | 2012
Michaela Götz; Suman Nath; Johannes Gehrke
The rise of smartphones equipped with various sensors has enabled personalization of various applications based on user contexts extracted from sensor readings. At the same time it has raised serious concerns about the privacy of user contexts. In this paper, we present MASKIT, a technique to filter a user context stream that provably preserves privacy. The filtered context stream can be released to applications or be used to answer their queries. Privacy is defined with respect to a set of sensitive contexts specified by the user. MASKIT limits what adversaries can learn from the filtered stream about the user being in a sensitive context - even if the adversaries are powerful and have knowledge about the filtering system and temporal correlations in the context stream. At the heart of MASKIT is a privacy check deciding whether to release or suppress the current user context. We present two novel privacy checks and explain how to choose the one with the higher utility for a user. Our experiments on real smartphone context traces of 91 users demonstrate the high utility of MASKIT.
very large data bases | 2009
Ashwin Machanavajjhala; Johannes Gehrke; Michaela Götz
Privacy in data publishing has received much attention recently. The key to defining privacy is to model knowledge of the attacker -- if the attacker is assumed to know too little, the published data can be easily attacked, if the attacker is assumed to know too much, the published data has little utility. Previous work considered either quite ignorant adversaries or nearly omniscient adversaries. In this paper, we introduce a new class of adversaries that we call realistic adversaries who live in the unexplored space in between. Realistic adversaries have knowledge from external sources with an associated stubbornness indicating the strength of their knowledge. We then introduce a novel privacy framework called epsilon-privacy that allows us to guard against realistic adversaries. We also show that prior privacy definitions are instantiations of our framework. In a thorough experimental study with real census data we show that e-privacy allows us to publish data with high utility while defending against strong adversaries.
international conference on management of data | 2010
Truls Amundsen Bjørklund; Michaela Götz; Johannes Gehrke
More and more important data is accumulated inside social networks. Limiting the flow of private information across a social network is very important, and most social networks provide sophisticated privacy settings to control this flow. Creating such extensive access control knobs makes the search for content a hard problem since each user sees a unique subset of all the data. In this work, we take a first step at integrating access control based on a social network in a search system. We describe a set of solutions to the problem, including what indexes to construct and how to filter out inaccessible results. An experimental analysis illustrates the tradeoffs of the various strategies, and we point out a set of interesting future research directions in this area.
Information Systems | 2009
Michaela Götz; Christoph Koch; Wim Martens
Tree pattern matching is a fundamental problem that has a wide range of applications in Web data management, XML processing, and selective data dissemination. In this paper we develop efficient algorithms for the tree homeomorphism problem, i.e., the problem of matching a tree pattern with exclusively transitive (descendant) edges. We first prove that deciding whether there is a tree homeomorphism is LOGSPACE-complete, improving on the current LOGCFL upper bound. Furthermore, we develop a practical algorithm for the tree homeomorphism decision problem that is both space- and time-efficient. The algorithm is in LOGDCFL and space consumption is strongly bounded, while the running time is linear in the size of the data tree. This algorithm immediately generalizes to the problem of matching the tree pattern against all subtrees of the data tree, preserving the mentioned efficiency properties.
international conference on database theory | 2009
Michaela Götz; Christoph Koch
The ability to flexibly compose confidence computation with the operations of relational algebra is an important feature of probabilistic database query languages. Computing confidences is computationally hard, however, and has to be approximated in practice. In a compositional query language, even very small errors caused by approximation can lead to an entirely incorrect result: A selection operation on an approximated probability can incorrectly keep or drop a tuple even if the probability value has been approximated to a very narrow confidence interval. In this paper, we study the query evaluation problem for compositional query languages for probabilistic databases with particular focus on providing overall result quality guarantees in the face of approximate intermediate results. We present a framework for evaluating compositional queries based on a new representation system that can capture uncertainty about probabilities. More specifically, we consider probability intervals instead of exact probabilities, interpreting tuples obtained by selection on approximate values as unreliable. We study the complexity of query evaluation over our new model. We present efficient confidence computation algorithms which compute bounds that are close to tight for important classes. For deciding a selection predicate, we show that no efficient randomized algorithm exists unless BPP⊃NP. Still we are able to efficiently guess robust predicates with a good error bound. Putting all these pieces together in our framework, we evaluate queries using a decomposition into a relational algebra plan and an approximation plan. The latter allows to successively improve accuracy and error bounds, while the relational algebra plan only has to be executed once.
database programming languages | 2007
Michaela Götz; Christoph Koch; Wim Martens
Tree pattern matching is a fundamental problem that has a wide range of applications in Web data management, XML processing, and selective data dissemination. In this paper we develop efficient algorithms for the tree homeomorphism problem, i.e., the problem of matching a tree pattern with exclusively transitive (descendant) edges. We first prove that deciding whether there is a tree homeomorphism is LOGSPACE-complete, improving on the current LOGCFL upper bound. As our main result we develop a practical algorithm for the tree homeomorphism decision problem that is both space- and time efficient. The algorithm is in LOGDCFL and space consumption is strongly bounded, while the running time is linear in the size of the data tree. This algorithm immediately generalizes to the problem of matching the tree pattern against all subtrees of the data tree, preserving the mentioned efficiency properties.
conference on information and knowledge management | 2011
Truls Amundsen Bjørklund; Michaela Götz; Johannes Gehrke; Nils Grimsmo
More and more data is accumulated inside social networks. Keyword search provides a simple interface for exploring this content. However, a lot of the content is private, and a search system must enforce the privacy settings of the social network. In this paper, we present a workload-aware keyword search system with access control based on a social network. We make two technical contributions: (1) HeapUnion, a novel union operator that improves processing of search queries with access control by up to a factor of two compared to the best previous solution; and (2) highly accurate cost models that vary in sophistication and accuracy; these cost models provide input to an optimization algorithm that selects the most efficient organization of access control meta-data for a given workload. Our experimental results with real and synthetic data show that our approach outperforms previous work by up to a factor of three.
IEEE Transactions on Knowledge and Data Engineering | 2012
Michaela Götz; Ashwin Machanavajjhala; Guozhang Wang; Xiaokui Xiao; Johannes Gehrke
arXiv: Databases | 2009
Michaela Götz; Ashwin Machanavajjhala; Guozhang Wang; Xiaokui Xiao; Johannes Gehrke