Kristen LeFevre
University of Michigan
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kristen LeFevre.
ACM Transactions on Database Systems | 2008
Kristen LeFevre; David J. DeWitt; Raghu Ramakrishnan
Protecting individual privacy is an important problem in microdata distribution and publishing. Anonymization algorithms typically aim to satisfy certain privacy definitions with minimal impact on the quality of the resulting data. While much of the previous literature has measured quality through simple one-size-fits-all measures, we argue that quality is best judged with respect to the workload for which the data will ultimately be used. This article provides a suite of anonymization algorithms that incorporate a target class of workloads, consisting of one or more data mining tasks as well as selection predicates. An extensive empirical evaluation indicates that this approach is often more effective than previous techniques. In addition, we consider the problem of scalability. The article describes two extensions that allow us to scale the anonymization algorithms to datasets much larger than main memory. The first extension is based on ideas from scalable decision trees, and the second is based on sampling. A thorough performance evaluation indicates that these techniques are viable in practice.
very large data bases | 2011
Daniel Fabbri; Kristen LeFevre
To comply with emerging privacy laws and regulations, it has become common for applications like electronic health records systems (EHRs) to collect access logs, which record each time a user (e.g., a hospital employee) accesses a piece of sensitive data (e.g., a patient record). Using the access log, it is easy to answer simple queries (e.g., Who accessed Alices medical record?), but this often does not provide enough information. In addition to learning who accessed their medical records, patients will likely want to understand why each access occurred. In this paper, we introduce the problem of generating explanations for individual records in an access log. The problem is motivated by user-centric auditing applications, and it also provides a novel approach to misuse detection. We develop a framework for modeling explanations which is based on a fundamental observation: For certain classes of databases, including EHRs, the reason for most data accesses can be inferred from data stored elsewhere in the database. For example, if Alice has an appointment with Dr. Dave, this information is stored in the database, and it explains why Dr. Dave looked at Alices record. Large numbers of data accesses can be explained using general forms called explanation templates. Rather than requiring an administrator to manually specify explanation templates, we propose a set of algorithms for automatically discovering frequent templates from the database (i.e., those that explain a large number of accesses). We also propose techniques for inferring collaborative user groups, which can be used to enhance the quality of the discovered explanations. Finally, we have evaluated our proposed techniques using an access log and data from the University of Michigan Health System. Our results demonstrate that in practice we can provide explanations for over 94% of data accesses in the log.
very large data bases | 2009
Jing Zhang; Adriane Chapman; Kristen LeFevre
Database provenance chronicles the history of updates and modifications to data, and has received much attention due to its central role in scientific data management. However, the use of provenance information still requires a leap of faith. Without additional protections, provenance records are vulnerable to accidental corruption, and even malicious forgery, a problem that is most pronounced in the loosely-coupled multi-user environments often found in scientific research. This paper investigates the problem of providing integrity and tamper-detection for database provenance. We propose a checksum-based approach, which is well-suited to the unique characteristics of database provenance, including non-linear provenance objects and provenance associated with multiple fine granularities of data. We demonstrate that the proposed solution satisfies a set of desirable security properties, and that the additional time and space overhead incurred by the checksum approach is manageable, making the solution feasible in practice.
Journal of the American Medical Informatics Association | 2013
Daniel Fabbri; Kristen LeFevre
OBJECTIVE Ensuring the security and appropriate use of patient health information contained within electronic medical records systems is challenging. Observing these difficulties, we present an addition to the explanation-based auditing system (EBAS) that attempts to determine the clinical or operational reason why accesses occur to medical records based on patient diagnosis information. Accesses that can be explained with a reason are filtered so that the compliance officer has fewer suspicious accesses to review manually. METHODS Our hypothesis is that specific hospital employees are responsible for treating a given diagnosis. For example, Dr Carl accessed Alices medical record because Hem/Onc employees are responsible for chemotherapy patients. We present metrics to determine which employees are responsible for a diagnosis and quantify their confidence. The auditing system attempts to use this responsibility information to determine the reason why an access occurred. We evaluate the auditing systems classification quality using data from the University of Michigan Health System. RESULTS The EBAS correctly determines which departments are responsible for a given diagnosis. Adding this responsibility information to the EBAS increases the number of first accesses explained by a factor of two over previous work and explains over 94% of all accesses with high precision. CONCLUSIONS The EBAS serves as a complementary security tool for personal health information. It filters a majority of accesses such that it is more feasible for a compliance officer to review the remaining suspicious accesses manually.
very large data bases | 2010
Daniel Fabbri; Kristen LeFevre; Qiang Zhu
Recent legislation has increased the requirements of organizations to report data breaches, or unauthorized access to data. While access control policies are used to restrict access to a database, these policies are complex and difficult to configure. As a result, misconfigurations sometimes allow users access to unauthorized data. In this paper, we consider the problem of reporting data breaches after such a misconfiguration is detected. To locate past SQL queries that may have revealed unauthorized information, we introduce the novel idea of a misconfiguration response (MR) query. The MR-query cleanly addresses the challenges of information propagation within the database by replaying the log of operations and returning all logged queries for which the result has changed due to the misconfiguration. A strawman implementation of the MR-query would go back in time and replay all the operations that occurred in the interim, with the correct policy. However, re-executing all operations is inefficient. Instead, we develop techniques to improve reporting efficiency by reducing the number of operations that must be re-executed and reducing the cost of replaying the operations. An extensive evaluation shows that our method can reduce the total runtime by up to an order of magnitude.Recent legislation has increased the requirements of organizations to report data breaches, or unauthorized access to data. While access control policies are used to restrict access to a database, these policies are complex and difficult to configure. As a result, misconfigurations sometimes allow users access to unauthorized data. In this paper, we consider the problem of reporting data breaches after such a misconfiguration is detected. To locate past SQL queries that may have revealed unauthorized information, we introduce the novel idea of a misconfiguration response (MR) query. The MR-query cleanly addresses the challenges of information propagation within the database by replaying the log of operations and returning all logged queries for which the result has changed due to the misconfiguration. A strawman implementation of the MR-query would go back in time and replay all the operations that occurred in the interim, with the correct policy. However, re-executing all operations is inefficient. Instead, we develop techniques to improve reporting efficiency by reducing the number of operations that must be re-executed and reducing the cost of replaying the operations. An extensive evaluation shows that our method can reduce the total runtime by up to an order of magnitude.
extending database technology | 2010
Lujun Fang; Kristen LeFevre
Data mining is increasingly performed by people who are not computer scientists or professional programmers. It is often done as an iterative process involving multiple ad-hoc tasks, as well as data pre- and post-processing, all of which must be executed over large databases. In order to make data mining more accessible, it is critical to provide a simple, easy-to-use language that allows the user to specify ad-hoc data processing, model construction, and model manipulation. Simultaneously, it is necessary for the underlying system to scale up to large datasets. Unfortunately, while each of these requirements can be satisfied, individually, by existing systems, no system fully satisfies all criteria. In this paper, we present a system called Splash to fill this void. Splash supports an extended relational data model and SQL query language, which allows for the natural integration of statistical modeling and ad-hoc data processing. It also supports a novel representatives operator to help explain models using a limited number of examples. We have developed a prototype implementation of Splash. Our experimental study indicates that it scales well to large input datasets. Further, to demonstrate the simplicity of the language, we conducted a case study using Splash to perform a series of exploratory analyses using network log data. Our study indicates that the query-based interface is simpler than a common data mining software package, and it often requires less programming effort to use.
knowledge discovery and data mining | 2011
Daniel Fabbri; Kristen LeFevre; David A. Hanauer
Electronic health record systems (EHRs) are increasingly used to store patient medical information. To ensure the responsible use of this data, EHRs collect access logs, which record each access to sensitive data (e.g., a patients record). Using the access log, it is easy to determine who has accessed a specific medical record. However, in addition to this information, we observe that for various applications such as user-centric auditing, it is also important to understand why each access occurred. In this paper, we study why accesses occur in EHRs. Our goal is to provide an explanation describing why each access occurred (e.g., Dr. Dave accessed Alices medical record because Dr. Dave has an appointment with Alice). Using data from the University of Michigan Health System, we demonstrate that most accesses to EHRs occur for a valid clinical or operational reason, and often the reason is documented in the EHR database. Specifically, we observe three general types of explanations (direct, group, and consultation), and we show that these explanations can explain over 90% of the accesses in the log. Moreover, we identify collaborative groups that help to explain additional accesses.
data engineering for wireless and mobile access | 2010
Wen Jin; Kristen LeFevre; Jignesh M. Patel
This paper studies the problem of protecting individual privacy when continuously publishing a stream of location trace data collected from a population of users. Fundamentally, this leads to the new challenge of anonymizing data that evolves in predictable ways over time. Our main technical contribution is a novel formal framework for reasoning about privacy in this setting. We articulate a new privacy principle called temporal unlinkability. Then, by incorporating a probabilistic model of data change (in this case, user motion), we are able to quantify the risk of privacy violations. Within this framework, we develop an initial set of algorithms for continuous privacy-preserving publishing. Finally, our experiments demonstrate the shortcomings of previous publishing techniques that do not account for inference based on predictable data change, and they demonstrate the feasibility of the new approach.
very large data bases | 2009
Bee-Chung Chen; Kristen LeFevre; Raghu Ramakrishnan
Privacy is an important issue in data publishing. Many organizations distribute non-aggregate personal data for research, and they must take steps to ensure that an adversary cannot predict sensitive information pertaining to individuals with high confidence. This problem is further complicated by the fact that, in addition to the published data, the adversary may also have access to other resources (e.g., public records and social networks relating individuals), which we call adversarial knowledge. A robust privacy framework should allow publishing organizations to analyze data privacy by means of not only data dimensions (data that a publishing organization has), but also adversarial-knowledge dimensions (information not in the data). In this paper, we first describe a general framework for reasoning about privacy in the presence of adversarial knowledge. Within this framework, we propose a novel multidimensional approach to quantifying adversarial knowledge. This approach allows the publishing organization to investigate privacy threats and enforce privacy requirements in the presence of various types and amounts of adversarial knowledge. Our main technical contributions include a multidimensional privacy criterion that is more intuitive and flexible than previous approaches to modeling background knowledge. In addition, we identify an important congregation property of the adversarial-knowledge dimensions. Based on this property, we provide algorithms for measuring disclosure and sanitizing data that improve computational efficiency several orders of magnitude over the best known techniques.
international world wide web conferences | 2010
Lujun Fang; Kristen LeFevre