Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Bradley Malin is active.

Publication


Featured researches published by Bradley Malin.


Journal of Biomedical Informatics | 2004

How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems

Bradley Malin; Latanya Sweeney

The increasing integration of patient-specific genomic data into clinical practice and research raises serious privacy concerns. Various systems have been proposed that protect privacy by removing or encrypting explicitly identifying information, such as name or social security number, into pseudonyms. Though these systems claim to protect identity from being disclosed, they lack formal proofs. In this paper, we study the erosion of privacy when genomic data, either pseudonymous or data believed to be anonymous, are released into a distributed healthcare environment. Several algorithms are introduced, collectively called RE-Identification of Data In Trails (REIDIT), which link genomic data to named individuals in publicly available records by leveraging unique features in patient-location visit patterns. Algorithmic proofs of re-identification are developed and we demonstrate, with experiments on real-world data, that susceptibility to re-identification is neither trivial nor the result of bizarre isolated occurrences. We propose that such techniques can be applied as system tests of privacy protection capabilities.


Journal of the American Medical Informatics Association | 2004

An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future

Bradley Malin

The incorporation of genomic data into personal medical records poses many challenges to patient privacy. In response, various systems for preserving patient privacy in shared genomic data have been developed and deployed. Although these systems de-identify the data by removing explicit identifiers (e.g., name, address, or Social Security number) and incorporate sound security design principles, they suffer from a lack of formal modeling of inferences learnable from shared data. This report evaluates the extent to which current protection systems are capable of withstanding a range of re-identification methods, including genotype-phenotype inferences, location-visit patterns, family structures, and dictionary attacks. For a comparative re-identification analysis, the systems are mapped to a common formalism. Although there is variation in susceptibility, each system is deficient in its protection capacity. The author discovers patterns of protection failure and discusses several of the reasons why these systems are susceptible. The analyses and discussion within provide guideposts for the development of next-generation protection methods amenable to formal proofs.


Proceedings of the National Academy of Sciences of the United States of America | 2010

Anonymization of electronic medical records for validating genome-wide association studies.

Grigorios Loukides; Aris Gkoulalas-Divanis; Bradley Malin

Genome-wide association studies (GWAS) facilitate the discovery of genotype–phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients’ standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data “as is” may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients’ identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilts University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks.


International Journal of Medical Informatics | 2010

The MITRE Identification Scrubber Toolkit: Design, training, and assessment

John S. Aberdeen; Samuel Bayer; Reyyan Yeniterzi; Benjamin Wellner; Cheryl Clark; David A. Hanauer; Bradley Malin; Lynette Hirschman

PURPOSE Medical records must often be stripped of patient identifiers, or de-identified, before being shared. De-identification by humans is time-consuming, and existing software is limited in its generality. The open source MITRE Identification Scrubber Toolkit (MIST) provides an environment to support rapid tailoring of automated de-identification to different document types, using automatically learned classifiers to de-identify and protect sensitive information. METHODS MIST was evaluated with four classes of patient records from the Vanderbilt University Medical Center: discharge summaries, laboratory reports, letters, and order summaries. We trained and tested MIST on each class of record separately, as well as on pooled sets of records. We measured precision, recall, F-measure and accuracy at the word level for the detection of patient identifiers as designated by the HIPAA Safe Harbor Rule. RESULTS MIST was applied to medical records that differed in the amounts and types of protected health information (PHI): lab reports contained only two types of PHI (dates, names) compared to discharge summaries, which were much richer. Performance of the de-identification tool depended on record class; F-measure results were 0.996 for order summaries, 0.996 for discharge summaries, 0.943 for letters and 0.934 for laboratory reports. Experiments suggest the tool requires several hundred training exemplars to reach an F-measure of at least 0.9. CONCLUSIONS The MIST toolkit makes possible the rapid tailoring of automated de-identification to particular document types and supports the transition of the de-identification software to medical end users, avoiding the need for developers to have access to original medical records. We are making the MIST toolkit available under an open source license to encourage its application to diverse data sets at multiple institutions.


Journal of Investigative Medicine | 2010

Technical and Policy Approaches to Balancing Patient Privacy and Data Sharing in Clinical and Translational Research

Bradley Malin; David R. Karp; Richard H. Scheuermann

Introduction Clinical researchers need to share data to support scientific validation and information reuse and to comply with a host of regulations and directives from funders. Various organizations are constructing informatics resources in the form of centralized databases to ensure reuse of data derived from sponsored research. The widespread use of such open databases is contingent on the protection of patient privacy. Methods We review privacy-related problems associated with data sharing for clinical research from technical and policy perspectives. We investigate existing policies for secondary data sharing and privacy requirements in the context of data derived from research and clinical settings. In particular, we focus on policies specified by the US National Institutes of Health and the Health Insurance Portability and Accountability Act and touch on how these policies are related to current and future use of data stored in public database archives. We address aspects of data privacy and identifiability from a technical, although approachable, perspective and summarize how biomedical databanks can be exploited and seemingly anonymous records can be reidentified using various resources without hacking into secure computer systems. Results We highlight which clinical and translational data features, specified in emerging research models, are potentially vulnerable or exploitable. In the process, we recount a recent privacy-related concern associated with the publication of aggregate statistics from pooled genome-wide association studies that have had a significant impact on the data sharing policies of National Institutes of Health-sponsored databanks. Conclusion Based on our analysis and observations we provide a list of recommendations that cover various technical, legal, and policy mechanisms that open clinical databases can adopt to strengthen data privacy protection as they move toward wider deployment and adoption.


Journal of the American Medical Informatics Association | 2013

Biomedical data privacy: problems, perspectives, and recent advances

Bradley Malin; Khaled El Emam; Christine M. O'Keefe

The notion of privacy in the healthcare domain is at least as old as the ancient Greeks. Several decades ago, as electronic medical record (EMR) systems began to take hold, the necessity of patient privacy was recognized as a core principle, or even a right, that must be upheld.1 ,2 This belief was re-enforced as computers and EMRs became more common in clinical environments.3–5 However, the arrival of ultra-cheap data collection and processing technologies is fundamentally changing the face of healthcare. The traditional boundaries of primary and tertiary care environments are breaking down and health information is increasingly collected through mobile devices,6 in personal domains (eg, in ones home7), and from sensors attached on or in the human body (eg, body area networks8–10). At the same time, the detail and diversity of information collected in the context of healthcare and biomedical research is increasing at an unprecedented rate, with clinical and administrative health data being complemented with a range of *omics data, where genomics11 and proteomics12 are currently leading the charge, with other types of molecular data on the horizon.13 Healthcare organizations (HCOs) are adopting and adapting information technologies to support an expanding array of activities designed to derive value from these growing data archives, in terms of enhanced health outcomes.14 The ready availability of such large volumes of detailed data has also been accompanied by privacy invasions. Recent breach notification laws at the US federal and state levels have brought to the publics attention the scope and frequency of these invasions. For example, there are cases of healthcare provider snooping on the medical records of famous people, family, and friends, use of personal information for identity fraud, and millions of records disclosed through lost and … Correspondence to Dr Bradley Malin, Department of Biomedical Informatics, Vanderbilt University, 2525 West End Avenue, Suite 600, Nashville, TN 37203, USA; b.malin{at}vanderbilt.edu


knowledge discovery and data mining | 2015

Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics

Yichen Wang; Robert Chen; Joydeep Ghosh; Joshua C. Denny; Abel N. Kho; You Chen; Bradley Malin; Jimeng Sun

Computational phenotyping is the process of converting heterogeneous electronic health records (EHRs) into meaningful clinical concepts. Unsupervised phenotyping methods have the potential to leverage a vast amount of labeled EHR data for phenotype discovery. However, existing unsupervised phenotyping methods do not incorporate current medical knowledge and cannot directly handle missing, or noisy data. We propose Rubik, a constrained non-negative tensor factorization and completion method for phenotyping. Rubik incorporates 1) guidance constraints to align with existing medical knowledge, and 2) pairwise constraints for obtaining distinct, non-overlapping phenotypes. Rubik also has built-in tensor completion that can significantly alleviate the impact of noisy and missing data. We utilize the Alternating Direction Method of Multipliers (ADMM) framework to tensor factorization and completion, which can be easily scaled through parallel computing. We evaluate Rubik on two EHR datasets, one of which contains 647,118 records for 7,744 patients from an outpatient clinic, the other of which is a public dataset containing 1,018,614 CMS claims records for 472,645 patients. Our results show that Rubik can discover more meaningful and distinct phenotypes than the baselines. In particular, by using knowledge guidance constraints, Rubik can also discover sub-phenotypes for several major diseases. Rubik also runs around seven times faster than current state-of-the-art tensor methods. Finally, Rubik is scalable to large datasets containing millions of EHR records.


Journal of Biomedical Informatics | 2014

PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records

Kenney Ng; Amol Ghoting; Steven R. Steinhubl; Walter F. Stewart; Bradley Malin; Jimeng Sun

OBJECTIVE Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. METHODS To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. RESULTS We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. CONCLUSION This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers.


Human Genetics | 2011

Identifiability in biobanks: models, measures, and mitigation strategies

Bradley Malin; Grigorios Loukides; Kathleen Benitez; Ellen Wright Clayton

The collection and sharing of person-specific biospecimens has raised significant questions regarding privacy. In particular, the question of identifiability, or the degree to which materials stored in biobanks can be linked to the name of the individuals from which they were derived, is under scrutiny. The goal of this paper is to review the extent to which biospecimens and affiliated data can be designated as identifiable. To achieve this goal, we summarize recent research in identifiability assessment for DNA sequence data, as well as associated demographic and clinical data, shared via biobanks. We demonstrate the variability of the degree of risk, the factors that contribute to this variation, and potential ways to mitigate and manage such risk. Finally, we discuss the policy implications of these findings, particularly as they pertain to biobank security and access policies. We situate our review in the context of real data sharing scenarios and biorepositories.


ACM Computing Surveys | 2015

Privacy in the Genomic Era

Muhammad Naveed; Erman Ayday; Ellen Wright Clayton; Jacques Fellay; Carl A. Gunter; Jean-Pierre Hubaux; Bradley Malin; XiaoFeng Wang

Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward.

Collaboration


Dive into the Bradley Malin's collaboration.

Top Co-Authors

Avatar

Murat Kantarcioglu

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

You Chen

Vanderbilt University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Joshua C. Denny

Vanderbilt University Medical Center

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Latanya Sweeney

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Abel N. Kho

Northwestern University

View shared research outputs
Researchain Logo
Decentralizing Knowledge