James J. Gardner | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James J. Gardner is active.

Explore More

Publication

Featured researches published by James J. Gardner.

data and knowledge engineering | 2009

An integrated framework for de-identifying unstructured medical data

James J. Gardner; Li Xiong

While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any information that can be used to identify a patient. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a framework and prototype system for de-identifying health information including both structured and unstructured data. We empirically study a simple Bayesian classifier, a Bayesian classifier with a sampling based technique, and a conditional random field based classifier for extracting identifying attributes from unstructured data. We deploy a k-anonymization based technique for de-identifying the extracted data to preserve maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach.

computer-based medical systems | 2008

HIDE: An Integrated System for Health Information DE-identification

James J. Gardner; Li Xiong

While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any identifiable information. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a prototype system for de-identifying health information including both structured and unstructured data. It deploys a conditional random fields based technique for extracting identifying attributes from unstructured data and k-anonymization based technique for de-identifying the data while preserving maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach.

Journal of the American Medical Informatics Association | 2013

SHARE: system design and case studies for statistical health information release

James J. Gardner; Li Xiong; Yonghui Xiao; Jingjing Gao; Andrew R. Post; Xiaoqian Jiang; Lucila Ohno-Machado

OBJECTIVES We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data. MATERIALS AND METHODS SHARE releases statistical information in electronic health records with differential privacy, a strong privacy framework for statistical data release. It includes a number of state-of-the-art methods for releasing multidimensional histograms and longitudinal patterns. We performed a variety of experiments on two real datasets, the surveillance, epidemiology and end results (SEER) breast cancer dataset and the Emory electronic medical record (EeMR) dataset, to demonstrate the feasibility and utility of SHARE. RESULTS Experimental results indicate that SHARE can deal with heterogeneous data present in medical data, and that the released statistics are useful. The Kullback-Leibler divergence between the released multidimensional histograms and the original data distribution is below 0.5 and 0.01 for seven-dimensional and three-dimensional data cubes generated from the SEER dataset, respectively. The relative error for longitudinal pattern queries on the EeMR dataset varies between 0 and 0.3. While the results are promising, they also suggest that challenges remain in applying statistical data release using the differential privacy framework for higher dimensional data. CONCLUSIONS SHARE is one of the first systems to provide a mechanism for custodians to release differentially private aggregate statistics for a variety of use cases in the medical domain. This proof-of-concept system is intended to be applied to large-scale medical data warehouses.

international conference on data engineering | 2012

DPCube: Releasing Differentially Private Data Cubes for Health Information

Yonghui Xiao; James J. Gardner; Li Xiong

We demonstrate DPCube, a component in our Health Information DE-identification (HIDE) framework, for releasing differentially private data cubes (or multi-dimensional histograms) for sensitive data. HIDE is a framework we developed for integrating heterogenous structured and unstructured health information and provides methods for privacy preserving data publishing. The DPCube component uses differentially private access mechanisms and an innovative 2-phase multidimensional partitioning strategy to publish a multi-dimensional data cube or histogram that achieves good utility while satisfying differential privacy. We demonstrate that the released data cubes can serve as a sanitized synopsis of the raw database and, together with an optional synthesized dataset based on the data cubes, can support various Online Analytical Processing (OLAP) queries and learning tasks.

extending database technology | 2009

HIDE: heterogeneous information DE-identification

James J. Gardner; Li Xiong; Kanwei Li; James J. Lu

While there is an increasing need to share data that may contain personal information, such data sharing must preserve individual privacy without disclosing any identifiable information. A considerable amount of research in the data privacy community has been devoted to formalizing the notion of identifiability with many techniques for anonymization, but is focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in the medical informatics community are highly specialized for specific document types or a subset of identifiers. In addition, they rely on simple identifier removal or grouping techniques and do not take advantage of the research developments in the data privacy community. We developed an integrated system, HIDE, for Heterogeneous Information DE-identification including structured and unstructured data utilizing existing anonymization techniques. We demonstrate a prototype of our system and show the effectiveness of our approach through a set of real data augmented with synthesized data.

conference on information and knowledge management | 2009

Automatic link detection: a sequence labeling approach

James J. Gardner; Li Xiong

The popularity of Wikipedia and other online knowledge bases has recently produced an interest in the machine learning community for the problem of automatic linking. Automatic hyperlinking can be viewed as two sub problems - link detection which determines the source of a link, and link disambiguation which determines the destination of a link. Wikipedia is a rich corpus with hyperlink data provided by authors. It is possible to use this data to train classifiers to be able to mimic the authors in some capacity. In this paper, we introduce automatic link detection as a sequence labeling problem. Conditional random fields (CRFs) are a probabilistic framework for labeling sequential data. We show that training a CRF with different types of features from the Wikipedia dataset can be used to automatically detect links with almost perfect precision and high recall.

collaborative computing | 2006

NNexus: Towards an Automatic Linker for a Massively-Distributed Collaborative Corpus

James J. Gardner; Aaron Krowne; Li Xiong

Collaborative online encyclopedias such as Wikipedia and PlanetMath are becoming increasingly popular. In order to understand an article in a corpus a user must understand the related and underlying concepts through linked articles. In this paper, we introduce NNexus, a generalization of the automatic linking component of PlanetMath.org and the first system that automates the process of linking encyclopedia entries into a semantic network of concepts. We discuss the challenges, present the conceptual models as well as specific mechanisms of NNexus system, and discuss some of our ongoing and completed works

international health informatics symposium | 2010

An evaluation of feature sets and sampling techniques for de-identification of medical records

James J. Gardner; Li Xiong; Fusheng Wang; Andrew R. Post; Joel H. Saltz; Tyrone Grandison

De-identification of text medical records is of critical importance in any health informatics system in order to facilitate research and sharing of medical records. While statistical learning based techniques have shown promising results for de-identification purposes, few such systems are publicly available. It remains a challenge for practitioners to build an accurate and efficient system as it involves a significant amount of feature engineering, i.e. creation and examination of new features used in the system. A comprehensive evaluation is needed to thoroughly understand the effects of different feature sets and potential impacts of sampling and their trade-offs between the often conflicting goals of precision (or positive predictive value), recall (or sensitivity), and efficiency. In this paper, we present the Health Information DE-identification (HIDE) framework and evaluate the open- source software. We present an evaluation of various types of features used in HIDE, and introduce a window sampling technique (only the terms within a specified distance from personal health information are used to train the classifier) and evaluate its effect on both quality and efficiency. Our results show that the context features (previous and next terms) are particularly important and the sampling technique can be used to increase recall with minimal impact on precision. We obtained token-level label precision of 0.967, recall of 0.986 and F-Score of 0.977 when not including true negatives. The overall HIDE system achieves token-level precision of .998, recall of .999, and f-score of .999 on the previous i2b2 challenge task.

Archive | 2010

Automatic Invocation Linking for Collaborative Web-Based Corpora

James J. Gardner; Aaron Krowne; Li Xiong

Collaborative online encyclopedias or knowledge bases such as Wikipedia and PlanetMath are becoming increasingly popular because of their open access, comprehensive and interlinked content, rapid and continual updates, and community interactivity. To understand a particular concept in these knowledge bases, a reader needs to learn about related and underlying concepts. In this chapter, we introduce the problem of invocation linking for collaborative encyclopedia or knowledge bases, review the state of the art for invocation linking including the popular linking system of Wikipedia, discuss the problems and challenges of automatic linking, and present the NNexus approach, an abstraction and generalization of the automatic linking system used by PlanetMath.org. The chapter emphasizes both research problems and practical design issues through discussion of real world scenarios and hence is suitable for both researchers in web intelligence and practitioners looking to adopt the techniques. Below is a brief outline of the chapter.

IEEE Transactions on Knowledge and Data Engineering | 2009