Benjamin C. M. Fung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin C. M. Fung is active.

Explore More

Publication

Featured researches published by Benjamin C. M. Fung.

ACM Computing Surveys | 2010

Privacy-preserving data publishing: A survey of recent developments

Benjamin C. M. Fung; Ke Wang; Rui Chen; Philip S. Yu

The collection of digital information by governments, corporations, and individuals has created tremendous opportunities for knowledge- and information-based decision making. Driven by mutual benefits, or by regulations that require certain data to be published, there is a demand for the exchange and publication of data among various parties. Data in its original form, however, typically contains sensitive information about individuals, and publishing such data will violate individual privacy. The current practice in data publishing relies mainly on policies and guidelines as to what types of data can be published and on agreements on the use of published data. This approach alone may lead to excessive data distortion or insufficient protection. Privacy-preserving data publishing (PPDP) provides methods and tools for publishing useful information while preserving data privacy. Recently, PPDP has received considerable attention in research communities, and many approaches have been proposed for different data publishing scenarios. In this survey, we will systematically summarize and evaluate different approaches to PPDP, study the challenges in practical data publishing, clarify the differences and requirements that distinguish PPDP from other related problems, and propose future research directions.

international conference on data engineering | 2005

Top-down specialization for information and privacy preservation

Benjamin C. M. Fung; Ke Wang; Philip S. Yu

Releasing person-specific data in its most specific state poses a threat to individual privacy. This paper presents a practical and efficient algorithm for determining a generalized version of data that masks sensitive information and remains useful for modelling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. Our approach exploits the fact that data usually contains redundant structures for classification. While generalization may eliminate some structures, other structures emerge to help. Our results show that quality of classification can be preserved even for highly restrictive privacy requirements. This work has great applicability to both public and private sectors that share information for mutual benefits and productivity.

knowledge discovery and data mining | 2006

Anonymizing sequential releases

Ke Wang; Benjamin C. M. Fung

An organization makes a new release as new information become available, releases a tailored view for each data request, releases sensitive information and identifying information separately. The availability of related releases sharpens the identification of individuals by a global quasi-identifier consisting of attributes from related releases. Since it is not an option to anonymize previously released data, the current release must be anonymized to ensure that a global quasi-identifier is not effective for identification. In this paper, we study the sequential anonymization problem under this assumption. A key question is how to anonymize the current release so that it cannot be linked to previous releases yet remains useful for its own release purpose. We introduce the lossy join, a negative property in relational database design, as a way to hide the join relationship among releases, and propose a scalable and practical solution.

knowledge discovery and data mining | 2011

Differentially private data release for data mining

Noman Mohammed; Rui Chen; Benjamin C. M. Fung; Philip S. Yu

Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among the existing privacy models, ∈-differential privacy provides one of the strongest privacy guarantees and has no assumptions about an adversarys background knowledge. Most of the existing solutions that ensure ∈-differential privacy are based on an interactive model, where the data miner is only allowed to pose aggregate queries to the database. In this paper, we propose the first anonymization algorithm for the non-interactive setting based on the generalization technique. The proposed solution first probabilistically generalizes the raw data and then adds noise to guarantee ∈-differential privacy. As a sample application, we show that the anonymized data can be used effectively to build a decision tree induction classifier. Experimental results demonstrate that the proposed non-interactive anonymization algorithm is scalable and performs better than the existing solutions for classification analysis.

extending database technology | 2008

Anonymity for continuous data publishing

Benjamin C. M. Fung; Ke Wang; Ada Wai-Chee Fu; Jian Pei

k-anonymization is an important privacy protection mechanism in data publishing. While there has been a great deal of work in recent years, almost all considered a single static release. Such mechanisms only protect the data up to the first release or first recipient. In practical applications, data is published continuously as new data arrive; the same data may be anonymized differently for a different purpose or a different recipient. In such scenarios, even when all releases are properly k-anonymized, the anonymity of an individual may be unintentionally compromised if recipient cross-examines all the releases received or colludes with other recipients. Preventing such attacks, called correspondence attacks, faces major challenges. In this paper, we systematically characterize the correspondence attacks and propose an efficient anonymization algorithm to thwart the attacks in the model of continuous data publishing.

Information Sciences | 2013

Privacy-preserving trajectory data publishing by local suppression

Rui Chen; Benjamin C. M. Fung; Noman Mohammed; Bipin C. Desai; Ke Wang

The pervasiveness of location-aware devices has spawned extensive research in trajectory data mining, resulting in many important real-life applications. Yet, the privacy issue in sharing trajectory data among different parties often creates an obstacle for effective data mining. In this paper, we study the challenges of anonymizing trajectory data: high dimensionality, sparseness, and sequentiality. Employing traditional privacy models and anonymization methods often leads to low data utility in the resulting data and ineffective data mining. In addressing these challenges, this is the first paper to introduce local suppression to achieve a tailored privacy model for trajectory data anonymization. The framework allows the adoption of various data utility metrics for different data mining tasks. As an illustration, we aim at preserving both instances of location-time doublets and frequent sequences in a trajectory database, both being the foundation of many trajectory data mining tasks. Our experiments on both synthetic and real-life data sets suggest that the framework is effective and efficient to overcome the challenges in trajectory data anonymization. In particular, compared with the previous works in the literature, our proposed local suppression method can significantly improve the data utility in anonymous trajectory data.

ACM Transactions on Knowledge Discovery From Data | 2010

Centralized and Distributed Anonymization for High-Dimensional Healthcare Data

Noman Mohammed; Benjamin C. M. Fung; Patrick C. K. Hung; Cheuk-kwong Lee

Sharing healthcare data has become a vital requirement in healthcare system management; however, inappropriate sharing and usage of healthcare data could threaten patients’ privacy. In this article, we study the privacy concerns of sharing patient information between the Hong Kong Red Cross Blood Transfusion Service (BTS) and the public hospitals. We generalize their information and privacy requirements to the problems of centralized anonymization and distributed anonymization, and identify the major challenges that make traditional data anonymization methods not applicable. Furthermore, we propose a new privacy model called LKC-privacy to overcome the challenges and present two anonymization algorithms to achieve LKC-privacy in both the centralized and the distributed scenarios. Experiments on real-life data demonstrate that our anonymization algorithms can effectively retain the essential information in anonymous data for data analysis and is scalable for anonymizing large datasets.

Archive | 2010

Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques

Benjamin C. M. Fung; Ke Wang; Ada Wai-Chee Fu; Philip S. Yu

Gaining access to high-quality data is a vital necessity in knowledge-based decision making. But data in its raw form often contains sensitive information about individuals. Providing solutions to this problem, the methods and tools of privacy-preserving data publishing enable the publication of useful information while protecting data privacy. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques presents state-of-the-art information sharing and data integration methods that take into account privacy and data mining requirements. The first part of the book discusses the fundamentals of the field. In the second part, the authors present anonymization methods for preserving information utility for specific data mining tasks. The third part examines the privacy issues, privacy models, and anonymization methods for realistic and challenging data publishing scenarios. While the first three parts focus on anonymizing relational data, the last part studies the privacy threats, privacy models, and anonymization methods for complex data, including transaction, trajectory, social network, and textual data. This book not only explores privacy and information utility issues but also efficiency and scalability challenges. In many chapters, the authors highlight efficient and scalable methods and provide an analytical discussion to compare the strengths and weaknesses of different solutions.

Information Sciences | 2013

A unified data mining solution for authorship analysis in anonymous textual communications

Farkhund Iqbal; Hamad Binsalleeh; Benjamin C. M. Fung; Mourad Debbabi

The cyber world provides an anonymous environment for criminals to conduct malicious activities such as spamming, sending ransom e-mails, and spreading botnet malware. Often, these activities involve textual communication between a criminal and a victim, or between criminals themselves. The forensic analysis of online textual documents for addressing the anonymity problem called authorship analysis is the focus of most cybercrime investigations. Authorship analysis is the statistical study of linguistic and computational characteristics of the written documents of individuals. This paper is the first work that presents a unified data mining solution to address authorship analysis problems based on the concept of frequent pattern-based writeprint. Extensive experiments on real-life data suggest that our proposed solution can precisely capture the writing styles of individuals. Furthermore, the writeprint is effective to identify the author of an anonymous text from a group of suspects and to infer sociolinguistic characteristics of the author.

international conference on data mining | 2012

Direct Discovery of High Utility Itemsets without Candidate Generation

Junqiang Liu; Ke Wang; Benjamin C. M. Fung

Utility mining emerged recently to address the limitation of frequent itemset mining by introducing interestingness measures that reflect both the statistical significance and the users expectation. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotone property holds with the interestingness measure. The state-of-the-art works on this problem all employ a two-phase, candidate generation approach, which suffers from the scalability issue due to the huge number of candidates. This paper proposes a high utility itemset growth approach that works in a single phase without generating candidates. Our basic approach is to enumerate itemsets by prefix extensions, to prune search space by utility upper bounding, and to maintain original utility information in the mining process by a novel data structure. Such a data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility itemsets in an efficient and scalable way. We further enhance the efficiency significantly by introducing recursive irrelevant item filtering with sparse data, and a lookahead strategy with dense data. Extensive experiments on sparse and dense, synthetic and real data suggest that our algorithm outperforms the state-of-the-art algorithms over one order of magnitude.

Explore More