Dinusha Vatsalan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dinusha Vatsalan is active.

Explore More

Publication

Featured researches published by Dinusha Vatsalan.

conference on information and knowledge management | 2013

GeCo: an online personal data generator and corruptor

Khoi-Nguyen Tran; Dinusha Vatsalan; Peter Christen

We demonstrate GeCo, an online personal data GEnerator and COrruptor that facilitates the creation of realistic personal data ranging from names, addresses, and dates, to social security and credit card numbers, as well as numerical values such as salary or blood pressure. Using an intuitive Web interface, a user can create records containing such data according to their needs, and apply various corruption functions to generate duplicates of these records. Synthetic personal data are increasingly required in areas such as record de-duplication, fraud detection, cloud computing, and health informatics, where data quality issues can significantly affect the outcomes of data integration, processing, and mining projects. Privacy concerns, however, often make it difficult for researchers to obtain real data that contain personal details. Compared to other data generators that have to be downloaded, installed and customized,GeCo allows the creation of personal data with much less effort. In this demonstration we show (1) how different types of attributes, and dependencies between them, can be specified; (2) how the generated data can be modified using various types of corruption functions; and (3) how a user can contribute to GeCo by providing attribute generation functions and look-up files. We believe GeCo will be a valuable tool for researchers that require realistic personal data to evaluate their algorithms with regard to efficiency and effectiveness.

conference on information and knowledge management | 2013

Flexible and extensible generation and corruption of personal data

Peter Christen; Dinusha Vatsalan

With much of todays data being generated by people or referring to people, researchers increasingly require data that contain personal identifying information to evaluate their new algorithms. In areas such as record matching and de-duplication, fraud detection, cloud computing, and health informatics, issues such as data entry errors, typographical mistakes, noise, or recording variations, can all significantly affect the outcomes of data integration, processing, and mining projects. However, privacy concerns make it challenging to obtain real data that contain personal details. An alternative to using sensitive real data is to create synthetic data which follow similar characteristics. The advantages of synthetic data are that (1) they can be generated with well defined characteristics; (2) it is known which records represent an individual created entity (this is often unknown in real data); and (3) the generated data and the generator program itself can be published. We present a sophisticated data generation and corruption tool that allows the creation of various types of data, ranging from names and addresses, dates, social security and credit card numbers, to numerical values such as salary or blood pressure. Our tool can model dependencies between attributes, and it allows the corruption of values in various ways. We describe the overall architecture and main components of our tool, and illustrate how a user can easily extend this tool with novel functionalities.

conference on information and knowledge management | 2013

Efficient two-party private blocking based on sorted nearest neighborhood clustering

Dinusha Vatsalan; Peter Christen; Vassilios S. Verykios

Integrating data from diverse sources with the aim to identify similar records that refer to the same real-world entities without compromising privacy of these entities is an emerging research problem in various domains. This problem is known as privacy-preserving record linkage (PPRL). Scalability of PPRL is a main challenge due to growing data size in real-world applications. Private blocking techniques have been used in PPRL to address this challenge by reducing the number of record pair comparisons that need to be conducted. Many of these private blocking techniques require a trusted third party to perform the blocking. One main threat with three-party solutions is the collusion between parties to identify the private data of another party. We introduce a novel two-party private blocking technique for PPRL based on sorted nearest neighborhood clustering. Privacy is addressed by a combination of the privacy techniques k-anonymous clustering and public reference values. Experiments conducted on two real-world databases validate that our approach is scalable to large databases and effective in generating candidate record pairs that correspond to true matches, while preserving k-anonymous privacy characteristics. Our approach also performs equal or superior compared to three other state-of-the-art private blocking techniques in terms of scalability, blocking quality, and privacy. It can achieve private blocking up-to two magnitudes faster than other state-of-the art private blocking approaches.

international conference on ehealth, telemedicine, and social medicine | 2010

Mobile Technologies for Enhancing eHealth Solutions in Developing Countries

Dinusha Vatsalan; Shiromi M. K. D. Arunatileka; Keith Chapman; Gihan Senaviratne; Saatviga Sudahar; Dulindra Wijetileka; Yvonne Wickramasinghe

The high penetration of mobile devices and networks globally implies that mobile technologies can be used very effectively in the field of Healthcare in order to compensate the scarcity of resources problem, particularly in developing countries. With the proliferation of mobile technologies, mobile health (mHealth) will play a vital role in the rapidly growing electronic health (eHealth) area. The form of transparency between the healthcare domains and patients is a need to compensate the lack of medical resources (e.g. Consultants, Specialists), especially in developing countries. This paper investigates the applicability of available mobile infrastructure and technologies in the health sector and proposes an effective m-Health model to suit the Sri Lankan setting.

Journal of Biomedical Informatics | 2016

Privacy-preserving matching of similar patients

Dinusha Vatsalan; Peter Christen

The identification of similar entities represented by records in different databases has drawn considerable attention in many application areas, including in the health domain. One important type of entity matching application that is vital for quality healthcare analytics is the identification of similar patients, known as similar patient matching. A key component of identifying similar records is the calculation of similarity of the values in attributes (fields) between these records. Due to increasing privacy and confidentiality concerns, using the actual attribute values of patient records to identify similar records across different organizations is becoming non-trivial because the attributes in such records often contain highly sensitive information such as personal and medical details of patients. Therefore, the matching needs to be based on masked (encoded) values while being effective and efficient to allow matching of large databases. Bloom filter encoding has widely been used as an efficient masking technique for privacy-preserving matching of string and categorical values. However, no work on Bloom filter-based masking of numerical data, such as integer (e.g. age), floating point (e.g. body mass index), and modulus (numbers wrap around upon reaching a certain value, e.g. date and time), which are commonly required in the health domain, has been presented in the literature. We propose a framework with novel methods for masking numerical data using Bloom filters, thereby facilitating the calculation of similarities between records. We conduct an empirical study on publicly available real-world datasets which shows that our framework provides efficient masking and achieves similar matching accuracy compared to the matching of actual unencoded patient records.

Handbook of Big Data Technologies | 2017

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

Dinusha Vatsalan; Ziad Sehili; Peter Christen; Erhard Rahm

The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.

pacific-asia conference on knowledge discovery and data mining | 2015

Efficient Interactive Training Selection for Large-Scale Entity Resolution

Qing Wang; Dinusha Vatsalan; Peter Christen

Entity resolution (ER) has wide-spread applications in many areas, including e-commerce, health-care, the social sciences, and crime and fraud detection. A crucial step in ER is the accurate classification of pairs of records into matches (assumed to refer to the same entity) and non-matches (assumed to refer to different entities). In most practical ER applications it is difficult and costly to obtain training data of high quality and enough size, which impedes the learning of an ER classifier. We tackle this problem using an interactive learning algorithm that exploits the cluster structure in similarity vectors calculated from compared record pairs. We select informative training examples to assess the purity of clusters, and recursively split clusters until clusters pure enough for training are found. We consider two aspects of active learning that are significant in practical applications: a limited budget for the number of manual classifications that can be done, and a noisy oracle where manual labeling might be incorrect. Experiments using several real data sets show that manual labeling efforts can be significantly reduced for training an ER classifier without compromising matching quality.

Journal of Data and Information Quality | 2014

Challenges for privacy preservation in data integration

Peter Christen; Dinusha Vatsalan; Vassilios S. Verykios

Techniques for integrating data from diverse sources have attracted significant interest in recent years. Much of today’s data collected by businesses and governments are about people, and integrating such data across organizations can raise privacy concerns. Various techniques that preserve privacy during data integration have been developed, but several challenges persist that need to be solved before such techniques become useful in practical applications. We elaborate on these challenges and discuss research directions.

pacific-asia conference on knowledge discovery and data mining | 2015

Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage

Thilina Ranbaduge; Dinusha Vatsalan; Peter Christen

The identification of common sets of records in multiple databases has become an increasingly important subject in many application areas, including banking, health, and national security. Often privacy concerns and regulations prevent the owners of the databases from sharing any sensitive details of their records with each other, and with any other party. The linkage of records in multiple databases while preserving privacy and confidentiality is an emerging research discipline known as privacy-preserving record linkage (PPRL). We propose a novel two-step indexing (blocking) approach for PPRL between multiple (more than two) parties. First, we generate small mini-blocks using a multi-bit Bloom filter splitting method and second we merge these mini-blocks based on their similarity using a novel hierarchical canopy clustering technique. An empirical study conducted with large datasets of up-to one million records shows that our approach is scalable with the size of the datasets and the number of parties, while providing better privacy than previous multi-party indexing approaches.

knowledge discovery and data mining | 2016

Hashing-Based Distributed Multi-party Blocking for Privacy-Preserving Record Linkage

Thilina Ranbaduge; Dinusha Vatsalan; Peter Christen; Vassilios S. Verykios

In many application domains organizations require information from multiple sources to be integrated. Due to privacy and confidentiality concerns often these organizations are not willing or allowed to reveal their sensitive and personal data to other database owners, and to any external party. This has led to the emerging research discipline of privacy-preserving record linkage PPRL. We propose a novel blocking approach for multi-party PPRL to efficiently and effectively prune the record sets that are unlikely to match. Our approach allows each database owner to perform blocking independently except for the initial agreement of parameter settings and a final central hashing-based clustering. We provide an analysis of our technique in terms of complexity, quality, and privacy, and conduct an empirical study with large datasets. The results show that our approach is scalable with the size of the datasets and the number of parties, while providing better quality and privacy than previous multi-party private blocking approaches.

Explore More