Dimitrios Karapiperis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dimitrios Karapiperis is active.

Explore More

Publication

Featured researches published by Dimitrios Karapiperis.

IEEE Transactions on Knowledge and Data Engineering | 2015

An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage

Dimitrios Karapiperis; Vassilios S. Verykios

We present a Λ-fold Redundant Blocking Framework, that relies on the Locality-Sensitive Hashing technique for identifying candidate record pairs, which have undergone an anonymization transformation. In this context, we demonstrate the usage and evaluate the performance of a variety of families of hash functions used for blocking. We illustrate that the performance attained is highly correlated to the distance-preserving properties of the anonymization format used. The parameters, of the blocking scheme, are optimally selected so that we achieve the highest possible accuracy in the least possible running time. We also introduce an SMC-based protocol in order to compare the formulated record pairs homomorphically, without running the risk of breaching the privacy of the underlying records.

balkan conference in informatics | 2013

A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage

Dimitrios Karapiperis; Vassilios S. Verykios

Privacy Preserving Record Linkage (PPRL) is the scientific field that explores methods of linking datasets in order to identify common entities efficiently and accurately by simultaneously preserving the privacy of the underlying data. In this paper we present a distributed Locality Sensitive Hashing-based framework for linking huge collections of records, by grouping similar records efficiently and by distributing computations among underutilized commodity hardware resources uniformly, without imposing an extra overhead on the existing infrastructure, thus promoting scalability. We also propose two methods of assessing computational cost, aiming to distribute workload evenly among compute nodes.

Knowledge and Information Systems | 2016

A fast and efficient Hamming LSH-based scheme for accurate linkage

Dimitrios Karapiperis; Vassilios S. Verykios

In this paper, we propose an efficient scheme for privacy-preserving record linkage by using the Hamming locality-sensitive hashing technique as the blocking mechanism and the Bloom filter-based encoding method for anonymizing the data sets at hand. We achieve highly accurate results and simultaneously reduce significantly the computational cost by minimizing the number of distance computations performed. Our scheme provides theoretical guarantees for identifying the similar anonymized record pairs by conducting redundant blocking and by performing a distance computation only if the corresponding anonymized record pair is formulated a specified number of times. A series of experiments illustrate the efficacy of our scheme in identifying the similar record pairs, while simultaneously keeping the running time exceptionally low.

international conference on data engineering | 2017

Distance-Aware Encoding of Numerical Values for Privacy-Preserving Record Linkage

Dimitrios Karapiperis; Aris Gkoulalas-Divanis; Vassilios S. Verykios

In this work, we propose Bit Vectors (BV), an accurate, distance-preserving encoding scheme for representing numerical data values in privacy-preserving tasks. Although many methods have been proposed in the literature for encoding strings, the problem of encoding numerical values has not been effectively addressed yet. In Privacy-Preserving Record Linkage (PPRL), a number of data custodians encode their records and submit them to a trusted third-party that is responsible to identify those records that refer to the same real-world entity. BV is supported by a strong theoretical foundation for embedding numerical values into an anonymization space in a way that preserves the initial distances. Key components of this embedding process are (a) the employed hash functions which, by utilizing random intervals, they allow for approximate matching, and (b) the threshold that is required by the distance computations, which we prove that can be specified in a way that guarantees accurate results.

international conference on data mining | 2016

LSHDB: a parallel and distributed engine for record linkage and similarity search

Dimitrios Karapiperis; Aris Gkoulalas-Divanis; Vassilios S. Verykios

In this paper, we present LSHDB, the first parallel and distributed engine for record linkage and similarity search. LSHDB materializes an abstraction layer to hide the mechanics of the Locality-Sensitive Hashing (a popular method for detecting similar items in high dimensions) which is used as the underlying similarity search engine. LSHDB creates the appropriate data structures from the input data and persists these structures on disk using a noSQL engine. It inherently supports the parallel processing of distributed queries, is highly extensible, and is easy to use.We will demonstrate LSHDB both as the underlying system for detecting similar records in the context of Record Linkage (and of Privacy-Preserving Record Linkage) tasks, as well as a search engine for identifying string values that are similar to submitted queries.

Sigkdd Explorations | 2015

Load-Balancing the Distance Computations in Record Linkage

Dimitrios Karapiperis; Vassilios S. Verykios

In this paper, we propose a novel method for distributing the distance computations of record pairs generated by a blocking mechanism to the reduce tasks of a Map/Reduce system. The proposed solutions in the literature analyze the blocks and then construct a profile, which contains the number of record pairs in each block. However, this deterministic process, including all its variants, might incur considerable overhead given massive data sets. In contrast, our method utilizes two Map/Reduce jobs where the first job formulates the record pairs while the second job distributes these pairs to the reduce tasks, which perform the distance computations, using repetitive allocation rounds. In each such round, we utilize all the available reduce tasks on a random basis by generating permutations of their indexes. A series of experiments demonstrate an almost-equal distribution of the record pairs, or equivalently of the distance computations, to the reduce tasks, which makes our method a simple, yet efficient, solution for applying a blocking mechanism given massive data sets.

database systems for advanced applications | 2015

Large-Scale Multi-party Counting Set Intersection Using a Space Efficient Global Synopsis

Dimitrios Karapiperis; Dinusha Vatsalan; Vassilios S. Verykios; Peter Christen

Privacy-preserving set intersection (PPSI) of very large data sets is increasingly being required in many real application areas including health-care, national security, and law enforcement. Various techniques have been developed to address this problem, where the majority of them rely on computationally expensive cryptographic techniques. Moreover, conventional data structures cannot be used efficiently for providing count estimates of the elements of the intersection of very large data sets. We consider the problem of efficient PPSI by integrating sets from multiple (three or more) sources in order to create a global synopsis which is the result of the intersection of efficient data structures, known as Count-Min sketches. This global synopsis furthermore provides count estimates of the intersected elements. We propose two protocols for the creation of this global synopsis which are based on homomorphic computations, a secure distributed summation scheme, and a symmetric noise addition technique. Experiments conducted on large synthetic and real data sets show the efficiency and accuracy of our protocols, while at the same time privacy under the Honest-but-Curious model is preserved.

ieee international conference on cloud computing technology and science | 2015

A Tutorial on Blocking Methods for Privacy-Preserving Record Linkage

Dimitrios Karapiperis; Vassilios S. Verykios; Eleftheria Katsiri; Alex Delis

In this paper, we first present five state-of-the-art private blocking methods which rely mainly on random strings, clustering, and public reference sets. We emphasize on the drawbacks of these methods, and then, we present our L-fold redundant blocking scheme, that relies on the Locality-Sensitive Hashing technique for identifying similar records. These records have undergone an anonymization transformation using a Bloom filter-based encoding technique. Finally, we perform an experimental evaluation of all these methods and present the results.

Data Mining and Knowledge Discovery | 2018

Fast schemes for online record linkage

Dimitrios Karapiperis; Aris Gkoulalas-Divanis; Vassilios S. Verykios

The process of integrating large volumes of data coming from disparate data sources, in order to detect records that refer to the same entities, has always been an important problem in both academia and industry. This problem becomes significantly more challenging when the integration involves a huge amount of records and needs to be conducted in a real-time fashion to address the requirements of critical applications. In this paper, we propose two novel schemes for online record linkage, which achieve very fast response times and high levels of recall and precision. Our proposed schemes embed the records into a Bloom filter space and employ the Hamming Locality-Sensitive Hashing technique for blocking. Each Bloom filter is hashed to a number of hash tables in order to amplify the probability of formulating similar Bloom filter pairs. The main theoretical premise behind our first scheme relies on the number of times a Bloom filter pair is formulated in the hash tables of the blocking mechanism. We prove that this number strongly depends on the distance of that Bloom filter pair. This correlation allows us to estimate in real-time the Hamming distances of Bloom filter pairs without performing the comparisons. The second scheme is progressive and achieves high recall, upfront during the linkage process, by continuously adjusting the sequence in which the hash tables are scanned, and also guarantees, with high probability, the identification of each similar Bloom filter pair. Our experimental evaluation, using four real-world data sets, shows that the proposed schemes outperform four state-of-the-art methods by achieving higher recall and precision, while being very efficient.

panhellenic conference on informatics | 2017

Using Wavelets for Matching Records Privately

T. C. Theodosiou; Dimitrios Karapiperis; Vassilios S. Verykios

This paper presents a wavelet-based methodology for performing privacy preserving record linkage. The proposed methodology is introduced in a bottom-up approach, starting from simple text matching and extending to actual record linkage. The discrete wavelet transform, along with some privacy preserving operations, is employed to cast text into a numerical sequence of fixed length. Database records are then treated as collections of such numerical sequences. Practical examples and implementation details are provided during all development phases. The method is applied on simulated data of bibliographic records, and results demonstrate that performance is comparable to other successful methodologies.

Explore More