Is this you? Create Your Porfile

Souptik Datta

University of Maryland, Baltimore County

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Souptik Datta is active.

Explore More

Publication

Featured researches published by Souptik Datta.

IEEE Internet Computing | 2006

Distributed Data Mining in Peer-to-Peer Networks

Souptik Datta; Kanishka Bhaduri; Chris Giannella; Ran Wolff; Hillol Kargupta

Peer-to-peer (P2P) networks are gaining popularity in many applications such as file sharing, e-commerce, and social networking, many of which deal with rich, distributed data sources that can benefit from data mining. P2P networks are, in fact, well-suited to distributed data mining (DDM), which deals with the problem of data analysis in environments with distributed data, computing nodes, and users. This article offers an overview of DDM applications and algorithms for P2P environments, focusing particularly on local algorithms that perform data analysis by using computing primitives with limited communication overhead. The authors describe both exact and approximate local P2P data mining algorithms that work in a decentralized and communication-efficient manner

Knowledge and Information Systems | 2005

Random-data perturbation techniques and privacy-preserving data mining

Hillol Kargupta; Souptik Datta; Qi Wang; Krishnamoorthy Sivakumar

Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques. A large fraction of them use randomized data-distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random-value distortion technique in privacy preservation. The paper first notes that random matrices have predictable structures in the spectral domain and then it develops a random matrix-based spectral-filtering technique to retrieve original data from the dataset distorted by adding random values. The proposed method works by comparing the spectrum generated from the observed data with that of random matrices. This paper presents the theoretical foundation and extensive experimental results to demonstrate that, in many cases, random-data distortion preserves very little data privacy. The analytical framework presented in this paper also points out several possible avenues for the development of new privacy-preserving data-mining techniques. Examples include algorithms that explicitly guard against privacy breaches through linear transformations, exploiting multiplicative and colored noise for preserving privacy in data mining applications.

Information Sciences | 2006

Clustering distributed data streams in peer-to-peer environments

Sanghamitra Bandyopadhyay; Chris Giannella; Ujjwal Maulik; Hillol Kargupta; Kun Liu; Souptik Datta

This paper describes a technique for clustering homogeneously distributed data in a peer-to-peer environment like sensor networks. The proposed technique is based on the principles of the K-Means algorithm. It works in a localized asynchronous manner by communicating with the neighboring nodes. The paper offers extensive theoretical analysis of the algorithm that bounds the error in the distributed clustering process compared to the centralized approach that requires downloading all the observed data to a single site. Experimental results show that, in contrast to the case when all the data is transmitted to a central location for application of the conventional clustering algorithm, the communication cost (an important consideration in sensor networks which are typically equipped with limited battery power) of the proposed approach is significantly smaller. At the same time, the accuracy of the obtained centroids is high and the number of samples which are incorrectly labeled is also small.

IEEE Transactions on Knowledge and Data Engineering | 2009

Approximate Distributed K-Means Clustering over a Peer-to-Peer Network

Souptik Datta; Chris Giannella; Hillol Kargupta

Data intensive peer-to-peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by ldquolocalrdquo synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.

international conference on distributed computing systems | 2007

Uniform Data Sampling from a Peer-to-Peer Network

Souptik Datta; Hillol Kargupta

Uniform random sample is often useful in analyzing data. Usually taking a uniform sample is not a problem if the entire data resides in one location. However, if the data is distributed in a peer-to-peer (P2P) network with different amount of data in different peers, collecting a uniform sample of data becomes a challenging task. A random sampling can be performed using random-walk, but due to varying degrees of connectivity and different sizes of data owned by each peer, this random walk gives a biased sample. In this paper, we propose a random walk-based sampling algorithm that can be used to sample data tuples uniformly from a large, unstructured P2P network. We model the random walk as a Markov chain and derive conditions to bound the length of the random walk necessary to achieve uniformity. A formal communication analysis shows logarithmic communication cost to discover a uniform data sample.

ieee international conference on fuzzy systems | 2003

Homeland security and privacy sensitive data mining from multi-party distributed resources

Hillol Kargupta; Kun Liu; Souptik Datta; Jessica Ryan; Krishnamoorthy Sivakumar

Defending the safety of an open society from terrorism or other similar threats requires intelligent but careful ways to monitor different types of activities and transactions in the electronic media. Data mining techniques are playing an increasingly important role in sifting through large amount of data in search of useful patterns that might help us in securing our safety. Although the objective of this class of data mining applications is very well justified, they also open up the possibility of misusing personal information by malicious people with access to the sensitive data. This brings up the following question: Can we design data mining techniques that are sensitive to privacy? Several researchers are currently working on a class of data mining algorithms that work without directly accessing the sensitive data in their original form. This paper considers the problem of mining distributed data in a privacy-sensitive manner. It first points out the problems of some of the existing privacy-sensitive data mining techniques that make use of additive random noise to hide sensitive information. Next it briefly reviews some new approaches that make use of random projection matrices for computing statistical aggregates from sensitive data.

international conference on data mining | 2003