Sumitra Mukherjee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sumitra Mukherjee is active.

Explore More

Publication

Featured researches published by Sumitra Mukherjee.

IEEE Transactions on Knowledge and Data Engineering | 2005

Minimum spanning tree partitioning algorithm for microaggregation

Michael J Laszlo; Sumitra Mukherjee

This paper presents a clustering algorithm for partitioning a minimum spanning tree with a constraint on minimum group size. The problem is motivated by microaggregation, a disclosure limitation technique in which similar records are aggregated into groups containing a minimum of k records. Heuristic clustering methods are needed since the minimum information loss microaggregation problem is NP-hard. Our MST partitioning algorithm for microaggregation is sufficiently efficient to be practical for large data sets and yields results that are comparable to the best available heuristic methods for microaggregation. For data that contain pronounced clustering effects, our method results in significantly lower information loss. Our algorithm is general enough to accommodate different measures of information loss and can be used for other clustering applications that have a constraint on minimum group size.

IEEE Transactions on Knowledge and Data Engineering | 2003

A polynomial algorithm for optimal univariate microaggregation

Stephen Lee Hansen; Sumitra Mukherjee

Microaggregation is a technique used by statistical agencies to limit disclosure of sensitive microdata. Noting that no polynomial algorithms are known to microaggregate optimally, Domingo-Ferrer and Mateo-Sanz have presented heuristic microaggregation methods. This paper is the first to present an efficient polynomial algorithm for optimal univariate microaggregation. Optimal partitions are shown to correspond to shortest paths in a network.

Pattern Recognition Letters | 2007

A genetic algorithm that exchanges neighboring centers for fuzzy c-means clustering

Sumitra Mukherjee; Firas Safwan Chahine

Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major shortcoming: it is extremely sensitive to the choice of initial centers used to seed the algorithm. Unless k-means is carefully initialized, it converges to an inferior local optimum and results in poor quality partitions. Developing improved method for selecting initial centers for k-means is an active area of research. Genetic algorithms (GAs) have been successfully used to evolve a good set of initial centers. Among the most promising GA-based methods are those that exchange neighboring centers between candidate partitions in their crossover operations. K-means is best suited to work when datasets have well-separated non-overlapping clusters. Fuzzy c-means (FCM) is a popular variant of k-means that is designed for applications when clusters are less well-defined. Rather than assigning each point to a unique cluster, FCM determines the degree to which each point belongs to a cluster. Like k-means, FCM is also extremely sensitive to the choice of initial centers. Building on GA-based methods for initial center selection for k-means, this dissertation developed an evolutionary program for center selection in FCM called FCMGA. The proposed algorithm utilized region-based crossover and other mechanisms to improve the GA. To evaluate the effectiveness of FCMGA, three independent experiments were conducted using real and simulated datasets. The results from the experiments demonstrate the effectiveness and consistency of the proposed algorithm in identifying better quality solutions than extant methods. Moreover, the results confirmed the effectiveness of region-based crossover in enhancing the search process for the GA and the convergence speed of FCM. Taken together, findings in these experiments illustrate that FCMGA was successful in solving the problem of initial center selection in partitional clustering algorithms.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2006

A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering

Michael J Laszlo; Sumitra Mukherjee

The k-means algorithm is widely used for clustering because of its computational efficiency. Given n points in d-dimensional space and the number of desired clusters k, k-means seeks a set of k-cluster centers so as to minimize the sum of the squared Euclidean distance between each point and its nearest cluster center. However, the algorithm is very sensitive to the initial selection of centers and is likely to converge to partitions that are significantly inferior to the global optimum. We present a genetic algorithm (GA) for evolving centers in the k-means algorithm that simultaneously identifies good partitions for a range of values around a specified k. The set of centers is represented using a hyper-quadtree constructed on the data. This representation is exploited in our GA to generate an initial population of good centers and to support a novel crossover operation that selectively passes good subsets of neighboring centers from parents to offspring by swapping subtrees. Experimental results indicate that our GA finds the global optimum for data sets with known optima and finds good solutions for large simulated data sets.

Journal of the American Statistical Association | 2000

Optimal Disclosure Limitation Strategy in Statistical Databases: Deterring Tracker Attacks through Additive Noise

George T. Duncan; Sumitra Mukherjee

Abstract Disclosure limitation methods transform statistical databases to protect confidentiality, a practical concern of statistical agencies. A statistical database responds to queries with aggregate statistics. The database administrator should maximize legitimate data access while keeping the risk of disclosure below an acceptable level. Legitimate users seek statistical information, generally in aggregate form; malicious users—the data snoopers—attempt to infer confidential information about an individual data subject. Tracker attacks are of special concern for databases accessed online. This article derives optimal disclosure limitation strategies under tracker attacks for the important case of data masking through additive noise. Operational measures of the utility of data access and of disclosure risk are developed. The utility of data access is expressed so that trade-offs can be made between the quantity and the quality of data to be released. Application is made to Ohio data from the 1990 census. The article derives conditions under which an attack by a data snooper is better thwarted by a combination of query restriction and data masking than by either disclosure limitation method separately. Data masking by independent noise addition and data perturbation are considered as extreme cases in the continuum of data masking using positively correlated additive noise. Optimal strategies are established for the data snooper. Circumstances are determined under which adding autocorrelated noise is preferable to using existing methods of either independent noise addition or data perturbation. Both moving average and autoregressive noise addition are considered.

ieee symposium on security and privacy | 1991

Microdata disclosure limitation in statistical databases: query size and random sample query control

George T. Duncan; Sumitra Mukherjee

A probabilistic framework can be used to assess the risk of disclosure of confidential information in statistical databases that use disclosure control mechanisms. The authors show how the method may be used to assess the strengths and weaknesses of two existing disclosure control mechanisms: the query set size restriction control and random sample query control mechanisms. Results indicate that neither scheme provides adequate security. The framework is then further exploited to analyze an alternative scheme combining query set size restriction and random sample query control. It is shown that this combination results in a significant decrease in the risk of disclosure.<<ETX>>

Operations Research Letters | 2005

Another greedy heuristic for the constrained forest problem

Michael J Laszlo; Sumitra Mukherjee

The constrained forest problem seeks a minimum-weight spanning forest in an undirected edge-weighted graph such that each tree spans at least a specified number of vertices. We present a greedy heuristic for this NP-hard problem, whose solutions are at least as good as, and often better than, those produced by the best-known 2-approximate heuristic.

Annals of Operations Research | 1992

On the integration of data and mathematical modeling languages

Hemant K. Bhargava; Ramayya Krishnan; Sumitra Mukherjee

This paper examines ways in which the addition of data modeling features can enhance the capabilities of mathematical modeling languages. It demonstrates how such integration is achieved as an application of the embedded languages technique proposed by Bhargava and Kimbrough [4]. Decision-making, and decision support systems, require the representation and manipulation of both data and mathematical models. Several data modeling languages as well as several mathematical modeling languages exist, but they have different sets of these capabilities. We motivate with a detailed example the need for the integration of these capabilities. We describe the benefits that might result, and claim that this could lead to a significant improvement in the functionality of model management systems. Then we present our approach for the integration of these languages, and specify how the claimed benefits can be realized.

IEEE Transactions on Knowledge and Data Engineering | 2009

Approximation Bounds for Minimum Information Loss Microaggregation

Michael J Laszlo; Sumitra Mukherjee

The NP-hard microaggregation problem seeks a partition of data points into groups of minimum specified size k, so as to minimize the sum of the squared euclidean distances of every point to its groups centroid. One recent heuristic provides an O(k3) guarantee for this objective function and an O(k2) guarantee for a version of the problem that seeks to minimize the sum of the distances of the points to its groups centroid. This paper establishes approximation bounds for another microaggregation heuristic, providing better approximation guarantees of O(k2) for the squared distance measure and O(k) for the distance measure.

Discrete Applied Mathematics | 2006

A class of heuristics for the constrained forest problem

Michael J Laszlo; Sumitra Mukherjee

The constrained forest problem seeks a minimum-weight spanning forest in an undirected edge-weighted graph such that each tree spans at least a specified number of vertices. We present a structured class of greedy heuristics for this NP-hard problem, and identify the best heuristic.

Explore More