Ganesh Ramesh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ganesh Ramesh is active.

Explore More

Publication

Featured researches published by Ganesh Ramesh.

very large data bases | 2004

On testing satisfiability of tree pattern queries

Laks V. S. Lakshmanan; Ganesh Ramesh; Hui Wang; Zheng Zhao

XPath and XQuery (which includes XPath as a sublanguage) are the major query languages for XML. An important issue arising in efficient evaluation of queries expressed in these languages is satisfiability, i.e., whether there exists a database, consistent with the schema if one is available, on which the query has a non-empty answer. Our experience shows satisfiability check can effect substantial savings in query evaluation. We systematically study satisfiability of tree pattern queries (which capture a useful fragment of XPath) together with additional constraints, with or without a schema. We identify cases in which this problem can be solved in polynomial time and develop novel efficient algorithms for this purpose. We also show that in several cases, the problem is NP-complete. We ran a comprehensive set of experiments to verify the utility of satisfiability check as a preprocessing step in query processing. Our results show that this check takes a negligible fraction of the time needed for processing the query while often yielding substantial savings.

symposium on principles of database systems | 2003

Feasible itemset distributions in data mining: theory and application

Ganesh Ramesh; William Maniatty; Mohammed Javeed Zaki

Computing frequent itemsets and maximally frequent item-sets in a database are classic problems in data mining. The resource requirements of all extant algorithms for both problems depend on the distribution of frequent patterns, a topic that has not been formally investigated. In this paper, we study properties of length distributions of frequent and maximal frequent itemset collections and provide novel solutions for computing tight lower bounds for feasible distributions. We show how these bounding distributions can help in generating realistic synthetic datasets, which can be used for algorithm benchmarking.

international conference on data engineering | 2007

Preservation Of Patterns and Input-Output Privacy

Shaofeng Bu; Laks V. S. Lakshmanan; Raymond T. Ng; Ganesh Ramesh

Privacy preserving data mining so far has mainly focused on the data collector scenario where individuals supply their personal data to an untrusted collector in exchange for value. In this scenario, random perturbation has proved to be very successful. An equally compelling, but overlooked scenario, is that of a data custodian, which either owns the data or is explicitly entrusted with ensuring privacy of individual data. In this scenario, we show that it is possible to minimize disclosure while guaranteeing no outcome change. We conduct our investigation in the context of building a decision tree and propose transformations that preserve the exact decision tree. We show with a detailed set of experiments that they provide substantial protection to both input data privacy and mining output privacy.

ACM Transactions on Knowledge Discovery From Data | 2008

On disclosure risk analysis of anonymized itemsets in the presence of prior knowledge

Laks V. S. Lakshmanan; Raymond T. Ng; Ganesh Ramesh

Decision makers of companies often face the dilemma of whether to release data for knowledge discovery, vis-a-vis the risk of disclosing proprietary or sensitive information. Among the various methods employed for “sanitizing” the data prior to disclosure, we focus in this article on anonymization, given its widespread use in practice. We do due diligence to the question “just how safe is the anonymized data?” We consider both those scenarios when the hacker has no information and, more realistically, when the hacker may have partial information about items in the domain. We conduct our analyses in the context of frequent set mining and address the safety question at two different levels: (i) how likely of being cracked (i.e., re-identified by a hacker), are the identities of individual items and (ii) how likely are sets of items cracked? For capturing the prior knowledge of the hacker, we propose a belief function, which amounts to an educated guess of the frequency of each item. For various classes of belief functions which correspond to different degrees of prior knowledge, we derive formulas for computing the expected number of cracks of single items and for itemsets, the probability of cracking the itemsets. While obtaining, exact values for more general situations is computationally hard, we propose a series of heuristics called the O-estimates. They are easy to compute and are shown fairly accurate, justified by empirical results on real benchmark datasets. Based on the O-estimates, we propose a recipe for the decision makers to resolve their dilemma. Our recipe operates at two different levels, depending on whether the data owner wants to reason in terms of single items or sets of items (or both). Finally, we present techniques for ascertaining a hackers knowledge of correlation in terms of co-occurrence of items likely. This information regarding the hackers knowledge can be incorporated into our framework of disclosure risk analysis and we present experimental results demonstrating how this knowledge affects the heuristic estimates we have developed.

international database engineering and applications symposium | 2005

Distribution-based synthetic database generation techniques for itemset mining

Ganesh Ramesh; Mohammed Javeed Zaki; William Maniatty

The resource requirements of frequent pattern mining algorithms depend mainly on the length distribution of the mined patterns in the database. Synthetic databases, which are used to benchmark performance of algorithms, tend to have distributions far different from those observed in real datasets. In this paper we focus on the problem of synthetic database generation and propose algorithms to effectively embed within the database, any given set of maximal pattern collections, and make the following contributions: 1. A database generation technique is presented which takes k maximal itemset collections as input, and constructs a database which produces these maximal collections as output, when mined at k levels of support. To analyze the efficiency of the procedure, upper bounds are provided on the number of transactions output in the generated database; 2. A compression method is used and extended to reduce the size of the output database. An optimization to the generation procedure is provided which could potentially reduce the number of transactions generated; 3. Preliminary experimental results are presented to demonstrate the feasibility of using the generation technique.

international conference on pattern recognition | 2002

Multi-source combined-media video tracking for summarization

Amit Bagga; Jianying Hu; Jialin Zhong; Ganesh Ramesh

Video summarization is receiving increasing attention due to the large amount of video content made available on the Internet. We present an idea to track video from multiple sources for video summarization. An algorithm that takes advantage of both video and closed caption text information for video scene clustering is described. Experimental results are given followed by discussion on future directions.

very large data bases | 2005

Can attackers learn from samples

Ganesh Ramesh

Sampling is often used to achieve disclosure limitation for categorical and microarray datasets. The motivation is that while the public gets a snapshot of what is in the data, the entire data is not revealed and hence complete disclosure is prevented. However, the presence of prior knowledge is often overlooked in risk assessment. A sample plays an important role in risk analysis and can be used by a malicious user to construct prior knowledge of the domain. In this paper, we focus on formalizing the various kinds of prior knowledge an attacker can develop using samples and make the following contributions. We abstract various types of prior knowledge and define measures of quality which enables us to quantify how good the prior knowledge is with respect to the true knowledge given by the database. We propose a lightweight general purpose sampling framework with which a data owner can assess the impact of various sampling methods on the quality of prior knowledge. Finally, through a systematic set of experiments using real benchmark datasets, we study the effect of various sampling parameters on the quality of prior knowledge that is obtained from these samples. Such an analysis can help the data owner in making informed decisions about releasing samples to achieve disclosure limitation.

international conference on management of data | 2005