Vikram Pudi
International Institute of Information Technology, Hyderabad
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vikram Pudi.
ieee international conference on fuzzy systems | 2009
Ashish Mangalampalli; Vikram Pudi
Fuzzy association rules use fuzzy logic to convert numerical attributes to fuzzy attributes, like “Income = High”, thus maintaining the integrity of information conveyed by such numerical attributes. On the other hand, crisp association rules use sharp partitioning to transform numerical attributes to binary ones like “Income = [100K and above]”, and can potentially introduce loss of information due to these sharp ranges. Fuzzy Apriori and its different variations are the only popular fuzzy association rule mining (ARM) algorithms available today. Like the crisp version of Apriori, fuzzy Apriori is a very slow and inefficient algorithm for very large datasets (in the order of millions of transactions). Hence, we have come up with a new fuzzy ARM algorithm meant for fast and efficient performance on very large datasets. As compared to fuzzy Apriori, our algorithm is 8-19 times faster for the very large standard real-life dataset we have used for testing with various mining workloads, both typical and extreme ones. A novel combination of features like two-phased multiple-partition tidlist-style processing, byte-vector representation of tidlists, and fast compression of tidlists contribute a lot to the efficiency in performance. In addition, unlike most two-phased ARM algorithms, the second phase is totally different from the first one in the method of processing (individual itemset processing as opposed to simultaneous itemset processing at each k-level), and is also many times faster. Our algorithm also includes an effective preprocessing technique for converting a crisp dataset to a fuzzy dataset.
international conference on knowledge based and intelligent information and engineering systems | 2010
G. V. R. Kiran; Ravi Shankar; Vikram Pudi
High dimensionality is a major challenge in document clustering. Some of the recent algorithms address this problem by using frequent itemsets for clustering. But, most of these algorithms neglect the semantic relationship between the words. On the other hand there are algorithms that take care of the semantic relations between the words by making use of external knowledge contained in Word Net, Mesh, Wikipedia, etc but do not handle the high dimensionality. In this paper we present an efficient solution that addresses both these problems. We propose a hierarchical clustering algorithm using closed frequent itemsets that use Wikipedia as an external knowledge to enhance the document representation. We evaluate our methods based on F-Score on standard datasets and show our results to be better than existing approaches.
ieee international conference on fuzzy systems | 2010
Ashish Mangalampalli; Vikram Pudi
Conventional Association Rule Mining (ARM) algorithms usually deal with datasets with binary values, and expect any numerical values to be converted to binary ones using sharp partitions, like Age = 25 to 60. In order to mitigate this constraint, Fuzzy logic is used to convert quantitative values of attributes to binary ones, so as to eliminate any loss of information arising due to sharp partitioning, especially at partition boundaries, and then generate fuzzy association rules. But, before any fuzzy ARM algorithm can be used, the original dataset (with crisp attributes) needs to be transformed into a form with fuzzy attributes. This paper describes a methodology, called FPrep, to do this pre-processing, which first involves using fuzzy clustering to generate fuzzy partitions, and then uses these partitions to get a fuzzy version (with fuzzy records) of the original dataset. Ultimately, the fuzzy data (fuzzy records) are represented in a standard manner such that they can be used as input to any kind of fuzzy ARM algorithm, irrespective of how it works and processes fuzzy data. We also show that FPrep is much faster than other such comparable transformation techniques, which in turn depend on non-fuzzy techniques, like hard clustering (CLARANS and CURE). Moreover, we illustrate the quality of the fuzzy partitions generated using FPrep, and the number of frequent itemsets generated by a fuzzy ARM algorithm when preceded by FPrep.
data mining and optimization | 2011
Aditya Singh; Kumar Shubhankar; Vikram Pudi
In this paper we propose an efficient method to rank the research papers from various fields of research published in various conferences over the years. This ranking method is based on citation network. The importance of a research paper is captured well by the peer vote, which in this case is the research paper being cited in other research papers. Using a modified version of the PageRank algorithm, we rank the research papers, assigning each of them an authoritative score. Using the scores of the research papers calculated by above mentioned method, we formulate scores for conferences and authors and rank them as well. We have introduced a new metric in the algorithm which takes into account the time factor in ranking the research papers to reduce the bias against the recent papers which get less time for being studied and consequently cited by the researchers as compared to the older papers. Often a researcher is more interested in finding the top conferences in a particular year rather than the overall conference ranking. Considering the year of publication of the papers, in addition to the paper scores we also calculated the year-wise score of each conference by slight improvisation of the above mentioned algorithm.
algorithmic learning theory | 2005
Risivardhan Thonangi; Vikram Pudi
Recent studies in classification have proposed ways of exploiting the association rule mining paradigm. These studies have performed extensive experiments to show their techniques to be both efficient and accurate. However, existing studies in this paradigm either do not provide any theoretical justification behind their approaches or assume independence between some parameters. In this work, we propose a new classifier based on association rule mining. Our classifier rests on the maximum entropy principle for its statistical basis and does not assume any independence not inferred from the given dataset. We use the classical generalized iterative scaling algorithm (GIS) to create our classification model. We show that GIS fails in some cases when itemsets are used as features and provide modifications to rectify this problem. We show that this modified GIS runs much faster than the original GIS. We also describe techniques to make GIS tractable for large feature spaces – we provide a new technique to divide a feature space into independent clusters each of which can be handled separately. Our experimental results show that our classifier is generally more accurate than the existing classification methods.
pacific-asia conference on knowledge discovery and data mining | 2013
Harshit Dubey; Vikram Pudi
K-Nearest Neighbor based classifier classifies a query instance based on the class labels of its neighbor instances. Although kNN has proved to be a ubiquitous classification tool with good scalability, but it suffers from some drawbacks. The existing kNN algorithm is equivalent to using only local prior probabilities to predict instance labels, and hence it does not take into account the class distribution around neighborhood of the query instance, which results into undesirable performance on imbalanced data. In this paper, a modified version of kNN algorithm is proposed so that it takes into account the class distribution in a wider region around the query instance. Our empirical experiments with several real world datasets show that our algorithm outperforms current state-of-the-art approaches.
knowledge discovery and data mining | 2011
Aditya Desai; Himanshu Singh; Vikram Pudi
The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC - Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors.
data mining and optimization | 2011
Kumar Shubankar; AdityaPratap Singh; Vikram Pudi
In this paper we introduce a novel and efficient approach to detect topics in a large corpus of research papers. With rapidly growing size of academic literature, the problem of topic detection has become a very challenging task. We present a unique approach that uses closed frequent keyword-set to form topics. Our approach also provides a natural method to cluster the research papers into hierarchical, overlapping clusters using topic as similarity measure. To rank the research papers in the topic cluster, we devise a modified PageRank algorithm that assigns an authoritative score to each research paper by considering the sub-graph in which the research paper appears. We test our algorithms on the DBLP dataset and experimentally show that our algorithms are fast, effective and scalable.
database systems for advanced applications | 2005
Ravindranath Jampani; Vikram Pudi
Joins on set-valued attributes (set joins) have numerous database applications. In this paper we propose PRETTI (PREfix Tree based seT joIn) – a suite of set join algorithms for containment, overlap and equality join predicates. Our algorithms use prefix trees and inverted indices. These structures are constructed on-the-fly if they are not already precomputed. This feature makes our algorithms usable for relations without indices and when joining intermediate results during join queries with more than two relations. Another feature of our algorithms is that results are output continuously during their execution and not just at the end. Experiments on real life datasets show that the total execution time of our algorithms is significantly less than that of previous approaches, even when the indices required by our algorithms are not precomputed.
international world wide web conferences | 2011
Ashish Mangalampalli; Adwait Ratnaparkhi; Andrew O. Hatch; Abraham Bagherjeiran; Rajesh Parekh; Vikram Pudi
Online advertising offers significantly finer granularity, which has been leveraged in state-of-the-art targeting methods, like Behavioral Targeting (BT). Such methods have been further complemented by recent work in Look-alike Modeling (LAM) which helps in creating models which are customized according to each advertisers requirements and each campaigns characteristics, and which show ads to users who are most likely to convert on them, not just click them. In Look-alike Modeling given data about converters and nonconverters, obtained from advertisers, we would like to train models automatically for each ad campaign. Such custom models would help target more users who are similar to the set of converters the advertiser provides. The advertisers get more freedom to define their preferred sets of users which should be used as a basis to build custom targeting models. In behavioral data, the number of conversions (positive class) per campaign is very small (conversions per impression for the advertisers in our data set are much less than 10-4), giving rise to a highly skewed training dataset, which has most records pertaining to the negative class. Campaigns with very few conversions are called as tail campaigns, and those with many conversions are called head campaigns. Creation of Look-alike Models for tail campaigns is very challenging and tricky using popular classifiers like Linear SVM and GBDT, because of the very few number of positive class examples such campaigns contain. In this paper, we present an Associative Classification (AC) approach to LAM for tail campaigns. Pairs of features are used to derive rules to build a Rule-based Associative Classifier, with the rules being sorted by frequency-weighted log-likelihood ratio (F-LLR). The top k rules, sorted by F-LLR, are then applied to any test record to score it. Individual features can also form rules by themselves, though the number of such rules in the top k rules and the whole rule-set is very small. Our algorithm is based on Hadoop, and is thus very efficient in terms of speed.