Avinava Dubey | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Avinava Dubey is active.

Explore More

Publication

Featured researches published by Avinava Dubey.

foundations of software engineering | 2012

AUSUM: approach for unsupervised bug report summarization

Senthil Mani; Rose Catherine; Vibha Singhal Sinha; Avinava Dubey

In most software projects, resolved bugs are archived for future reference. These bug reports contain valuable information on the reported problem, investigation and resolution. When bug triaging, developers look for how similar problems were resolved in the past. Search over bug repository gives the developer a set of recommended bugs to look into. However, the developer still needs to manually peruse the contents of the recommended bugs which might vary in size from a couple of lines to thousands. Automatic summarization of bug reports is one way to reduce the amount of data a developer might need to go through. Prior work has presented learning based approaches for bug summarization. These approaches have the disadvantage of requiring large training set and being biased towards the data on which the model was learnt. In fact, maximum efficacy was reported when the model was trained and tested on bug reports from the same project. In this paper, we present the results of applying four unsupervised summarization techniques for bug summarization. Industrial bug reports typically contain a large amount of noise---email dump, chat transcripts, core-dump---useless sentences from the perspective of summarization. These derail the unsupervised approaches, which are optimized to work on more well-formed documents. We present an approach for noise reduction, which helps to improve the precision of summarization over the base technique (4% to 24% across subjects and base techniques). Importantly, by applying noise reduction, two of the unsupervised techniques became scalable for large sized bug reports.

siam international conference on data mining | 2013

A nonparametric mixture model for topic modeling over time

Avinava Dubey; Ahmed Hefny; Sinead A. Williamson; Eric P. Xing

A single, stationary topic model such as latent Dirichlet allocation is inappropriate for modeling corpora that span long time periods, as the popularity of topics is likely to change over time. A number of models that incorporate time have been proposed, but in general they either exhibit limited forms of temporal variation, or require computationally expensive inference methods. In this paper we propose nonparametric Topics over Time (npTOT), a model for time-varying topics that allows an unbounded number of topics and flexible distribution over the temporal variations in those topics’ popularity. We develop a collapsed Gibbs sampler for the proposed model and compare against existing models on synthetic and real document sets.

web search and data mining | 2014

Spatial compactness meets topical consistency: jointly modeling links and content for community detection

Mrinmaya Sachan; Avinava Dubey; Shashank Srivastava; Eric P. Xing; Eduard H. Hovy

In this paper, we address the problem of discovering topically meaningful, yet compact (densely connected) communities in a social network. Assuming the social network to be an integer-weighted graph (where the weights can be intuitively defined as the number of common friends, followers, documents exchanged, etc.), we transform the social network to a more efficient representation. In this new representation, each user is a bag of her one-hop neighbors. We propose a mixed-membership model to identify compact communities using this transformation. Next, we augment the representation and the model to incorporate user-content information imposing topical consistency in the communities. In our model a user can belong to multiple communities and a community can participate in multiple topics. This allows us to discover community memberships as well as community and user interests. Our method outperforms other well known baselines on two real-world social networks. Finally, we also provide a fast, parallel approximation of the same.

european conference on machine learning | 2010

A cluster-level semi-supervision model for interactive clustering

Avinava Dubey; Indrajit Bhattacharya; Shantanu Godbole

Semi-supervised clustering models, that incorporate user provided constraints to yield meaningful clusters, have recently become a popular area of research. In this paper, we propose a cluster-level semi-supervision model for inter-active clustering. Prototype based clustering algorithms typically alternate between updating cluster descriptions and assignment of data items to clusters. In our model, the user provides semi-supervision directly for these two steps. Assignment feedback re-assigns data items among existing clusters, while cluster description feedback helps to position existing cluster centers more meaningfully. We argue that providing such supervision is more natural for exploratory data mining, where the user discovers and interprets clusters as the algorithm progresses, in comparison to the pair-wise instance level supervision model, particularly for high dimensional data such as document collection. We show how such feedback can be interpreted as constraints and incorporated within the kmeans clustering framework. Using experimental results on multiple real-world datasets, we show that this framework improves clustering performance significantly beyond traditional k-means. Interestingly, when given the same number of feedbacks from the user, the proposed framework significantly outperforms the pair-wise supervision model.

international conference on data mining | 2009

Conditional Models for Non-smooth Ranking Loss Functions

Avinava Dubey; Jinesh Machchhar; Chiranjib Bhattacharyya; Soumen Chakrabarti

Learning to rank is an important area at the interface of machine learning, information retrieval and Web search. The central challenge in optimizing various measures of ranking loss is that the objectives tend to be non-convex and discontinuous. To make such functions amenable to gradient based optimization procedures one needs to design clever bounds. In recent years, boosting, neural networks, support vector machines, and many other techniques have been applied. However, there is little work on directly modeling a conditional probability Pr(y|x_q) where y is a permutation of the documents to be ranked and x_q represents their feature vectors with respect to a query q. A major reason is that the space of y is huge: n! if n documents must be ranked. We first propose an intuitive and appealing expected loss minimization objective, and give an efficient shortcut to evaluate it despite the huge space of ys. Unfortunately, the optimization is non-convex, so we propose a convex approximation. We give a new, efficient Monte Carlo sampling method to compute the objective and gradient of this approximation, which can then be used in a quasi-Newton optimizer like LBFGS. Extensive experiments with the widely-used LETOR dataset show large ranking accuracy improvements beyond recent and competitive algorithms.

international conference on data mining | 2011

Learning Dirichlet Processes from Partially Observed Groups

Avinava Dubey; Indrajit Bhattacharya; Mrinal Kanti Das; Tanveer Afzal Faruquie; Chiranjib Bhattacharyya

Motivated by the task of vernacular news analysis using known news topics from national news-papers, we study the task of topic analysis, where given source datasets with observed topics, data items from a target dataset need to be assigned either to observed source topics or to new ones. Using Hierarchical Dirichlet Processes for addressing this task imposes unnecessary and often inappropriate generative assumptions on the observed source topics. In this paper, we explore Dirichlet Processes with partially observed groups (POG-DP). POG-DP avoids modeling the given source topics. Instead, it directly models the conditional distribution of the target data as a mixture of a Dirichlet Process and the posterior distribution of a Hierarchical Dirichlet Process with known groups and topics. This introduces coupling between selection probabilities of all topics within a source, leading to effective identification of source topics. We further improve on this with a Combinatorial Dirichlet Process with partially observed groups (POG-CDP) that captures finer grained coupling between related topics by choosing intersections between sources. We evaluate our models in three different real-world applications. Using extensive experimentation, we compare against several baselines to show that our model performs significantly better in all three applications.

international conference on machine learning | 2013