Amit Dhurandhar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amit Dhurandhar is active.

Explore More

Publication

Featured researches published by Amit Dhurandhar.

Science | 2017

Predicting human olfactory perception from chemical features of odor molecules

Andreas Keller; Richard C. Gerkin; Yuanfang Guan; Amit Dhurandhar; Gábor Turu; Bence Szalai; Yusuke Ihara; Chung Wen Yu; Russ Wolfinger; Celine Vens; Leander Schietgat; Kurt De Grave; Raquel Norel; Gustavo Stolovitzky; Guillermo A. Cecchi; Leslie B. Vosshall; Pablo Meyer

How will this molecule smell? We still do not understand what a given substance will smell like. Keller et al. launched an international crowd-sourced competition in which many teams tried to solve how the smell of a molecule will be perceived by humans. The teams were given access to a database of responses from subjects who had sniffed a large number of molecules and been asked to rate each smell across a range of different qualities. The teams were also given a comprehensive list of the physical and chemical features of the molecules smelled. The teams produced algorithms to predict the correspondence between the quality of each smell and a given molecule. The best models that emerged from this challenge could accurately predict how a new molecule would smell. Science, this issue p. 820 Results of a crowdsourcing competition show that it is possible to accurately predict and reverse-engineer the smell of a molecule. It is still not possible to predict whether a given molecule will have a perceived odor or what olfactory percept it will produce. We therefore organized the crowd-sourced DREAM Olfaction Prediction Challenge. Using a large olfactory psychophysical data set, teams developed machine-learning algorithms to predict sensory attributes of molecules based on their chemoinformatic features. The resulting models accurately predicted odor intensity and pleasantness and also successfully predicted 8 among 19 rated semantic descriptors (“garlic,” “fish,” “sweet,” “fruit,” “burnt,” “spices,” “flower,” and “sour”). Regularized linear models performed nearly as well as random forest–based ones, with a predictive accuracy that closely approaches a key theoretical limit. These models help to predict the perceptual qualities of virtually any molecule with high accuracy and also reverse-engineer the smell of a molecule.

International Journal of Machine Learning and Cybernetics | 2013

Probabilistic characterization of nearest neighbor classifier

Amit Dhurandhar; Alin Dobra

The k-nearest neighbor classification algorithm (kNN) is one of the most simple yet effective classification algorithms in use. It finds major applications in text categorization, outlier detection, handwritten character recognition, fraud detection and in other related areas. Though sound theoretical results exist regarding convergence of the generalization error (GE) of this algorithm to Bayes error, these results are asymptotic in nature. The understanding of the behavior of the kNN algorithm in real world scenarios is limited. In this paper, assuming categorical attributes, we provide a principled way of studying the non-asymptotic behavior of the kNN algorithm. In particular, we derive exact closed form expressions for the moments of the GE for this algorithm. The expressions are functions of the sample, and hence can be computed given any joint probability distribution defined over the input–output space. These expressions can be used as a tool that aids in unveiling the statistical behavior of the algorithm in settings of interest viz. an acceptable value of k for a given sample size and distribution. Moreover, Monte Carlo approximations of such closed form expressions have been shown in Dhurandhar and Dobra (J Mach Learn Res 9, 2008; ACM Trans Knowl Discov Data 3, 2009) to be a superior alternative in terms of speed and accuracy when compared with computing the moments directly using Monte Carlo. This work employs the semi-analytical methodology that was proposed recently to better understand the non-asymptotic behavior of learning algorithms.

Knowledge and Information Systems | 2012

Distribution-free bounds for relational classification

Amit Dhurandhar; Alin Dobra

Statistical relational learning (SRL) is a subarea in machine learning which addresses the problem of performing statistical inference on data that is correlated and not independently and identically distributed (i.i.d.)—as is generally assumed. For the traditional i.i.d. setting, distribution-free bounds exist, such as the Hoeffding bound, which are used to provide confidence bounds on the generalization error of a classification algorithm given its hold-out error on a sample size of N. Bounds of this form are currently not present for the type of interactions that are considered in the data by relational classification algorithms. In this paper, we extend the Hoeffding bounds to the relational setting. In particular, we derive distribution-free bounds for certain classes of data generation models that do not produce i.i.d. data and are based on the type of interactions that are considered by relational classification algorithms that have been developed in SRL. We conduct empirical studies on synthetic and real data which show that these data generation models are indeed realistic and the derived bounds are tight enough for practical use.

ACM Transactions on Knowledge Discovery From Data | 2009

Semi-analytical method for analyzing models and model selection measures based on moment analysis

Amit Dhurandhar; Alin Dobra

In this article we propose a moment-based method for studying models and model selection measures. By focusing on the probabilistic space of classifiers induced by the classification algorithm rather than on that of datasets, we obtain efficient characterizations for computing the moments, which is followed by visualization of the resulting formulae that are too complicated for direct interpretation. By assuming the data to be drawn independently and identically distributed from the underlying probability distribution, and by going over the space of all possible datasets, we establish general relationships between the generalization error, hold-out-set error, cross-validation error, and leave-one-out error. We later exemplify the method and the results by studying the behavior of the errors for the naive Bayes classifier.

Journal of Intelligent Manufacturing | 2016

Continuous prediction of manufacturing performance throughout the production lifecycle

Sholom M. Weiss; Amit Dhurandhar; Robert J. Baseman; Brian F. White; Ronald Logan; Jonathan Winslow; Daniel J. Poindexter

We describe methods for continual prediction of manufactured product quality prior to final testing. In our most expansive modeling approach, an estimated final characteristic of a product is updated after each manufacturing operation. Our initial application is for the manufacture of microprocessors, and we predict final microprocessor speed. Using these predictions, early corrective manufacturing actions may be taken to increase the speed of expected slow wafers (a collection of microprocessors) or reduce the speed of fast wafers. Such predictions may also be used to initiate corrective supply chain management actions. Developing statistical learning models for this task has many complicating factors: (a) a temporally unstable population (b) missing data that is a result of sparsely sampled measurements and (c) relatively few available measurements prior to corrective action opportunities. In a real manufacturing pilot application, our automated models selected 125 fast wafers in real-time. As predicted, those wafers were significantly faster than average. During manufacture, downstream corrective processing restored 25 nominally unacceptable wafers to normal operation.

international conference on data mining | 2010

Learning Maximum Lag for Grouped Graphical Granger Models

Amit Dhurandhar

Temporal causal modeling has been a highly active research area in the last few decades. Temporal or time series data arises in a wide array of application domains ranging from medicine to finance. Deciphering the causal relationships between the various time series can be critical in understanding and consequently, enhancing the efficacy of the underlying processes in these domains. Grouped graphical modeling methods such as Granger methods provide an efficient alternative for finding out such dependencies. A key parameter which affects the performance of these methods is the maximum lag. The maximum lag specifies the extent to which one has to look into the past to predict the future. A smaller than required value of the lag will result in missing important dependencies while an excessively large value of the lag will increase the computational complexity along with the addition of noisy dependencies. In this paper, we propose a novel approach for estimating this key parameter efficiently. One of the primary advantages of this approach is that it can, in a principled manner, incorporate prior knowledge of dependencies that are known to exist between certain pairs of time series out of the entire set and use this information to estimate the lag for the entire set. This ability to extrapolate the lag from a known subset to the entire set, in order to get better estimates of the overall lag efficiently, makes such an approach attractive in practice.

international conference on data mining | 2010

Multi-step Time Series Prediction in Complex Instrumented Domains

Amit Dhurandhar

Time series prediction algorithms are widely used for applications such as demand forecasting, weather forecasting and many others to make well informed decisions. In this paper, we compare the most prevalent of these methods as well as suggest our own, where the time series are generated from highly complex industrial processes. These time series are non-stationary and the relationships between the various time series vary with time. Given a set of time series from an industrial process, the challenge is to keep predicting a chosen one as far ahead as possible, with the knowledge of the other time series at those instants in time. This scenario occurs, since the chosen time series is usually very expensive to measure or extremely difficult to obtain compared to the rest. Our studies on real data suggest, that our method is substantially more robust to predicting multiple steps ahead than the existing methods in these complex domains.

international conference on data mining | 2015

Informative Prediction Based on Ordinal Questionnaire Data

Tsuyoshi Idé; Amit Dhurandhar

Supporting human decision making is a major goal of data mining. The more decision making is critical, the more interpretability is required in the predictive model. This paper proposes a new framework to build a fully interpretable predictive model for questionnaire data, while maintaining high prediction accuracy with regards to the final outcome. Such a model has applications in project risk assessment, in health care, in sentiment analysis and presumably in any real world application that relies on questionnaire data for informative and accurate prediction. Our framework is inspired by models in Item Response Theory (IRT), which were originally developed in psychometrics with applications to standardized tests such as SAT. We first extend these models, which are essentially unsupervised, to the supervised setting. We then derive a distance metric from the trained model to define the informativeness of individual question items. On real-world questionnaire data obtained from information technology projects, we demonstrate the power of this approach in terms of interpretability as well as predictability. To the best of our knowledge, this is the first work that leverages the IRT framework to provide informative and accurate prediction on ordinal questionnaire data.

computational intelligence and security | 2005

Robust pattern recognition scheme for devanagari script

Amit Dhurandhar; Kartik Shankarnarayanan; Rakesh Jawale

In this paper, a Devanagari script recognition scheme based on a novel algorithm is proposed. Devanagari script poses new challenges in the field of pattern recognition primarily due to the highly cursive nature of the script seen across its diverse character set. In the proposed algorithm, the character is initially subjected to a simple noise removal filter. Based on a reference co-ordinate system, the significant contours of the character are extracted and characterized as a contour set. The recognition of the character involves comparing these contour sets with those in the enrolled database. The matching of these contour sets is achieved by characterizing each contour based on its length, its relative position in the reference co-ordinate system and an interpolation scheme which eliminates displacement errors. In the Devanagari script, similar contour sets are observed among few characters, hence this method helps to filter out disparate characters and narrow down the possibilities to a limited set. The next step involves focusing on the subtle yet vital differences between the similar contours in this limited set. This is done by a prioritization scheme which concentrates only on those portions of character which reflect its uniqueness. The major challenge in developing the proposed scheme lay in striking the right balance between definiteness and flexibility to derive optimal solutions for out of sample data. Experimental results show the validity and efficiency of the developed scheme for recognition of characters of this script.

Knowledge and Information Systems | 2017

Supervised item response models for informative prediction

Tsuyoshi Idé; Amit Dhurandhar

Supporting human decision-making is a major goal of data mining. The more decision-making is critical, the more interpretability is required in the predictive model. This paper proposes a new framework to build a fully interpretable predictive model for questionnaire data, while maintaining a reasonable prediction accuracy with regard to the final outcome. Such a model has applications in project risk assessment, in healthcare, in social studies, and, presumably, in any real-world application that relies on questionnaire data for informative and accurate prediction. Our framework is inspired by models in item response theory (IRT), which were originally developed in psychometrics with applications to standardized academic tests. We extend these models, which are essentially unsupervised, to the supervised setting. For model estimation, we introduce a new iterative algorithm by combining Gauss–Hermite quadrature with an expectation–maximization algorithm. The learned probabilistic model is linked to the metric learning framework for informative and accurate prediction. The model is validated by three real-world data sets: Two are from information technology project failure prediction and the other is an international social survey about people’s happiness. To the best of our knowledge, this is the first work that leverages the IRT framework to provide informative and accurate prediction on ordinal questionnaire data.

Explore More