Rohit Babbar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rohit Babbar is active.

Explore More

Publication

Featured researches published by Rohit Babbar.

analytics for noisy unstructured text data | 2010

Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

Rohit Babbar; Nidhi Singh

Regular Expressions have been used for Information Extraction tasks in a variety of domains. The alphabet of the regular expression can either be the relevant tokens corresponding to the entity of interest or individual characters in which case the alphabet size becomes very large. The presence of noise in unstructured text documents along with increased alphabet size of the regular expressions poses a significant challenge for entity extraction tasks, and also for algorithmically learning complex regular expressions. In this paper, we present a novel algorithm for regular expression learning which clusters similar matches to obtain the corresponding regular expressions, identifies and eliminates noisy clusters, and finally uses weighted disjunction of the most promising candidate regular expressions to obtain the final expression. The experimental results demonstrate high value of both precision and recall of this final expression, which reinforces the applicability of our approach in entity extraction tasks of practical importance.

web search and data mining | 2017

DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification

Rohit Babbar; Bernhard Schölkopf

Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousands or even millions of labels. Datasets in extreme classification exhibit fit to power-law distribution, i.e. a large fraction of labels have very few positive instances in the data distribution. Most state-of-the-art approaches for extreme multi-label classification attempt to capture correlation among labels by embedding the label matrix to a low-dimensional linear sub-space. However, in the presence of power-law distributed extremely large and diverse label spaces, structural assumptions such as low rank can be easily violated. In this work, we present DiSMEC, which is a large-scale distributed framework for learning one-versus-rest linear classifiers coupled with explicit capacity control to control model size. Unlike most state-of-the-art methods, DiSMEC does not make any low rank assumptions on the label matrix. Using double layer of parallelization, DiSMEC can learn classifiers for datasets consisting hundreds of thousands labels within few hours. The explicit capacity control mechanism filters out spurious parameters which keep the model compact in size, without losing prediction accuracy. We conduct extensive empirical evaluation on publicly available real-world datasets consisting upto 670,000 labels. We compare DiSMEC with recent state-of-the-art approaches, including - SLEEC which is a leading approach for learning sparse local embeddings, and FastXML which is a tree-based approach optimizing ranking based loss function. On some of the datasets, DiSMEC can significantly boost prediction accuracies - 10% better compared to SLECC and 15% better compared to FastXML, in absolute terms.

Sigkdd Explorations | 2014

On power law distributions in large-scale taxonomies

Rohit Babbar; Cornelia Metzig; Ioannis Partalas; Eric Gaussier; Massih-Reza Amini

In many of the large-scale physical and social complex systems phenomena fat-tailed distributions occur, for which different generating mechanisms have been proposed. In this paper, we study models of generating power law distributions in the evolution of large-scale taxonomies such as Open Directory Project, which consist of websites assigned to one of tens of thousands of categories. The categories in such taxonomies are arranged in tree or DAG structured configurations having parent-child relations among them. We first quantitatively analyse the formation process of such taxonomies, which leads to power law distribution as the stationary distributions. In the context of designing classifiers for large-scale taxonomies, which automatically assign unseen documents to leaf-level categories, we highlight how the fat-tailed nature of these distributions can be leveraged to analytically study the space complexity of such classifiers. Empirical evaluation of the space complexity on publicly available datasets demonstrates the applicability of our approach.

international conference on neural information processing | 2013

Maximum-Margin Framework for Training Data Synchronization in Large-Scale Hierarchical Classification

Rohit Babbar; Ioannis Partalas; Eric Gaussier; Massih-Reza Amini

In the context of supervised learning, the training data for large-scale hierarchical classification consist of (i) a set of input-output pairs, and (ii) a hierarchy structure defining parent-child relation among class labels. It is often the case that the hierarchy structure given a-priori is not optimal for achieving high classification accuracy. This is especially true for web-taxonomies such as Yahoo! directory which consist of tens of thousand of classes. Furthermore, an important goal of hierarchy design is to render better navigability and browsing. In this work, we propose a maximum-margin framework for automatically adapting the given hierarchy by using the set of input-output pairs to yield a new hierarchy. The proposed method is not only theoretically justified but also provides a more principled approach for hierarchy flattening techniques proposed earlier, which are ad-hoc and empirical in nature. The empirical results on publicly available large-scale datasets demonstrate that classification with new hierarchy leads to better or comparable generalization performance than the hierarchy flattening techniques.

international acm sigir conference on research and development in information retrieval | 2014

Re-ranking approach to classification in large-scale power-law distributed category systems

Rohit Babbar; Ioannis Partalas; Eric Gaussier; Massih-Reza Amini

For large-scale category systems, such as Directory Mozilla, which consist of tens of thousand categories, it has been empirically verified in earlier studies that the distribution of documents among categories can be modeled as a power-law distribution. It implies that a significant fraction of categories, referred to as rare categories, have very few documents assigned to them. This characteristic of the data makes it harder for learning algorithms to learn effective decision boundaries which can correctly detect such categories in the test set. In this work, we exploit the distribution of documents among categories to (i) derive an upper bound on the accuracy of any classifier, and (ii) propose a ranking-based algorithm which aims to maximize this upper bound. The empirical evaluation on publicly available large-scale datasets demonstrate that the proposed method not only achieves higher accuracy but also much higher coverage of rare categories as compared to state-of-the-art methods.

intelligent data analysis | 2015

Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Georgios Balikas; Ioannis Partalas; Eric Gaussier; Rohit Babbar; Massih-Reza Amini

Hyper-parameter tuning is a resource-intensive task when optimizing classification models. The commonly used k-fold cross validation can become intractable in large scale settings when a classifier has to learn billions of parameters. At the same time, in real-world, one often encounters multi-class classification scenarios with only a few labeled examples; model selection approaches often offer little improvement in such cases and the default values of learners are used. We propose bounds for classification on accuracy and macro measures (precision, recall, F1) that motivate efficient schemes for model selection and can benefit from the existence of unlabeled data. We demonstrate the advantages of those schemes by comparing them with k-fold cross validation and hold-out estimation in the setting of large scale classification.

international conference on neural information processing | 2012

Adaptive classifier selection in large-scale hierarchical classification

Ioannis Partalas; Rohit Babbar; Eric Gaussier; Cécile Amblard

Going beyond the traditional text classification, involving a few tens of classes, there has been a surge of interest in automatic document categorization in large taxonomies where the number of classes range from hundreds of thousands to millions. Due to the complex nature of the learning problem posed in such scenarios, one needs to adapt the conventional classification schemes to suit this domain. This paper presents a novel approach for classifier selection in large hierarchies, which is based on exploiting training data heterogeneity across the hierarchy. We also present a meta-learning framework for further flexibility in classifier selection. The experimental results demonstrate the applicability of our approach, which achieves accuracy comparable to the state-of-the-art and is also significantly faster for prediction.

extended semantic web conference | 2013

Comparative Classifier Evaluation for Web-Scale Taxonomies Using Power Law

Rohit Babbar; Ioannis Partalas; Cornelia Metzig; Eric Gaussier; Massih-Reza Amini

In the context of web-scale taxonomies such as Directory Mozilla( www.dmoz.org ), previous works have shown the existence of power law distribution in the size of the categories for every level in the taxonomy. In this work, we analyse how such high-level semantics can be leveraged to evaluate accuracy of hierarchical classifiers which automatically assign the unseen documents to leaf-level categories. The proposed method offers computational advantages over k-fold cross-validation.

conference on information and knowledge management | 2012

On empirical tradeoffs in large scale hierarchical classification

Rohit Babbar; Ioannis Partalas; Eric Gaussier; Cécile Amblard

While multi-class categorization of documents has been of research interest for over a decade, relatively fewer approaches have been proposed for large scale taxonomies in which the number of classes range from hundreds of thousand as in Directory Mozilla to over a million in Wikipedia. As a result of ever increasing number of text documents and images from various sources, there is an immense need for automatic classification of documents in such large hierarchies. In this paper, we analyze the tradeoffs between the important characteristics of different classifiers employed in the top down fashion. The properties for relative comparison of these classifiers include, (i) accuracy on test instance, (ii) training time (iii) size of the model and (iv) test time required for prediction. Our analysis is motivated by the well known error bounds from learning theory, which is also further reinforced by the empirical observations on the publicly available data from the Large Scale Hierarchical Text Classification Challenge. We show that by exploiting the data heterogenity across the large scale hierarchies, one can build an overall classification system which is approximately 4 times faster for prediction, 3 times faster to train, while sacrificing only 1% point in accuracy.

Frontiers in Endocrinology | 2018

Prediction of Glucose Tolerance without an Oral Glucose Tolerance Test

Rohit Babbar; Martin Heni; Andreas Peter; Martin Hrabě de Angelis; Hans-Ulrich Häring; Andreas Fritsche; Hubert Preissl; Bernhard Schölkopf; Robert Wagner

Introduction Impaired glucose tolerance (IGT) is diagnosed by a standardized oral glucose tolerance test (OGTT). However, the OGTT is laborious, and when not performed, glucose tolerance cannot be determined from fasting samples retrospectively. We tested if glucose tolerance status is reasonably predictable from a combination of demographic, anthropometric, and laboratory data assessed at one time point in a fasting state. Methods Given a set of 22 variables selected upon clinical feasibility such as sex, age, height, weight, waist circumference, blood pressure, fasting glucose, HbA1c, hemoglobin, mean corpuscular volume, serum potassium, fasting levels of insulin, C-peptide, triglyceride, non-esterified fatty acids (NEFA), proinsulin, prolactin, cholesterol, low-density lipoprotein, HDL, uric acid, liver transaminases, and ferritin, we used supervised machine learning to estimate glucose tolerance status in 2,337 participants of the TUEF study who were recruited before 2012. We tested the performance of 10 different machine learning classifiers on data from 929 participants in the test set who were recruited after 2012. In addition, reproducibility of IGT was analyzed in 78 participants who had 2 repeated OGTTs within 1 year. Results The most accurate prediction of IGT was reached with the recursive partitioning method (accuracy = 0.78). For all classifiers, mean accuracy was 0.73 ± 0.04. The most important model variable was fasting glucose in all models. Using mean variable importance across all models, fasting glucose was followed by NEFA, triglycerides, HbA1c, and C-peptide. The accuracy of predicting IGT from a previous OGTT was 0.77. Conclusion Machine learning methods yield moderate accuracy in predicting glucose tolerance from a wide set of clinical and laboratory variables. A substitution of OGTT does not currently seem to be feasible. An important constraint could be the limited reproducibility of glucose tolerance status during a subsequent OGTT.

Explore More