Saurabh Kataria
Xerox
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Saurabh Kataria.
knowledge discovery and data mining | 2011
Saurabh Kataria; Krishnan S. Kumar; Rajeev Rastogi; Prithviraj Sen; Srinivasan H. Sengamedu
Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedias category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedias crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy.
International Journal on Document Analysis and Recognition | 2009
Xiaonan Lu; Saurabh Kataria; William J. Brouwer; James Ze Wang; Prasenjit Mitra; C. Lee Giles
Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure’s legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.
international conference on image processing | 2010
Saurabh Kataria; Luca Marchesotti; Florent Perronnin
This paper addresses the problem of font retrieval using a query-by-example paradigm: given a font, retrieve the the most visually similar fonts. We describe a font by (a) rendering a set of reference characters, (b) extracting a feature vector for each reference character and (c) concatenating the-level character descriptors. The similarity between two fonts is simply the similarity between the vectorial representations. Our contribution is an experimental comparison of character-level descriptors of step (b) on a large dataset of 9,000 fonts. The descriptors we chose to evaluate were drawn from the literature on typed and handwritten text analysis. An important conclusion is that the SIFT descriptor, which was shown to be state-of-the-art for object recognition in photographs and for handwriting recognition, yields the best results for font retrieval.
acm/ieee joint conference on digital libraries | 2008
Saurabh Kataria; Sujatha Das; Prasenjit Mitra; C. Lee Giles
Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and extracting figure data are important problems. To the best of our knowledge, there exists no tool to automatically extract data from figures in digital documents. If we can extract data from these images automatically and store them in a database, an end-user can query and combine data from multiple digital documents simultaneously and efficiently. We propose a framework based on image analysis and machine learning to extract information from 2-D plot images and store them in a database. The proposed algorithm identifies a 2-D plot and extracts the axis labels, legend and the data points from the 2-D plot. We also segregate overlapping shapes that correspond to different data points. We demonstrate performance of individual algorithms, using a combination of generated and real-life images.
conference on information and knowledge management | 2013
Lei Li; Wei Peng; Saurabh Kataria; Tong Sun; Tao Li
In this paper, we propose a framework of recommending users and communities in social media. Given a users profile, our framework is capable of recommending influential users and topic-cohesive interactive communities that are most relevant to the given user. In our framework, we present a generative topic model to discover user-oriented and community-oriented topics simultaneously, which enables us to capture the exact topic interests of users, as well as the focuses of communities. Extensive evaluation on a data set obtained from Twitter has demonstrated the effectiveness of our proposed framework compared with other probabilistic topic model based recommendation methods.
ACM Transactions on Knowledge Discovery From Data | 2015
Lei Li; Wei Peng; Saurabh Kataria; Tong Sun; Tao Li
Social media has become increasingly prevalent in the last few years, not only enabling people to connect with each other by social links, but also providing platforms for people to share information and interact over diverse topics. Rich user-generated information, for example, users’ relationships and daily posts, are often available in most social media service websites. Given such information, a challenging problem is to provide reasonable user and community recommendation for a target user, and consequently, help the target user engage in the daily discussions and activities with his/her friends or like-minded people. In this article, we propose a unified framework of recommending users and communities that utilizes the information in social media. Given a user’s profile or a set of keywords as input, our framework is capable of recommending influential users and topic-cohesive interactive communities that are most relevant to the given user or keywords. With the proposed framework, users can find other individuals or communities sharing similar interests, and then have more interaction with these users or within the communities. We present a generative topic model to discover user-oriented and community-oriented topics simultaneously, which enables us to capture the exact topical interests of users, as well as the focuses of communities. Extensive experimental evaluation and case studies on a dataset collected from Twitter demonstrate the effectiveness of our proposed framework compared with other probabilistic-topic-model-based recommendation methods.
conference on information and knowledge management | 2012
Wenyi Huang; Saurabh Kataria; Cornelia Caragea; Prasenjit Mitra; C. Lee Giles; Lior Rokach
national conference on artificial intelligence | 2010
Saurabh Kataria; Prasenjit Mitra; Sumit Bhatia
international joint conference on artificial intelligence | 2011
Saurabh Kataria; Prasenjit Mitra; Cornelia Caragea; C. Lee Giles
national conference on artificial intelligence | 2008
Saurabh Kataria; Prasenjit Mitra; C. Lee Giles