Sumit Bhatia
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sumit Bhatia.
international acm sigir conference on research and development in information retrieval | 2011
Sumit Bhatia; Debapriyo Majumdar; Prasenjit Mitra
After an end-user has partially input a query, intelligent search engines can suggest possible completions of the partial query to help end-users quickly express their information needs. All major web-search engines and most proposed methods that suggest queries rely on search engine query logs to determine possible query suggestions. However, for customized search systems in the enterprise domain, intranet search, or personalized search such as email or desktop search or for infrequent queries, query logs are either not available or the user base and the number of past user queries is too small to learn appropriate models. We propose a probabilistic mechanism for generating query suggestions from the corpus without using query logs. We utilize the document corpus to extract a set of candidate phrases. As soon as a user starts typing a query, phrases that are highly correlated with the partial user query are selected as completions of the partial query and are offered as query suggestions. Our proposed approach is tested on a variety of datasets and is compared with state-of-the-art approaches. The experimental results clearly demonstrate the effectiveness of our approach in suggesting queries with higher quality.
international conference on document analysis and recognition | 2013
Suppawong Tuarob; Sumit Bhatia; Prasenjit Mitra; C. Lee Giles
A significant number of scholarly articles in computer science and other disciplines contain algorithms that provide concise descriptions for solving a wide variety of computational problems. For example, Dijkstras algorithm describes how to find the shortest paths between two nodes in a graph. Automatic identification and extraction of these algorithms from scholarly digital documents would enable automatic algorithm indexing, searching, analysis and discovery. An algorithm search engine, which identifies pseudocodes in scholarly documents and makes them searchable, has been implemented as a part of the CiteSeerX suite. Here, we illustrate the limitations of start-of-the-art rule based pseudocode detection approach, and present a novel set of machine learning based techniques that extend previous methods.
Knowledge Based Systems | 2014
Prakhar Biyani; Sumit Bhatia; Cornelia Caragea; Prasenjit Mitra
Subjectivity analysis essentially deals with separating factual information and opinionative information. It has been actively used in various applications such as opinion mining of customer reviews in online review sites, improving answering of opinion questions in community question–answering (CQA) sites, multi-document summarization, etc. However, there has not been much focus on subjectivity analysis in the domain of online forums. Online forums contain huge amounts of user-generated data in the form of discussions between forum members on specific topics and are a valuable source of information. In this work, we perform subjectivity analysis of online forum threads. We model the task as a binary classification of threads in one of the two classes: subjective (seeking opinions, emotions, other private states) and non-subjective (seeking factual information). Unlike previous works on subjectivity analysis, we use several non-lexical thread-specific features for identifying subjectivity orientation of threads. We evaluate our methods by comparing them with several state-of-the-art subjectivity analysis techniques. Experimental results on two popular online forums demonstrate that our methods outperform strong baselines in most of the cases.
Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation | 2011
Sumit Bhatia; Suppawong Tuarob; Prasenjit Mitra; C. Lee Giles
Efficient algorithms are extremely important and can be crucial for certain software projects. Even though many source code search engines have been proposed in the literature to help software developers find source code related to their needs, to our knowledge there has been no effort to develop systems that keep abreast of the latest algorithmic developments. In this paper, we describe our initial effort towards developing such an algorithm search engine. The proposed system extracts and indexes algorithms discussed in academic literature and their associated metadata. Users can search the index through a free text query interface. The source code of proposed system, being developed as a part of a larger open source toolkit, SeerSuite, will be released in due course. We also provide directions for further research and improvements of the current system.
international conference on data mining | 2014
Quanzeng You; Sumit Bhatia; Tong Sun; Jiebo Luo
Identifying user attributes from their social media activities has been an active research topic. The ability to predict user attributes such as age, gender, and interests from their social media activities is essential for personalization and recommender systems. Most of the techniques proposed for this purpose utilize the textual content created by a user, while multimedia content has gained popularity in social networks. In this paper, we propose a novel algorithm to infer a users gender by using the images posted by the user on different social networks.
ACM Transactions on Information Systems | 2012
Sumit Bhatia; Prasenjit Mitra
Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short “synopsis” of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.
conference on information and knowledge management | 2009
Sumit Bhatia; Shibamouli Lahiri; Prasenjit Mitra
Scientists often search for document-elements like tables, figures, or algorithm pseudo-codes. Domain scientists and researchers report important data, results and algorithms using these document-elements; readers want to compare the reported results with their findings. Some document-element search engines have been proposed (especially to search for tables and figures) to make this task easier. While searching for document-elements today, the end-user is presented with the caption of the document-element and a sentence in the document text that refers to the document-element. Oftentimes, the caption and the reference text do not contain enough information to interpret the document-element. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between information content and size of the generated synopses.
knowledge discovery and data mining | 2015
Meenakshi Nagarajan; Angela D. Wilkins; Benjamin J. Bachman; Ilya B. Novikov; Shenghua Bao; Peter J. Haas; María E. Terrón-Díaz; Sumit Bhatia; Anbu Karani Adikesavan; Jacques Joseph Labrie; Sam Regenbogen; Christie M. Buchovecky; Curtis R. Pickering; Linda Kato; Andreas Martin Lisewski; Ana Lelescu; Houyin Zhang; Stephen K. Boyer; Griff Weber; Ying Chen; Lawrence A. Donehower; W. Scott Spangler; Olivier Lichtarge
We present KnIT, the Knowledge Integration Toolkit, a system for accelerating scientific discovery and predicting previously unknown protein-protein interactions. Such predictions enrich biological research and are pertinent to drug discovery and the understanding of disease. Unlike a prior study, KnIT is now fully automated and demonstrably scalable. It extracts information from the scientific literature, automatically identifying direct and indirect references to protein interactions, which is knowledge that can be represented in network form. It then reasons over this network with techniques such as matrix factorization and graph diffusion to predict new, previously unknown interactions. The accuracy and scope of KnITs knowledge extractions are validated using comparisons to structured, manually curated data sources as well as by performing retrospective studies that predict subsequent literature discoveries using literature available prior to a given date. The KnIT methodology is a step towards automated hypothesis generation from text, with potential application to other scientific domains.
Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition | 2014
Chandrima Sarkar; Sumit Bhatia; Arvind Agarwal; Juan Li
It is an important yet challenging task to develop an intelligent system in a way that it automatically classifies human personality traits. Automatic classification of human traits requires the knowledge of significant attributes and features that contribute to the prediction of a given trait. Motivated by the fact that detection of significant features is an essential part of a personality recognition system, we present in this paper an in-depth analysis of audio visual, text, demographic and sentiment features for classification of multi-modal personality traits namely, extraversion, agreeableness, conscientiousness, emotional stability and openness to experience. We use the YouTube personality data set and use logistic regression model with a ridge estimator for the classification purpose. We experiment with audio-visual features, bag of word features, sentiment based and demographic features. Our results provide important insights about the significance of different feature types for personality classification task.
international world wide web conferences | 2010
Sumit Bhatia; Prasenjit Mitra; C. Lee Giles
Algorithms are an integral part of computer science literature. However, none of the current search engines offer specialized algorithm search facility. We describe a vertical search engine that identifies the algorithms present in documents and extracts and indexes the related metadata and textual description of the identified algorithms. This algorithm specific information is then utilized for algorithm ranking in response to user queries. Experimental results show the superiority of our system on other popular search engines.