Sumit Negi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sumit Negi is active.

Explore More

Publication

Featured researches published by Sumit Negi.

knowledge discovery and data mining | 2003

A bag of paths model for measuring structural similarity in Web documents

Sachindra Joshi; Neeraj Agrawal; Raghu Krishnapuram; Sumit Negi

Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages.

analytics for noisy unstructured text data | 2009

A survey of types of text noise and techniques to handle noisy text

L. Venkata Subramaniam; Shourya Roy; Tanveer A. Faruquie; Sumit Negi

Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as online chat, SMS, emails, message boards, newsgroups, blogs, wikis and web pages contain considerable noise. In this paper, we present a survey of the existing measures for noise in text. We also cover application areas that ingest this noisy text for various tasks like Information Retrieval and Information Extraction.

international joint conference on natural language processing | 2009

SMS based Interface for FAQ Retrieval

Govind Kothari; Sumit Negi; Tanveer A. Faruquie; Venkatesan T. Chakaravarthy; L. Venkata Subramaniam

Short Messaging Service (SMS) is popularly used to provide information access to people on the move. This has resulted in the growth of SMS based Question Answering (QA) services. However automatically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based question answering system for SMS users. We handle the noise in a SMS query by formulating the query similarity over FAQ questions as a combinatorial search problem. The search space consists of combinations of all possible dictionary variations of tokens in the noisy query. We present an efficient search algorithm that does not require any training data or SMS normalization and can handle semantic variations in question formulation. We demonstrate the effectiveness of our approach on two reallife datasets.

extending database technology | 2013

Processing multi-way spatial joins on map-reduce

Himanshu Gupta; Bhupesh Chawda; Sumit Negi; Tanveer A. Faruquie; L. V. Subramaniam; Mukesh K. Mohania

In this paper we investigate the problem of processing multi-way spatial joins on map-reduce platform. We look at two common spatial predicates - overlap and range. We address these two classes of join queries, discuss the challenges and outline novel approaches for executing these queries on a map-reduce framework. We then discuss how we can process join queries involving both overlap and range predicates. Specifically we present a Controlled-Replicate framework using which we design the approaches presented in this paper. The Controlled-Replicate framework is carefully engineered to minimize the communication among cluster nodes. Through experimental evaluations we discuss the complexity of the problem under investigation, details of Controlled-Replicate framework and demonstrate that the proposed approaches comfortably outperform naive approaches.

International Journal on Document Analysis and Recognition | 2009

Language independent unsupervised learning of short message service dialect

Sreangsu Acharyya; Sumit Negi; L. Venkata Subramaniam; Shourya Roy

Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant messenger and short message service data and they adversely affect off-the-shelf text mining methods. Most techniques address this problem by supervised methods by making use of hand labeled corrections. But they require human generated labels and corrections that are very expensive and time consuming to obtain because of multilinguality and complexity of the corruptions. While we do not champion unsupervised methods over supervised when quality of results is the singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention that is necessary to generate a parallel labeled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A hidden Markov model (HMM) over a “subsequencized” representation of words is used, where a word is represented as a bag of weighted subsequences. The approximate maximum likelihood inference algorithm used is such that the training phase involves clustering over vectors and not the customary and expensive dynamic programming (Baum–Welch algorithm) over sequences that is necessary for HMMs. A principled transformation of maximum likelihood based “central clustering” cost function of Baum–Welch into a “pairwise similarity” based clustering is proposed. This transformation makes it possible to apply “subsequence kernel” based methods that model delete and insert corruptions well. The novelty of this approach lies in that the expensive (Baum–Welch) iterations required for HMM, can be avoided through an approximation of the loglikelihood function and by establishing a connection between the loglikelihood and a pairwise distance. Anecdotal evidence of efficacy is provided on public and proprietary data.

conference on information and knowledge management | 2008

Identification of class specific discourse patterns

Anup Chalamalla; Sumit Negi; L. Venkata Subramaniam; Ganesh Ramakrishnan

In this paper we address the problem of extracting important (and unimportant) discourse patterns from call center conversations. Call centers provide dialog based calling-in support for customers to address their queries, requests and complaints. A Call center is the direct interface between an organization and its customers and it is important to capture the voice-of-customer by gathering insights into the customer experience. We have observed that the calls received at a call center contain segments within them that follow specific patterns that are typical of the issue being addressed in the call. We present methods to extract such patterns from the calls. We show that by aggregating over a few hundred calls, specific discourse patterns begin to emerge for each class of calls. Further, we show that such discourse patterns are useful for classifying calls and for identifying parts of the calls that provide insights into customer behaviour.

international conference on data engineering | 2004

EShopMonitor: a Web content monitoring tool

Neeraj Agrawal; Rema Ananthanarayanan; Rahul Gupta; Sachindra Joshi; Raghu Krishnapuram; Sumit Negi

Data presented on commerce sites runs into thousands of pages, and is typically delivered from multiple back-end sources. This makes it difficult to identify incorrect, anomalous, or interesting data such as

international conference on data mining | 2009

Automatically Extracting Dialog Models from Conversation Transcripts

Sumit Negi; Sachindra Joshi; Anup K. Chalamalla; L. Venkata Subramaniam

9.99 air fares, missing links, drastic changes in prices and addition of new products or promotions. We describe a system that monitors Web sites automatically and generates various types of reports so that the content of the site can be monitored and the quality maintained. The solution designed and implemented by us consists of a site crawler that crawls dynamic pages, an information miner that learns to extract useful information from the pages based on examples provided by the user, and a reporter that can be configured by the user to answer specific queries. The tool can also be used for identifying price trends and new products or promotions at competitor sites. A pilot run of the tool has been successfully completed at the ibm.com site.

conference on information and knowledge management | 2011

Discovering customer intent in real-time for streamlining service desk conversations

Ullas Nambiar; Tanveer A. Faruquie; L. Venkata Subramaniam; Sumit Negi; Ganesh Ramakrishnan

There is a growing need for task-oriented natural language dialog systems that can interact with a user to accomplish a given objective. Recent work on building task-oriented dialog systems have emphasized the need for acquiring task-specific knowledge from un-annotated conversational data. In our work we acquire task-specific knowledge by defining \textit{sub-task} as the key unit of a task-oriented conversation. We propose an unsupervised, apriori like algorithm that extracts the sub-tasks and their valid orderings from un-annotated human-human conversations. Modeling dialogues as a combination of sub-tasks and their valid orderings easily captures the variability in conversations. It also provides us the ability to map our dialogue model to AIML constructs and therefore use off-the-shelf AIML interpreters to build task-oriented chat-bots. We conduct experiments on real world data sets to establish the effectiveness of the sub-task extraction process. We codify the extracted sub-tasks in an AIML knowledge base and build a chatbot using this knowledge base. We also show the usefulness of the chatbot in automatically handling customer requests by performing a user evaluation study.

Ibm Journal of Research and Development | 2004

The eShopmonitor: a comprehensive data extraction tool for monitoring web sites

Neeraj Agrawal; Rema Ananthanarayanan; Rahul Gupta; Sachindra Joshi; Raghu Krishnapuram; Sumit Negi

Businesses require the contact center agents to meet pre-specified customer satisfaction levels while keeping the cost of operations low or meeting sales targets, objectives that end up being complementary and difficult to achieve in real-time. In this paper, we describe a speech enabled real-time conversation management system that tracks customer-agent conversations to detect user intent (e.g. gathering information, likely to buy, etc.) that can help agents to then decide the best sequence of actions for that call. We present an entropy based decision support system that parses a text stream generated in real-time during a audio conversation and identifies the first instance at which the intent becomes distinct enough for the agent to then take subsequent actions. We provide evaluation results displaying the efficiency and effectiveness of our system.

Explore More