Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Shourya Roy is active.

Publication


Featured researches published by Shourya Roy.


international conference on data mining | 2007

How Much Noise Is Too Much: A Study in Automatic Text Classification

Sumeet Agarwal; Shantanu Godbole; Diwakar Punjani; Shourya Roy

Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters- 21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.


Information Sciences | 2009

Getting insights from the voices of customers: Conversation mining at a contact center

Hironori Takeuchi; L. Venkata Subramaniam; Tetsuya Nasukawa; Shourya Roy

Business-oriented conversations between customers and agents need to be analyzed to obtain valuable insights that can be used to improve product and service quality, operational efficiency, and revenue. For such an analysis, it is critical to identify appropriate textual segments and expressions to focus on, especially when the textual data consists of complete transcripts, which are often lengthy and redundant. In this paper, we propose a method to identify important segments from the conversations by looking for changes in the accuracy of a categorizer designed to separate different business outcomes. We then use text mining to extract important associations between key entities (insights). We show the effectiveness of the method for making chance discoveries by using real life data from a car rental service center.


knowledge discovery and data mining | 2008

Text classification, business intelligence, and interactivity: automating C-Sat analysis for services industry

Shantanu Godbole; Shourya Roy

Text classification has matured as a research discipline over the last decade. Independently, business intelligence over structured databases has long been a source of insights for enterprises. In this work, we bring the two together for Customer Satisfaction(C-Sat) analysis in the services industry. We present ITACS, a solution combining text classification and business intelligence integrated with a novel interactive text labeling interface. ITACS has been deployed in multiple client accounts in contact centers. It can be extended to any services industry setting to analyze unstructured text data and derive operational and business insights. We highlight importance of interactivity in real-life text classification settings. We bring out some unique research challenges about label-sets, measuring accuracy, and interpretability that need serious attention in both academic and industrial research. We recount invaluable experiences and lessons learned as data mining researchers working toward seeing research technology deployed in the services industry.


International Journal on Document Analysis and Recognition | 2007

Special issue on noisy text analytics

Craig A. Knoblock; Daniel P. Lopresti; Shourya Roy; L. Venkata Subramaniam

Noisy unstructured text data are ubiquitous in real-world communications. Text produced by processing signals intended for human interpretation, such as printed and handwritten documents, spontaneous speech, and cameracaptured scene images, are prime examples. Application of Automatic SpeechRecognition (ASR) systems on telephonic conversations between call center agents and customers often see 30–40% word error rates. Optical character recognition (OCR) error rates for hardcopy documents can range widely from 2–3% for clean inputs to 50% or higher depending on the quality of the page image, the complexity of the layout, and aspects of the typography. Unconstrained handwriting recognition is still considered to be largely an open problem. Recognition errors are not the sole source of noise; natural language and its creative usage can cause problems for computational techniques. Electronic text taken directly from the Internet (emails, message boards, newsgroups, blogs, wikis, chat logs, and web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), and mobile phones (text messages) is often very noisy and challenging to process. Spelling errors, abbreviations,


congress on evolutionary computation | 2007

A Conversation-Mining System for Gathering Insights to Improve Agent Productivity

H. Takeuchi; L.V. Subramaniam; Tetsuya Nasukawa; Shourya Roy; S. Balakrishnan

We describe a method to analyze transcripts of conversations between customers and agents in a contact center. The aim is to obtain actionable insights from the conversations to improve agent performance. Our approach has three steps. First we segment the call into logical parts. Next we extract relevant phrases within different segments. Finally we do two dimensional association analysis to identify actionable trends. We use real data from a contact center to identify specific actions by agents that result in positive outcomes. We also show that implementing the actionable results in improved agent productivity.


ieee international conference on services computing | 2008

Text to Intelligence: Building and Deploying a Text Mining Solution in the Services Industry for Customer Satisfaction Analysis

Shantanu Godbole; Shourya Roy

We present our experiences in building and deploying a text mining solution in services industry settings, specifically in contact centers. We describe the voice of customer (VoC) and customer satisfaction (C-Sat) analysis settings and outline several unique research challenges brought about by this confluence of text mining and industrial services research. We describe our system for integrated text classification, business intelligence and interactive text labeling for C-Sat analysis. We recount invaluable lessons learned as computer science researchers in services research engagements. The system has been deployed in multiple accounts in contact centers and can be extended to any industrial CRM service practice to analyze unstructured text data.


analytics for noisy unstructured text data | 2008

Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples

Sreangsu Acharyya; Sumit Negi; L. V. Subramaniam; Shourya Roy

Noise in textual data such as those introduced by multi-linguality, misspellings, abbreviations, deletions, phonetic spellings, non standard transliteration, etc pose considerable problems for text-mining. Such corruptions are very common in instant messenger (IM) and short message service (SMS) data and adversely affect off the shelf text mining methods. Most techniques address this problem by supervised methods. But they require labels that are very expensive and time consuming to obtain. While we do not champion unsupervised methods over supervised when quality of results is the supreme and singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention to generate parallely labelled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A Hidden Markov Model (HMM) over subsequencized representation of words is used subject to a parameterization such that the training phase involves clustering over vectors and not the customary dynamic programming over sequences. A principled transformation of maximum likelihood based central clustering cost function into a pairwise similarity based clustering is proposed. This transformation makes it possible to apply subsequence kernel based methods that model delete and insert edit operations well. The novelty of this approach lies in that the expensive (Baum-Welch) iterations required for HMM, can be avoided through a careful factorization of the HMM Loglikelihood and in establishing the connection between information theoretic cost function and the kernel approach of machine learning. Anecdotal evidence of efficacy is provided on public and proprietary data.


acm conference on hypertext | 2004

Automatic categorization of web sites based on source types

Shourya Roy; Sachindra Joshi; Raghu Krishnapuram

An important issue with the Web is verification of the accuracy, currency and authenticity of the information associated with Web sites. One way to address this problem is to identify the source or sponsor of the Web site. However, source identification is non-trivial because the source of a Web site cannot always be determined by the URL or content of the site. In this paper, we propose a method for source identification that uses various types of inbound, outbound and internal interactions that arise due to hyperlinks between and within Web sites.


international conference on autonomic computing | 2006

Identity Delegation in Policy Based Systems

Rajeev Gupta; Shourya Roy; Manish A. Bhide

With the increase in complexity of enterprise systems, autonomic computing systems are gaining more and more importance. Policy based autonomic systems simplify lives of the system administrators and supporting IT people as they only have to specify the goals or objectives to be met in a suitable language. In this paper, we deal with the problem of identity delegation for policy execution. We propose a concept of `implicit identity delegation, from policy author to policy enforcer, whereby an autonomic system figures out the right policy enforcer for a given policy and implicitly delegates the task of policy enforcement.


knowledge discovery and data mining | 2008

An integrated system for automatic customer satisfaction analysis in the services industry

Shantanu Godbole; Shourya Roy

Text classification has matured well as a research discipline over the years. At the same time, business intelligence over databases has long been a source of insights for enterprises. With the growing importance of the services industry, customer relationship management and contact center operations have become very important. Specifically, the voice of the customer and customer satisfaction (C-Sat) have emerged as invaluable sources of insights about how an enterprises products and services are percieved by customers.n In this demonstration, we present the IBM Technology to Automate Customer Satisfaction analysis (ITACS) system that combines text classification technology, and a business intelligence solution along with an interactive document labeling interface for automating C-Sat analysis. This system has been successfully deployed in client accounts in large contact centers and can be extended to any services industry setting for analyzing unstructured text data. This demonstration will highlight the importance of intervention and interactivity in real-world text classification settings. We will point out unique research challenges in this domain regarding label-sets, measuring accuracy, and interpretability of results and we will discuss solutions and open questions.

Researchain Logo
Decentralizing Knowledge