L. Venkata Subramaniam

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where L. Venkata Subramaniam is active.

Explore More

Publication

Featured researches published by L. Venkata Subramaniam.

analytics for noisy unstructured text data | 2009

A survey of types of text noise and techniques to handle noisy text

L. Venkata Subramaniam; Shourya Roy; Tanveer A. Faruquie; Sumit Negi

Often, in the real world noise is ubiquitous in text communications. Text produced by processing signals intended for human use are often noisy for automated computer processing. Automatic speech recognition, optical character recognition and machine translation all introduce processing noise. Also digital text produced in informal settings such as online chat, SMS, emails, message boards, newsgroups, blogs, wikis and web pages contain considerable noise. In this paper, we present a survey of the existing measures for noise in text. We also cover application areas that ingest this noisy text for various tasks like Information Retrieval and Information Extraction.

international joint conference on natural language processing | 2009

SMS based Interface for FAQ Retrieval

Govind Kothari; Sumit Negi; Tanveer A. Faruquie; Venkatesan T. Chakaravarthy; L. Venkata Subramaniam

Short Messaging Service (SMS) is popularly used to provide information access to people on the move. This has resulted in the growth of SMS based Question Answering (QA) services. However automatically handling SMS questions poses significant challenges due to the inherent noise in SMS questions. In this work we present an automatic FAQ-based question answering system for SMS users. We handle the noise in a SMS query by formulating the query similarity over FAQ questions as a combinatorial search problem. The search space consists of combinations of all possible dictionary variations of tokens in the noisy query. We present an efficient search algorithm that does not require any training data or SMS normalization and can handle semantic variations in question formulation. We demonstrate the effectiveness of our approach on two reallife datasets.

conference on information and knowledge management | 2003

Information extraction from biomedical literature: methodology, evaluation and an application

L. Venkata Subramaniam; Sougata Mukherjea; Pankaj Kankar; Biplav Srivastava; Vishal S. Batra; Pasumarti V. Kamesam; Ravi Kothari

Journals and conference proceedings represent the dominant mechanisms of reporting new biomedical results. The unstructured nature of such publications makes it difficult to utilize data mining or automated knowledge discovery techniques. Annotation (or markup) of these unstructured documents represents the first step in making these documents machine analyzable. In this paper we first present a system called BioAnnotator for identifying and annotating biological terms in documents. BioAnnotator uses domain based dictionary look-up for recognizing known terms and a rule engine for discovering new terms. The combination and dictionary look-up and rules result in good performance (87% precision and 94% recall on the GENIA 1.1 corpus for extracting general biological terms based on an approximate matching criterion). To demonstrate the subsequent mining and knowledge discovery activities that are made feasible by BioAnnotator, we also present a system called MedSummarizer that uses the extracted terms to identify the common concepts in a given group of genes.

Journal of the Acoustical Society of America | 2002

Speech driven lip synthesis using viseme based hidden markov models

Sankar Basu; Tanveer A. Faruquie; Chalapathy Neti; Nitendra Rajput; Andrew W. Senior; L. Venkata Subramaniam; Ashish Verma

A method of speech driven lip synthesis which applies viseme based training models to units of visual speech. The audio data is grouped into a smaller number of visually distinct visemes rather than the larger number of phonemes. These visemes then form the basis for a Hidden Markov Model (HMM) state sequence or the output nodes of a neural network. During the training phase, audio and visual features are extracted from input speech, which is then aligned according to the apparent viseme sequence with the corresponding audio features being used to calculate the HMM state output probabilities or the output of the neutral network. During the synthesis phase, the acoustic input is aligned with the most likely viseme HMM sequence (in the case of an HMM based model) or with the nodes of the network (in the case of a neural network based system), which is then used for animation.

international conference on data engineering | 2009

Business Intelligence from Voice of Customer

L. Venkata Subramaniam; Tanveer A. Faruquie; Shajith Ikbal; Shantanu Godbole; Mukesh K. Mohania

In this paper, we present a first of a kind system, called Business Intelligence from Voice of Customer (BIVoC), that can: 1) combine unstructured information and structured information in an information intensive enterprise and 2) derive richer business insights from the combined data. Unstructured information, in this paper, refers to Voice of Customer (VoC) obtained from interaction of customer with enterprise namely, conversation with call-center agents, email, and sms. Structured database reflect only those business variables that are static over (a longer window of) time such as, educational qualification, age group, and employment details. In contrast, a combination of unstructured and structured data provide access to business variables that reflect upto date dynamic requirements of the customers and more importantly indicate trends that are difficult to derive from a larger population of customers through any other means. For example, some of the variables reflected in unstructured data are problem/interest in a certain product, expression of dissatisfaction with the business provided, and some unexplored category of people showing certain interest/problem. This gives the BIVoC system the ability to derive business insights that are richer, more valuable and crucial to the enterprises than the traditional business intelligence systems which utilize onlystructured information. We demostrate the effectiveness of BIVoC system through one of our real-life engagements where the problem is to determine how to improve agent productivity in a call center scenario. We also highlight major challenges faced while dealing with unstructured information such as handling noise and linking with structured data.

international conference on data engineering | 2010

Data cleansing as a transient service

Tanveer A. Faruquie; K. Hima Prasad; L. Venkata Subramaniam; Mukesh K. Mohania; Girish Venkatachaliah; Shrinivas Kulkarni; Pramit Basu

There is often a transient need within enterprises for data cleansing which can be satisfied by offering data cleansing as a transient service. Every time a data cleansing need arises it should be possible to provision hardware, software and staff for accomplishing the task and then dismantling the set up. In this paper we present such a system that uses virtualized hardware and software for data cleansing. We share actual experiences gained from building such a system.We use a cloud infrastructure to offer virtualized data cleansing instances that can be accessed as a service. We build a system that is scalable, elastic and configurable. Each enterprise has unique needs which makes it necessary to customize both the infrastructure and the cleansing algorithms to address these needs. In this paper we will present a system that is easily configurable to suit the data cleansing needs of an enterprise.

Information Sciences | 2009

Getting insights from the voices of customers: Conversation mining at a contact center

Hironori Takeuchi; L. Venkata Subramaniam; Tetsuya Nasukawa; Shourya Roy

Business-oriented conversations between customers and agents need to be analyzed to obtain valuable insights that can be used to improve product and service quality, operational efficiency, and revenue. For such an analysis, it is critical to identify appropriate textual segments and expressions to focus on, especially when the textual data consists of complete transcripts, which are often lengthy and redundant. In this paper, we propose a method to identify important segments from the conversations by looking for changes in the accuracy of a categorizer designed to separate different business outcomes. We then use text mining to extract important associations between key entities (insights). We show the effectiveness of the method for making chance discoveries by using real life data from a car rental service center.

International Journal on Document Analysis and Recognition | 2007

Special issue on noisy text analytics

Craig A. Knoblock; Daniel P. Lopresti; Shourya Roy; L. Venkata Subramaniam

Noisy unstructured text data are ubiquitous in real-world communications. Text produced by processing signals intended for human interpretation, such as printed and handwritten documents, spontaneous speech, and cameracaptured scene images, are prime examples. Application of Automatic SpeechRecognition (ASR) systems on telephonic conversations between call center agents and customers often see 30–40% word error rates. Optical character recognition (OCR) error rates for hardcopy documents can range widely from 2–3% for clean inputs to 50% or higher depending on the quality of the page image, the complexity of the layout, and aspects of the typography. Unconstrained handwriting recognition is still considered to be largely an open problem. Recognition errors are not the sole source of noise; natural language and its creative usage can cause problems for computational techniques. Electronic text taken directly from the Internet (emails, message boards, newsgroups, blogs, wikis, chat logs, and web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), and mobile phones (text messages) is often very noisy and challenging to process. Spelling errors, abbreviations,

ieee international conference on services computing | 2010

Resource Allocation and SLA Determination for Large Data Processing Services over Cloud

K. Hima Prasad; Tanveer A. Faruquie; L. Venkata Subramaniam; Mukesh K. Mohania; Girish Venkatachaliah

Data processing on the cloud is increasingly used for offering cost effective services. In this paper, we present a method for resource allocation for data processing services over the cloud taking into account not just the processing power and memory requirements, but the network speed, reliability and data throughput. We also present algorithms for partitioning data, for doing parallel block data transfer to achieve better throughput and allocated cloud resources. We also present methods for optimal pricing and determination of Service Level Agreements for a given data processing job. The usefulness of our approach is shown through experiments performed under different resource allocation conditions.

ieee international conference on services computing | 2010

A Knowledge Acquisition Method for Improving Data Quality in Services Engagements

Mohan N. Dani; Tanveer A. Faruquie; Rishabh Garg; Govind Kothari; Mukesh K. Mohania; K. Hima Prasad; L. Venkata Subramaniam; Varsha N. Swamy

Poor Data Quality is a serious problem affecting enterprises. Enterprise databases are large and manual data cleansing is not feasible. For such large databases it is logical to attempt to cleanse the data in an automated way. This has led to the development of commercial tools for automatic cleansing. However, offering data cleansing as a service has been a challenge because of the need to customize the tool for different datasets. This is because current commercial systems lack the ability to incorporate the unique exceptions of different data sources. This makes the migration of underlying data cleansing algorithms from one dataset to another difficult. In this paper we specifically look at the address standardization task. We use Ripple Down Rules (RDR) framework to lower the manual effort required in rewriting the rules from one source to another. The RDR framework allows us to incrementally patch the existing rules or add exceptions without breaking other rules. We compare the RDR approach with a conditional random field (CRF) address standardization system and an existing commercially available data cleansing tool. We demonstrate that RDR is an effective knowledge acquisition method and that its adoption for data cleansing can allow data cleansing to be offered as a service.

Explore More