Giridharan Iyengar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Giridharan Iyengar is active.

Explore More

Publication

Featured researches published by Giridharan Iyengar.

EURASIP Journal on Advances in Signal Processing | 2003

Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

W. H. Adams; Giridharan Iyengar; Ching-Yung Lin; Milind R. Naphade; Chalapathy Neti; Harriet J. Nock; John R. Smith

We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely, audio, video, and text. Concept representations are modeled using Gaussian mixture models (GMM), hidden Markov models (HMM), and support vector machines (SVM). Models such as Bayesian networks and SVMs are used in a late-fusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: our proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector.

acm multimedia | 2003

Discriminative model fusion for semantic concept detection and annotation in video

Giridharan Iyengar; Harriet J. Nock

In this paper we describe a general information fusion algorithm that can be used to incorporate multimodal cues in building user-defined semantic concept models. We compare this technique with a Bayesian Network-based approach on a semantic concept detection task. Results indicate that this technique yields superior performance. We demonstrate this approach further by building classifiers of arbitrary concepts in a score space defined by a pre-deployed set of multimodal concepts. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus.

conference on image and video retrieval | 2003

Speaker localisation using audio-visual synchrony: an empirical study

Harriet J. Nock; Giridharan Iyengar; Chalapathy Neti

This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speakers mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.

electronic imaging | 2003

Discovery and fusion of salient multimodal features toward news story segmentation

Winston H. Hsu; Shih-Fu Chang; Chih-Wei Huang; Lyndon Kennedy; Ching-Yung Lin; Giridharan Iyengar

In this paper, we present our new results in news video story segmentation and classification in the context of TRECVID video retrieval benchmarking event 2003. We applied and extended the Maximum Entropy statistical model to effectively fuse diverse features from multiple levels and modalities, including visual, audio, and text. We have included various features such as motion, face, music/speech types, prosody, and high-level text segmentation information. The statistical fusion model is used to automatically discover relevant features contributing to the detection of story boundaries. One novel aspect of our method is the use of a feature wrapper to address different types of features -- asynchronous, discrete, continuous and delta ones. We also developed several novel features related to prosody. Using the large news video set from the TRECVID 2003 benchmark, we demonstrate satisfactory performance (F1 measures up to 0.76 in ABC news and 0.73 in CNN news), present how these multi-level multi-modal features construct the probabilistic framework, and more importantly observe an interesting opportunity for further improvement.

international conference on acoustics, speech, and signal processing | 2003

Audio-visual synchrony for detection of monologues in video archives

Giridharan Iyengar; Harriet J. Nock; Chalapathy Neti

We present our approach to detecting monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony between audio and video signals is also applicable for voice and face-based biometrics, assessing lip-synchronization quality in movie editing, and for speaker localization in video. Our approach is envisioned as a two part scheme. We first detect the occurrence of speech and face in a video shot. In shots containing both speech and a face, we distinguish monologue shots as those shots where the speech and facial movements are synchronized. To measure the synchrony between speech and facial movements we use a mutual-information based measure. Experiments with the VT02 corpus indicate that using synchrony, the average precision improves by more than 50% relative compared to using face and speech information alone. Our synchrony based monologue detector submission had the best average precision performance (in VT02) amongst 18 different submissions.

acm multimedia | 2005

Joint visual-text modeling for automatic retrieval of multimedia documents

Giridharan Iyengar; Pinar Duygulu; Shaolei Feng; Pavel Ircing; Sanjeev Khudanpur; Dietrich Klakow; M. R. Krause; R. Manmatha; Harriet J. Nock; D. Petkova; Brock Pytlik; Paola Virga

In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus.

International Journal of Speech Technology | 2001

A Cascade Visual Front End for Speaker Independent Automatic Speechreading

Gerasimos Potamianos; Chalapathy Neti; Giridharan Iyengar; Andrew W. Senior; Ashish Verma

We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speakers mouth area. The first stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis-based data projection, which is applied on a concatenation of a small amount of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multivariate normal distribution with diagonal covariance. We applied the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 8-hour long, large vocabulary, continuous speech audio-visual database. We demonstrated significant classification accuracy gains by each added stage of the proposed algorithm which, when combined, can achieve up to 27% improvement. Overall, we achieved a 60% (49%) visual-only frame-level visemic classification accuracy with (without) use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.

international symposium on biomedical imaging | 2007

REAL-TIME MUTUAL-INFORMATION-BASED LINEAR REGISTRATION ON THE CELL BROADBAND ENGINE PROCESSOR

Moriyoshi Ohara; Hangu Yeo; F. Savino; Giridharan Iyengar; Leiguang Gong; Hiroshi Inoue; Hideaki Komatsu; Vadim Sheinin; S. Daijavaa; Bradley J. Erickson

Emerging multi-core processors are able to accelerate medical imaging applications by exploiting the parallelism available in their algorithms. We have implemented a mutual-information-based 3D linear registration algorithm on the Cell Broadband Enginetrade (CBE) processor, which has nine processor cores on a chip and has a 4-way SIMD unit for each core. By exploiting the highly parallel architecture and its high memory bandwidth, our implementation with two CBE processors can compute mutual information for about 33 million pixel pairs in a second. This implementation is significantly faster than a conventional one on a traditional microprocessor or even faster than a previously reported custom-hardware implementation. As a result, it can register a pair of 256times256times30 3D images in one second by using a multi-resolution method. This paper describes our implementation with a focus on localized sampling and speculative packing techniques, which reduce the amount of the memory traffic by 82%

Communications of The ACM | 2004

Multimodal processing by finding common cause

Harriet J. Nock; Giridharan Iyengar; Chalapathy Neti

Commonalities help answer many context-aware questions that arise in human-computer interaction.

Computer Vision and Image Understanding | 2004

A multi-modal system for the retrieval of semantic video events

Arnon Amir; Sankar Basu; Giridharan Iyengar; Ching-Yung Lin; Milind R. Naphade; John R. Smith; Savitha Srinivasan; Belle L. Tseng

A framework for event detection is proposed where events, objects, and other semantic concepts are detected from video using trained classifiers. These classifiers are used to automatically annotate video with semantic labels, which in turn are used to search for new, untrained types of events and semantic concepts. The novelty of the approach lies in the: (1) semi-automatic construction of models of events from feature descriptors and (2) integration of content-based and concept-based querying in the search process. Speech retrieval is independently applied and combined results are produced. Results of applying these to the Search benchmark of the NIST TREC Video track 2001 are reported, and the gained experience and future work are discussed.

Explore More