Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Harriet J. Nock is active.

Publication


Featured researches published by Harriet J. Nock.


EURASIP Journal on Advances in Signal Processing | 2003

Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

W. H. Adams; Giridharan Iyengar; Ching-Yung Lin; Milind R. Naphade; Chalapathy Neti; Harriet J. Nock; John R. Smith

We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely, audio, video, and text. Concept representations are modeled using Gaussian mixture models (GMM), hidden Markov models (HMM), and support vector machines (SVM). Models such as Bayesian networks and SVMs are used in a late-fusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: our proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector.


acm multimedia | 2003

Discriminative model fusion for semantic concept detection and annotation in video

Giridharan Iyengar; Harriet J. Nock

In this paper we describe a general information fusion algorithm that can be used to incorporate multimodal cues in building user-defined semantic concept models. We compare this technique with a Bayesian Network-based approach on a semantic concept detection task. Results indicate that this technique yields superior performance. We demonstrate this approach further by building classifiers of arbitrary concepts in a score space defined by a pre-deployed set of multimodal concepts. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus.


conference on image and video retrieval | 2003

Speaker localisation using audio-visual synchrony: an empirical study

Harriet J. Nock; Giridharan Iyengar; Chalapathy Neti

This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speakers mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.


international conference on acoustics, speech, and signal processing | 2003

Audio-visual synchrony for detection of monologues in video archives

Giridharan Iyengar; Harriet J. Nock; Chalapathy Neti

We present our approach to detecting monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony between audio and video signals is also applicable for voice and face-based biometrics, assessing lip-synchronization quality in movie editing, and for speaker localization in video. Our approach is envisioned as a two part scheme. We first detect the occurrence of speech and face in a video shot. In shots containing both speech and a face, we distinguish monologue shots as those shots where the speech and facial movements are synchronized. To measure the synchrony between speech and facial movements we use a mutual-information based measure. Experiments with the VT02 corpus indicate that using synchrony, the average precision improves by more than 50% relative compared to using face and speech information alone. Our synchrony based monologue detector submission had the best average precision performance (in VT02) amongst 18 different submissions.


acm multimedia | 2005

Joint visual-text modeling for automatic retrieval of multimedia documents

Giridharan Iyengar; Pinar Duygulu; Shaolei Feng; Pavel Ircing; Sanjeev Khudanpur; Dietrich Klakow; M. R. Krause; R. Manmatha; Harriet J. Nock; D. Petkova; Brock Pytlik; Paola Virga

In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus.


Communications of The ACM | 2004

Multimodal processing by finding common cause

Harriet J. Nock; Giridharan Iyengar; Chalapathy Neti

Commonalities help answer many context-aware questions that arise in human-computer interaction.


international conference on acoustics, speech, and signal processing | 2005

Semantic annotation of multimedia using maximum entropy models

Janne Argillander; Giridharan Iyengar; Harriet J. Nock

In this paper, we propose a maximum entropy based approach for automatic annotation of multimedia content. In our approach, we explicitly model the spatial-location of the low-level features by means of specially designed predicates. In addition, the interaction between the low-level features is modeled using joint observation predicates. We evaluate the performance of semantic concept classifiers built using this approach on the TRECVID2003 corpus. Experiments indicate that our model performance is on par with the best results reported to-date on this dataset; despite using only unimodal features and a single approach towards model-building. This compares favorably with the state-of-the-art systems which use multimodal features and classifier fusion to achieve similar results on this corpus.


Storage and Retrieval for Image and Video Databases | 2003

Context-enhanced video understanding

Alejandro Jaimes; Milind R. Naphade; Harriet J. Nock; John R. Smith; Belle L. Tseng

Many recent efforts have been made to automatically index multimedia content with the aim of bridging the semantic gap between syntax and semantics. In this paper, we propose a novel framework to automatically index video using context for video understanding. First we discuss the notion of context and how it relates to video understanding. Then we present the framework we are constructing, which is modeled as an expert system that uses a rule-based engine, domain knowledge, visual detectors (for objects and scenes), and different data sources available with the video (metadata, text from automatic speech recognition, etc.). We also describe our approach to align text from speech recognition and video segments, and present experiments using a simple implementation of our framework. Our experiments show that context can be used to improve the performance of visual detectors.


international conference on acoustics, speech, and signal processing | 2004

Multimodal video search techniques: late fusion of speech-based retrieval and visual content-based retrieval

Arnon Amir; Giridharan Iyengar; Ching-Yung Lin; Milind R. Naphade; Apostol Natsev; Chalapathy Neti; Harriet J. Nock; John R. Smith; Belle L. Tseng

This paper describes multimodal systems for ad-hoc search constructed by IBM for the TRECVID 2003 benchmark of search systems for broadcast video. These systems all use a late fusion of independently developed speech-based and visual content-based retrieval systems and outperform our individual retrieval systems on both manual and interactive search tasks. For the manual task, our best system used a query-dependent linear weighting between speech-based and image-based retrieval systems. This system has mean average precision (MAP) performance 20% above our best unimodal system for manual search. For the interactive task, where the user has full knowledge of the query topic and the performance of the individual search systems, our best system used an interlacing approach. The user determines the (subjectively) optimal weights A and B for the speech-based and image-based systems, where the multimodal result set is aggregated by combining the top A documents from system A followed by top B documents of system B and then repeating this process until the desired result set size is achieved. This multimodal interactive search has MAP 40% above our best unimodal interactive search system.


Archive | 2003

Method, apparatus, and program for cross-linking information sources using multiple modalities

Giridharan Iyengar; Chalapathy Neti; Harriet J. Nock

Researchain Logo
Decentralizing Knowledge