Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Julien van Hout is active.

Publication


Featured researches published by Julien van Hout.


machine vision applications | 2014

Evaluating multimedia features and fusion for example-based event detection

Gregory K. Myers; Ramesh Nallapati; Julien van Hout; Stephanie Pancoast; Ramakant Nevatia; Chen Sun; Amirhossein Habibian; Dennis Koelma; Koen E. A. van de Sande; Arnold W. M. Smeulders; Cees G. M. Snoek

Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME’s performance in the 2012 TRECVID MED evaluation was one of the best reported.


international conference on acoustics, speech, and signal processing | 2013

Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams

Murat Akbacak; Lukas Burget; Wen Wang; Julien van Hout

We address the problem of retrieving spoken information from noisy and heterogeneous audio archives using system combination with a rich and diverse set of noise-robust modules. Audio search applications so far have focused on constrained domains or genres and not-so-noisy and heterogeneous acoustic or channel conditions. In this paper, our focus is to improve the accuracy of a keyword spotting system in highly degraded and diverse channel conditions by employing multiple recognition systems in parallel with different robust frontends and modeling choices, as well as different representations during audio indexing and search (words vs. subword units). After aligning keyword hits from different systems, we employ system combination at the score level using a logistic-regression-based classifier. Side information such as the output of an acoustic condition identification module is used to guide system combination system that is trained on a held-out dataset. Lattice-based indexing and search is used in all keyword spotting systems. We present improvements in probability-miss at a fixed probability-false-alarm by employing our proposed rich system combination approach on DARPA Robust Automatic Transcription of Speech (RATS) Phase-I evaluation data that contains highly degraded channel recordings (signal-to-noise ratio levels as low as 0 dB) and different channel characteristics.


international conference on acoustics, speech, and signal processing | 2012

A novel approach to soft-mask estimation and Log-Spectral enhancement for robust speech recognition

Julien van Hout; Abeer Alwan

This paper describes a technique for enhancing the Mel-filtered log spectra of noisy speech, with application to noise robust speech recognition. We first compute an SNR-based soft-decision mask in the Mel-spectral domain as an indicator of speech presence. Then, we exploit the known time-frequency correlation of speech by treating this mask as an image, and performing median filtering and blurring to remove the outliers and to smooth the decision regions. This mask constitutes a set of multiplicative coefficients (ranging in [0,1]) that are used to discard the unreliable parts of the Mel-filtered log-spectrum of noisy speech. Finally, we apply Log-Spectral Flooring [1] on the liftered spectra of both clean and noisy speech so as to match their respective dynamic ranges and to emphasize the information in the spectral peaks. The noisy MFCCs computed on these modified log-spectra show an increased similarity with their corresponding clean MFCCs. Evaluation on the Aurora-2 corpus shows that the proposed approach competes with state-of-the-art front-ends, like ETSI-AFE, MVA or PNCC.


international conference on acoustics, speech, and signal processing | 2013

Extracting spoken and acoustic concepts for multimedia event detection

Julien van Hout; Murat Akbacak; Diego Castán; Eric Yeh; Michelle Hewlett Sanchez

Because of the popularity of online videos, there has been much interest in recent years in audio processing for the improvement of online video search. In this paper, we explore using acoustic concepts and spoken concepts extracted via audio segmentation/recognition and speech recognition respectively for Multimedia Event Detection (MED). To extract spoken concepts, a segmenter trained on annotated data from user videos segments the audio into three classes: speech, music, and other sounds. The speech segments are passed to an Automatic Speech Recognition (ASR) engine, and words from the 1-best ASR output, as well as posterior-weighted word counts collected from ASR lattices, are used as features to an SVM based classifier. Acoustic concepts are extracted using the 3-gram lattice counts of two Acoustic Concept Recognition (ACR) systems trained on 7 broad classes. MED results are reported on a subset of the NIST 2011 TRECVID data. We find that spoken concepts using lattices yield a 15% relative improvement in Average Pmiss (APM) over 1-best based features. Further, the proposed spoken concepts gave a 30% relative gain in APM over the ACR-based MED system using 7 classes. Lastly, we obtain an 8% relative APM improvement after score-level fusion of both concept types, showing the effective coupling of both approaches.


international conference on acoustics, speech, and signal processing | 2014

Feature fusion for high-accuracy keyword spotting

Vikramjit Mitra; Julien van Hout; Horacio Franco; Dimitra Vergyri; Yun Lei; Martin Graciarena; Yik-Cheung Tam; Jing Zheng

This paper assesses the role of robust acoustic features in spoken term detection (a.k.a keyword spotting - KWS) under heavily degraded channel and noise corrupted conditions. A number of noise-robust acoustic features were used, both in isolation and in combination, to train large vocabulary continuous speech recognition (LVCSR) systems, with the resulting word lattices used for spoken term detection. Results indicate that the use of robust acoustic features improved KWS performance with respect to a highly optimized state-of-the art baseline system. It has been shown that fusion of multiple systems improve KWS performance, however the number of systems that can be trained is constrained by the number of frontend features. This work shows that given a number of frontend features it is possible to train several systems by using the frontend features by themselves along with different feature fusion techniques, which provides a richer set of individual systems. Results from this work show that KWS performance can be improved compared to individual feature based systems when multiple features are fused with one another and even further when multiple such systems are combined. Finally this work shows that fusion of fused and single feature bases systems provide significant improvement in KWS performance compared to fusion of singlefeature based systems.


conference of the international speech communication association | 2018

Analysis of Complementary Information Sources in the Speaker Embeddings Framework.

Mahesh Kumar Nandwana; Mitchell McLaren; Diego Castán; Julien van Hout; Aaron Lawson

Deep neural network (DNN)-based speaker embeddings have resulted in new, state-of-the-art text-independent speaker recognition technology. However, very limited effort has been made to understand DNN speaker embeddings. In this study, our aim is analyzing the behavior of the speaker recognition systems based on speaker embeddings toward different front-end features, including the standard Mel frequency cepstral coefficients (MFCC), as well as power normalized cepstral coefficients (PNCC), and perceptual linear prediction (PLP). Using a speaker recognition system based on DNN speaker embeddings and probabilistic linear discriminant analysis (PLDA), we compared different approaches to leveraging complementary information using score-, embeddings-, and feature-level combination. We report our results for Speakers in the Wild (SITW) and NIST SRE 2016 datasets. We found that first and second embeddings layers are complementary in nature. By applying score and embedding-level fusion we demonstrate relative improvements in equal error rate of 17% on NIST SRE 2016 and 10% on SITW over the baseline system.


international conference on acoustics, speech, and signal processing | 2017

Speech recognition in unseen and noisy channel conditions

Vikramjit Mitra; Horacio Franco; Chris Bartels; Julien van Hout; Martin Graciarena; Dimitra Vergyri

Speech recognition in varying background conditions is a challenging problem. Acoustic condition mismatch between training and evaluation data can significantly reduce recognition performance. For mismatched conditions, data-adaptation techniques are typically found to be useful, as they expose the acoustic model to the new data condition(s). Supervised adaptation techniques usually provide substantial performance improvement, but such gain is contingent on having labeled or transcribed data, which is often unavailable. The alternative is unsupervised adaptation, where feature-transform methods and model-adaptation techniques are typically explored. This work investigates robust features, feature-space maximum likelihood linear regression (fMLLR) transform, and deep convolutional nets to address the problem of unseen channel and noise conditions. In addition, the work investigates bottleneck (BN) features extracted from deep autoencoder (DAE) networks trained by using acoustic features extracted from the speech signal. We demonstrate that such representations not only produce robust systems but also that they can be used to perform data selection for unsupervised model adaptation. Our results indicate that the techniques presented in this paper significantly improve performance of speech recognition systems in unseen channel and noise conditions.


New Era for Robust Speech Recognition, Exploiting Deep Learning | 2017

Robust Features in Deep-Learning-Based Speech Recognition

Vikramjit Mitra; Horacio Franco; Richard M. Stern; Julien van Hout; Luciana Ferrer; Martin Graciarena; Wen Wang; Dimitra Vergyri; Abeer Alwan; John H. L. Hansen

Recent progress in deep learning has revolutionized speech recognition research, with Deep Neural Networks (DNNs) becoming the new state of the art for acoustic modeling. DNNs offer significantly lower speech recognition error rates compared to those provided by the previously used Gaussian Mixture Models (GMMs). Unfortunately, DNNs are data sensitive, and unseen data conditions can deteriorate their performance. Acoustic distortions such as noise, reverberation, channel differences, etc. add variation to the speech signal, which in turn impact DNN acoustic model performance. A straightforward solution to this issue is training the DNN models with these types of variation, which typically provides quite impressive performance. However, anticipating such variation is not always possible; in these cases, DNN recognition performance can deteriorate quite sharply. To avoid subjecting acoustic models to such variation, robust features have traditionally been used to create an invariant representation of the acoustic space. Most commonly, robust feature-extraction strategies have explored three principal areas: (a) enhancing the speech signal, with a goal of improving the perceptual quality of speech; (b) reducing the distortion footprint, with signal-theoretic techniques used to learn the distortion characteristics and subsequently filter them out of the speech signal; and finally (c) leveraging knowledge from auditory neuroscience and psychoacoustics, by using robust features inspired by auditory perception.


conference of the international speech communication association | 2013

Strategies for High Accuracy Keyword Detection in Noisy Channels

Arindam Mandal; Julien van Hout; Yik-Cheung Tam; Vikramjit Mitra; Yun Lei; Jing Zheng; Dimitra Vergyri; Luciana Ferrer; Martin Graciarena; Andreas Kathol; Horacio Franco


Archive | 2013

The 2013 SESAME Multimedia Event Detection and Recounting System

Robert C. Bolles; J. Brian Burns; James A. Herson; Gregory K. Myers; Stephanie Pancoast; Julien van Hout; Wen Wang; Julie Wong; Eric Yeh; Amirhossein Habibian; Dennis Koelma; Zhenyang Li; Masoud Mazloom; Silvia-Laura Pintea; Sung Chun Lee; Ram Nevatia; Pramod Sharma; Chen Sun; Remi Trichet

Collaboration


Dive into the Julien van Hout's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Chen Sun

University of Southern California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mitchell McLaren

Queensland University of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge