Michael S. Brandstein
Harvard University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael S. Brandstein.
Archive | 2001
Joseph H. DiBiase; Harvey F. Silverman; Michael S. Brandstein
Talker localization with microphone arrays has received significant attention lately as a means for the automated tracking of individuals in an enclosure and as a necessary component of any general purpose speech capture system. Several algorithmic approaches are available for speech source localization with multi-channel data. This chapter summarizes the current field and comments on the general merits and shortcomings of each genre. A new localization method is then presented in detail. By utilizing key features of existing methods, this new algorithm is shown to be significantly more robust to acoustical conditions, particularly reverberation effects, than the traditional localization techniques in use today.
international conference on acoustics, speech, and signal processing | 1997
Michael S. Brandstein; Harvey F. Silverman
Conventional time-delay estimators exhibit dramatic performance degradations in the presence of multipath signals. This limits their application in reverberant enclosures, particularly when the signal of interest is speech and it may not possible to estimate and compensate for channel effects prior to time-delay estimation. This paper details an alternative approach which reformulates the problem as a linear regression of phase data and then estimates the time-delay through minimization of a robust statistical error measure. The technique is shown to be less susceptible to room reverberation effects. Simulations are performed across a range of source placements and room conditions to illustrate the utility of the proposed time-delay estimation method relative to conventional methods.
Computer Speech & Language | 1997
Michael S. Brandstein; Harvey F. Silverman
Abstract Electronically steerable arrays of microphones have a variety of uses in speech data acquisition systems. Applications include teleconferencing, speech recognition and speaker identification, sound capture in adverse environments, and biomedical devices for the hearing impaired. An array of microphones has a number of advantages over a single-microphone system. It may be electronically aimed to provide a high-quality signal from a desired source location while simultaneously attenuating interfering talkers and ambient noise, does not necessitate local placement of transducers or encumber the talker with a hand-held or head-mounted microphone, and does not require physical movement to alter its direction of reception. Additionally, it has capabilities that a single microphone does not; namely automatic detection, localization and tracking of active talkers in its receptive area. This paper addresses the specific application of source localization algorithms for estimating the position of speech sources in a real-room environment given limited computational resources. The theoretical foundations of a speech source localization system are presented. This includes the development of a source–sensor geometry for talkers and sensors in the near-field environment as well as the evaluation of several error criteria available to the problem. Several practical algorithms necessary for real-time implementation are developed, specifically the derivation and evaluation of an appropriate time-delay estimator and a novel closed-form locator. Finally, results obtained from a real system are presented to illustrate the effectiveness of the proposed source localization techniques as well as to confirm the practicality of the theoretical models.
international conference on acoustics speech and signal processing | 1998
Ce Wang; Michael S. Brandstein
A hybrid real-time face tracker based on both sound and visual cues is presented. Initial talker locations are estimated acoustically from microphone array data while precise localization and tracking are derived from image information. A computationally efficient algorithm for face detection via motion analysis is employed to track individual faces at rates up to 30 frames per second. The system is robust to nonlinear source motions, complex backgrounds, varying lighting conditions, and a variety of source-camera depths. While the direct focus of this work is automated video conferencing, the face tracking capability has utility to many multimedia and virtual reality applications.
Journal of the Acoustical Society of America | 1999
Michael S. Brandstein
The relative time delay associated with a speech signal received at a pair of spatially separated microphones is a key component in talker localization and microphone array beamforming procedures. The traditional method for estimating this parameter utilizes the generalized cross correlation (GCC), the performance of which is compromised by the presence of room reverberations and background noise. Typically, the GCC filtering criteria used are either focused on the signal degradations due to additive noise or those due exclusively to multipath channel effects. There has been relatively little success at applying GCC weighting schemes which are robust to both of these conditions. This paper details an alternative approach which attempts to employ a signal-dependent criterion, namely, the estimated periodicity of the speech signal, to design a GCC filter appropriate for the combination of noise and multipath distortions. Simulations are performed across a range of room conditions to illustrate the utility of the proposed time-delay estimation method relative to conventional GCC filtering approaches.
multimedia signal processing | 1999
Ce Wang; Michael S. Brandstein
A real-time face tracker based on both sound and visual cues is presented. Initial talker locations are estimated acoustically from microphone array data while precise localization and tracking are derived from visual data. The image processing employs a hierarchical structure which utilizes source motion, contour geometry, color data, and facial features. The resulting system is capable of tracking multiple persons in complex backgrounds and robustly discriminating faces from similar objects. While the direct focus of this work is automated videoconferencing, the face tracking capability has utility to many multimedia and virtual reality applications.
workshop on applications of signal processing to audio and acoustics | 2001
Scott M. Griebel; Michael S. Brandstein
This paper presents an alternative approach to acoustic source localization which modifies the traditional two-step localization procedure to not require explicit time-delay estimates. Instead, the cross-correlation functions derived from various microphone pairs are simultaneously maximized over a set of potential delay combinations consistent with candidate source locations. The result is a procedure that combines the advantages offered by the phase transform (PHAT) weighting (or any reasonable cross-correlation-type function) and a more robust localization procedure without dramatically increasing computational load. Simulations are performed across a range of reverberation conditions to illustrate the utility of the proposed method relative to conventional generalized cross-correlation (GCC) filtering approaches and a more modern eigenvalue-based technique.
workshop on applications of signal processing to audio and acoustics | 1997
Michael S. Brandstein
Generalized cross-correlation (GCC) has been the traditional method for estimating the relative time-delay associated with speech signals received by a pair of microphones in a reverberant, noisy environment. The filtering criterion employed is either focussed on the signal degradations due to additive noise or those due exclusively to multipath channel effects. There has been relatively little success at applying GCC weighting schemes which are robust to both of these conditions. This paper details an alternative approach which attempts to employ a signal dependent criterion, namely the estimated periodicity of harmonic spectral intervals, to design a GCC filter appropriate for the combination of noise and multipath signal distortions. Simulations are performed across a range of room conditions to illustrate the utility of the proposed time-delay estimation method relative to conventional GCC filtering approaches.
international conference on acoustics speech and signal processing | 1998
Michael S. Brandstein
This paper addresses the limitations of current approaches to distant-talker speech acquisition and advocates the development of techniques which explicitly incorporate the nature of the speech signal (e.g. statistical non-stationarity, method of production, pitch, voicing, formant structure, and source radiator model) into a multi-channel context. The goal is to combine the advantages of spatial filtering achieved through beamforming with knowledge of the desired time-series attributes. The potential utility of such an approach is demonstrated through the application of a multi-channel version of the dual excitation speech model.
international conference on multimedia and expo | 2000
Ce Wang; Scott M. Griebel; Michael S. Brandstein
An automatic video-conferencing system is proposed which employs acoustic source localization, video face tracking and pose estimation, and multi-channel speech enhancement. The video portion of the system tracks talkers by utilizing source motion, contour geometry, color data and simple facial features. Decisions involving which camera to use are based on an estimate of the heads gazing angle. This head pose estimation is achieved using a very general head model which employs hairline features and a learned network classification procedure. Finally, a wavelet microphone array technique is used to create an enhanced speech waveform to accompany the recorded video signal. The system presented in this paper is robust to both visual clutter (e.g. ovals in the scene of interest which are not faces) and audible noise (e.g. reverberations and background noise).