Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jintao Jiang is active.

Publication


Featured researches published by Jintao Jiang.


EURASIP Journal on Advances in Signal Processing | 2002

On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics

Jintao Jiang; Abeer Alwan; Patricia A. Keating; Lynne E. Bernstein

This study examines relationships between external face movements, tongue movements, and speech acoustics for consonant-vowel (CV) syllables and sentences spoken by two male and two female talkers with different visual intelligibility ratings. The questions addressed are how relationships among measures vary by syllable, whether talkers who are more intelligible produce greater optical evidence of tongue movements, and how the results for CVs compared to those for sentences. Results show that the prediction of one data stream from another is better for C/a/ syllables than C/i/ and C/u/ syllables. Across the different places of articulation, lingual places result in better predictions of one data stream from another than do bilabial and glottal places. Results vary from talker to talker; interestingly, high rated intelligibility do not result in high predictions. In general, predictions for CV syllables are better than those for sentences.


NeuroImage | 2010

Comparison of landmark-based and automatic methods for cortical surface registration

Dimitrios Pantazis; Anand A. Joshi; Jintao Jiang; David W. Shattuck; Lynne E. Bernstein; Hanna Damasio; Richard M. Leahy

Group analysis of structure or function in cerebral cortex typically involves, as a first step, the alignment of cortices. A surface-based approach to this problem treats the cortex as a convoluted surface and coregisters across subjects so that cortical landmarks or features are aligned. This registration can be performed using curves representing sulcal fundi and gyral crowns to constrain the mapping. Alternatively, registration can be based on the alignment of curvature metrics computed over the entire cortical surface. The former approach typically involves some degree of user interaction in defining the sulcal and gyral landmarks while the latter methods can be completely automated. Here we introduce a cortical delineation protocol consisting of 26 consistent landmarks spanning the entire cortical surface. We then compare the performance of a landmark-based registration method that uses this protocol with that of two automatic methods implemented in the software packages FreeSurfer and BrainVoyager. We compare performance in terms of discrepancy maps between the different methods, the accuracy with which regions of interest are aligned, and the ability of the automated methods to correctly align standard cortical landmarks. Our results show similar performance for ROIs in the perisylvian region for the landmark-based method and FreeSurfer. However, the discrepancy maps showed larger variability between methods in occipital and frontal cortex and automated methods often produce misalignment of standard cortical landmarks. Consequently, selection of the registration approach should consider the importance of accurate sulcal alignment for the specific task for which coregistration is being performed. When automatic methods are used, the users should ensure that sulci in regions of interest in their studies are adequately aligned before proceeding with subsequent analysis.


Journal of the Acoustical Society of America | 2006

On the perception of voicing in syllable-initial plosives in noise

Jintao Jiang; Marcia Chen; Abeer Alwan

Previous studies [Lisker, J. Acoust. Soc. Am. 57, 1547-1551 (1975); Summerfield and Haggard, J. Acoust. Soc. Am. 62, 435-448 (1977)] have shown that voice onset time (VOT) and the onset frequency of the first formant are important perceptual cues of voicing in syllable-initial plosives. Most prior work, however, has focused on speech perception in quiet environments. The present study seeks to determine which cues are important for the perception of voicing in syllable-initial plosives in the presence of noise. Perceptual experiments were conducted using stimuli consisting of naturally spoken consonant-vowel syllables by four talkers in various levels of additive white Gaussian noise. Plosives sharing the same place of articulation and vowel context (e.g., /pa,ba/) were presented to subjects in two alternate forced choice identification tasks, and a threshold signal-to-noise-ratio (SNR) value (corresponding to the 79% correct classification score) was estimated for each voiced/voiceless pair. The threshold SNR values were then correlated with several acoustic measurements of the speech tokens. Results indicate that the onset frequency of the first formant is critical in perceiving voicing in syllable-initial plosives in additive white Gaussian noise, while the VOT duration is not.


Brain Research | 2008

Quantified acoustic–optical speech signal incongruity identifies cortical sites of audiovisual speech processing

Lynne E. Bernstein; Zhong-Lin Lu; Jintao Jiang

A fundamental question about human perception is how the speech perceiving brain combines auditory and visual phonetic stimulus information. We assumed that perceivers learn the normal relationship between acoustic and optical signals. We hypothesized that when the normal relationship is perturbed by mismatching the acoustic and optical signals, cortical areas responsible for audiovisual stimulus integration respond as a function of the magnitude of the mismatch. To test this hypothesis, in a previous study, we developed quantitative measures of acoustic-optical speech stimulus incongruity that correlate with perceptual measures. In the current study, we presented low incongruity (LI, matched), medium incongruity (MI, moderately mismatched), and high incongruity (HI, highly mismatched) audiovisual nonsense syllable stimuli during fMRI scanning. Perceptual responses differed as a function of the incongruity level, and BOLD measures were found to vary regionally and quantitatively with perceptual and quantitative incongruity levels. Each increase in the level of incongruity resulted in an increase in overall levels of cortical activity and in additional activations. However, the only cortical region that demonstrated differential sensitivity to the three stimulus incongruity levels (HI>MI>LI) was a subarea of the left supramarginal gyrus (SMG). The left SMG might support a fine-grained analysis of the relationship between audiovisual phonetic input in comparison with stored knowledge, as hypothesized here. The methods here show that quantitative manipulation of stimulus incongruity is a new and powerful tool for disclosing the system that processes audiovisual speech stimuli.


Frontiers in Neuroscience | 2013

Auditory Perceptual Learning for Speech Perception Can be Enhanced by Audiovisual Training

Lynne E. Bernstein; Silvio P. Eberhardt; Jintao Jiang

Speech perception under audiovisual (AV) conditions is well known to confer benefits to perception such as increased speed and accuracy. Here, we investigated how AV training might benefit or impede auditory perceptual learning of speech degraded by vocoding. In Experiments 1 and 3, participants learned paired associations between vocoded spoken nonsense words and nonsense pictures. In Experiment 1, paired-associates (PA) AV training of one group of participants was compared with audio-only (AO) training of another group. When tested under AO conditions, the AV-trained group was significantly more accurate than the AO-trained group. In addition, pre- and post-training AO forced-choice consonant identification with untrained nonsense words showed that AV-trained participants had learned significantly more than AO participants. The pattern of results pointed to their having learned at the level of the auditory phonetic features of the vocoded stimuli. Experiment 2, a no-training control with testing and re-testing on the AO consonant identification, showed that the controls were as accurate as the AO-trained participants in Experiment 1 but less accurate than the AV-trained participants. In Experiment 3, PA training alternated AV and AO conditions on a list-by-list basis within participants, and training was to criterion (92% correct). PA training with AO stimuli was reliably more effective than training with AV stimuli. We explain these discrepant results in terms of the so-called “reverse hierarchy theory” of perceptual learning and in terms of the diverse multisensory and unisensory processing resources available to speech perception. We propose that early AV speech integration can potentially impede auditory perceptual learning; but visual top-down access to relevant auditory features can promote auditory perceptual learning.


Journal of Experimental Psychology: Human Perception and Performance | 2011

Psychophysics of the McGurk and other audiovisual speech integration effects.

Jintao Jiang; Lynne E. Bernstein

When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called McGurk effect), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of the four types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, correlations, and acoustic measures, all representing audiovisual stimulus relationships. In Experiment 1, open-set perceptual responses were obtained for acoustic /bɑ/ or /lɑ/ dubbed to video /bɑ, dɑ, gɑ, vɑ, zɑ, lɑ, wɑ, ðɑ/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for perceptual response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships.


Attention Perception & Psychophysics | 2007

Similarity structure in visual speech perception and optical phonetic signals.

Jintao Jiang; Abeer Alwan; Patricia A. Keating; Lynne E. Bernstein

A complete understanding of visual phonetic perception (lipreading) requires linking perceptual effects to physical stimulus properties. However, the talking face is a highly complex stimulus, affording innumerable possible physical measurements. In the search for isomorphism between stimulus properties and phonetic effects, second-order isomorphism was examined between the perceptual similarities of video-recorded perceptually identified speech syllables and the physical similarities among the stimuli. Four talkers produced the stimulus syllables comprising 23 initial consonants followed by one of three vowels. Six normal-hearing participants identified the syllables in a visual-only condition. Perceptual stimulus dissimilarity was quantified using the Euclidean distances between stimuli in perceptual spaces obtained via multidimensional scaling. Physical stimulus dissimilarity was quantified using face points recorded in three dimensions by an optical motion capture system. The variance accounted for in the relationship between the perceptual and the physical dissimilarities was evaluated using both the raw dissimilarities and the weighted dissimilarities. With weighting and the full set of 3-D optical data, the variance accounted for ranged between 46% and 66% across talkers and between 49% and 64% across vowels. The robust second-order relationship between the sparse 3-D point representation of visible speech and the perceptual effects suggests that the 3-D point representation is a viable basis for controlled studies of first-order relationships between visual phonetic perception and physical stimulus attributes.


Human Brain Mapping | 2011

Visual phonetic processing localized using speech and nonspeech face gestures in video and point‐light displays

Lynne E. Bernstein; Jintao Jiang; Dimitrios Pantazis; Zhong-Lin Lu; Anand A. Joshi

The talking face affords multiple types of information. To isolate cortical sites with responsibility for integrating linguistically relevant visual speech cues, speech and nonspeech face gestures were presented in natural video and point‐light displays during fMRI scanning at 3.0T. Participants with normal hearing viewed the stimuli and also viewed localizers for the fusiform face area (FFA), the lateral occipital complex (LOC), and the visual motion (V5/MT) regions of interest (ROIs). The FFA, the LOC, and V5/MT were significantly less activated for speech relative to nonspeech and control stimuli. Distinct activation of the posterior superior temporal sulcus and the adjacent middle temporal gyrus to speech, independent of media, was obtained in group analyses. Individual analyses showed that speech and nonspeech stimuli were associated with adjacent but different activations, with the speech activations more anterior. We suggest that the speech activation area is the temporal visual speech area (TVSA), and that it can be localized with the combination of stimuli used in this study. Hum Brain Mapp, 2010.


international conference on multimedia and expo | 2006

Acoustically-Driven Talking Face Synthesis using Dynamic Bayesian Networks

Jianxia Xue; Jonas Borgstrom; Jintao Jiang; Lynne E. Bernstein; Abeer Alwan

Dynamic Bayesian networks (DBNs) have been widely studied in multi-modal speech recognition applications. Here, we introduce DBNs into an acoustically-driven talking face synthesis system. Three prototypes of DBNs, namely independent, coupled, and product HMMs were studied. Results showed that the DBN methods were more effective in this study than a multilinear regression baseline. Coupled and product HMMs performed similarly better than independent HMMs in terms of motion trajectory accuracy. Audio and visual speech asynchronies were represented differently for coupled HMMs versus product HMMs


international conference on acoustics, speech, and signal processing | 2002

Similarity structure in perceptual and physical measures for visual Consonants across talkers

Jintao Jiang; Abeer Alwan; Lynne E. Bernstein; Patricia A. Keating

This paper investigates the relationship between visual confusion matrices and physical (facial) measures. The similarity structure in perceptual and physical measures for visual consonants was examined across four talkers. Four talkers, spanning a wide range of rated visual intelligibility, were recorded producing 69 Consonant-Vowel (CV) syllables. Audio, video, and 3-D face motion were recorded. Each talkers CV productions were presented for identification in a visual-only condition to six viewers with average or better lipreading ability. The obtained visual confusion matrices demonstrated that phonemic equivalence classes were related to visual intelligibility and were talker and vowel context dependent. Physical measures accounted for about 63% of the variance of visual consonant perception, with C/u/ syllables yielding higher correlations than C/a/ and C/i/ syllables.

Collaboration


Dive into the Jintao Jiang's collaboration.

Top Co-Authors

Avatar

Lynne E. Bernstein

George Washington University

View shared research outputs
Top Co-Authors

Avatar

Abeer Alwan

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jianxia Xue

University of California

View shared research outputs
Top Co-Authors

Avatar

Anand A. Joshi

University of Southern California

View shared research outputs
Top Co-Authors

Avatar

Dimitrios Pantazis

McGovern Institute for Brain Research

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bosco S. Tjan

University of Southern California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hanna Damasio

University of Southern California

View shared research outputs
Researchain Logo
Decentralizing Knowledge