Joseph Picone
Temple University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Joseph Picone.
Proceedings of the IEEE | 1993
Joseph Picone
A tutorial on signal processing in state-of-the-art speech recognition systems is presented, reviewing those techniques most commonly used. The four basic operations of signal modeling, i.e. spectral shaping, spectral analysis, parametric transformation, and statistical modeling, are discussed. Three important trends that have developed in the last five years in speech recognition are examined. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similarity transform techniques, often used to normalize and decorrelate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signals spectrum can be estimated in a closed-loop manner. The signal processing components of these algorithms are reviewed. >
IEEE Assp Magazine | 1990
Joseph Picone
The use of hidden Markov models (HMMs) in continuous speech recognition is reviewed. Markov models are presented as a generalization of their predecessor technology, dynamic programming. A unified view is offered in which both linguistic decoding and acoustic matching are integrated into a single, optimal network search framework. Advances in recognition architectures are discussed. The fundamentals of Viterbi beam search, the dominant search algorithm used today in speed recognition, are presented. Approaches to estimating the probabilities associated with an HMM model are examined. The HMM-supervised training paradigm is examined. Several examples of successful HMM-based speech recognition systems are reviewed.<<ETX>>
IEEE Transactions on Signal Processing | 2004
Aravind Ganapathiraju; Jonathan Hamaker; Joseph Picone
Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Alphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Journal of the Acoustical Society of America | 1995
Joseph Picone; Barbara Wheatley
A voice log-in system is based on a persons spoken name input only, using speaker-dependent acoustic name recognition models in a performing speaker-independent name recognition. In an enrollment phase, a dual pass endpointing procedure defines both the persons full name (broad endpoints), and the component names separated by pauses (precise endpoints). An HMM (Hidden Markov Model) recognition model generator generates a corresponding HMM name recognition model modified by the insertion of additional skip transitions for the pauses between component names. In a recognition/update phase, a spoken-name speech signal is input to an HMM name recognition engine which performs speaker-independent name recognition--the modified HMM name recognition model permits the name recognition operation to accommodate pauses between component names of variable duration.
IEEE Transactions on Speech and Audio Processing | 2001
Aravind Ganapathiraju; Jonathan Hamaker; Joseph Picone; Mark Ordowski; George R. Doddington
Most large vocabulary continuous speech recognition (LVCSR) systems in the past decade have used a context-dependent (CD) phone as the fundamental acoustic unit. We present one of the first robust LVCSR systems that uses a syllable-level acoustic unit for LVCSR on telephone-bandwidth speech. This effort is motivated by the inherent limitations in phone-based approaches-namely the lack of an easy and efficient way for modeling long-term temporal dependencies. A syllable unit spans a longer time frame, typically three phones, thereby offering a more parsimonious framework for modeling pronunciation variation in spontaneous speech. We present encouraging results which show that a syllable-based system exceeds the performance of a comparable triphone system both in terms of word error rate (WER) and complexity. The WER of the best syllabic system reported here is 49.1% on a standard Switchboard evaluation, a small improvement over the triphone system. We also report results on a much smaller recognition task, OGI Alphadigits, which was used to validate some of the benefits syllables offer over triphones. The syllable-based system exceeds the performance of the triphone system by nearly 20%, an impressive accomplishment since the alphadigits application consists mostly of phone-level minimal pair distinctions.
international conference on acoustics, speech, and signal processing | 2000
William Byrne; Peter Beyerlein; Juan M. Huerta; Sanjeev Khudanpur; B. Marthi; John Morgan; Nino Peterek; Joseph Picone; Dimitra Vergyri; T. Wang
We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge-based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and discriminative model combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.
Journal of the Acoustical Society of America | 1994
Barbara Wheatley; Joseph Picone
A name recognition system (FIG. 1 )used to provide access to a database based on the voice recognition of a proper name spoken by a person who may not know the correct pronunciation of the name. During an enrollment phase (10), for each name-text entered (11) into a text database (12), text-derived recognition models (22) are created for each of a selected number of pronunciations of a name-text, with each recognition model being constructed from a respective sequence of phonetic features (15) generated by a Boltzmann machine (13). During a name recognition phase (20), the spoken input (24,25) of a name (by a person who may not know the correct pronunciation) is compared (26) with the recognition models (22) looking for a pattern match--selection of a corresponding name-text is made based on a decision rule (28).
international conference on acoustics speech and signal processing | 1999
Joseph Picone; S. Pike; R. Regan; T. Kamm; J. Bridle; Li Deng; Z. Ma; H. Richards; M. Schuster
Conversational speech recognition is a challenging problem primarily because speakers rarely fully articulate sounds. A successful speech recognition approach must infer intended spectral targets from the speech data, or develop a method of dealing with large variances in the data. Hidden dynamic models (HDMs) attempt to automatically learn such targets in a hidden feature space using models that integrate linguistic information with constrained temporal trajectory models. HDMs are a radical departure from conventional hidden Markov models (HMMs), which simply account for variation in the observed data. We present an initial evaluation of such models on a conversational speech recognition task involving a subset of the SWITCHBOARD corpus. We show that in an N-best rescoring paradigm, HDMs are capable of delivering performance competitive with HMMs.
Journal of the Acoustical Society of America | 1991
Joseph Picone; George R. Doddington
A speech encoder is disclosed quantizing speech information with respect to energy, voicing and pitch parameters to provide a fixed number of bits per block of frames. Coding of the parameters takes place for each N frames, which comprise a block, irrespective of phonemic boundaries. Certain frames of speech information are discarded during transmission, if such information is substantially duplicated in an adjacent frame. A very low data rate transmission system is thus provided which exhibits a high degree of fidelity and throughput.
international conference on acoustics, speech, and signal processing | 1989
Joseph Picone; George R. Doddington
The authors study the problem of coding spectral information in speech at bit rates in the range of 100-400 b/s using speaker-independent phone-based recognition. Spectral information is coded as a sequence of phonetic events and a sequence of transitions through the corresponding hidden Markov model (HMM)-based phone models. This simple phonetic speech-coding system has been shown to be a promising approach. A simple inventory of phonemes is sufficient for capturing the bulk of the acoustic information.<<ETX>>