Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lloyd A. Smith.
Journal of the Acoustical Society of America | 1979
Lloyd A. Smith; Brian L. Scott
The intelligibility of the front vowels (/i/, /l/,/e/, and /ae/) was investigated as sung in four different ways: (1) operatic, (2) in CVC context, (3) with a raised larynx, and (4) with both raised larynx and in CVC context. All syllables were sung by a trained soprano at F4, A4 C♯5, F5, A5, and C♯6. Ten subjects listened and identified randomized sets of ten tokens of each vowel per condition (method of articulation) at each pitch. Results showed that, from C♯5 (554.4 Hz) to F5 (698.5 Hz), the intelligibility of operatic vowels (condition 1) fell from 56% to 16%. The mean intelligibility of the vowels at the three highest pitch values (F5, A5, C♯6) was 10% for condition 1, 64% for condition 2, 62% for condition 3, and 83% for condition 4. Probable reasons for increased intelligibility across conditions will be discussed. The results indicate that the generally accepted notion that vowel sounds are largely unintelligible at higher pitch values pertains only to a restricted manner of production.
Journal of the Acoustical Society of America | 1996
Mark A. Hall; Lloyd A. Smith
This paper describes a new algorithm that composes blues melodies to fit a given chord sequence. It comprises an analysis stage followed by a synthesis stage. The analysis stage produces a Markov model composed of zero, first‐, and second‐order transition tables covering both pitches and rhythms. In order to capture the relationship between harmony and melody, a set of transition tables is produced for each chord in the analyzed songs. The synthesis stage uses the output tables from analysis to generate new melodies; second‐order tables are used as much as possible, with fall back procedures to lower‐order tables. Some constraints are encoded in the form of rules to control the placement of rhythmic patterns within measures, pitch values for long duration notes and pitch values for the start of new phrases. The model was evaluated by a listening experiment; results showed that listeners were unable to reliably distinguish human from computer composed melodies.
Journal of the Acoustical Society of America | 1989
Brian L. Scott; Lisan Lin; Mark Newell; Lloyd A. Smith
In May 1988, a speaker‐independent recognition algorithm was described at the 115th Meeting of the Acoustical Society of America [Scott et al., J. Acoust. Soc. Am. Suppl. 1 83, S55 (1988)]. The algorithm yielded 95.4% recognition accuracy on the 20‐word TI database obtained from the National Bureau of Standards. The system has since been adapted for use over the telephone. In so doing, a new database was developed consisting of 16 words (zero‐nine, oh, yes, no, cancel, terminate, and help) as spoken by 11 males and 11 females from various locations across the country. Although small (1001 utterances), the database represented a significant challenge as compared to the one obtained from the NBS. There were fewer training passes per word, more speakers, and there was considerably more noise in the database. Initial tests on this database yielded accuracies of approximately 60%. Four major enhancements to the algorithm improved the accuracy on this database to 95.3% Two of the enhancements compensate for tim...
Journal of the Acoustical Society of America | 1991
Dale Carnegie; Geoff Holmes; Lloyd A. Smith
An auditory model has been implemented that produces as output an in synchrony bands spectrum (SBS) [O. Ghitza, IEEE Acoust. Speech Signal Process ASSP‐35, 736–740 (1987)]. To reconstruct speech from this output it is necessary to use the amplitudes and phases of the input speech as determined by an FFT earlier in the process. The model dramatically reduces (by up to 70%) the number of frequency components in the signal. The data rate of the resulting speech is determined by the number of frequency components and the number of bits used to quantize the frequencies, amplitudes, and phases. Experiments are currently being run to determine the number of bits necessary to maintain intelligibility. This paper will report the results of these experiments.
Journal of the Acoustical Society of America | 1991
Mark A. Hall; Lloyd A. Smith
A computer program has been written that generates blues melodies to fit input chord progressions. The program uses a combination of stochastic methods and high‐level rules. Second‐order Markov models govern the generation of pitches and rhythms, with fall back procedures (to first‐ and zero‐order models) used to deal with zero‐frequency problems. Pitches and rhythms are generated by independent processes; an experimental goal is to determine the degree to which these processes may be linked. The rhythm model uses rhythmic patterns of varying lengths, thus incorporating a moderately variable time scale. Long‐term time scale factors are controlled by rules operating at the phrase level. Output from the program will be judged by listeners to determine the degree to which the program captures the structure of blues melodies. Results of these listening experiments will be presented at the conference.
Journal of the Acoustical Society of America | 1989
Lloyd A. Smith; Brian L. Scott; Lisan S. Lin; J. M. Newell
Speech sounds generated by a simple waveform synthesizer were used to create a 16‐vector codebook for vector quantization (vq). This codebook was used as a seed for training over the TI‐20 isolated word database, which was then used to test the resulting codebooks using speech recognition. The database was split into two sets of speakers and two codebooks were created for each set, one trained on one pass over the speech and the other trained iteratively. All codebooks were trained adaptively, resulting in codebooks of 31 vectors—16 high‐energy and 15 low‐energy vectors. Recognition was performed using a conventional DTW algorithm. The baseline recognition accuracy (using no vq) was 97%; the accuracy for the 16‐vector synthesized codebook was 91.8%. Accuracy using trained codebooks ranged from 94% (two‐sided vq) to 95.1% (one‐sided vq with a normalized match distance). Crossing speaker sets (resulting in speaker‐independent codebooks) yielded 94.9% recognition accuracy. The results indicate that speech sy...
Journal of the Acoustical Society of America | 1989
Brian L. Scott; Lloyd A. Smith; Lisan S. Lin; J. Mark Newell
Perhaps the most difficult features for speech recognition systems to properly capture and align are the transitional cues that exist between phones. There are advantages and disadvantages associated with both linear time normalization methods and dynamic time warping (DTW). Linear methods preserve the relative durations of transitional and steady‐state portions of the signal but tend to smear features and lose resolution. DTW retains resolution but loses durational relations. The present algorithm attempts to compensate for the smearing associated with linear normalization by copying transitional regions of the utterance to additional locations within the representation. This process involves locating the onset and offset of the syllable nucleus based on the normalized representation and copying two frames immediately surrounding the onset and offset to another location in the representation. The process enhances performance of the recognizer in two ways. First of all, the redundancy serves to weight mor...
Journal of the Acoustical Society of America | 1989
Lloyd A. Smith; Brian L. Scott; Lisan S. Lin; J. M. Newell
A method is described for automated training of a speaker‐independent isolated word recognizer. The training process generates vocabulary templates from a database of collected training utterances. These templates are then modified through adaptive training, an iterative process of testing and modifying templates in order to optimize recognition. Robustness of the templates is enhanced by varying the presentation of the collected utterances during adaptation; varying the utterance sampling rate, for example, has the effect of presenting the same utterance at differing pitches and time scales. Adaptive training continues until the error rate falls to an acceptable level. Results will bc presented for similar vocabularies developed with and without adaptation and under varying adaptation conditions.
Journal of the Acoustical Society of America | 1988
Brian L. Scott; Lisan Lin; Mark Newell; Lloyd A. Smith
The algorithms described have yielded speaker‐independent scores of 95.1% on the 20‐word TI database obtained from the National Bureau of Standards. Results were obtained by training the system on half of the speakers in the database, testing on the other half, and then reversing the order. Training was done with the 10 training tokens per speaker per word only. Testing was on the 16 test tokens per speaker per word. The total number of test trials was 5120. The recognizer uses conventional methods for time normalization and matching. Time normalization is linear and scoring is accomplished with a simple differencing algorithm weighted by variances. Storage requirement is 3072 bits per word. Most of the speaker normalization is accomplished by the proprietary signal processing method developed by Scott Instruments. Aside from the amplitude normalization routines, no floating point arithmetic is used. All signal processing is temporally based. The front end process can be adapted for use with dynamic time ...
Journal of the Acoustical Society of America | 1988
Lloyd A. Smith
Speech sounds generated by a simple waveform synthesizer were used to create a vector quantization codebook for use in speech recognition. Recognition was tested over the TI-20 isolated word data base using a conventional DTW matching algorithm. Input speech was band limited to 300-3300 Hz, then passed through the Scott Instruments Corp. Coretechs process, implemented on a VET3 speech terminal, to create the speech representation for matching. Synthesized sounds were processed in the software by a VET3 signal processing emulation program. Emulation and recognition were performed on a DEC VAX 11/750. The experiments were organized in 2 series. A preliminary experiment, using no vector quantization, provided a baseline for comparison. The original codebook contained 109 vectors, all derived from 2 format synthesized sounds. This codebook was decimated through the course of the first series of experiments, based on the number of times each vector was used in quantizing the training data for the previous experiment, in order to determine the smallest subset of vectors suitable for coding the speech data base. The second series of experiments altered several test conditions in order to evaluate the applicability of the minimal synthesized codebook to conventional codebook training. The baseline recognition rate was 97%. The recognition rate for synthesized codebooks was approximately 92% for sizes ranging from 109 to 16 vectors. Accuracy for smaller codebooks was slightly less than 90%. Error analysis showed that the primary loss in dropping below 16 vectors was in coding of voiced sounds with high frequency second formants. The 16 vector synthesized codebook was chosen as the seed for the second series of experiments. After one training iteration, and using a normalized distortion score, trained codebooks performed with an accuracy of 95.1%. When codebooks were trained and tested on different sets of speakers, accuracy was 94.9%, indicating that very little speaker dependence was introduced by the training.