Kjell Elenius
Royal Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kjell Elenius.
international conference on acoustics, speech, and signal processing | 1982
Kjell Elenius; Mats Blomberg
A dynamic programming pattern matching isolated word recognition system has been modified in order to emphasize the transient parts of speech in the similarity measure. The technique is to weight the word distances with a normalized spectral change function. A small positive effect is measured. Emphasizing the stationary parts is shown to substantially decrease the performance. Adding the time derivative of the speech parameters to the word patterns improves performance significantly. This is probably a consequence of an improvement in the description of the transient segments.
International Journal of Speech Technology | 2000
Kjell Elenius
The EU-funded SpeechDat project was initiated in order to create large-scale speech databases for the development of voice-operated telecommunication services. This paper deals with the design of two such Swedish resources: 5000 speakers recorded over the fixed telephone network and 1000 speakers over the mobile network. Speakers were balanced according to gender, age and dialect. We also report on experiences from speaker recruitment. A “snowball” method, in which people gave addresses to friends according to a chain letter principle, was shown to be effective. Females were, in general, more cooperative than males. However, using Internet for recruiting favored young males. Statistics on speaker distribution are presented. Results regarding orthographic labeling of pronunciation, pronunciation errors and non-speech events are also included. The length of the longest word in a read sentence is shown to be directly correlated with mispronunciations and word repetitions.
Recent Research Towards Advanced Man-Machine Interface Through Spoken Language | 1996
Mats Blomberg; Rolf Carlson; Kjell Elenius; Björn Granström; Sheri Hunnicutt
Publisher Summary This chapter reports on some experiments, which are part of a long term project towards knowledge based speech recognition system, NEBULA. An extreme stand is taken in these experiments of comparing human speech to predicted pronunciations on the acoustic level with the help of straightforward pattern matching technique. The significantly better results when human references are used were not a surprise. It is well known that text-to-speech systems still need more work before they reach human quality. However, the results can be regarded as encouraging. The low levels of NEBULA explore the descriptive power of cues, and use multiple cues to analyze, classify, and segment the speech wave. The mid-portion of NEBULA is currently represented by a syntactic component of the text-to-speech system, morphological decomposition in the text-to-speech system, and a concept-to-speech system.
Journal of the Acoustical Society of America | 1988
Mats Blomberg; Rolf Carlson; Kjell Elenius; Björn Granström; Sheri Hunnicutt
A major problem in large‐vocabulary speech recognition is the collection of reference data and speaker normalization. In this paper, the use of synthetic speech is proposed as a means of handling this problem. An experimental scheme for such a speech recognition system will be described. A rule‐based speech synthesis procedure is used for generating the reference data. Ten male subjects participated in an experiment using a 26‐word test vocabulary recorded in a normal office room. The subjects were asked to read the words from a list with little instruction except to pronounce each word separately. The synthesis was used to build the references. No adjustments were done to the synthesis in this first stage. All the human speakers served better as reference than the synthesis. Differences between natural and synthetic speech have been analyzed in detail at the segmental level. Methods for updating the synthetic speech parameters from natural speech templates will be described. [This work has been supported by the Swedish Board of Technical Development.]
international conference on spoken language processing | 1996
Mats Blomberg; Kjell Elenius
With limited training data, infrequent triphone models for speech recognition are not observed in sufficient numbers. In this paper, a speech production approach is used to predict the characteristics of unseen triphones by concatenating diphones and/or monophones in the parametric representation of a formant speech synthesiser. The parameter trajectories are estimated by interpolation between the endpoints of the original units. The spectral states of the created triphone are generated by the speech synthesiser. Evaluation of the proposed technique has been performed using spectral error measurements and recognition candidate rescoring of N-best lists. In both cases, the created triphones are shown to perform better than the shorter units from which they were constructed.
international conference on acoustics, speech, and signal processing | 1990
Kjell Elenius
The orthographic structure of Swedish words was used for predicting word class using a connectionist approach. This technique can be used to aid syntactic processing within a text-to-speech system. The error backpropagation technique was used for the connectionist learning procedure. A corpus of the 10000 most frequent Swedish words was used for training and testing the system. The results indicate that around 80% of the words can be correctly classified by using the last part of each word. The system is compared to a rule-based system that makes the same sort of predictions from word endings. The two systems give comparable results for the corpus used.<<ETX>>
Journal of the Acoustical Society of America | 1978
Mats Blomberg; Kjell Elenius
The principal object in using a phonetic approach is the reduction of the influence on recognition rate caused by the intra‐ and inter‐speaker speech variations. The system is implemented on a 16‐K minicomputer and uses a filter bank delivering spectral sections, 0–5 kHz, every 10 ms. Estimates of the first three formants are calculated and energies in different spectral bands are used to segment the speech signal into broad classes. The following measures are calculated depending on the segmental class and the speech parameter: mean values, steady‐state values, durations, transition rates, and some distances between formants. In a learning phase the statistics of the measures of the used vocabulary are automatically calculated by a program given the quasiphonetic spelling of the input words. The statistics are based on phoneme pairs, i.e., diphones. In the recognition phase the program uses the statistics and the quasiphonetic spelling to recognize the input words. Six male speakers were used for calculating the statistics of a 41‐word vocabulary. Their mean recognition rate was 98%, using a new recording. The rate decreased to 96.3% using four male talkers, unknown to the system. [Work supported by STU, Sweden.]
international conference on artificial neural networks | 1991
Kjell Elenius; György Takács
An artificial neural network has been trained to recognizes phonemes using the error back-propagation technique. First a coarse feature network was trained to extract seven quasi-phonetic features from the spectral frames of a Bark-scaled filter bank. The outputs of this net and the spectral outputs of the filter bank were input to a phoneme recognition net. A seven frame wide window of the feature net output was used to include the context of the frame being classified. Both Swedish and Hungarian speech material was used and the following results are for Hungarian. The coarse features were recognized with 80% - 93% accuracy and the performance was shown to be relatively insensitive to changing speaker or language. The frame level phone recognition rate was 55%. Using manual segmentation the phone recognition rate was 64% and in 82% of the cases, the correct phone was among the best three phoneme candidates.
international conference on acoustics, speech, and signal processing | 1986
Mats Blomberg; Kjell Elenius
A technique of nonlinear frequency warping has been investigated for recognition of Swedish vowels. A frequency warp between two spectra is computed using a standard dynamic programming algorithm. The frequency distance, defined as the area between the obtained warping function and the diagonal, is contributing to the spectral distance. The distance between two spectra is a weighted sum of the warped amplitude distance and the frequency distance. By changing two weights, we get a gradual shift between non-warped amplitude distance, warped amplitude distance, and frequency distance. In recognition experiments on natural and synthetic vowel spectra, a metric combining the frequency and amplitude distances gave better results than using only amplitude or frequency deviation. Analysis of the results of the synthetic vowels show a reduced sensitivity to voice source and pitch variation. For the natural vowels, the recognition improvement is larger for the male and female speakers separately than for the combined groups.
international conference on spoken language processing | 2006
Daniel Neiberg; Kjell Elenius; Kornel Laskowski