Torbjørn Svendsen
Norwegian University of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Torbjørn Svendsen.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Sabato Marco Siniscalchi; Dau-Cheng Lyu; Torbjørn Svendsen; Chin-Hui Lee
A state-of-the-art automatic speech recognition (ASR) system can often achieve high accuracy for most spoken languages of interest if a large amount of speech material can be collected and used to train a set of language-specific acoustic phone models. However, designing good ASR systems with little or no language-specific speech data for resource-limited languages is still a challenging research topic. As a consequence, there has been an increasing interest in exploring knowledge sharing among a large number of languages so that a universal set of acoustic phone units can be defined to work for multiple or even for all languages. This work aims at demonstrating that a recently proposed automatic speech attribute transcription framework can play a key role in designing language-universal acoustic models by sharing speech units among all target languages at the acoustic phonetic attribute level. The language-universal acoustic models are evaluated through phone recognition. It will be shown that good cross-language attribute detection and continuous phone recognition performance can be accomplished for “unseen” languages using minimal training data from the target languages to be recognized. Furthermore, a phone-based background model (PBM) approach will be presented to improve attribute detection accuracies.
international conference on acoustics, speech, and signal processing | 2008
Sabato Marco Siniscalchi; Torbjørn Svendsen; Chin-Hui Lee
In recent research, we have proposed a high-accuracy bottom-up detection-based paradigm for continuous phone speech recognition. The key component of our system was a bank of articulatory detectors each of which computes a score describing an activation level of the specified speech phonetic features that the current frame exhibits. In this work, we present our first attempt at designing a universal phone recognizer using the detection-based approach. We show that our technique is intrinsically language independent since reliable articulatory detectors can be designed for diverse languages, and robust detection can be performed across languages. Moreover, a universal set of detectors is designed by sharing the training material available for several diverse languages. We further demonstrate that our approach makes it possible to decode new target languages by neither retraining nor applying acoustic adaptation techniques. We report phone recognition performance that compares favorably with the best results known by the authors on the OGI Multi-language Telephone Speech corpus.
ieee automatic speech recognition and understanding workshop | 2007
Sabato Marco Siniscalchi; Torbjørn Svendsen; Chin-Hui Lee
We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the systems inherent flexibility and the ease of adding new detectors may provide further improvements.
Neurocomputing | 2014
Sabato Marco Siniscalchi; Torbjørn Svendsen; Chin-Hui Lee
Abstract An artificial neural network (ANN) is a powerful mathematical framework used to either model complex relationships between inputs and outputs or find patterns in data. It is based on an interconnected group of artificial neurons, and it employs a connectionist approach to computation when processing information. ANNs have been successfully used for a great variety of applications, such as decision making, quantum chemistry, radar systems, face identification, gesture recognition, handwritten text recognition, medical diagnosis, financial applications, robotics, data mining, and e-spam filtering. In the speech community, neural architectures have been used since the beginning of the 1980s, and ANNs have been proven useful to accomplish several speech processing tasks, e.g., to extract linguistically motivated features, to perform speech detection, and to generate local scores to be used for different goals. In recent years, there has been a renewed interest in the use of ANNs for speech applications due to a major advance made in pre-training the weights in deep neural networks (DNNs). It seems that a new trend to move the speech technology forward through the use of NNs has begun, and it can therefore be instructive to review key ANN applications to automatic speech processing. In this paper, several ANN-based applications for speech processing will be presented, ranging from speech attribute extraction to phoneme estimation and/or classification. Furthermore, it will be shown that ANNs play a key role in several important speech applications, such as large vocabulary continuous speech recognition (LVCSR) and automatic language recognition. The goal of the paper is to summarize chief ANN approaches to speech processing using the experience gathered in the last seven years in our laboratories.
IEEE Transactions on Audio, Speech, and Language Processing | 2013
Sabato Marco Siniscalchi; Torbjørn Svendsen; Chin-Hui Lee
A novel bottom-up decoding framework for large vocabulary continuous speech recognition (LVCSR) with a modular search strategy is presented. Weighted finite state machines (WFSMs) are utilized to accomplish stage-by-stage acoustic-to-linguistic mappings from low-level speech attributes to high-level linguistic units in a bottom-up manner. Probabilistic attribute and phone lattices are used as intermediate vehicles to facilitate knowledge integration at different levels of the speech knowledge hierarchy. The final decoded sentence is obtained by performing lexical access and applying syntactical constraints. Two key factors are critical to warrant a high recognition accuracy, namely: (i) generation of high-precision sets of competing hypotheses at every intermediate stage; and (ii) low-error pruning of unlikely theories to reduce input lattice sizes while maintaining high-quality hypotheses for the next layers of knowledge integration. The decoupled nature of the proposed techniques allows us to obtain recognition results at all stages, including attribute, phone and word levels, and enables an integration of various knowledge sources not easily done in the state-of-the-art hidden Markov model (HMM) systems based on top-down knowledge integration. Evaluation on the Nov92 test set of the 5000-word, Wall Street Journal task demonstrates that high-accuracy attribute and phone classification can be attained. As for word recognition, the proposed WFSM-based framework achieves encouraging word error rates. Finally, by combining attribute scores with the conventional HMM likelihood scores and re-ordering the N-best lists obtained from the word lattices generated with the proposed WFSM system, the word error rate (WER) can be further reduced.
ieee automatic speech recognition and understanding workshop | 2011
Luis Javier Rodriguez-Fuentes; Mikel Penagarikano; Amparo Varona; Mireia Diez; Germán Bordel; David Martinez; Jesús Villalba; Antonio Miguel; Alfonso Ortega; Eduardo Lleida; Alberto Abad; Oscar Koller; Isabel Trancoso; Paula Lopez-Otero; Laura Docio-Fernandez; Carmen García-Mateo; Rahim Saeidi; Mehdi Soufifar; Tomi Kinnunen; Torbjørn Svendsen; Pasi Fränti
Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.
international conference on signal processing | 2004
Torbjørn Svendsen
Written text is based on an orthographic representation of words, i.e. linear sequences of letters. Modern speech technology (automatic speech recognition and text-to-speech synthesis) is based on phonetic units representing realization of sounds. A mapping between the orthographic form and phonetic forms representing the pronunciation is thus required. This may be obtained by creating pronunciation lexica and/or rule-based systems for grapheme-to-phoneme conversion. Traditionally, this mapping has been obtained manually, based on phonetic and linguistic knowledge. This approach has a number of drawbacks: i) the pronunciations represent typical pronunciations and will have a limited capacity for describing pronunciation variation due to speaking style and dialectical/accent variations; ii) if multiple pronunciation variants are included, it does not indicate which variants are more significant for the specific application; iii) the description is based on phonetic-knowledge and does not take into account that the units used in speech technology may deviate from the phonetic interpretation; and iv) the description is limited to units with a linguistic interpretation. The paper will present and discuss methods for modeling pronunciation and pronunciation variation specifically for applications in speech technology.
international conference on acoustics, speech, and signal processing | 2009
Sabato Marco Siniscalchi; Torbjørn Svendsen; Chin-Hui Lee
Large Vocabulary Continuous Speech Recognition (LVCSR) systems decode the input speech using diverse information sources, such as acoustic, lexical, and linguistic. Although most of the unreliable hypotheses are pruned during the recognition process, current state-of-the-art systems often make errors that are “unreasonable” for human listeners. Several studies have shown that a proper integration of acoustic-phonetic information can be beneficial to reducing such errors. We have previously shown that high-accuracy phone recognition can be achieved if a bank of speech attribute detectors is used to compute a confidence score describing attribute activation levels that the current frame exhibits. In those experiments, the phone recognition system did not rely on the language model to follow their word sequence constraints, and the vocabulary was small. In this work, we extend our approach to LVCSR by introducing a second recognition step during which additional information not directly used during conventional log-likelihood based decoding is introduced. Experimental results show promising performance.
international conference on acoustics, speech, and signal processing | 2010
Sabato Marco Siniscalchi; Torbjørn Svendsen; Filippo Sorbello; Chin-Hui Lee
The choice of hidden non-linearity in a feed-forward multi-layer perceptron (MLP) architecture is crucial to obtain good generalization capability and better performance. Nonetheless, little attention has been paid to this aspect in the ASR field. In this work, we present some initial, yet promising, studies toward improving ASR performance by adopting hidden activation functions that can be automatically learned from the data and change shape during training. This adaptive capability is achieved through the use of orthonormal Hermite polynomials. The “adaptive” MLP is used in two neural architectures that generate phone posterior estimates, namely, a standalone configuration and a hierarchical structure. The posteriors are input to a hybrid phone recognition system with good results on the TIMIT corpus. A scheme for optimizing the contributions of high-accuracy neural architectures is also investigated, resulting in a relative improvement of ~9.0% over a non-optimized combination. Finally, initial experiments on the WSJ Nov92 task show that the proposed technique scales well up to large vocabulary continuous speech recognition (LVCSR) tasks.
international conference on acoustics, speech, and signal processing | 1984
Torbjørn Svendsen
In speech coding quantization has traditionally been done on a sample by sample basis. According to rate distortion theory there is much to be gained by applying multidimensional schemes to the quantization at low bit rates. This paper presents the use of a tree-encoder for quantization of the LPC-residual. The perceptual speech quality for the straightforward encoding is however not satisfactory and a frequency weighted error criterion which greatly improves the perceptual speech quality is suggested.