Is this you? Create Your Porfile

Toomas Altosaar

Helsinki University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Toomas Altosaar is active.

Explore More

Publication

Featured researches published by Toomas Altosaar.

international conference on acoustics, speech, and signal processing | 1994

Warped linear prediction (WLP) in speech and audio processing

Unto K. Laine; Matti Karjalainen; Toomas Altosaar

A linear prediction process is applied to frequency warped signals. The warping is realized by using orthonormal FAM (frequency modulated complex exponentials) functions. The general formulation of WLP is given and effective realizations with allpass filters are studied. The application of auditory WLP to speech coding and speech recognition has given good results.<<ETX>>

international conference on acoustics, speech, and signal processing | 1990

An orthogonal set of frequency and amplitude modulated (FAM) functions for variable resolution signal analysis

Unto K. Laine; Toomas Altosaar

A general formula for defining a wide class of orthogonal functions is given. The class is based on circular sine and cosine functions which are simultaneously frequency and amplitude modulated in such a way that they remain orthogonal. This is achieved with any choice of FM or AM function. The class, which is called FAM functions, offers a practical and flexible tool for signal processing. They have been used to produce nonuniform resolution auditory spectrograms. The achieved time-frequency resolution is of very high quality. The preliminary results show that they are approaching the theoretical limit given for the Delta f- Delta t product. The orthogonality of the FAM functions is proved, how a complex orthogonal auditory transform (OAT) can be realized by FAMs is described, and a method for constructing a complex orthogonal one Bark filter bank for signal analysis and psychoacoustic experimentation is given.<<ETX>>

international conference on acoustics speech and signal processing | 1988

QuickSig-an object-oriented signal processing environment

Matti Karjalainen; Toomas Altosaar; Paavo Alku

An object-oriented DSP (digital signal-processing) environment called QuickSig is described that is based on recent developments in object-oriented programming (New Flavors on Symbolics Lisp machines). The design philosophy of QuickSig has been to extend the Lisp language by a layer of general DSP constructs, abstracts, and structures like signals, filters, windows, graphical presentations, and related signal-processing operations. QuickSig is a fast prototyping system for algorithmic development. It is extendable to include new ways of modeling signals and signal processing, both numerical and symbolic. The main features of the present system and some features that are under development are reported.<<ETX>>

Archive | 2011

Blind Segmentation of Speech Using Non-Linear Filtering Methods

Okko Räsänen; Unto K. Laine; Toomas Altosaar

Automated segmentation of speech into phone-sized units has been a subject of study for over 30 years, as it plays a central role in many speech processing and ASR applications. While segmentation by hand is relatively precise, it is also extremely laborious and tedious. This is one reason why automated methods are widely utilized. For example, phonetic analysis of speech (Mermelstein, 1975), audio content classification (Zhang & Kuo, 1999), and word recognition (Antal, 2004) utilize segmentation for dividing continuous audio signals into discrete, non-overlapping units in order to provide structural descriptions for the different parts of a processed signal. In the field of automatic segmentation of speech, the best results have so far been achieved with semi-automatic HMMs that require prior training (see, e.g., Makhoul & Schwartz, 1994). Algorithms using additional linguistic information like phonetic annotation during the segmentation process are often also effective (e.g., Hemert, 1991). The use of these types of algorithms is well justified for several different purposes, but extensive training may not always be possible, nor may adequately rich descriptions of speech material be available, for instance, in real-time applications. Training of the algorithms also imposes limitations to the material that can be segmented effectively, with the results being highly dependent on, e.g., the language and vocabulary of the training and target material. Therefore, several researchers have concurrently worked on blind speech segmentation methods that do not require any external or prior knowledge regarding the speech to be segmented (Almpanidis & Kotropoulos, 2008; Aversano et al., 2001; Cherniz et al., 2007; Esposito & Aversano, 2005; Estevan et al., 2007; Sharma & Mammone, 1996). These so called blind segmentation algorithms have many potential applications in the field of speech processing that are complementary to supervised segmentation, since they do not need to be trained extensively on carefully prepared speech material. As an important property, blind algorithms do not necessarily make assumptions about underlying signal conditions whereas in trained algorithms possible mismatches between training data and processed input cause problems and errors in segmentation, e.g., due to changes in background noise conditions or microphone properties. Blind methods also provide a valuable tool for investigating speech from a basic level such as phonetic research, they are language independent, and they can be used as a processing step in self-learning agents attempting to make sense of sensory input where externally supplied linguistic knowledge cannot be used (e.g., Rasanen & Driesen, 2009; Rasanen et al., 2008).

multimedia signal processing | 1999

Towards a high quality Finnish talking head

Jean-Luc Olives; Mikko Sams; Janne Kulju; Otto Seppälä; Matti Karjalainen; Toomas Altosaar; Sami Lemmetty; Kristian Töyrä; Martti Vainio

We describe how our Finnish talking head was improved by using a new auditory speech synthesis method based on neural networks and optimal synchronization of the facial speech animation and the audio signal. In our first version of the talking head, the user typed in text and synthesized auditory speech and synchronized facial animation were created automatically. We combine a 3D facial model with a commercial auditory text-to-speech synthetizer (TTS). The auditory speech is produced by concatenating pre-recorded samples of natural speech according to a set of rules. The quality of the current speech synthesis is not yet adequate. A new strategy has been developed to improve the TTS and to integrate auditory synthesizer synchronization, especially when hardware capabilities are limited. We are developing a new method to achieve an optimal synchronization, independent of the platform used. This method is based on predictive visual synthesis. The new synchronization method gives us better control over audio-visual speech synthesis in the time domain. Using the diphone duration, we can use a more realistic interpolation function between the visemes. Thus, we can also take into account coarticulation effects.

international conference on acoustics speech and signal processing | 1998

Speech synthesis using warped linear prediction and neural networks

Matti Karjalainen; Toomas Altosaar; Martti Vainio

A text-to-speech synthesis technique, based on warped linear prediction (WLP) and neural networks, is presented for high-quality individual sounding synthetic speech. Warped linear prediction is used as a speech production model with wide audio bandwidth yet with highly compressed control parameter data. An excitation codebook, inverse filtered from a target speakers voice, is applied to obtain individual tone quality. A set of neural networks, specialized to yield synthesis control parameters from phonemic input in specific contexts, generate the detailed parametric controls of WLP. Neural nets are also used successfully to compute the prosodic parameters. We have applied this approach in prototyping highly improved text-to-speech synthesis for the Finnish language.

international conference on acoustics speech and signal processing | 1988

Event-based multiple-resolution analysis of speech signals

Toomas Altosaar; Matti Karjalainen

A methodology for multiple-resolution analysis and event-based representation of speech signals is presented. The computation of multiple-resolution filtering, event detection and parsing of event structures is described with examples and discussions of auditory modeling aspects. The approach and its implementation are entirely based on object-oriented programming, which provides a systematic framework for the hierarchical nature of the method. Implementation of the analysis system is based on an object-oriented signal processing system called QuickSig running on the Symbolics Lisp machine. The QuickSig system has object classes such as signals, windows and filter banks, each with its own set of method functions for signal processing operations.<<ETX>>

international conference on spoken language processing | 1996

A multilingual phonetic representation and analysis system for different speech databases

Toomas Altosaar; Matti Karjalainen; Martti Vainio

A multilingual phonetic representation and analysis system for different speech databases is presented. The need for such a system is first justified and then one is proposed based on the Worldbet phonetic alphabet. A phonetic class hierarchy is developed and a description of the hierarchical structural representation follows. Database access is based on the latter and is accomplished by defining predicate search functions and applying them to a database. Immediate signal analysis of the results is possible since the multilingual phonetic representation system is seamlessly integrated into a digital signal processing environment.

workshop on applications of signal processing to audio and acoustics | 1991

Time-frequency And Multiple-resolution Representations In Auditory Modeling

Unto K. Laine; Matti Karjalainen; Toomas Altosaar

The human auditory system is known to utilize different temporal and frequency re,solutions in different contexts and analysis phases. In this paper we discuss some aspects of using time-frequency representations and multiple resolutions in auditory modeling from an information and signal theoretic point of view. The first question is how to allocate resolution optimally between frequency and time. For this purpose a new method called the FAM tranSform is described. The other question is how to utilize multiple parallel and redundant resolutions to avoid some problems that are faced when using single resolution approaches.

Journal of the Acoustical Society of America | 1996

Modeling of pitch, loudness, and segmental durations in Finnish using neural networks

Toomas Altosaar; Martti Vainio; Matti Karjalainen

Several facets of the man–machine interface, such as speech synthesis and recognition in the spoken language realm, can be modeled using neural networks. Here neural networks have been applied to model the lexical prosodic parameters: segmental duration, loudness, and pitch, for the Finnish language. The prosodic models that were generated can be used in currently viable applications such as speech synthesis to further improve their naturalness. The text input stream was first converted into a phoneme sequence from which the input representation for the nets was generated. Inputs included: phoneme position in word, number of phonemes in word, and context in terms of previous and future phonemes. Optimal input representations for each type of prosodic net were searched for by varying the size of the input vector. The number of hidden nodes was also varied to determine the complexity of the problem. Estimating duration required class specific nets for the error to drop below 20%, the difference limen. For loudness it was 2.2 phon (1 phon is just noticeable), while pitch networks performed well with an error of 3.5% (equals 0.6 semitones at 100 Hz which is less than the 1.5 semitone perceptual intonation threshold).

Explore More