Fernando Perdigão
University of Coimbra
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Fernando Perdigão.
Archive | 2011
Carla Lopes; Fernando Perdigão
In the information age, computer applications have become part of modern life and this has in turn encouraged the expectations of friendly interaction with them. Speech, as “the” communication mode, has seen the successful development of quite a number of applications using automatic speech recognition (ASR), including command and control, dictation, dialog systems for people with impairments, translation, etc. But the actual challenge goes beyond the use of speech in control applications or to access information. The goal is to use speech as an information source, competing, for example, with text online. Since the technology supporting computer applications is highly dependent on the performance of the ASR system, research into ASR is still an active topic, as is shown by the range of research directions suggested in (Baker et al., 2009a, 2009b). Automatic speech recognition – the recognition of the information embedded in a speech signal and its transcription in terms of a set of characters, (Junqua & Haton, 1996) – has been object of intensive research for more than four decades, achieving notable results. It is only to be expected that speech recognition advances make spoken language as convenient and accessible as online text when the recognizers reach error rates near zero. But while digit recognition has already reached a rate of 99.6%, (Li, 2008), the same cannot be said of phone recognition, for which the best rates are still under 80% 1,(Mohamed et al., 2011; Siniscalchi et al., 2007). Speech recognition based on phones is very attractive since it is inherently free from vocabulary limitations. Large Vocabulary ASR (LVASR) systems’ performance depends on the quality of the phone recognizer. That is why research teams continue developing phone recognizers, in order to enhance their performance as much as possible. Phone recognition is, in fact, a recurrent problem for the speech recognition community. Phone recognition can be found in a wide range of applications. In addition to typical LVASR systems like (Morris & Fosler-Lussier, 2008; Scanlon et al., 2007; Schwarz, 2008), it can be found in applications related to keyword detection, (Schwarz, 2008), language recognition, (Matejka, 2009; Schwarz, 2008), speaker identification, (Furui, 2005) and applications for music identification and translation, (Fujihara & Goto, 2008; Gruhne et al., 2007). The challenge of building robust acoustic models involves applying good training algorithms to a suitable set of data. The database defines the units that can be trained and
Ndt & E International | 2001
Jaime B. Santos; Fernando Perdigão
Abstract The objective of this work is to provide a contribution to defect classification. More precisely, we try to prove that it is possible to identify and classify defects of different types using the pulse-echo technique. The classification process makes use of the time and frequency domain responses of the ultrasonic echo signals acquired from different specimens simulating defects with three different shapes (cylindrical, spherical and planar with rectangular cross-section) and sizes. Although the final goal is the characterisation of practical defects (for instance, voids, cracks, delaminations, and so on) appearing in composite materials during manufacturing and in service, we first use the already mentioned reflectors for simplicity reasons. In these experiments 66 reflectors are used with water as matrix material. The inclusion (reflector) materials are brass, copper, steel and polystyrene. From the time domain signals we extract three features, namely, pulse duration, pulse decay rate and peak-to-peak relative amplitude of the third cycle. From the spectra of the echoes we extract the frequency for maximum amplitude and the standard error estimate from the deconvolved spectrum responses. All experimental signals were obtained using only one normal incident ultrasonic transducer aligned to maximise the direct reflected signal. In spite of the fact that this kind of configuration does not provide complete information about the characteristics of the geometries being studied, all the extracted features proved to be important discriminating factors of the geometrical classes considered, as will be demonstrated by making use of a pattern recognition technique for classification.
IberSPEECH | 2012
Arlindo Veiga; Dirce Celorico; Jorge Proença; Sara Candeias; Fernando Perdigão
This study presents an approach to the task of automatically classifying and detecting speaking styles. The detection of speaking styles is useful for the segmentation of multimedia data into consistent parts and has important applications, such as identifying speech segments to train acoustic models for speech recognition. In this work the database consists of daily news broadcasts in Portuguese television, on which two main speaking styles are evident: read speech from voice-over and anchors, and spontaneous speech from interviews and commentaries. Using a combination of phonetic and prosodic features we can separate these two speaking styles with a good accuracy (93.7% read, 69.5% spontaneous). This is performed in two steps. The first step separates the speech segments from the non-speech audio segments and the second step classifies read versus spontaneous speaking style. The use of phonetic and prosodic features provides alternative information that leads to an improvement of the classification and detection task.
processing of the portuguese language | 2008
José Lopes; Cláudio Neves; Arlindo Veiga; Alexandre M. A. Maciel; Carla Lopes; Fernando Perdigão; Luis A. S. V. de Sa
This paper describes the development of a robust speech recognition using a database collected in the scope of the Tecnovoz project. The speech recognition system is speaker independent, robust to noise and operates in a small footprint embedded hardware platform. Some issues about the database, the training of the acoustic models, the noise suppression front-end and the recognizers confidence measure are addressed in the paper. Although the database was especially designed for specific small-vocabulary tasks, the best system performance was obtained using triphone models rather than whole-word models.
european signal processing conference | 2015
Jorge Proenga; Arlindo Veiga; Fernando Perdigão
This paper presents an approach to the Query-by-Example task of finding spoken queries on speech databases when the intended match may be non-exact or slightly complex. The built system is low-resource as it tries to solve the problem where the language of queries and searched audio is unspecified. Our method is based on a modified Dynamic Time Warping (DTW) algorithm using posterior-grams and extracting intricate paths to account for special cases of query match such as word re-ordering, lexical variations and filler content. This system was evaluated on the MediaEval 2014 task of Query by Example Search on Speech (QUESST) where the spoken data is from different languages, unknown to the participant. We combined the results of five DTW modifications computed on the output of three phoneme recognizers of different languages. The combination of all systems provided the best performance overall and improved detection of complex case queries.
processing of the portuguese language | 2014
Arlindo Veiga; Carla Lopes; Luis A. S. V. de Sa; Fernando Perdigão
This paper presents a study on keyword spotting systems based on acoustic similarity between a filler model and keyword model. The ratio between the keyword model likelihood and the generic (filler) model likelihood is used by the classifier to detect relevant peaks values that indicate keyword occurrences. We have changed the standard scheme of keyword spotting system to allow keyword detection in a single forward step. We propose a new log-likelihood ratio normalization to minimize the effect of word length on the classifier performance. Tests show the effectiveness of our normalization method against two other methods. Experiments were performed on continuous speech utterances of the Portuguese TECNOVOZ database (read sentences) with keywords of several lengths.
Journal of the Brazilian Computer Society | 2013
Arlindo Veiga; Sara Candeias; Fernando Perdigão
This paper addresses the problem of grapheme to phoneme conversion to create a pronunciation dictionary from a vocabulary of the most frequent words in European Portuguese. A system based on a mixed approach funded on a stochastic model with embedded rules for stressed vowel assignment is described. The implemented model can generate pronunciations from unrestricted words; however, a dictionary with the 40k most frequent words was constructed and corrected interactively. The dictionary includes homographs with multiplepronunciations. The vocabulary was defined using the CETEMPúblico corpus. The model and dictionary are publicly available.
processing of the portuguese language | 2012
Carla Lopes; Arlindo Veiga; Fernando Perdigão
This paper introduces a European Portuguese speech database containing spoken material recorded from children. The need for such database arose from the need of train phone models for the development of a computer aided speech therapy system. Articulatory disorders affect a significant number of children in pre-school age. We propose a system intended to assist and reinforce the conventional speech therapy programs. Through the systematic use of games, it learns the phones where the child has more difficulty to pronounce. The child is then taken to train the production of those phones by playing games. Another interest of a children speech database is that accurate childrens phone recognition is only possible using training data that reflects the population of users. It is a difficult task due to the high pitch of childrens speech.
Microprocessing and Microprogramming | 1990
Luis A. S. V. de Sa; Vitor Silva; Fernando Perdigão; Sérgio M. M. de Faria; Pedro A. Amado Assunção
Abstract A computing architecture capable of coding video signals in real time is described. The codec uses several digital signal processors (DSPs) which can be easily programmed to implement the recent H.261 algorithm approved by the CCITT. The DSPs are organized as a single instruction multiple data (SIMD) computing architecture. Every image in a sequence is divided in regions of horizontal strips and each region is operated by its own processor. The principle is used in both the encoder and decoder. These local processors code (decode) one horizontal strip of data which, using the terminology of the H.261 norm, corresponds to two group of blocks (GOBs). They also communicate to a central processor which multiplexes (demultiplexes) the coded data from (for) the processors in the encoder (decoder). In the case of the encoder this central processor also controls a data buffer for bit-rate adaptation. Lateral communication between adjacent processors is also permitted. This allows comparisons between blocks situated in neighbouring regions, as required by most motion estimation algorithms.
conference of the international speech communication association | 2016
Jorge Proença; Fernando Perdigão
This paper describes a low-resource approach to a Query-byExample task, where spoken queries must be matched in a large dataset of spoken documents sometimes in complex or nonexact ways. Our approach tackles these complex match cases by using Dynamic Time Warping to obtain alternative paths that account for reordering of words, small extra content and small lexical variations. We also report certain advances on calibration and fusion of sub-systems that improve overall results, such as manipulating the score distribution per query and using an average posteriorgram distance matrix as an extra sub-system. Results are evaluated on the MediaEval task of Query-by-Example Search on Speech (QUESST). For this task, the language of the audio being searched is almost irrelevant, approaching the use case scenario to a language of very low resources. For that, we use as features the posterior probabilities obtained from five phonetic recognizers trained with five different languages.