Matteo Gerosa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matteo Gerosa is active.

Explore More

Publication

Featured researches published by Matteo Gerosa.

Speech Communication | 2007

Acoustic variability and automatic recognition of children's speech

Matteo Gerosa; Diego Giuliani; Fabio Brugnara

This paper presents several acoustic analyses carried out on read speech collected from Italian children aged from 7 to 13 years and North American children aged from 5 to 17 years. These analyses aimed at achieving a better understanding of spectral and temporal changes in speech produced by children of various ages in view of the development of automatic speech recognition applications. The results of these analyses confirm and complement the results reported in the literature, showing that characteristics of childrens speech change with age and that spectral and temporal variability decrease as age increases. In fact, younger children show a substantially higher intra- and inter-speaker variability with respect to older children and adults. We investigated the use of several methods for speaker adaptive acoustic modeling to cope with inter-speaker spectral variability and to improve recognition performance for children. These methods proved to be effective in recognition of read speech with a vocabulary of about 11k words.

international conference on acoustics, speech, and signal processing | 2003

Investigating recognition of children's speech

Diego Giuliani; Matteo Gerosa

Recognition of childrens speech was investigated by considering a phone recognition task. Two baseline systems were trained, one for children and one for adults, by exploiting two Italian speech databases. Under matching conditions, for training and recognition performed with data from the same population group, the phone recognition accuracy was 77.30% and 79.43% for children and adults, respectively. It was found that, for many children, recognition results were as good as for adults. However, a higher variability in phone recognition accuracy across speakers was observed for children than for adults. Vocal tract length normalization, under matched and mismatched training and testing conditions, was also investigated. For both adults and children, a performance improvement, with respect to the baseline systems, was observed.

Proceedings of the 2nd Workshop on Child, Computer and Interaction | 2009

A review of ASR technologies for children's speech

Matteo Gerosa; Diego Giuliani; Shrikanth Narayanan; Alexandros Potamianos

In this paper, we review: (1) the acoustic and linguistic properties of childrens speech for both read and spontaneous speech, and (2) the developments in automatic speech recognition for children with application to spoken dialogue and multimodal dialogue system design. First, the effect of developmental changes on the absolute values and variability of acoustic correlates is presented for read speech for children ages 6 and up. Then, verbal child-machine spontaneous interaction is reviewed and results from recent studies are presented. Age trends of acoustic, linguistic and interaction parameters are discussed, such as sentence duration, filled pauses, politeness and frustration markers, and modality usage. Some differences between child-machine and human-human interaction are pointed out. The implications for acoustic modeling, linguistic modeling and spoken dialogue system design for children are presented. We conclude with a review of relevant applications of spoken dialogue technologies for children.

multimedia signal processing | 2007

A System for Technology Based Assessment of Language and Literacy in Young Children: the Role of Multiple Information Sources

Abeer Alwan; Yijian Bai; Matthew P. Black; Larry Casey; Matteo Gerosa; Markus Iseli; Barbara Jones; Abe Kazemzadeh; Sungbok Lee; Shrikanth Narayanan; Patti Price; Joseph Tepperman; Shizhen Wang

This paper describes the design and realization of an automatic system for assessing and evaluating the language and literacy skills of young children. This system was developed in the context of the TBALL (technology based assessment of language and literacy) project and aims at automatically assessing the English literacy skills of both native talkers of American English and Mexican-American children in grades K-2. The automatic assessments were carried out employing appropriate speech recognition and understanding techniques. In this paper, we describe the system focusing on the role of the multiple sources of information at our disposal. We present the content of the assessment system, discuss some issues in creating a child-friendly interface, and how to provide a suitable feedback to the teachers. In addition, we will discuss the different assessment modules and the different algorithms used for speech analysis.

international conference on acoustics, speech, and signal processing | 2006

Analyzing Children's Speech: An Acoustic Study of Consonants and Consonant-Vowel Transition

Matteo Gerosa; Sungbok Lee; Diego Giuliani; Shrikanth Narayanan

This paper presents several acoustic analyses on read speech, collected from 5 adults and 35 children aged 5 to 17 years, focusing on consonants and consonant-vowel transition. Characteristics of consonants such as duration, intra-speaker variability and, for stop consonants, voice onset time are analyzed and compared with results achieved on vowels. Strong and significant correlation with age is observed for both duration and intra-speaker variability. In fact, younger children show higher phone duration and larger spectral and temporal variability than older children and adults. Voice onset time, on the other hand, is less correlated with age. Analysis of consonant-vowel transition shows that the duration of the transition and the amount of spectral difference between consonant and vowel are clearly age-dependent. Younger children, in fact, show shorter transition duration and larger spectral difference between consonant and vowel in the consonant-vowel pair

Speech Communication | 2009

Towards age-independent acoustic modeling

Matteo Gerosa; Diego Giuliani; Fabio Brugnara

In automatic speech recognition applications, due to significant differences in voice characteristics, adults and children are usually treated as two population groups, for which different acoustic models are trained. In this paper, age-independent acoustic modeling is investigated in the context of large vocabulary speech recognition. Exploiting a small amount (9h) of childrens speech and a more significant amount (57h) of adult speech, age-independent acoustic models are trained using several methods for speaker adaptive acoustic modeling. Recognition results achieved using these models are compared with those achieved using age-dependent acoustic models for children and adults, respectively. Recognition experiments are performed on four Italian speech corpora, two consisting of childrens speech and two of adult speech, using 64k word and 11k word trigram language models. Methods for speaker adaptive acoustic modeling prove to be effective for training age-independent acoustic models ensuring recognition results at least as good as those achieved with age-dependent acoustic models for adults and children.

international conference on acoustics, speech, and signal processing | 2009

Coping with out-of-vocabulary words: Open versus huge vocabulary asr

Matteo Gerosa; Marcello Federico

This paper investigates methods for coping with out-of-vocabulary words in a large vocabulary speech recognition task, namely the automatic transcription of Italian broadcast news. Two alternative ways for augmenting a 64K(thousand)-word recognition vocabulary and language model are compared: introducing extra words with their phonetic transcription up to 1.2M (million) words, or extending the language model with so-called graphones, i.e. subword units made of phone-character sequences. Graphones and phonetic transcriptions of words are automatically generated by adapting an off-the-shelf statistical machine translation toolkit. We found that the word-based and graphone-based extensions allow both for better recognition performance, with the former performing significantly better than the latter. In addition, the word-based extension approach shows interesting potential even under conditions of little supervision. In fact, by training the grapheme to phoneme translation system with only 2K manually verified transcriptions, the final word error rate increases by just 3% relative, with respect to starting from a lexicon of 64K words.

international conference on smart cities and green ict systems | 2015

An Open Platform for Children’s Independent Mobility

Matteo Gerosa; Annapaola Marconi; Marco Pistore; Paolo Traverso

Children’s independent mobility is a perfect example of a smart community, where proactive citizens participation and new form of collaboration between citizens and city managers are fundamental to solve daily problems in the city. This application domain, intersecting several areas of a smart city, from sustainable mobility to health and education, is at the same time very relevant from a societal perspective and very challenging from an ICT perspective, since it requires a combination of socio-psychological theories and practices and of advanced ICT techniques and tools. In this paper we illustrate the problem, analyzing on-going initiatives, lessons learnt and potential role of ICT solutions, and propose a solution, the CLIMB Platform, that will be experimented within the city of Trento.

ieee automatic speech recognition and understanding workshop | 2009

Phone-to-word decoding through statistical machine translation and complementary system combination

Daniele Falavigna; Matteo Gerosa; Roberto Gretter; Diego Giuliani

In this paper, phone-to-word transduction is first investigated by coupling a speech recognizer, generating for each speech segment a phone sequence or a phone confusion network, with the efficient decoder of confusion networks adopted by MOSES, a popular statistical machine translation toolkit. Then, system combination is investigated by combining the outputs of several conventional ASR systems with the output of a system embedding phone-to-word decoding through statistical machine translation. Experiments are carried out in the context of a large vocabulary speech recognition task consisting of transcription of speeches delivered in English during the European Parliament Plenary Sessions (EPPS). While only a marginal performance improvements is achieved in system combination experiments when the output of the phone-to-word transducer is included in the combination, partial results show a great potential for improvements.

Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligence | 2010

An automatic transcription system of hearings in Italian courtrooms

Daniele Falavigna; Matteo Gerosa; Diego Giuliani; Roberto Gretter

This paper describes and discusses the recognition results obtained using the automatic transcription system developed in our labs, after having adapted it to the judicial domain. The performance has been evaluated on field audio data, formed by about 7 hours of multiply tracks recordings, acquired in two different dates in the Court of Naples. Different sets of acoustic and language models have been used and compared in the system, providing results (word error rate is around 40%) that are in line with those obtained on other comparable Automatic Speech Recognition (ASR) tasks (e.g. meeting transcription)and that leave room for future investigations

Explore More