Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Sandesh Aryal is active.

Publication


Featured researches published by Sandesh Aryal.


Computer Speech & Language | 2016

Data driven articulatory synthesis with deep neural networks

Sandesh Aryal; Ricardo Gutierrez-Osuna

We present an articulatory-to-acoustic mapping for real-time articulatory synthesis.The method uses a deep neural network with a tapped-delay input line.Tapped-delay line efficiently captures dynamics in articulatory trajectories.The model achieved higher accuracy than competing models based on Gaussian mixtures.The improvement was also found perceivable in a subjective listening test. The conventional approach for data-driven articulatory synthesis consists of modeling the joint acoustic-articulatory distribution with a Gaussian mixture model (GMM), followed by a post-processing step that optimizes the resulting acoustic trajectories. This final step can significantly improve the accuracy of the GMM frame-by-frame mapping but is computationally intensive and requires that the entire utterance be synthesized beforehand, making it unsuited for real-time synthesis. To address this issue, we present a deep neural network (DNN) articulatory synthesizer that uses a tapped-delay input line, allowing the model to capture context information in the articulatory trajectory without the need for post-processing. We characterize the DNN as a function of the context size and number of hidden layers, and compare it against two GMM articulatory synthesizers, a baseline model that performs a simple frame-by-frame mapping, and a second model that also performs trajectory optimization. Our results show that a DNN with a 60-ms context window and two 512-neuron hidden layers can synthesize speech at four times the frame rate - comparable to frame-by-frame mappings, while improving the accuracy of trajectory optimization (a 9.8% reduction in Mel Cepstral distortion). Subjective evaluation through pairwise listening tests also shows a strong preference toward the DNN articulatory synthesizer when compared to GMM trajectory optimization.


international conference on acoustics, speech, and signal processing | 2014

Normalization of articulatory data through Procrustes transformations and analysis-by-synthesis

Daniel Felps; Sandesh Aryal; Ricardo Gutierrez-Osuna

We describe and compare three methods that can be used to normalize articulatory data across speakers. The methods seek to explain systematic anatomical differences between a source and target speaker without modifying the articulatory velocities of the source speaker. The first method is the classical Procrustes transform, which allows for a global translation, rotation, and scaling of articulator positions. We present an extension to the Procrustes transform that allows independent translations of each articulator. The additional parameters provide a 35% increase in articulatory similarity between pairs of speakers when compared to classical Procrustes. The proposed extension is finally coupled with a data-driven articulatory synthesizer in an analysis-by-synthesis loop to select model parameters that best explain the predicted acoustic (rather than articulatory) differences. This normalization method is able to increase acoustic similarity between source and the target speaker by 34%. However, it also reduces articulatory similarity by 22%, which suggest that improvements in acoustic similarity do not necessarily require an increase in articulatory similarity.


international conference on acoustics, speech, and signal processing | 2014

Can voice conversion be used to reduce non-native accents?

Sandesh Aryal; Ricardo Gutierrez-Osuna

Voice-conversion (VC) techniques aim to transform utterances from a source speaker to sound as if a target speaker had produced them. For this reason, VC is generally ill-suited for accent-conversion (AC) purposes, where the goal is to capture the regional accent of the source while preserving the voice quality of the target. In this paper, we propose a modification of the conventional training process for VC that allows it to perform as an AC transform. Namely, we pair source and target vectors based not on their ordering within a parallel corpus, as is commonly done in VC, but based on their linguistic similarity. We validate the approach on a corpus containing native-accented and Spanish-accented utterances, and compare it against conventional VC through a series of listening tests. We also analyze whether phonological differences between the two languages (Spanish and American English) help predict the performance of the two methods.


international conference on acoustics, speech, and signal processing | 2014

Accent conversion through cross-speaker articulatory synthesis

Sandesh Aryal; Ricardo Gutierrez-Osuna

Accent conversion (AC) seeks to transform second-language (L2) utterances to appear as if produced with a native (L1) accent. In the acoustic domain, AC is difficult due to the complex interaction between linguistic content and voice quality. Alternatively, AC can be performed in the articulatory domain by building a mapping from L2 articulators to L2 acoustics, and then driving the model with L1 articulators. However, collecting articulatory data for each L2 learner is impractical. Here we propose an approach that avoids this expensive step. Our method builds a cross-speaker forward mapping (CSFM) to generate L2 acoustic observations directly from L1 articulatory trajectories. We evaluated the CSFM against a baseline articulatory synthesizer trained with L2 articulators. Subjective listening tests show that both methods perform comparably in terms of accent reduction and ability to preserve the voice quality of the L2 speaker, with only a small impact in acoustic quality.


international conference on acoustics, speech, and signal processing | 2013

Articulatory inversion and synthesis: Towards articulatory-based modification of speech

Sandesh Aryal; Ricardo Gutierrez-Osuna

Certain speech modifications, such as changes in foreign/regional accents or articulatory styles, are performed more effectively in the articulatory domain than in the acoustic domain. Though measuring articulators is cumbersome, articulatory parameters may be estimated from acoustics through inversion. In this paper, we study the impact on synthesis quality when articulators predicted from acoustics are used in articulatory synthesis. For this purpose, we trained a GMM articulatory synthesizer and drove it with articulators predicted with an RBF-based inversion model. Using inverted instead of measured articulators degraded synthesis quality, as measured through Mel cepstral distortion and subjective tests. However, retraining the synthesizer with predicted articulators not only reversed the effect of errors introduced during inversion but also improved synthesis quality relative to using measured articulators. These results suggest that inverted articulators do not compromise synthesis quality, and open up the possibility of performing speech modification in the articulatory domain through inversion.


annual symposium on computer human interaction in play | 2014

Flappy voice: an interactive game for childhood apraxia of speech therapy

Tian Lan; Sandesh Aryal; Beena Ahmed; Kirrie J. Ballard; Ricardo Gutierrez-Osuna

We present Flappy Voice, a mobile game to facilitate acquisition of speech timing and prosody skills for children with apraxia of speech. The game is adapted from the popular game Flappy Bird, and replaces touch interaction with voice control. Namely, we map the childs vocal loudness into the birds position by means of a smoothing filter. In this way, children control the game via the duration and amplitude of their voice. Flappy Voice allows the therapist to create new exercises with different difficulty levels, including an assisted mode for children with limited skills, and a free mode for advanced players. Results from a pilot user study with children support the feasibility of the game as a speech training tool.


Journal of the Acoustical Society of America | 2015

Reduction of non-native accents through statistical parametric articulatory synthesis

Sandesh Aryal; Ricardo Gutierrez-Osuna

This paper presents an articulatory synthesis method to transform utterances from a second language (L2) learner to appear as if they had been produced by the same speaker but with a native (L1) accent. The approach consists of building a probabilistic articulatory synthesizer (a mapping from articulators to acoustics) for the L2 speaker, then driving the model with articulatory gestures from a reference L1 speaker. To account for differences in the vocal tract of the two speakers, a Procrustes transform is used to bring their articulatory spaces into registration. In a series of listening tests, accent conversions were rated as being more intelligible and less accented than L2 utterances while preserving the voice identity of the L2 speaker. No significant effect was found between the intelligibility of accent-converted utterances and the proportion of phones outside the L2 inventory. Because the latter is a strong predictor of pronunciation variability in L2 speech, these results suggest that articulatory resynthesis can decouple those aspects of an utterance that are due to the speakers physiology from those that are due to their linguistic gestures.


conference of the international speech communication association | 2016

Comparing Articulatory and Acoustic Strategies for Reducing Non-Native Accents.

Sandesh Aryal; Ricardo Gutierrez-Osuna

This article presents an experimental comparison of two types of techniques, articulatory and acoustic, for transforming nonnative speech to sound more native-like. Articulatory techniques use articulators from a native speaker to drive an articulatory synthesizer of the non-native speaker. These methods have a good theoretical justification, but articulatory measurements (e.g., via electromagnetic articulography) are difficult to obtain. In contrast, acoustic methods use techniques from the voice conversion literature to build a mapping between the two acoustic spaces, making them more attractive for practical applications (e.g., language learning). We compare two representative implementations of these approaches, both based on statistical parametric speech synthesis. Through a series of perceptual listening tests, we evaluate the two approaches in terms of accent reduction, speech intelligibility and speaker quality. Our results show that the acoustic method is more effective than the articulatory method in reducing perceptual ratings of non-native accents, and also produces synthesis of higher intelligibility while preserving voice quality.


conference of the international speech communication association | 2013

Foreign accent conversion through voice morphing

Sandesh Aryal; Daniel Felps; Ricardo Gutierrez-Osuna


conference of the international speech communication association | 2015

Articulatory-based conversion of foreign accents with deep neural networks.

Sandesh Aryal; Ricardo Gutierrez-Osuna

Collaboration


Dive into the Sandesh Aryal's collaboration.

Researchain Logo
Decentralizing Knowledge