Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jinfu Ni is active.

Publication


Featured researches published by Jinfu Ni.


Speech Communication | 2006

Quantitative and structural modeling of voice fundamental frequency contours of speech in Mandarin

Jinfu Ni; Keikichi Hirose

Abstract This paper presents an approach to structural modeling of voice fundamental frequency contours ( F 0 contours) of Mandarin utterances as a sequence of modulated tones. A proposed functional model mathematically implements the tone modulation with both local and global controls. The local control consists of placing a series of normalized F 0 targets along the time axis, which are specified by transition time and amplitudes and are always reached; and the transitions between targets are approximated by connecting truncated second-order transition functions. The global control in terms of sentence modality simply compresses or expands the heights and ranges of the prototypical patterns of syllabic tones generated by the local control. Both local and global controls are integrated in a unified framework, and this paper explains the underlying scientific and linguistic principles. Analysis of 1044 utterances of various sentences read by eight native speakers revealed that the model could closely approximate the observed F 0 contours with a small number of parameters. These parameters are localized and suited to a data-driven fitting process. As will be demonstrated, the model also is promising for measuring intonation variations from observed F 0 contours.


Journal of the Acoustical Society of America | 2006

Constrained tone transformation technique for separation and combination of Mandarin tone and intonation

Jinfu Ni; Hisashi Kawai; Keikichi Hirose

This paper addresses a classical but important problem: The coupling of lexical tones and sentence intonation in tonal languages, such as Chinese, focusing particularly on voice fundamental frequency (F1) contours of speech. It is important because it forms the basis of speech synthesis technology and prosody analysis. We provide a solution to the problem with a constrained tone transformation technique based on structural modeling of the F1 contours. This consists of transforming target values in pairs from norms to variants. These targets are intended to sparsely specify the prosodic contributions to the F1 contours, while the alignment of target pairs between norms and variants is based on underlying lexical tone structures. When the norms take the citation forms of lexical tones, the technique makes it possible to separate sentence intonation from observed F0 contours. When the norms take normative F0 contours, it is possible to measure intonation variations from the norms to the variants, both having identical lexical tone structures. This paper explains the underlying scientific and linguistic principles and presents an algorithm that was implemented on computers. The methods capability of separating and combining tone and intonation is evaluated through analysis and re-synthesis of several hundred observed F0 contours.


international conference on acoustics, speech, and signal processing | 2006

Constructing a Phonetic-Rich Speech Corpus While Controlling Time-Dependent Voice Quality Variability for English Speech Synthesis

Jinfu Ni; Toshio Hirai; Hisashi Kawai

This paper presents a practical approach to constructing a large-scale speech corpus for corpus-based speech synthesis. This consists of (1) selecting a source text corpus that fits limited target domains; (2) analyzing the source text corpus to obtain the unit statistics; (3) automatically extracting prompt subjects (sentences) from the source text corpus to maximize the intended unit coverage with the given amount of text; and (4) recording prompt subjects while controlling such critical factors that cause undesirable voice variability. This paper describes related computational methods, such as a greedy algorithm for prompt selection, the proximity effects found in a real recording system, and a technique for detecting the time-dependent voice variations. While the approach is demonstrated in English, it is also promising for other languages


international conference on acoustics, speech, and signal processing | 2007

Use of Poisson Processes to Generate Fundamental Frequency Contours

Jinfu Ni; Satoshi Nakamura

The prosodic contributions to voice fundamental frequency (F0) contours can be analyzed into a series of sparser tonal targets (F0 peaks and valleys). The transitions through these targets are interpolated by spline or filtering functions to predict the shape of F0 contours. A functional model was proposed in the previous work for this purpose. This paper presents an enhanced version of this model achieved by replacing its decay filter with a Poisson-process-induced filter. It is enhanced because the former is a special case of the latter. The new filter manages to delay the decaying process while interpolations are being uttered. A target point can thus act as target levels, if necessary. The algorithms for estimating parameters, which were implemented on computers, are also presented. Experiments conducted on thousands of observed F0 contours, including Mandarin, Japanese, and English, indicate that the enhanced version significantly facilitates their automatic parameterization.


international conference on acoustics, speech, and signal processing | 2004

Minimum segmentation error based discriminative training for speech synthesis application

Yi-Jian Wu; Hisashi Kawai; Jinfu Ni; Ren-Hua Wang

In the conventional HMM-based segmentation method, the HMM training is based on MLE criteria, which links the segmentation task to the problem of distribution estimation. The HMM are built to identify the phonetic segments, not to detect the boundary. This kind of inconsistency between training and application limited the performance of segmentation. In this paper, we adopt the discriminative training method and introduce a new criterion, named minimum segmentation error (MSGE), for HMM training. In this method, a loss function directly related to the segmentation error is defined. By minimizing the overall empirical loss with the generalized probabilistic descent (GPD) algorithm, the segmentation error is also minimized. From the results on both Chinese and Japanese data, the accuracy of segmentation is improved. Moreover, this method is robust even when we do not have enough knowledge on HMM modeling, e.g. the number of states is not optimized.


international symposium on universal communication | 2008

Prosody Modeling from Tone to Intonation in Chinese using a Functional F0 Model

Jinfu Ni; Shinsuke Sakai; Tohru Shimizu; Satoshi Nakamura

Chinese is a tonal language. It has both lexical tones and intonation. The fundamental frequency (F0) contours thereby consist of tone and intonation components. This paper presents an approach to modeling the two components in separate ways and combining them to form the final F0 contours based on a functional F0 model. We analyze tonal patterns as sparse target points (tonal F0 peaks and valleys) and model them using classification and regression trees (CART) with contextual linguistic features. As a first step, we stylize expressive intonation using a few piecewise linear patterns specified by a few markup tags. Both tonal and intonational patterns are represented in a parametric form within the framework of this F0 model. Our experimental results indicated that very low F0 prediction errors were achieved by the CART-based modeling of the tonal patterns uttered by two female and male speakers. In a listening test, the native speakers could identify 90% of synthesized stimuli with enhancing emphasis in word. Also, the linguistic features related to the lexical tone context and distinction between voiced and unvoiced initials played the most important role in characterizing the tonal patterns.


Speech Communication | 2005

Discriminative training and explicit duration modeling for HMM-based automatic segmentation

Yi-Jian Wu; Hisashi Kawai; Jinfu Ni; Ren-Hua Wang

HMM-based automatic segmentation has been popularly used for corpus construction for concatenative speech synthesis. Since the most important reasons for the inaccuracy of HMM-based automatic segmentation are the HMM training criterion and duration control, we will study these particular issues. For the HMM training, we apply the discriminative training method and introduce a new criterion, named Minimum SeGmentation Error (MSGE). In this method, a loss function directly related to the segmentation error is defined, and parameter optimization is performed by the Generalized Probabilistic Descent (GPD) algorithm. For the duration control problem, we apply explicit duration models and propose a two-step-based segmentation method to solve the problem of computational cost, where the duration model is incorporated in a postprocessor procedure. From the experimental results, these two techniques significantly improve segmentation accuracy with different focuses, where the MSGE-based discriminative training focuses on improving the accuracy of sensitive boundary, i.e., a boundary where an error in segmentation is likely to cause a noticeable degradation in speech synthesis quality, and the explicit duration modeling focuses on eliminating large errors. After combining these two techniques, the error average was reduced from 6.86 ms to 5.79 ms on Japanese data, and from 8.67 ms to 6.61 ms on Chinese data. Simultaneously, the number of errors larger than 30 ms were reduced 25% and 51% on Chinese and Japanese data, respectively.


International Conference of the Pacific Association for Computational Linguistics | 2015

The Application of Phrase Based Statistical Machine Translation Techniques to Myanmar Grapheme to Phoneme Conversion

Ye Kyaw Thu; Win Pa Pa; Andrew M. Finch; Jinfu Ni; Eiichiro Sumita; Chiori Hori

Grapheme-to-Phoneme (G2P) conversion is a necessary step for speech synthesis and speech recognition. In this paper, we attempt to apply a Statistical Machine Translation (SMT) approach for Myanmar G2P conversion. The performance of G2P conversion with SMT is measured in terms of BLEU score, syllable phoneme accuracy and processing time. The experimental results show that G2P conversion with SMT is outperformed a Conditional Random Field (CRF) approach. Moreover, the training time was considerably faster than the CRF approach.


international symposium on chinese spoken language processing | 2014

Superpositional HMM-based intonation synthesis using a functional F0 model

Jinfu Ni; Yoshinori Shiga; Chiori Hori

This paper addresses intonation synthesis combining both statistical and generative models to manipulate fundamental frequency (F0) contours in the framework of HMM-based speech synthesis. An F0 contour is represented as a superposition of micro, accent, and register components at logarithmic scale in light of the Fujisaki model. Three component sets are extracted from a speech corpus by an algorithm of pitch decomposition upon a functional F0 model, and separated context-dependent (CD) HMM is trained for each component. At the phase of speech synthesis, CDHMM-generated micro, accent, and register components are superimposed to form F0 contours for input text. Objective and subjective evaluations are carried out on a Japanese speech corpus. Compared with the conventional approach, this method demonstrates the improved performance in naturalness by achieving better local and global F0 behaviors and exhibits a link between phonology and phonetics, making it possible to flexibly control intonation using given marking information on the fly to manipulate the parameters of the functional F0 model.


international conference on acoustics, speech, and signal processing | 2003

Tone feature extraction through parametric modeling and analysis-by-synthesis-based pattern matching

Jinfu Ni; Hisashi Kawai

A functional fundamental frequency (F/sub 0/) model is applied to extract tone peak and gliding features from Mandarin F/sub 0/ contours aiming at automatic prosodic labeling of a large scale speech corpus. Modeling four lexical tones and representing them in a parametric form based on the F/sub 0/ model, we first cluster baseline tone patterns using the LBG (Linde-Buzo-Gray) algorithm, then perform analysis-by-synthesis-based pattern matching to estimate underlying tone peaks and tone pattern types from observed F/sub 0/ contours and phonetic labels with lexical tones. Tone gliding features are re-estimated after the determination of tone peaks. 94% of the automatically estimated labels were consistent with the manual labels in an open test of 968 utterances from eight native speakers. Also, experimental results indicate that the proposed method is applicable for F/sub 0/ contour smoothing and tone verification.

Collaboration


Dive into the Jinfu Ni's collaboration.

Top Co-Authors

Avatar

Hisashi Kawai

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Satoshi Nakamura

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Shinsuke Sakai

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Keiichi Tokuda

Nagoya Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Minoru Tsuzaki

Nagoya Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Chiori Hori

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Tomoki Toda

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Yoshinori Shiga

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Yoshinori Shiga

National Institute of Information and Communications Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge