Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Masashi Unoki is active.

Publication


Featured researches published by Masashi Unoki.


Speech Communication | 2005

Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis

Takeshi Saitou; Masashi Unoki; Masato Akagi

Abstract A fundamental frequency (F0) control model, which can cope with F0 dynamic characteristics related to singing-voice perception, is required to construct natural singing-voice synthesis systems. This paper discusses importance of F0 dynamic characteristics in singing-voices and demonstrates how strongly they influence singing-voice perception through psychoacoustic experiments. This paper, then, proposes an F0 control model that can generate F0 contours of singing-voices based on these considerations, and a singing-voice synthesis system. The results show that several types of F0 fluctuation—overshoot, vibrato, preparation, and fine fluctuation—affect the perception and quality of a singing-voice, and that overshoot has the greatest effect. Moreover, the results show that the proposed F0 control model can control F0 fluctuations, generate F0 contours of singing-voices, and can be applied to natural singing-voice synthesis.


Journal of the Acoustical Society of America | 2006

Comparison of the roex and gammachirp filters as representations of the auditory filter

Masashi Unoki; Toshio Irino; Brian R. Glasberg; Brian C. J. Moore; Roy D. Patterson

Although the rounded-exponential (roex) filter has been successfully used to represent the magnitude response of the auditory filter, recent studies with the roex(p, w, t) filter reveal two serious problems: the fits to notched-noise masking data are somewhat unstable unless the filter is reduced to a physically unrealizable form, and there is no time-domain version of the roex(p, w, t) filter to support modeling of the perception of complex sounds. This paper describes a compressive gammachirp (cGC) filter with the same architecture as the roex(p, w, t) which can be implemented in the time domain. The gain and asymmetry of this parallel cGC filter are shown to be comparable to those of the roex(p, w, t) filter, but the fits to masking data are still somewhat unstable. The roex(p, w, t) and parallel cGC filters were also compared with the cascade cGC filter [Patterson et al., J. Acoust. Soc. Am. 114, 1529-1542 (2003)], which was found to provide an equivalent fit with 25% fewer coefficients. Moreover, the fits were stable. The advantage of the cascade cGC filter appears to derive from its parsimonious representation of the high-frequency side of the filter. It is concluded that cGC filters offer better prospects than roex filters for the representation of the auditory filter.


workshop on applications of signal processing to audio and acoustics | 2007

Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices

Takeshi Saitou; Masataka Goto; Masashi Unoki; Masato Akagi

This paper describes a speech-to-singing synthesis system that can synthesize a singing voice, given a speaking voice reading the lyrics of a song and its musical score. The system is based on the speech manipulation system STRAIGHT and comprises three models controlling three acoustic features unique to singing voices: the fundamental frequency (F0), phoneme duration, and spectrum. Given the musical score and its tempo, the F0 control model generates the F0 contour of the singing voice by controlling four types of F0 fluctuations: overshoot, vibrato, preparation, and fine fluctuation. The duration control model lengthens the duration of each phoneme in the speaking voice by considering the duration of its musical note. The spectral control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. Experimental results show that the proposed system can convert speaking voices into singing voices whose naturalness is almost the same as actual singing voices.


international conference on acoustics, speech, and signal processing | 2003

A method based on the MTF concept for dereverberating the power envelope from the reverberant signal

Masashi Unoki; Masakazu Furukawa; Keigo Sakata; Masato Akagi

This paper proposes a method for dereverberating the power envelope from the reverberant signal. This method is based on the modulation transfer function (MTF) and does not require that the impulse response of an environment be measured. It improves upon the basic model proposed by Hirobayashi et al. (1998) regarding the following problems: (i) how to precisely extract the power envelope from the observed signal; (ii) how to determine the parameters of the impulse response of the room; and (iii) a lack of consideration as to whether the MTF concept can be applied to a more realistic signal. We have shown that the proposed method can accurately dereverberate the power envelope from the reverberant signal.


intelligent information hiding and multimedia signal processing | 2011

Reversible Watermarking for Digital Audio Based on Cochlear Delay Characteristics

Masashi Unoki; Ryota Miyauchi

There have recently been serious social issues involved in multimedia signal processing such as digital rights management, secure authentication, malicious attacks, and tampering with digital audio/speech signals. Reversible watermarking is a technique that enables these signals to be authenticated and then restored to their original signals by removing watermarks from them. We previously proposed an inaudible digital-audio watermarking approach based on cochlear delay (CD). We investigated how the proposed approach could be developed as reversible watermarking by considering blind detection for inaudible watermarks and the reversibility of audio watermarking. We evaluated inaudible and reversible watermarking with the proposed approach by carrying out three objective tests (PEAQ, LSD, and bit-detection or SNR). The results revealed that reversible watermarking based on CD could be accomplished.


Computer Speech & Language | 2011

Sub-band temporal modulation envelopes and their normalization for automatic speech recognition in reverberant environments

Xugang Lu; Masashi Unoki; Satoshi Nakamura

Abstract: Automatic speech recognition (ASR) in reverberant environments is still a challenging task. In this study, we propose a robust feature-extraction method on the basis of the normalization of the sub-band temporal modulation envelopes (TMEs). The sub-band TMEs were extracted using a series of constant bandwidth band-pass filters with Hilbert transforms followed by low-pass filtering. Based on these TMEs, the modulation spectrums in both clean and reverberation spaces are transformed to a reference space by using modulation transfer functions (MTFs), wherein the MTFs are estimated as the measure of the modulation transfer effect on the sub-band TMEs between the clean, reverberation, and reference spaces. By using the MTFs on the modulation spectrum, it is supposed that the difference on the modulation spectrum caused by the difference of the recording environments is removed. Based on the normalized modulation spectrum, inverse Fourier transform was conducted to restore the sub-band TMEs by retaining their original phase information. We tested the proposed method on speech recognition experiments in a reverberant room with differing speaker to microphone distance (SMD). For comparison, the recognition performance of using the traditional Mel frequency cepstral coefficients with mean and variance normalization was used as the baseline. The experimental results showed that by averaging the results for SMDs from 50cm to 400cm, we obtained a 44.96% relative improvement by only using sub-band TME processing, and obtained a further 15.68% relative improvement by performing the normalization on the modulation spectrum of the sub-band TMEs. In all, we obtained a 53.59% relative improvement, which was better than using other temporal filtering and normalization methods.


Speech Communication | 2010

Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition

Xugang Lu; Shigeki Matsuda; Masashi Unoki; Satoshi Nakamura

Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on the temporal modulation of speech. Then on the basis of this analysis, we proposed a two-stage processing algorithm that adaptively normalizes the temporal modulation of speech to extract robust speech features for automatic speech recognition. In the first stage of the proposed algorithm, the temporal modulation contrast of the cepstral time series for both clean and noisy speech is normalized. In the second stage, the contrast normalized temporal modulation spectrum is smoothed in order to reduce the artifacts due to noise while preserving the information in the speech modulation events (edges). We tested our algorithm in speech recognition experiments for additive noise condition, reverberant condition, and noisy condition (both additive noise and reverberation) using the AURORA-2J data corpus. Our results showed that as part of a uniform processing framework, the algorithm helped achieve the following: (1) for the additive noise condition, a 55.85% relative word error reduction (RWER) rate when clean conditional training was performed, and a 41.64% RWER rate when multi-conditional training was performed, (2) for the reverberant condition, a 51.28% RWER rate, and (3) for the noisy condition (both additive noise and reverberation), a 95.03% RWER rate. In addition, we evaluated the performance of each stage of the proposed algorithm in AURORA-2J and AURORA4 experiments, and compared the performance of our algorithm with the performances of two similar processing algorithms in the second stage. The evaluation results further confirmed the effectiveness of our proposed algorithm.


international conference on communications | 2008

An LP-based blind model for restoring bone-conducted speech

Thang tat Vu; Masashi Unoki; Masato Akagi

Due to the stability against the external noise, bone-conducted (BC) speech seems better to be used instead of noisy air-conducted speech in an extremely noisy environment. However the quality of bone-conducted speech is very low and restoring bone-conducted speech is a challenged topic in speech signal processing field. As the main issue to improve the BC speech, many studies try to model and resolve the degradation when the signal is conducted through bone transduction. In previous study, we proposed a linear prediction (LP) based blind-restoration model. In this paper, we therefore completely evaluated the proposed model in comparison with other models to find out whether our proposed model could adequately improve voice quality and the intelligibility of BC speech, using objective measures (LSD, MCD, and LCD) and carrying out Japanese word-intelligibility tests (JWITs), Vietnamese word-intelligibility tests (VWITs) and Modified Rhyme Tests (MRTs) for English. The results of experiments on different languages, i.e. Japanese, English and Vietnamese proved the practicability of blind-BC restoration.


IEICE Technical Report; IEICE Tech. Rep. | 2007

Estimates of Tuning of Auditory Filter Using Simultaneous and Forward Notched-noise Masking

Masashi Unoki; Ryota Miyauchi; Chin-Tuan Tan

The frequency selectivity of an auditory filter system is often conceptualized as a bank of bandpass auditory filters. Over the past 30 years, many simultaneous masking experiments using notched-noise maskers have been done to define the shape of the auditory filters (e.g., Glasberg and Moore 1990; Patterson and Nimmo-Smith 1980; Rosen and Baker, 1994). The studies of Glasberg and Moore (2000) and Baker and Rosen (2006) are notable inasmuch as they measured the human auditory filter shape over most of the range of frequencies and levels encountered in everyday hearing. The advantage of using notched-noise masking is that one can avoid off-frequency listening and investigate filter asymmetry. However, the derived filter shapes are also affected by the effects of suppression. The tunings of auditory filters derived from data collected in forward masking experiments were apparently sharper than those derived from simultaneous masking experiments, especially when the signal levels are low. The tuning of a filter is commonly believed to be affected by cochlear nonlinearity such as the effect of suppression. In past studies, the tunings of auditory filters derived from simultaneous masking data were wider than those of filters derived from nonsimultaneous (forward) masking data (Moore and Glasberg 1978; Glasberg and Moore 1982; Oxenham and Shera 2003). Heinz et al. (2002) showed that a tuning is generally sharpest when stimuli are at low levels and that suppression may affect tuning estimates more at high characteristic frequencies (CFs) than at low CFs. If the suggestion of Heinz et al. (2002) holds, i.e., if suppression affects frequency changes, comparing the filter bandwidths derived from simultaneous and forward masking experiments would indicate this. In this study we attempt to estimate filter tunings using both simultaneous and forward masking experiments with a notched-noise masker to investigate how the effects of suppression affect estimates of frequency selectivity across signal frequencies, signal levels, notch conditions (symmetric and asymmetric), and signal delays. This study extends the study of Unoki and Tan (2005).


Journal of the Acoustical Society of America | 2013

Objective evaluation of sound quality for attacks on robust audio watermarking

Akira Nishimura; Masashi Unoki; Kazuhiko Kondo; Akio Ogihara

Various attacks on robust audio watermarking have been proposed. Excessive intentional modifications and/or perceptual coding to the distributed stego audio degrades sound quality and can prevent the extraction of hidden data so that piracy detection systems using automated watermarking and crawling are disrupted. Reversible signal processing attacks, such as linear speed changes, also degrade the sound quality of distributed stego audio. However, the inverse processing of the reversible processing attack can recover the original sound quality of the audio received after illegal distribution. Therefore, the degradation of sound quality induced by perceptual codecs and reversible processing attacks followed by inverse processing should be considered to determine whether the intensity of these attacks is realistic. In this study, objective audio quality measurement was applied to audio signals, including typical perceptual coding, MP3, tandem MP3, and MPEG4AAC, and reversible signal processing techniques, i...

Collaboration


Dive into the Masashi Unoki's collaboration.

Top Co-Authors

Avatar

Masato Akagi

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Ryota Miyauchi

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Xugang Lu

National Institute of Information and Communications Technology

View shared research outputs
Top Co-Authors

Avatar

Shengbei Wang

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Nhut Minh Ngo

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Satoshi Nakamura

Nara Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Zhi Zhu

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Shota Morita

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Jessada Karnjana

Japan Advanced Institute of Science and Technology

View shared research outputs
Top Co-Authors

Avatar

Yang Liu

Japan Advanced Institute of Science and Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge