Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yasunori Ohishi is active.

Publication


Featured researches published by Yasunori Ohishi.


international conference on acoustics, speech, and signal processing | 2013

Bayesian semi-supervised audio event transcription based on Markov indian buffet process

Yasunori Ohishi; Daichi Mochihashi; Tomoko Matsui; Masahiro Nakano; Hirokazu Kameoka; Tomonori Izumitani; Kunio Kashino

We present a novel generative model for audio event transcription that recognizes “events” on audio signals including multiple kinds of overlapping sounds. In the proposed model, firstly, the overlapping audio events are modeled based on nonnegative matrix factorization into which Bayesian nonparametric approaches: the Markov Indian buffet process and the Chinese restaurant process, are incorporated. This approach allows us to automatically transcribe the events while avoiding the model selection problem by assuming a countably infinite number of possible audio events in the input signal. Then, Bayesian logistic regression annotates the audio frames with the multiple event labels in a semi-supervised learning setup. Experimental results show that our model can better annotate an audio signal in comparison with a baseline method. Additionally, we verify that our infinite generative model is also able to detect unknown audio events that are not included in the training data.


IEEE Transactions on Audio, Speech, and Language Processing | 2015

Generative modeling of voice fundamental frequency contours

Hirokazu Kameoka; Kota Yoshizato; Tatxsuma Ishihara; Kento Kadowaki; Yasunori Ohishi; Kunio Kashino

This paper introduces a generative model of voice fundamental frequency (F0) contours that allows us to extract prosodic features from raw speech data. The present F0 contour model is formulated by translating the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. There are two motivations behind this formulation. One is to derive a general parameter estimation framework for the Fujisaki model that allows the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model that we can incorporate into statistical-model-based text-to-speech synthesizers in such a way that the Fujisaki-model parameters can be learned from a speech corpus in a unified manner. It could also be useful for other speech applications such as emotion recognition, speaker identification, speech conversion and dialogue systems, in which prosodic information plays a significant role. We quantitatively evaluated the performance of the proposed Fujisaki model parameter extractor using real speech data. Experimental results revealed that our method was superior to a state-of-the-art Fujisaki model parameter extractor.


international conference on acoustics, speech, and signal processing | 2012

Bayesian nonparametric music parser

Masahiro Nakano; Yasunori Ohishi; Hirokazu Kameoka; Ryo Mukai; Kunio Kashino

This paper proposes a novel representation of music that can be used for similarity-based music information retrieval, and also presents a method that converts an input polyphonic audio signal to the proposed representation. The representation involves a 2-dimensional tree structure, where each node encodes the musical note and the dimensions correspond to the time and simultaneous multiple notes, respectively. Since the temporal structure and the synchrony of simultaneous events are both essential in music, our representation reflects them explicitly. In the conventional approaches to music representation from audio, note extraction is usually performed prior to structure analysis, but accurate note extraction has been a difficult task. In the proposed method, note extraction and structure estimation is performed simultaneously and thus the optimal solution is obtained with a unified inference procedure. That is, we propose an extended 2-dimensional infinite probabilistic context-free grammar and a sparse factor model for spectrogram analysis. An efficient inference algorithm, based on Markov chain Monte Carlo sampling and dynamic programming, is presented. The experimental results show the effectiveness of the proposed approach.


international conference on acoustics, speech, and signal processing | 2014

Mixture of Gaussian process experts for predicting sung melodic contour with expressive dynamic fluctuations

Yasunori Ohishi; Daichi Mochihashi; Hirokazu Kameoka; Kunio Kashino

We present a generative model for predicting the sung melodic contour, i.e., F0 contour, with expressive dynamic fluctuations, such as vibrato and portamento, for a given musical score. Although several studies have attempted to characterize such fluctuations, no systematic method has been developed for generating the F0 contour with them in connection with musical notes. In our model, the relationship between a musical note sequence and F0 contour is directly learned by a mixture of Gaussian process experts. This approach allows us to automatically characterize the fluctuations by utilizing the kernel function for each Gaussian process expert and predict the F0 contour for an arbitrary musical note sequence. Experimental results show that our model can better predict the F0 contour than a baseline method can. Additionally, we discuss the effective musical contexts and the amount of training data for the prediction.


international conference on acoustics, speech, and signal processing | 2014

Mondrian hidden Markov model for music signal processing

Masahiro Nakano; Yasunori Ohishi; Hirokazu Kameoka; Ryo Mukai; Kunio Kashino

This paper discusses a new extension of hidden Markov models that can capture clusters embedded in transitions between the hidden states. In our model, the state-transition matrices are viewed as representations of relational data reflecting a network structure between the hidden states. We specifically present a nonparametric Bayesian approach to the proposed state-space model whose network structure is represented by a Mondrian Process-based relational model. We show an application of the proposed model to music signal analysis through some experimental results.


international conference on acoustics, speech, and signal processing | 2011

Automatic audio tag classification via semi-supervised canonical density estimation

Jun Takagi; Yasunori Ohishi; Akisato Kimura; Masashi Sugiyama; Makoto Yamada; Hirokazu Kameoka

We propose a novel semi-supervised method for building a statistical model that represents the relationship between sounds and text labels (“tags”). The proposed method, named semi-supervised canonical density estimation, makes use of unlabeled sound data in two ways: 1) a low-dimensional latent space representing topics of sounds is extracted by a semi-supervised variant of canonical correlation analysis, and 2) topic models are learned by multi-class extension of semi-supervised kernel density estimation in the topic space. Real-world audio tagging experiments indicate that our proposed method improves the accuracy even when only a small number of labeled sounds are available.


Journal of the Acoustical Society of America | 2006

Automatic discrimination between singing and speaking voices for a flexible music retrieval system

Yasunori Ohishi; Masataka Goto; Katunobu Itou; Kazuya Takeda

This paper describes a music retrieval system that enables a user to retrieve a song by two different methods: by singing its melody or by saying its title. To allow the user to use those methods seamlessly without changing a voice input mode, a method of automatically discriminating between singing and speaking voices is indispensable. We therefore first investigated measures that characterize differences between singing and speaking voices. From subjective experiments, we found that human listeners discriminated between these two voices with 70% accuracy for 200‐ms signals. These results showed that even short‐term characteristics such as the spectral envelope represented as MFCC can be used as a discrimination cue, while the temporal structure is the most important cue when longer signals are given. According to these results, we then developed the automatic method of discriminating between singing and speaking voices by combining two measures: MFCC and an F0 (voice pitch) contour. Experimental results...


conference of the international speech communication association | 2010

A statistical model of speech F0 contours

Hirokazu Kameoka; Jonathan Le Roux; Yasunori Ohishi


conference of the international speech communication association | 2005

Discrimination between Singing and Speaking Voices

Yasunori Ohishi; Masataka Goto; Katsunobu Itou; Kazuya Takeda


international symposium/conference on music information retrieval | 2007

A Stochastic Representation of the Dynamics of Sung Melody.

Yasunori Ohishi; Masataka Goto; Katunobu Itou; Kazuya Takeda

Collaboration


Dive into the Yasunori Ohishi's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kunio Kashino

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Daichi Mochihashi

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Masataka Goto

National Institute of Advanced Industrial Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hidehisa Nagano

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ryo Mukai

Nippon Telegraph and Telephone

View shared research outputs
Researchain Logo
Decentralizing Knowledge