Hiromasa Fujihara
National Institute of Advanced Industrial Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hiromasa Fujihara.
international symposium on multimedia | 2006
Hiromasa Fujihara; Masataka Goto; Jun Ogata; Kazunori Komatani; Tetsuya Ogata; Hiroshi G. Okuno
This paper describes a system that can automatically synchronize between polyphonic musical audio signals and corresponding lyrics. Although there were methods that can synchronize between monophonic speech signals and corresponding text transcriptions by using Viterbi alignment techniques, they cannot be applied to vocals in CD recordings because accompaniment sounds often overlap with vocals. To align lyrics with such vocals, we therefore developed three methods: a method for segregating vocals from polyphonic sound mixtures, a method for detecting vocal sections, and a method for adapting a speech-recognizer phone model to segregated vocal signals. Experimental results for 10 Japanese popular-music songs showed that our system can synchronize between music and lyrics with satisfactory accuracy for 8 songs
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Hiromasa Fujihara; Masataka Goto; Tetsuro Kitahara; Hiroshi G. Okuno
This paper describes a method of modeling the characteristics of a singing voice from polyphonic musical audio signals including sounds of various musical instruments. Because singing voices play an important role in musical pieces with vocals, such representation is useful for music information retrieval systems. The main problem in modeling the characteristics of a singing voice is the negative influences caused by accompaniment sounds. To solve this problem, we developed two methods, accompaniment sound reduction and reliable frame selection . The former makes it possible to calculate feature vectors that represent a spectral envelope of a singing voice after reducing accompaniment sounds. It first extracts the harmonic components of the predominant melody from sound mixtures and then resynthesizes the melody by using a sinusoidal model driven by these components. The latter method then estimates the reliability of frame of the obtained melody (i.e., the influence of accompaniment sound) by using two Gaussian mixture models (GMMs) for vocal and nonvocal frames to select the reliable vocal portions of musical pieces. Finally, each song is represented by its GMM consisting of the reliable frames. This new representation of the singing voice is demonstrated to improve the performance of an automatic singer identification system and to achieve an MIR system based on vocal timbre similarity.
international conference on acoustics, speech, and signal processing | 2006
Hiromasa Fujihara; Tetsuro Kitahara; Masataka Goto; Kazunori Komatani; Tetsuya Ogata; H.G. Okun
This paper describes a method for estimating F0s of vocal from polyphonic audio signals. Because melody is sung by a singer in many musical pieces, the estimation of F0s of the vocal part is useful for many applications. Based on existing multiple-F0 estimation method, we evaluate the vocal probabilities of the harmonic structure of each F0 candidate. In order to calculate the vocal probabilities of the harmonic structure, we extract and resynthesize the harmonic structure by using a sinusoidal model and extract feature vectors. Then, we evaluate the vocal probability by using vocal and non-vocal Gaussian mixture models (GMMs). Finally, we track F0 trajectories using these probabilities based on Viterbi search. Experimental results show that our method improves estimation accuracy from 78.1% to 84.3%, which is 28.3% reduction of misestimation
IEEE Journal of Selected Topics in Signal Processing | 2011
Hiromasa Fujihara; Masataka Goto; Jun Ogata; Hiroshi G. Okuno
This paper describes a system that can automatically synchronize polyphonic musical audio signals with their corresponding lyrics. Although methods for synchronizing monophonic speech signals and corresponding text transcriptions by using Viterbi alignment techniques have been proposed, these methods cannot be applied to vocals in CD recordings because vocals are often overlapped by accompaniment sounds. In addition to a conventional method for reducing the influence of the accompaniment sounds, we therefore developed four methods to overcome this problem: a method for detecting vocal sections, a method for constructing robust phoneme networks, a method for detecting fricative sounds, and a method for adapting a speech-recognizer phone model to segregated vocal signals. We then report experimental results for each of these methods and also describe our music playback interface that utilizes our system for synchronizing music and lyrics.
international conference on acoustics, speech, and signal processing | 2008
Hiromasa Fujihara; Masataka Goto
Three techniques are described that improve a previously developed system for automatically synchronizing lyrics with musical audio signals. Although this system achieves state-of-the-art accuracy by extracting vocal vowels from polyphonic sound mixtures and using forced alignment between those vowels and a phoneme network of the lyrics, there was still room for improvement. The first technique detects nonexistence regions in which fricative consonant sounds do not exist, which were not utilized in the previous system, and prohibits the alignment of the fricative phonemes to those regions. The second technique inserts a filler model between phrases of the phoneme network. This model improves the accuracy of the forced alignment by ignoring inter-phrase vowel utterances not included in the lyrics. The third technique introduces novel feature vectors for vocal activity detection that enable a distance calculation between two sets of the harmonic structure without estimating their spectral envelopes. Experimental results showed that all three techniques contribute to improved synchronization.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Matthias Mauch; Hiromasa Fujihara; Masataka Goto
Aligning lyrics to audio has a wide range of applications such as the automatic generation of karaoke scores, song-browsing by lyrics, and the generation of audio thumbnails. Existing methods are restricted to using only lyrics and match them to phoneme features extracted from the audio (usually mel-frequency cepstral coefficients). Our novel idea is to integrate the textual chord information provided in the paired chords-lyrics format known from song books and Internet sites into the inference procedure. We propose two novel methods that implement this idea: First, assuming that all chords of a song are known, we extend a hidden Markov model (HMM) framework by including chord changes in the Markov chain and an additional audio feature (chroma) in the emission vector; second, for the more realistic case in which some chord information is missing, we present a method that recovers the missing chord information by exploiting repetition in the song. We conducted experiments with five changing parameters and show that with accuracies of 87.5% and 76.7%, respectively, both methods perform better than the baseline with statistical significance. We introduce the new accompaniment interface Song Prompter, which uses the automatically aligned lyrics to guide musicians through a song. It demonstrates that the automatic alignment is accurate enough to be used in a musical performance.
workshop on applications of signal processing to audio and acoustics | 2009
Hiromasa Fujihara; Masataka Goto; Hiroshi G. Okuno
A novel method is described that can be used to recognize the phoneme of a singing voice (vocal) in polyphonic music. Though we focus on the voiced phoneme in this paper, this method is design to concurrently recognize other elements of a singing voice such as fundamental frequency and singer. Thus, this method is considered to be a new framework for recognizing a singing voice in polyphonic music. Our method stochastically models a mixture of a singing voice and other instrumental sounds without segregating the singing voice. It can also estimate a reliable spectral envelope by estimating it from many harmonic structures with various fundamental frequencies (F0s). The results of phoneme recognition experiments with 10 popular-music songs by 6 singers showed that our method improves the recognition accuracy by 8.7 points and achieves a 20.0% decrease in error rate.
international conference on acoustics, speech, and signal processing | 2010
Masataka Goto; Takeshi Saitou; Tomoyasu Nakano; Hiromasa Fujihara
In this paper, we propose a novel area of research referred to as singing information processing. To shape the concept of this area, we first introduce singing understanding systems for synchronizing between vocal melody and corresponding lyrics, identifying the singer name, evaluating singing skills, creating hyperlinks between phrases in the lyrics of songs, and detecting breath sounds. We then introduce music information retrieval systems based on similarity of vocal melody timbre and vocal percussion, and singing synthesis systems. Common signal processing techniques for modeling singing voices that are used in these systems, such as techniques for extracting the vocal melody from polyphonic music recordings and modeling the lyrics by using phoneme HMMs for singing voices, are discussed.
Multimodal Music Processing | 2012
Hiromasa Fujihara; Masataka Goto
Automatic lyrics-to-audio alignment techniques have been drawing attention in the last years and various studies have been made in this field. The objective of lyrics-to-audio alignment is to estimate a temporal relationship between lyrics and musical audio signals and can be applied to various applications such as Karaoke-style lyrics display. In this contribution, we provide an overview of recent development in this research topic, where we put a particular focus on categorization of various methods and on applications.
international conference on acoustics, speech, and signal processing | 2011
Hiromasa Fujihara; Masataka Goto
The scarcity of available multi-track recordings constitutes a severe constraint on the training of probabilistic models for voice extraction from polyphonic music. We propose a novel training method to estimate a spectral envelope of a singing voice that makes it possible to train the models from a polyphonic music without segregating a singing voice. We implement this method as an extension to the existing W-PST method, which concurrently estimates singing voice fundamental frequency (F0) and phoneme from polyphonic music. The novel training method is based on random sampling from probabilistic distributions. We conducted experiments on concurrent F0 and phoneme estimation and confirm the effectiveness of our method.
Collaboration
Dive into the Hiromasa Fujihara's collaboration.
National Institute of Advanced Industrial Science and Technology
View shared research outputsNational Institute of Advanced Industrial Science and Technology
View shared research outputsNational Institute of Advanced Industrial Science and Technology
View shared research outputsNational Institute of Advanced Industrial Science and Technology
View shared research outputsNational Institute of Advanced Industrial Science and Technology
View shared research outputs