Anssi Klapuri
Queen Mary University of London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Anssi Klapuri.
Archive | 2006
Anssi Klapuri; Manuel Davy
This book serves as an ideal starting point for newcomers and an excellent reference source for people already working in the field. Researchers and graduate students in signal processing, computer science, acoustics and music will primarily benefit from this text. It could be used as a textbook for advanced courses in music signal processing. Since it only requires a basic knowledge of signal processing, it is accessible to undergraduate students.
international conference on acoustics speech and signal processing | 1999
Anssi Klapuri
A system was designed, which is able to detect the perceptual onsets of sounds in acoustic signals. The system is general in regard to the sounds involved and was found to be robust for different kinds of signals. This was achieved without assuming regularities in the positions of the onsets. In this paper, a method is first proposed that can determine the beginnings of sounds that exhibit onset imperfections, i.e., the amplitude envelope of which does not rise monothinically. Then the mentioned system is described, which utilizes band-wise processing and a psychoacoustic model of intensity coding to combine the results from the separate frequency bands. The performance of the system was validated by applying it to the detection of onsets in musical signals ranging from rock to classical and big band recordings.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Anssi Klapuri; Antti Eronen; Jaakko Astola
A method is described which analyzes the basic pattern of beats in a piece of music, the musical meter. The analysis is performed jointly at three different time scales: at the temporally atomic tatum pulse level, at the tactus pulse level which corresponds to the tempo of a piece, and at the musical measure level. Acoustic signals from arbitrary musical genres are considered. For the initial time-frequency analysis, a new technique is proposed which measures the degree of musical accent as a function of time at four different frequency ranges. This is followed by a bank of comb filter resonators which extracts features for estimating the periods and phases of the three pulses. The features are processed by a probabilistic model which represents primitive musical knowledge and uses the low-level observations to perform joint estimation of the tatum, tactus, and measure pulses. The model takes into account the temporal dependencies between successive estimates and enables both causal and noncausal analysis. The method is validated using a manually annotated database of 474 music signals from various genres. The method works robustly for different types of music and improves over two state-of-the-art reference methods in simulations.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Antti Eronen; Vesa T. Peltonen; Juha T. Tuomi; Anssi Klapuri; Seppo Fagerlund; Timo Sorsa; Gaëtan Lorho; Jyri Huopaniemi
The aim of this paper is to investigate the feasibility of an audio-based context recognition system. Here, context recognition refers to the automatic classification of the context or an environment around a device. A system is developed and compared to the accuracy of human listeners in the same task. Particular emphasis is placed on the computational complexity of the methods, since the application is of particular interest in resource-constrained portable devices. Simplistic low-dimensional feature vectors are evaluated against more standard spectral features. Using discriminative training, competitive recognition accuracies are achieved with very low-order hidden Markov models (1-3 Gaussian components). Slight improvement in recognition accuracy is observed when linear data-driven feature transformations are applied to mel-cepstral features. The recognition rate of the system as a function of the test sequence length appears to converge only after about 30 to 60 s. Some degree of accuracy can be achieved even with less than 1-s test sequence lengths. The average reaction time of the human listeners was 14 s, i.e., somewhat smaller, but of the same order as that of the system. The average recognition accuracy of the system was 58% against 69%, obtained in the listening tests in recognizing between 24 everyday contexts. The accuracies in recognizing six high-level classes were 82% for the system and 88% for the subjects.
IEEE Transactions on Speech and Audio Processing | 2003
Anssi Klapuri
A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.
international conference on acoustics, speech, and signal processing | 2002
Vesa T. Peltonen; Juha T. Tuomi; Anssi Klapuri; Jyri Huopaniemi; Timo Sorsa
In this paper, we address the problem of computational auditory scene recognition and describe methods to classify auditory scenes into predefined classes. By auditory scene recognition we mean recognition of an environment using audio information only. The auditory scenes comprised tens of everyday outside and inside environments, such as streets, restaurants, offices, family homes, and cars. Two completely different but almost equally effective classification systems were used: band-energy ratio features with 1-NN classifier and Mel-frequency cepstral coefficients with Gaussian mixture models. The best obtained recognition rate for 17 different scenes out of 26 and for an analysis duration of 30 seconds was 68.4%. For comparison, the recognition accuracy of humans was 70% for 25 different scenes and the average response time was around 20 seconds. The efficiency of different acoustic features and the effect of test sequence length were studied.
international conference on acoustics, speech, and signal processing | 2000
Antti Eronen; Anssi Klapuri
In this paper, a system for pitch independent musical instrument recognition is presented. A wide set of features covering both spectral and temporal properties of sounds was investigated, and their extraction algorithms were designed. The usefulness of the features was validated using test data that consisted of 1498 samples covering the full pitch ranges of 30 orchestral instruments from the string, brass and woodwind families, played with different techniques. The correct instrument family was recognized with 94% accuracy and individual instruments in 80% of cases. These results are compared to those reported in other work. Also, utilization of a hierarchical classification framework is considered.
IEEE Journal of Selected Topics in Signal Processing | 2011
Meinard Müller; Daniel P. W. Ellis; Anssi Klapuri; Gaël Richard
Music signal processing may appear to be the junior relation of the large and mature field of speech signal processing, not least because many techniques and representations originally developed for speech have been applied to music, often with good results. However, music signals possess specific acoustic and structural characteristics that distinguish them from spoken language or other nonmusical signals. This paper provides an overview of some signal analysis techniques that specifically address musical dimensions such as melody, harmony, rhythm, and timbre. We will examine how particular characteristics of music signals impact and determine these techniques, and we highlight a number of novel music analysis and retrieval tasks that such processing makes possible. Our goal is to demonstrate that, to be successful, music audio signal processing techniques must be informed by a deep and thorough insight into the nature of music itself.
Computer Music Journal | 2008
Matti Ryynänen; Anssi Klapuri
This article proposes a method for the automatic transcription of the melody, bass line, and chords in polyphonic pop music. The method uses a frame-wise pitch-salience estimator as a feature extraction front-end. For the melody and bass-line transcription, this is followed by acoustic modeling of note events and musicological modeling of note transitions. The acoustic models include a model for the target notes (i.e., melody or bass notes) and a background model. The musicological model involves key estimation and note bigrams that determine probabilities for transitions between target notes. A transcription of the melody or the bass line is obtained using Viterbi search via the target and the background note models. The performance of the melody and the bass-line transcription is evaluated using approximately 8.5 hours of realistic polyphonic music. The chord transcription maps the pitch salience estimates to a pitch-class representation and uses trained chord models and chord-transition probabilities to produce a transcription consisting of major and minor triads. For chords, the evaluation material consists of the first eight Beatles albums. The method is computationally efficient and allows causal implementation, so it can process streaming audio. Transcription of music refers to the analysis of an acoustic music signal for producing a parametric representation of the signal. The representation may be a music score with a meticulous arrangement for each instrument or an approximate description of melody and chords in the piece, for example. The latter type of transcription is commonly used in commercial songbooks of pop music and is usually sufficient for musicians or music hobbyists to play the piece. On the other hand, more detailed transcriptions are often employed in classical music to preserve the exact arrangement of the composer. � 2008 Massachusetts Institute of Technology. We propose a method for the automatic transcription of the melody, bass line, and chords in pop-music recordings. Conventionally, these tasks have been carried out by trained musicians who listen to a piece of music and write down notes or chords by hand, which is time-consuming and requires musical training. A machine transcriber enables several applications. First, it provides an easy way of obtaining a description of a music recording, allowing musicians to play it. Second, the produced transcriptions may be used in music analysis, music information retrieval (MIR) from large music databases, content-based audio processing, and interactive music systems, for example.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Anssi Klapuri
A method is described for estimating the fundamental frequencies of several concurrent sounds in polyphonic music and multiple-speaker speech signals. The method consists of a computational model of the human auditory periphery, followed by a periodicity analysis mechanism where fundamental frequencies are iteratively detected and canceled from the mixture signal. The auditory model needs to be computed only once, and a computationally efficient strategy is proposed for implementing it. Simulation experiments were made using mixtures of musical sounds and mixed speech utterances. The proposed method outperformed two reference methods in the evaluations and showed a high level of robustness in processing signals where important parts of the audible spectrum were deleted to simulate bandlimited interference. Different system configurations were studied to identify the conditions where pitch analysis using an auditory model is advantageous over conventional time or frequency domain approaches.