Slim Essid
Université Paris-Saclay
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Slim Essid.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Slim Essid; Gaël Richard; Bertrand David
We propose a new approach to instrument recognition in the context of real music orchestrations ranging from solos to quartets. The strength of our approach is that it does not require prior musical source separation. Thanks to a hierarchical clustering algorithm exploiting robust probabilistic distances, we obtain a taxonomy of musical ensembles which is used to efficiently classify possible combinations of instruments played simultaneously. Moreover, a wide set of acoustic features is studied including some new proposals. In particular, signal to mask ratios are found to be useful features for audio classification. This study focuses on a single music genre (i.e., jazz) but combines a variety of instruments among which are percussion and singing voice. Using a varied database of sound excerpts from commercial recordings, we show that the segmentation of music with respect to the instruments played can be achieved with an average accuracy of 53%.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Cyril Joder; Slim Essid; Gaël Richard
Nowadays, it appears essential to design automatic indexing tools which provide meaningful and efficient means to describe the musical audio content. There is in fact a growing interest for music information retrieval (MIR) applications amongst which the most popular are related to music similarity retrieval, artist identification, musical genre or instrument recognition. Current MIR-related classification systems usually do not take into account the mid-term temporal properties of the signal (over several frames) and lie on the assumption that the observations of the features in different frames are statistically independent. The aim of this paper is to demonstrate the usefulness of the information carried by the evolution of these characteristics over time. To that purpose, we propose a number of methods for early and late temporal integration and provide an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases. In particular, the impact of the time horizon over which the temporal integration is performed will be assessed both for fixed and variable frame length analysis. Also, a number of proposed alignment kernels will be used for late temporal integration. For all experiments, the results are compared to a state of the art musical instrument recognition system.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Slim Essid; Gaël Richard; Bertrand David
Musical instrument recognition is an important aspect of music information retrieval. In this paper, statistical pattern recognition techniques are utilized to tackle the problem in the context of solo musical phrases. Ten instrument classes from different instrument families are considered. A large sound database is collected from excerpts of musical phrases acquired from commercial recordings translating different instrument instances, performers, and recording conditions. More than 150 signal processing features are studied including new descriptors. Two feature selection techniques, inertia ratio maximization with feature space projection and genetic algorithms are considered in a class pairwise manner whereby the most relevant features are fetched for each instrument pair. For the classification task, experimental results are provided using Gaussian mixture models (GMMs) and support vector machines (SVMs). It is shown that higher recognition rates can be reached with pairwise optimized subsets of features in association with SVM classification using a radial basis function kernel
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Cyril Joder; Slim Essid; Gaël Richard
In this paper, we introduce the use of conditional random fields (CRFs) for the audio-to-score alignment task. This framework encompasses the statistical models which are used in the literature and allows for more flexible dependency structures. In particular, it allows observation functions to be computed from several analysis frames. Three different CRF models are proposed for our task, for different choices of tradeoff between accuracy and complexity. Three types of features are used, characterizing the local harmony, note attacks and tempo. We also propose a novel hierarchical approach, which takes advantage of the score structure for an approximate decoding of the statistical model. This strategy reduces the complexity, yielding a better overall efficiency than the classic beam search method used in HMM-based models. Experiments run on a large database of classical piano and popular music exhibit very accurate alignments. Indeed, with the best performing system, more than 95% of the note onsets are detected with a precision finer than 100 ms. We additionally show how the proposed framework can be modified in order to be robust to possible structural differences between the score and the musical performance.
IEEE Transactions on Circuits and Systems for Video Technology | 2007
Olivier Gillet; Slim Essid; Gaël Richard
The study of the associations between audio and video content has numerous important applications in the fields of information retrieval and multimedia content authoring. In this work, we focus on music videos which exhibit a broad range of structural and semantic relationships between the music and the video content. To identify such relationships, a two-level automatic structuring of the music and the video is achieved separately. Note onsets are detected from the music signal, along with section changes. The latter is achieved by a novel algorithm which makes use of feature selection and statistical novelty detection approaches based on kernel methods. The video stream is independently segmented to detect changes in motion activity, as well as shot boundaries. Based on this two-level segmentation of both streams, four audio-visual correlation measures are computed. The usefulness of these correlation measures is illustrated by a query by video experiment on a 100 music video database, which also exhibits interesting genre dependencies
IEEE Transactions on Multimedia | 2013
Slim Essid; Cédric Févotte
This paper introduces a new paradigm for unsupervised audiovisual document structuring. In this paradigm, a novel Nonnegative Matrix Factorization (NMF) algorithm is applied on histograms of counts (relating to a bag of features representation of the content) to jointly discover latent structuring patterns and their activations in time. Our NMF variant employs the Kullback-Leibler divergence as a cost function and imposes a temporal smoothness constraint to the activations. It is solved by a majorization-minimization technique. The approach proposed is meant to be generic and is particularly well suited to applications where the structuring patterns may overlap in time. As such, it is evaluated on two person-oriented video structuring tasks (one using the visual modality and the second the audio). This is done using a challenging database of political debate videos. Our results outperform reference results obtained by a method using Hidden Markov Models. Further, we show the potential that our general approach has for audio speaker diarization.
international conference on acoustics, speech, and signal processing | 2012
Slim Essid; Dimitrios S. Alexiadis; Robin Tournemenne; Marc Gowing; Philip Kelly; David S. Monaghan; Petros Daras; Angélique Drémeau; Noel E. O'Connor
The ever increasing availability of high speed Internet access has led to a leap in technologies that support real-time realistic interaction between humans in online virtual environments. In the context of this work, we wish to realise the vision of an online dance studio where a dance class is to be provided by an expert dance teacher and to be delivered to online students via the web. In this paper we study some of the technical issues that need to be addressed in this challenging scenario. In particular, we describe an automatic dance analysis tool that would be used to evaluate a students performance and provide him/her with meaningful feedback to aid improvement.
international conference on acoustics, speech, and signal processing | 2010
Cyril Joder; Slim Essid; Gaël Richard
In this paper we review the acoustic features used for music-to-score alignment and study their influence on the performance in a challenging alignment task, where the audio data is polyphonic and may contain percussion. Furthermore, as we aim at using “real world” scores, we follow an approach which does exploit the rhythm information (considered unreliable) and test its robustness to score errors. We use a unified framework to handle different state-of-the-art features, and propose a simple way to exploit either a model of the feature values, or an audio synthesis of a musical score, in an audio-to-score alignment system. We confirm that chroma vectors drawn from representations using a logarithmic frequency scale are the most efficient features, and lead to a good precision, even with a simple alignment strategy. Robustness tests also show that the relative performance of the features do not depend on possible musical score degradations.
international conference on acoustics, speech, and signal processing | 2016
Victor Bisot; Romain Serizel; Slim Essid; Gaël Richard
In this paper we study the use of unsupervised feature learning for acoustic scene classification (ASC). The acoustic environment recordings are represented by time-frequency images from which we learn features in an unsupervised manner. After a set of preprocessing and pooling steps, the images are decomposed using matrix factorization methods. By decomposing the data on a learned dictionary, we use the projection coefficients as features for classification. An experimental evaluation is done on a large ASC dataset to study popular matrix factorization methods such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) as well as some of their extensions including sparse, kernel based and convolutive variants. The results show the compared variants lead to significant improvement compared to the state-of-the-art results in ASC.
IEEE Transactions on Signal Processing | 2014
Nicolas Seichepine; Slim Essid; Cédric Févotte; Olivier Cappé
This work introduces a new framework for nonnegative matrix factorization (NMF) in multisensor or multimodal data configurations, where taking into account the mutual dependence that exists between the related parallel streams of data is expected to improve performance. In contrast with previous works that focused on co-factorization methods -where some factors are shared by the different modalities-we propose a soft co-factorization scheme which accounts for possible local discrepancies across modalities or channels. This objective is formalized as an optimization problem where concurrent factorizations are jointly performed while being tied by a coupling term that penalizes differences between the related factor matrices associated with different modalities. We provide majorization-minimization (MM) algorithms for three common measures of fit-the squared Euclidean norm, the Kullback-Leibler divergence and the Itakura-Saito divergence-and two possible coupling variants, using either the l1 or the squared Euclidean norm of differences. The approach is shown to achieve promising performance in two audio-related tasks: multimodal speaker diarization using audiovisual data and audio source separation using stereo data.