Aleksandr Diment
Tampere University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Aleksandr Diment.
international workshop on machine learning for signal processing | 2013
Forrest Briggs; Yonghong Huang; Raviv Raich; Konstantinos Eftaxias; Zhong Lei; William Cukierski; Sarah Frey Hadley; Adam S. Hadley; Matthew G. Betts; Xiaoli Z. Fern; Jed Irvine; Lawrence Neal; Anil Thomas; Gabor Fodor; Grigorios Tsoumakas; Hong Wei Ng; Thi Ngoc Tho Nguyen; Heikki Huttunen; Pekka Ruusuvuori; Tapio Manninen; Aleksandr Diment; Tuomas Virtanen; Julien Marzat; Joseph Defretin; Dave Callender; Chris Hurlburt; Ken Larrey; Maxim Milakov
Birds have been widely used as biological indicators for ecological research. They respond quickly to environmental changes and can be used to infer about other organisms (e.g., insects they feed on). Traditional methods for collecting data about birds involves costly human effort. A promising alternative is acoustic monitoring. There are many advantages to recording audio of birds compared to human surveys, including increased temporal and spatial resolution and extent, applicability in remote sites, reduced observer bias, and potentially lower cost. However, it is an open problem for signal processing and machine learning to reliably identify bird sounds in real-world audio data collected in an acoustic monitoring scenario. Some of the major challenges include multiple simultaneously vocalizing birds, other sources of non-bird sound (e.g., buzzing insects), and background noise like wind, rain, and motor vehicles.
european signal processing conference | 2015
Aleksandr Diment; Emre Cakir; Toni Heittola; Tuomas Virtanen
A feature based on the group delay function from all-pole models (APGD) is proposed for environmental sound event recognition. The commonly used spectral features take into account merely the magnitude information, whereas the phase is overlooked due to the complications related to its interpretation. Additional information concealed in the phase is hypothesised to be beneficial for sound event recognition. The APGD is an approach to inferring phase information, which has shown applicability for speech and music analysis and is now studied in environmental audio. The evaluation is performed within a multi-label deep neural network (DNN) framework on a diverse real-life dataset of environmental sounds. It shows performance improvement compared to the baseline log mel-band energy case. Combined with the magnitude-based features, APGD demonstrates further improvement.
IEEE Transactions on Audio, Speech, and Language Processing | 2018
Joonas Nikunen; Aleksandr Diment; Tuomas Virtanen
In this paper, we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel nonnegative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the methods susceptibility toward tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.
Speech Communication | 2016
Joonas Nikunen; Aleksandr Diment; Tuomas Virtanen; Miikka Vilermo
A method for binaural rendering of sound scene recordings is proposed.Source signals and their direction of arrival is estimated using a microphone array.A low-rank NMF model for separation of sound sources is used.Speech intelligibility test with overlapping speech is conducted.The speech intelligibility by binaural processing is shown to increase over stereo. This paper proposes a method for binaural reconstruction of a sound scene captured with a portable-sized array consisting of several microphones. The proposed processing is separating the scene into a sum of small number of sources, and the spectrogram of each of them is in turn represented as a small number of latent components. The direction of arrival (DOA) of each source is estimated, which is followed by binaural rendering of each source at its estimated direction. For representing the sources, the proposed method uses low-rank complex-valued non-negative matrix factorization combined with DOA-based spatial covariance matrix model. The binaural reconstruction is achieved by applying the binaural cues (head-related transfer function) associated with the estimated source DOA to the separated source signals. The binaural rendering quality of the proposed method was evaluated using a speech intelligibility test. The test results indicated that the proposed binaural rendering was able to improve the intelligibility of speech over stereo recordings and separation by minimum variance distortionless response beamformer with the same binaural synthesis in a three-speaker scenario. An additional listening test evaluating the subjective quality of the rendered output indicates no added processing artifacts by the proposed method in comparison to unprocessed stereo recording.
workshop on applications of signal processing to audio and acoustics | 2015
Aleksandr Diment; Tuomas Virtanen
This paper proposes dictionary learning with archetypes for audio processing. Archetypes refer to so-called pure types, which are a combination of a few data points and which can be combined to obtain a data point. The concept has been found useful in various problems, but it has not yet been applied for audio analysis. The algorithm performs archetypal analysis that minimises the generalised Kullback-Leibler divergence, shown suitable for audio, between an observation and the model. The methodology is evaluated in a source separation scenario (mixtures of speech) and shows results, which are comparable to the state-of-the-art, with perceptual measures indicating its superiority over all of the competing methods in the case of medium-size dictionaries.
international symposium on neural networks | 2017
Michele Valenti; Stefano Squartini; Aleksandr Diment; Giambattista Parascandolo; Tuomas Virtanen
This paper presents a novel application of convolutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the “Detection and Classification of Acoustic Scenes and Events” (DCASE) challenges held in 20161 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winners score.
european signal processing conference | 2016
Aleksandr Diment; Mikko Parviainen; Tuomas Virtanen; Roman Zelov; Alex Glasman
Detection of whispered speech in the presence of high levels of background noise has applications in fraudulent behaviour recognition. For instance, it can serve as an indicator of possible insider trading. We propose a deep neural network (DNN)-based whispering detection system, which operates on both magnitude and phase features, including the group delay feature from all-pole models (APGD). We show that the APGD feature outperforms the conventional ones. Trained and evaluated on the collected diverse dataset of whispered and normal speech with emulated phone line distortions and significant amounts of added background noise, the proposed system performs with accuracies as high as 91.8%.
computer music modeling and retrieval | 2013
Aleksandr Diment; Padmanabhan Rajan; Toni Heittola; Tuomas Virtanen
In this work, the feature based on the group delay function from all-pole models (APGD) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The APGD is an elegant approach to inferring phase information, which lacks of the issues related to interpreting the phase and does not require extensive parameter adjustment. Having shown applicability for speech-related problems, it is now explored in terms of instrument recognition. The evaluation is performed with various instrument sets and shows noteworthy absolute accuracy gains of up to 7 % compared to the baseline mel-frequency cepstral coefficients (MFCCs) case. Combined with the MFCCs and with feature selection, APGD demonstrates superiority over the baseline with all the evaluated sets.
european signal processing conference | 2013
Aleksandr Diment; Toni Heittola; Tuomas Virtanen
DCASE 2017 - Workshop on Detection and Classification of Acoustic Scenes and Events | 2017
Annamaria Mesaros; Toni Heittola; Aleksandr Diment; Benjamin Elizalde; Ankit Shah; Emmanuel Vincent; Bhiksha Raj; Tuomas Virtanen