[PDF] An Ensemble of Convolutional Neural Networks for Audio Classification

Abstract

In this paper, ensembles of classifiers that exploit several data augmentation techniques and four signal representations for training Convolutional Neural Networks (CNNs) for audio classification are presented and tested on three freely available audio classification datasets: i) bird calls, ii) cat sounds, and iii) the Environmental Sound Classification dataset. The best performing ensembles combining data augmentation techniques with different signal representations are compared and shown to outperform the best methods reported in the literature on these datasets. The approach proposed here obtains state-of-the-art results in the widely used ESC-50 dataset. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification. Results demonstrate not only that CNNs can be trained for audio classification but also that their fusion using different techniques works better than the stand-alone classifiers.

Full PDF

AAn Ensemble of Convolutional Neural Networks for Audio Classification

Loris Nanni a , Gianluca Maguolo, a,  Sheryl Brahnam b and Michelangelo Paci c a DEI, University of Padua, Viale Gradenigo 6, Padua, Italy b Information Technology and Cybersecurity, Glass Hall Room 378, Missouri State University, Springfield, MO 65804, USA c BioMediTech, Faculty of Medicine and Health Technology, Tampere University, Arvo Ylpön katu 34, D 219, FI-33520, Tampere, Finland Introduction

Sound classification and recognition have long been included in the field of pattern recognition. Some of the more popular application domains include speech recognition [1], music classification [2], biometric identification [3], and environmental sound recognition [4]. Following the three classical pattern recognition steps of i) preprocessing, ii) feature extraction, and iii) classification, most early work in sound classification began by extracting features, such as the Statistical Spectrum Descriptor or Rhythm Histogram [5], from actual audio traces. Once it was recognized, however, that visual representations of audio traces contain valuable information, powerful feature extraction techniques used in image classification began to be investigated. Mel-frequency Cepstral Coefficients spectrograms [6], spectrograms [7], and other similar representations visually display the spectrum of frequencies of the original audio traces as they vary over time. A spectrogram, for instance, is a bi-dimensional graph representing the geometric dimensions of frequency and time, and information regarding the signal’s amplitude in a specific frequency at a given time step can also be encoded as pixel intensity [8]. An important research stream in sound classification has focused on investigating the value of extracting features from visual representations of audio. Costa et al. [9], for instance, computed gray level co-occurrence matrices (GLCMs) [10] from spectrograms as features to train Support Vector Machines (SVMs) on the Latin Music Database (LMD) [11], and in [12] they demonstrated improved accuracy by extracting Local Binary Patterns (LBPs) [13] to train SVMs not only on the LMD dataset but also on ISMIR04 [14]. Costa et al. [15] later investigated extracting Local Phase Quantization (LPQ) and Gabor filters [16] as features. Ensembles of classifiers designed to fuse many state-of-the-art texture descriptors with acoustic features extracted from  Corresponding author. e-mail: [email protected] the audio traces on multiple datasets of texture descriptors were exhaustively investigated by Nanni et al. [2]. They demonstrated that the accuracy of systems based solely on acoustic or visual features could be enhanced by combining image features. Recently, deep learning classifiers have proven even more robust in pattern recognition and classification than have texture analysis techniques. With the broad availability of relatively inexpensive Graphics Processing Units (GPUs), many researchers have begun applying deep learning techniques to visual representations of acoustic traces. Preselected or handcrafted descriptors, such as LBP, are not necessary for deep learners since they learn salient features during the training phase. Deep learners, moreover, are uniquely suited to handling visual representations of audio because many of the most famous deep classifiers, such as Convolutional Neural Networks (CNN), require matrices as their input. Humphrey and Bello [17, 18] were among the first to apply CNNs to audio images for music classification and, as a result, succeeded in redefining the state of the art in automatic chord detection and recognition. In the same year, Nakashika et al. [19] reported converting spectrograms to GCLM maps to train CNNs to performed music genre classification on the GTZAN dataset [20]. Later, Costa et al. [21] fused a CNN with the traditional pattern recognition framework of training SVMs on LBP features to classify the LMD dataset. These works exceeded traditional classification results on these genre datasets. Up to this point, most work in audio classification has applied the latest advances in machine learning to the problem of sound classification and recognition without modifying the classification process to make it singularly suitable for sound recognition. An early exception to the generic approach is found in the work of Sigtia and Dixon [22], who adjusted CNN parameters and structures in such a way as to reduce the time it took to train a set of audio images. Time reduction was accomplished by replacing

A B S T R A C T

Audio classification, Data Augmentation, Ensemble of Classifiers, Pattern Recognition. igmoid units with Rectified Linear Units (ReLU) and stochastic gradient descent with the Hessian Free optimization. Wang et al. [23] created what they called a sparse coding CNN . Audio image representation

Since the input to a CNN is in the form of a matrix, the following four methods were used to map the audio signals into spectrograms.

Discrete Gabor Transform (DGT)

DGT is a Short-Time Fourier Transform (STFT) with a Gaussian kernel as the window function, defined as the convolution between the product of the signal with a complex exponential and a Gaussian:

𝑮(𝝉, 𝝎) = ∫ 𝒙(𝒕)𝒆 𝒊𝝎𝒕 𝒆 −𝝅𝝈 (𝒕−𝝉) 𝒅𝒕 +∞−∞ , (1) Where x(t) is the signal, ω is a frequency, and i is the imaginary unit. The width of the Gaussian window is defined by σ . The discrete version of DGT applies the discrete convolution rather than the continuous convolution. The output G(τ, ω) is a matrix, where the columns represent the frequencies of the signal at a fixed time. The DGT implementation is that defined in [52]. Mel spectrograms (MEL)

MEL [53] spectrograms are computed by extracting the coefficients relative to the compositional frequencies with STFT. Extraction is accomplished by passing each frame of the frequency-domain representation through a Mel filter bank (the idea is to mimic the non-linear human ear perception of sound, which discriminates lower frequencies better than higher frequencies). Conversion between Hertz (f) and Mel (m) is defined as m =2595 log (1+700 f ). (2) The filters in the filter bank are all triangular, which means that each has a response of 1 at the center frequency, which decreases linearly towards 0 until it reaches the center frequencies of the two adjacent filters, where the response is 0. Gammatone (GA) band-pass filters

This is a bank of GA filters whose bandwidth increases with the increasing central frequency. The functional form of Gammatone is inspired by the response of the cochlea membrane in the inner ear of the human auditory system [54]. The impulse response of a Gammatone filter is the product of a statistical distribution (Gamma) and a sinusoidal carrier tone. This response can be defined as (3) Where ω i is the central frequency of the filter, and φ is its phase. Gain is controlled by the constant a , and n is the order of the filter. The parameter B i is a decay factor that determines the bandwidth of the band-pass filter. Cochleagram (CO)

CO is a mapping method models the frequency selectivity property of the human cochlea [55]. To extract a cochleagram, it is first necessary to filter the original signal with a gammatone filter bank (see 1.3 above). The filtered signal must then be divided into overlapping windows. For each window and every frequency, the energy of the signal is calculated. Each of the four spectrograms is then mapped to a gray-scale image using a linear transformation that maps the minimum value to 0 and the maximum value to 255, with the value of each pixel rounded to the closest smaller integer. Convolutional neural networks

CNNs are used for two different purposes in this study: 1) as a feature extractor, where the features are used to train simpler SVMs, and 2) as a classifier. One of the first examples of CNNs was first introduced in 1998 by LeCun et al. [54], with the introduction of the LeNet5 architecture, which is a deep feed-forward neural network whose neurons in one layer are locally connected to the previous layer and whose weights and biases, as well as activation functions, are iteratively adjusted during training. Aside from the input and output layers, CNNs are composed of one or more of the following hidden layers: convolutional (CONV), activation (ACT), pooling (POOL), and fully-connected (FC), or classification layer. The CONV layers extract features from the input volume and work by convolving a local region of the input volume (the receptive field) to filters of the same size. Once the convolution is computed, these filters slide into the next receptive field, where once again the convolution between the new receptive field and the same filter is computed. This process is iterated over the entire input image, whereupon it produces the input for the next layer, a non-linear ACT layer, which improves the learning capabilities and classification performance of the network. Typical activation functions include i) the nonsaturating ReLU function, f ( x ) = max(0, x ), ii) the saturating hyperbolic tangent f ( x ) = tanh( x ), and iii) the sigmoid function f ( x ) =(1+e -x ) -1 . Pool layers are often interspersed between CONV layers and perform non-linear downsampling operations (max or average pool) that serve to reduce the spatial size of the representation, which in turn has the benefit of reducing the number of parameters, the possibility of overfitting, and the computational complexity of the CNN. FC layers typically make up the last hidden layers and have fully connected neurons to all the activations in the previous layer. SoftMax is generally used as the activation function for the output CLASS layer, which performs the final classification (also typically using the SoftMax function). In this study, five CNNs pretrained on ImageNet [56] or Places365 [57] are adapted to the problem of sound classification as defined in the datasets used in this work. The architecture of the following pretrained CNNs remains unaltered except for the last three layers, which are replaced by an FC layer, an ACT layer using SoftMax, and a CLASS layer also using SoftMax: 1. AlexNet [58] is the first neural network to win (and by a large margin) the ImageNet Large Scale Visual Recognition Competition (ILSVRC 2012). AlexNet has a structure composed of five CONV blocks followed by three FC layers. The dimension of the hidden layers in the network is gradually reduced with max-pooling layers. The architecture of AlexNet is simple since every hidden layer has only one input layer and one output layer. 2.

GoogleNet [59] is the winner of ILSVRC 2014 challenge. The architecture of GoogleNet involves twenty-two layers and five POOL layers. GoogleNet was unique in its introduction of a novel

Inception module, which is a subnetwork made up of parallel convolutional filters. Because the output of these filters is concatenated, the number of learnable parameters is significantly reduced. This study uses two pre-trained GoogleNets: the first is trained on the ImageNet database [56], and the second is trained on the Places365 [57] datasets. 3.

VGGNet [60] is a CNN that took second place in ILSVRC 2014. Because VGGNet includes 16 (VGG-16) or 19 (VGG-19) CONV/FC layers, it is extremely deep. All the CONV layers are homogeneous. Unlike AlexNet [58], which applies a POOL layer after every CONV layer, VGGNet is composed of relatively tiny 3 x 3 convolutional filters with a POOL layer applied every two to three CONV layers. Both VGG-16 and VGG-19 are used in this study, and both are pre-trained on the ImageNet database [56] 4.

ResNet [61] is the winner of ILSVRC 2015 and is much deeper than VGGNet. ResNet is distinguished by introducing a novel “network-in-network” architecture composed of residual (RES) layers. ResNet is also unique in applying global average pooling layers at the end of the network rather than the more typical set of FC layers. These architectural dvances produce a model that is eight times deeper than VGGNet yet significantly smaller in size. Both ResNet50 (i.e., a 50 layer Residual Network) and ResNet101 (the deeper variant of ResNet50) are investigated in this study. Both CNNs take as input images of size 224×224 pixels. 5.

InceptionV3 [62] advances GoogleNet (which is also known as InceptionV1) by making the auxiliary classifiers perform as regulators rather than as classifiers, by factorizing 7x7 convolutions into two or three consecutive layers of 3×3 convolutions, and by applying the RMSProp Optimizer. InceptionV3 accepts images of size 299×299 pixels. Data augmentation approaches

Below is a description of the augmentation protocols that were combined and tested in this study. For each data augmentation method used to train a CNN, both the original and the artificially generated patterns were included in the training set.

Standard signal augmentation (SGN)

SGN is the application of the built-in data augmentation methods for audio signals available in MATLAB. For each training signal, ten new ones were generated by applying the following labeled transformations with 50% probability: 1.

SpeedupFactorRange scales the speed of the signal by a random number in the range of [0.8, 1.2] ; 2.

SemitoneShiftRange shifts the pitch of the signal by a random number in the range of [−2,2] semitones; 3.

VolumeGainRange increases or decreases the gain of the signal by a random number in the range of [−3,3] dB; 4.

SNR injects random noise into the signal in the range of [0, 10] dB; 5.

TimeShiftRange shifts the time of the signal in the range of [−0.005, 0.005] s. Short Signal Augmentation (SSA)

SSA works directly on the raw audio signals. For every original image, the following ten augmentations are applied to produce ten new images: 1. applyWowResampling implements wow resampling, a variant of pitch shift that changes the intensity in time; the wow transformation cam be defined as where x is the input signal. In this study, a m = f m = 2. applyNoise is the insertion of white noise so that the ratio between the signal and the noise is X dB; in this study X =10; 3. applyClipping normalizes the audio signal by leaving 10% of the samples out of [-1, 1], with the out-of-range samples ( x ) clipped to sign( x ). 4. applySpeedUp not only increases but also decreases the speed of the audio signal; in this study, the speed was augmented by 15%. 5. applyHarmonicDistortion is the repeated application of quadratic distortion to the signal; in this study, the following distortion was applied five consecutive times: s out =sin(2πs in ). 6. applyGain increases the gain by a specific number of dB, which in this study was set to ten dB. 7. applyRandTimeShift randomly divides each audio signal in two and swaps them by mounting them back into a randomly shifted signal. If we call s in (t) the value of the input audio signal at time t,T is the length of the signal and t * is a random time between 0 and T: applyDynamicRangeCompressor applies Dynamic Range Compression (DRC) [63] to a sample audio signal. DRC boosts the lower intensities of an audio signal and attenuates the higher intensities by applying an increasing piecewise linear function. DRC, in other words, compresses an audio signal's dynamic range. 9. applyPitchShift shifts the pitch of an audio signal by a specific number of semitones. We chose to increase it by two semitones. 10. We use applyPitchShift again to decrease the pitch of the audio signal by two semitones.

Super signal augmentation (SSiA)

In this protocol, twenty-nine new images are generated from every original image. The following five augmentations are applied to every sample, with the parameters of the augmentations randomized to generate new images: 1. applyWowResampling, as in SSA; 2. applySpeedUp , as in SSA; but, in this case, the speed is either increased or decreased by a random number of percentage points in the range [-5, 5]; 3. applyGain , as in SSA, but the gain factor is sampled randomly in the range of [-0.5, 0.5]; 4. applyRandTimeShift , as in SSA; 5. applyPitchShift , as in SSA, but the pitch is shifted in the range of [-0.5, 0.5]. Small parameters are selected because applying multiple transformations to the input introduces changes that are too large. The difference between the protocols in SSiA and in SSA is that SSiA protocols create many images through multiple small transformations. Conversely, the images created by SSA protocols are generated with only one large transformation.

Time scale modification (TSM)

This protocol applies the five algorithms contained in the TSM toolbox [64]. TSM methods are commonly used in music production software to change the speed of signals without changing their pitch. Since two different constant stretching factors (0.5 and 1.8) were used for each TSM method, this augmentation approach produced ten new images. For a detailed description of the TMS toolbox, see [65]. A brief description of the five TMS algorithms follows: 1.

OverLap Add (OLA): this algorithm is the simplest TSM method. It covers the input signals with overlapping windows of size H a and maps them into overlapping windows of size H s . The number H a depends on the implementation of the algorithm, while the ratio α= H s / H a is the speed-up factor, and the user can optionally set it. It was set in this study to 0.8 and 1.5. These same values were used for each TMS method. 2. Waveform Similarity OverLap Add (WSOLA): this is a modification of OLA where the overlap of the windows is not fixed but has some tolerance to better represent the output signal in cases where there is a difference of phase. 3.

Phase Vocoder addresses the same phase problem as WSOLA. However, it exploits the dual approach by matching the windows in the frequency domain: first, the Fourier transforms of the signal is calculated; second, the frequencies are matched, and the signal is pulled back into the time domain. .

Phase Vocoder with identity phase locking: this TSM method is a modification of Phase Vocoder where the frequencies are matched as if they were not independent of each other. This modification was introduced by Laroche and Dolson [66]. 5.

Harmonic-Percussive Source Separation (HPSS): this augmentation technique decomposes an audio signal into its harmonic sound components, which form structures in the time direction, and its percussive sounds, which yield structures in the frequency direction. After decomposing the signal in this way, the phase vocoder is applied with the identity phase locking to the harmonic component, and OLA is applied to the percussive component. Finally, these two components are merged to form a new signal.

Time scale modification (TSM)

SSpA works directly on spectrograms and generates five transformed versions of each original: 1. applySpectrogramRandomShifts randomly applies pitch shift and time shift. 2.

Vocal Tract Length Normalization (apply

VTLN ) creates a new image by applying VTLP [47], which divides a given spectrogram into ten unique temporal slices. Once so divided, each slice passes through the following transformation: where 𝑓 , 𝑓 𝑚𝑎𝑥 where f , f max are the basic and maximum frequency, and 𝛼 ∈ [𝑎, 𝑏] is randomly chosen. In this study, a and b are set to 0.9 and 1.1, respectively. 3. applyRandTimeShift does as its name indicates by randomly picking the shift value T in [1, M ], where M is the horizontal size of the input spectrogram. A given spectrogram is cut into two different images: S and S , the first taken before and second after time T . The new image is generated by inverting the order of S and S . 4. applyRandomImageWarp creates a new image by applying Thin-Spline Image Warping (TPS-Warp) [67] to a given spectrogram. TPS-Warp is a perturbation method applied to the original image by randomly changing the position of a subset S of the input pixels. It adapts pixels that do not belong to S via linear interpolation. In this study, the spectrogram is changed on the horizontal axis only. Also, a frequency-time mask is applied by setting to zero the values of two rows and one column of the spectrogram. In this study, the width of the rows is set to 5 pixels and the width of the column to 15 pixels. 5. applyNoiseS applies pointwise random noise to spectrograms. The value of a pixel is multiplied by a uniform random variable of average one and variance one, with probability 0.3. Super Spectro Augmentation (SuSA)

In this protocol, 29 new images are generated from each original sample. The following five augmentation methods are applied to each signal, with parameters randomized to produce different samples: 1. applySpectrogramRandomShifts as in SSpA, but with time shift equal to zero and random pitch shift in [-1,1], 2. applyVTLN as in SSpA, 3. applyRandTimeShift as in SSpA, 4. applyFrequencyMasking sets to zero at most two random columns (which represent times) and at most two random rows (which represent frequencies), 5. applyNoiseS applies pointwise random noise to spectrograms. The value of a pixel is multiplied by a uniform random variable in [0.3, 1.7] with probability 0.1. Experimental results

The approach presented here was tested on three sound datasets: • • CAT [28, 68]: a balanced dataset of 300 samples of 10 classes of cat vocalizations collected from Kaggle, Youtube, and Flickr. The class labels are 1) Resting, 2) Warning, 3) Angry, 4) Defence, 5) Fighting, 6) Happy, 7) Hunting mind, 8) Mating, 9) Mother call, and 10) Paining. The average duration of each sound is ~ 4s. • ESC-50 [4]: an environmental sound classification dataset that contains 2000 samples evenly divided into 50 classes and five folds; hence, every fold contains eight samples belonging to each class (the 50 classes are divided by the general categories of Animals, Natural soundscapes & water sounds, Human, non-speech sounds, Interior domestic sounds, Exterior urban noises). It should be noted that several papers report classification results on these datasets that are superior to human performance [35-37, 39, 46, 64, 69]. The data augmentation techniques explored in this study are assessed on each dataset using the same testing protocols described in the original papers. The recognition rate (the average accuracy across all folds) is used as the performance indicator. In Tables 1-4, the mean accuracy over the ten-fold cross-validation obtained by some of the data augmentation protocols is reported and compared with the baseline that skips the augmentation step (NoAUG). The CNNs were trained for 30 epochs with a learning rate of 0.0001, except for the last fully connected layer that has a learning rate of 0.001, and a batch size of 60. The one exception is the CNN labeled ‘VGG16-batchSize,’ the standard VGG16 with a fixed batch size of 30. For NoAUG, the batch size was set to 30. Also, seven fusions are reported in Tables 1-4. We combined the results of the CNNs in an ensemble using the sum rule. The sum rule consists of averaging all the output probability vectors of the stand-alone CNNs in the ensemble to create a new probability vector that is used for classification. The rationale behind fusion is, as Hansen [49] describes, that “ the collective decision produced by the ensemble is less likely to be in error than the decision made by any of the individual networks. ” The labels used in the tables and a brief description of the seven ensembles follow: 1.

GoogleGoogle365: sum rule of GoogleNet and GoogleNet-places 365 trained with each of the data augmentation protocols; .

Fusion – Local: sum rule of CNNs where each one is trained with a different data augmentation method; 3.

Fusion Short: sum rule of all CNNs trained with SGN, SSA, and SSpA; 4.

Fusion ShortSuper: sum rule of all CNNs trained with SGN, SSA, SSpA, SSiA, and SuSA; 5.

FusionSuper: sum rule of all CNNs trained with SGN, SSiA, SuSA, and TSM; 6.

FusionSuper_VGG16: sum rule of VGG16 trained with SGN, SSiA, SuSA, and TSM; 7.

Fusion ALL: sum rule of all CNNs trained with SGN, SSA, SSpA, SSiA, SuSA, and TSM. VGG16 can fail to converge; when this happens, VGG16 undergoes a second training. VGG16 can also produce a numeric problem by assigning the same scores to all patterns (random performance in the training set). In this case, all scores are considered zeros. Other numeric problems in the fusions by sum rule can occur. To avoid such issues, all scores that produce not-a-number value are treated as zero. In Tables 1-3, DGT spectrogram is used for representing the signal as an image. Any cell with ‘---’ means that the given CNN was not trained successfully (mainly due to memory problems with the GPUs). We also tested the three additional methods GA, MEL, and CO to represent a signal as an image, coupled with SGN only, as reported in Table 4.

Table 1. Performance on the CAT dataset

CAT NoAUG SGN SSA SSpA SSiA SuSA TSM AlexNet

GoogleNet

VGG16

VGG19

ResNet101

Inception

GoogleNet- places365

VGG16-batchSize --- 86.10 88.14 86.78 89.49 86.10

Fusion Short

Fusion ShortSuper

FusionSuper

FusionALL

FusionSuper_VGG16

Table 2. Performance on the BIRDZ dataset

The * in the row VGG19 and column SSA indicates that a fold failed to converge, thus producing a random performance in that fold.

BIRDZ NoAUG SGN SSA SSpA SSiA SuSA TSM AlexNet

GoogleNet

VGG16

VGG19

ResNet50

ResNet101

Inception

GoogleNet-places365

VGG16-batchSize --- 95.84 ---- 94.91 95.81

Fusion - Local

Fusion Short

Fusion ShortSuper

Fusion Super 96.90 Fusion ALL

FusionSuper_VGG16

Table 3. Performance on the BIRDZ dataset

ESC-50 NoAUG SGN SSA SSpA SSiA SuSA TSM

AlexNet 60.80 72.75 73.85 65.75 73.30 64.65 70.95 GoogleNet 60.00 72.30 73.70 67.85 73.20 71.70 73.55 VGG16 71.60 79.40

FusionALL 87.30 Fusion Super VGG16 85.75

Table 4. Performance using different methods for representing the signal as an image

CAT BIRD ESC-50 GA MEL CO GA MEL CO GA MEL CO AlexNet

GoogleNet

VGG16

VGG19

ResNet50

ResNet101

Inception

GoogleNet-places365

VGG16-batchSize

Fusion-Local

In Table 5 our best ensembles FusionGlobal and FusionGlobal-CO are compared with the state of the art. FusionGlobal is built with the CNNs belonging to Fusion Super and those reported in Table 4. FusionGlobal-CO is built similarly to FusionGlobal but without considering the CNNs trained using CO as a signal representation approach. The performance reported in [30] in Table 4 is different from that reported in the original paper since, for a fair comparison with this work, we ran the method without considering the supervised data augmentation approaches. able 5. Comparison of our best approach and the state of the art

Descriptor BIRDZ CAT ESC-50 [30] 96.45 89.15 85.85 FusionGlobal 96.82 90.51

FusionGlobal-CO --- [28] - CNN --- 90.8 --- [71] 96.7 --- --- [39] --- --- 88.50 [46] --- --- 84.90 [35] --- --- 86.50 [36] --- --- 83.50 [38] --- --- 83.50 [37] --- --- 81.95 Human Accuracy [4] --- --- 81.30

The following conclusions can be drawn from the reported results: 1.

There is no single data augmentation protocol that outperforms all the others across all the tests. TSM performs best on CAT and ESC-50 but works poorly on BIRDZ. Data augmentation at the spectrogram level works poorly on ESC-50 as well as on two other datasets. SGN and data augmentation at the signal level work well across all the datasets. On average, the best data augmentation approach is SSA. Although it produces a performance that is close to SSiA, the training time for SSA is shorter. 2.

The best stand-alone CNNs are VGG16 and VGG19. 3.

DGT works better than the other signal representations. 4.

Combining different CNNs enhances performance across all the tested datasets. 5.

For the ensemble “Fusion-Local,” data augmentation is marginally beneficial on CAT and BIRDZ but produces excellent results on ESC-50. Compared to the stand-alone CNNs, data augmentation improves results on all three datasets. Of note, an ensemble of VGG16 (FusionSuper_VGG16) outperforms the stand-alone VGG16. 6.

The performance of the ensemble of CNNs trained with different augmentation policies (FusionALL) can be further improved by adding to ensemble those networks trained using different signal representation (FusionGlobal). However, this performance improvement required considerable computation time, mainly during the training step. The methods reported here were based solely on deep learning approaches. As mentioned in the introduction, several papers have proposed sound classification methods based on handcrafted features. It is also possible to construct ensembles that combine deep learning with handcrafted methods. To examine the potential of combining handcrafted ensembles with deep learning approaches, the following fusions were examined: • Sum rule between FusionGlobal and the ensemble of handcrafted features proposed in [72] (extracted from DGT images) obtains an accuracy of 98.51% (higher than that obtained by FusionGlobal). Before the sum rule, the scores of the two ensembles are normalized to mean 0 and std 1. In BIRDZ, the handcrafted ensemble obtains an accuracy of 96.87%, which is close to that obtained by our deep learning approach. The handcrafted ensemble works poorly on ESC- 50, however, producing an accuracy of only 70.6%. As a result, it is not advantageous to combine the handcrafted approach with FusionGlobal. • The sum rule between FusionGlobal and [51] obtains an excellent accuracy of 98.96% compared to [51], which achieves an accuracy of 93.6%. This means that the features extracted in [51] and by the deep learning approach access different information. In terms of computation time, the most expensive activity is the conversion of audio signals into spectrograms since the conversion runs on a CPU and not on a GPU. In Table 6, computation time is reported for different CNNs and signal representations on a machine equipped with an i7-7700HQ 2.80 GHz processor, 16 GB RAM, and a GTX 1080 GPU. This test was run on an audio file of length 1.27 s with a sample rate of 32 kHz. It is interesting to note that FusionGlobal takes less than three seconds using a laptop. However, a speed-up is possible: audio files can be classified simultaneously with DGT since it can be parallelized.

Table 6. Computation time (in seconds) comparison between CNNs and signal representations

Signal Representation Computation Time CNN Computation Time DGT 1.29 AlexNet 0.01 GA 0.02 GoogleNet 0.03 MEL 0.01 VGG16 0.01 CO 0.08 VGG19 0.01 ResNet50 0.02 ResNet101 0.03 Inception 0.03 Conclusion

In this paper, we presented an extensive study that investigated ensembles of CNNs using different data augmentation techniques for audio classification. Several data augmentation approaches designed for audio signals were tested and compared with each other, and with a baseline approach that did not include data augmentation. Data augmentation methods were applied to the raw audio signals and their visual representations using different spectrograms. CNNs were trained on different sets of data augmentation approaches and fusions combined by sum rule. Experimental results clearly demonstrate that ensembles composed of fine-tuned CNNs with different architectures maximized performance on the tested three audio classification problems, with some of the ensembles obtaining state-of-the-art results. To the best of our knowledge, this is the largest, most exhaustive study of CNN ensembles applied to the task of audio classification. This work can be expanded further by investigating which augmentation methods (Spectrogram Augmentation vs. Signal Augmentation) work best for classifying different kinds of sounds. A systematic selection of augmentation approaches, e.g. by iteratively evaluating an increasing subset of augmentation techniques (as is typical when evaluating different features), would require an enormous amount of time and computation power. An expert-based approach that utilizes the knowledge of environmental scientists would be the best way of handling this challenge. This study could also be expanded by including more datasets (e.g. whale and frog datasets), which would provide a more comprehensive validation of the proposed fusion methods. Furthermore, there is need to investigate the impact on performance when different CNN topologies and parameter settings in the re-tuning step are coupled with different types of data augmentation.

Acknowledgments

The authors thank NVIDIA Corporation for supporting this work by donating Titan Xp GPU and the Tampere Center for Scientific Computing for generous computational resources

References [1] J. Padmanabhan and M. J. J. Premkumar, "Machine learning in automatic speech recognition: A survey,"

Iete Technical Review, vol. 32, pp. 240-251, 2015. [2] L. Nanni, Y. M. G. Costa, D. R. Lucio, C. N. Silla Jr., and S. Brahnam, "Combining visual and acoustic features for audio classification tasks,"

Pattern Recognit Lett, vol. 88, no. March, pp. 49-56, 2017. [3] S. K. Sahoo, T. Choubisa, and S. R. M. Prasanna, "Multimodal Biometric Person Authentication : A Review,"

IETE Technical Review, vol. 29, no. 1, pp. 54-75, 2012/01/01 2012, doi: 10.4103/0256-4602.93139. [4] K. J. Piczak, "ESC: Dataset for Environmental Sound Classification," presented at the Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, 2015. [Online]. Available: https://doi.org/10.1145/2733373.2806390. [5] T. Lidy and A. Rauber, "Evaluation of feature extractors and psycho-acoustic transformations for music genre classification,"

ISMIR, pp. 34-41, 2005. [6] J. Rubin, R. Abreu, A. Ganguli, S. Nelaturi, I. Matei, and K. Sricharan, "Classifying heart sound recordings using deep convolutional neural networks and mel-frequency cepstral coefficient," presented at the Computing in Cardiology (CinC), Vancouver, Canada, 2016. [7] L. Wyse, "Audio spectrogram representations for processing with convolutional neural networks,"

ArXiv Prepr , vol. ArXiv1706.09559. [8] L. Nanni, Y. M. G. Costa, and S. Brahnam, "Set of texture descriptors for music genre classiﬁcation," in , Plzen, Czech Republic, 2014. [9] Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon, "Music genre recognition using spectrograms," presented at the 18th International Conference on Systems, Signals and Image Processing, 2011. [10] R. M. Haralick, "Statistical and structural approaches to texture,"

Proceedings of the IEEE, vol. 67, no. 5, pp. 786-804, 1979. [11] C. N. Silla Jr., A. L. Koerich, and C. A. A. Kaestner, "The latin music database," presented at the 9th International Conference on Music Information Retrieval,, Philadelphia, USA, 2008. [12] Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, F. Gouyon, and J. Martins, "Music genre classification using LBP textural features,"

Signal Processing, vol. 92, pp. 2723-2737, 2012. [13] T. Ojala, M. Pietikainen, and T. Maeenpaa, "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,"

IEEE Trans Pattern Analysis Mach Intell, vol. 24, no. 7, pp. 971-987, 2002. [14] E. Gomez et al. , "ISMIR 2004 audio description contest," Music Technology Group-Universitat Pompeu Fabra, Barcelona, Spain, 2006. [15] Y. M. G. Costa, L. S. Oliveira, A. L. Koerich, and F. Gouyon, "Music genre recognition using gabor filters and LPQ texture descriptors," presented at the 18th Iberoamerican Congress on Pattern Recognition, 2013. [16] V. Ojansivu and J. Heikkila, "Blur insensitive texture classification using local phase quantization," presented at the ICISP, 2008. [17] E. Humphrey and J. P. Bello, "Rethinking automatic chord recognition with convolutional neural networks," presented at the International Conference on Machine Learning and Applications, 2012. [18] E. Humphrey, J. P. Bello, and Y. LeCun, "Moving beyond feature design: deep architectures and automatic feature learning in music informatics,"

International Conference on Music Information Retrieval, pp. 403-408, 2012. [19] T. Nakashika, C. Garcia, and T. Takiguchi, "Local-feature-map integration using convolutional neural networks for music genre classification,"

Interspeech,, pp. 1752-1755, 2012. [20] G. Tzanetakis and P. Cook, "Musical genre classification of audio signals,"

IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293-302, 2002. 21] Y. M. G. Costa, L. E. S. Oliveira, and C. N. Silla Jr., "An evaluation of Convolutional Neural Networks for music classification using spectrograms,"

Applied Soft Computing vol. 52, pp. 28-38, 2017. [22] S. Sigtia and S. Dixon, "Improved music feature learning with deep neural networks," presented at the IEEE International Conference on Acoustic, Speech and Signal Processing, 2014. [23] C.-Y. Wang, A. Santoso, S. Mathulaprangsan, C.-C. Chiang, C.-H. Wu, and J.-C. Wang, "Recognition and retrieval of sound events using sparse coding convolutional neural network," presented at the IEEE International Conference On Multimedia & Expo (ICME), 2017. [24] S. Oramas, O. Nieto, F. Barbieri, and X. Serra, "Multi-label music genre classification from audio, text and images using deep features," presented at the International Society for Music Information Retrieval (ISMR) Conference, 2017. [25] M. A. Acevedo, C. J. Corrada-Bravo, H. Corrada-Bravo, L. J. Villanueva-Rievera, and T. M. Aide, "Automated classification of bird and amphibian calls using machine learning: a comparison of methods,"

Ecological Informatics, vol. 4, pp. 206-214, 2009. [26] V. I. Cullinan, S. Matzner, and C. A. Duberstein, "Classification of birds and bats using flight tracks,"

Ecological Informatics, vol. 27, pp. 55-63, 2015. [27] K. M. Fristrup and W. A. Watkins, "Marine animal sound classification," WHOI Technical Reports, 1993. [Online]. Available: https://hdl.handle.net/1912/546 [28] Y. R. Pandeya, D. Kim, and J. Lee, "Domestic cat sound classification using learned features from deep neural nets,"

Applied Sciences, vol. 8, no. 10, p. 1949, 2018. [29] M. Malfante, J. I. Mars, M. D. Mura, and C. Gervaise, "Automatic fish sounds classification,"

The Journal of the Acoustical Society of America, vol. 143, no. 5, pp. 2834-2846, 2018, doi: 10.1121/1.5036628. [30] L. Nanni, G. Maguolo, and M. Paci, "Data augmentation approaches for improving animal audio classification,"

ArXiv, vol. abs/1912.07756, 2019. [31] Z. Cao, J. C. Príncipe, B. Ouyang, F. R. Dalgleish, and A. K. Vuorenkoski, "Marine animal classification using combined CNN and hand-designed image features,"

OCEANS 2015 - MTS/IEEE Washington, pp. 1-6, 2015. [32] L. Nanni, S. Brahnam, A. Lumini, and T. Barrier, "Ensemble of local phase quantization variants with ternary encoding," in

Local Binary Patterns: New Variants and Applications , S. Brahnam, L. C. Jain, A. Lumini, and L. Nanni Eds. Berlin: Springer-Verlag, 2014, pp. 177-188. [33] D. R. Edgington, D. Cline, D. Davis, I. Kerkez, and J. Mariette, "Detecting, Tracking and Classifying Animals in Underwater Video,"

OCEANS 2006, pp. 1-5, 2005. [34] J. Salamon, J. P. Bello, A. Farnsworth, and S. Kelling, "Fusing sallow and deep learning for bioacoustic bird species " presented at the IEEE International Conference on Acoustics, Speech and Signal Processings (ICASSP), 2017. [35] H. B. Sailor, D. M. Agrawal, and H. A. Patil, "Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification," in

INTERSPEECH , 2017. [36] X. Li, V. Chebiyyam, and K. Kirchhoff, "Multi-stream network with temporal attention for environmental sound classification,"

ArXiv Prepr , vol. ArXiv1901.08608. [37] D. M. Agrawal, H. B. Sailor, M. H. Soni, and H. A. Patil, "Novel TEO-based Gammatone features for environmental sound classification," presented at the 25th European Signal Processing Conference (EUSIPCO 2017), Kos Islang, Greece. [38] A. Kumar, M. Khadkevich, and C. Fügen, "Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes," presented at the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP), Alberta, Canada, 2018. [39] J. Sharma, O.-C. Granmo, and M. G. Olsen, "Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks,"

ArXiv, vol. abs/1908.11219, 2019. [40] J. Chaki, "Pattern analysis based acoustic signal processing: a survey of the state-of-art,"

International Journal of Speech Technology,

ArXiv, vol. abs/1801.00631, 2018. [42] M. Lasseck, "Audio-based Bird Species Identification with Deep Convolutional Neural Networks," in

CLEF , 2018. [43] E. Sprengel, M. Jaggi, Y. Kilcher, and T. Hofmann, "Audio Based Bird Species Identification using Deep Learning Techniques," in

CLEF , 2016. [44] S. Wei, K. Xu, D. Wang, F. Liao, H. Wang, and Q. Kong, "Sample Mixed-Based Data Augmentation for Domestic Audio Tagging,"

ArXiv, vol. abs/1808.03883, 2018. [45] T. Inoue, P. Vinayavekhin, S. Wang, D. P. Wood, N. M. Greco, and R. Tachibana, "Domestic activities classification based on cnn using shuffling and mixing data augmentation technical report," DCASE 2018 Challenge, 2018, 2018. [Online]. Available: https://pdfs.semanticscholar.org/90f8/75233e3efebe02feeb10cb551cc69f20ebc7.pdf [46] Y. Tokozume, Y. Ushiku, and T. Harada, "Learning from Between-class Examples for Deep Sound Recognition,"

ArXiv, vol. abs/1711.10282, 2018. [47] N. Jaitly and E. S. Hinton, "Vocal Tract Length Perturbation (VTLP) improves speech recognition," presented at the International Conference on Machine Learning (ICML Work), Atlanta, Georgia, 2013. [48] N. Takahashi, M. Gygli, B. Pfister, and L. V. Gool, "Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition," presented at the Interspeech 2016, San Francisco, 2016. [49] L. K. Hansen and P. Salamon, "Neural network ensembles,"

IEEE Trans Pattern Analysis Mach Intell, vol. 12, pp. 993–1001, 1990. [50] L. Nanni, S. Brahnam, and G. Maguolo, "Data augmentation for building an ensemble of convolutional neural networks," in

Smart Innovation, Systems and Technologies , vol. 145, Y.-W. Chen, A. Zimmermann, R. J. Howlett, and L. C. Jain Eds. Singapore: Springer Nature, 2019, pp. 61-70. [51] Z. Zhao et al. , "Automated bird acoustic event detection and robust species classification,"

Ecological Informatics, vol. 39, pp. 99-108, 2017. [52] Z. Prusa, P. L. Søndergaard, and P. Balázs, "The Large Time Frequency Analysis Toolbox: Wavelets," 2013. [Online]. Available: http://ltfat.github.io/doc/gabor/sgram.html. [Online]. Available: http://ltfat.github.io/doc/gabor/sgram.html [53] L. R. Rabiner and R. W. Schafer, "Theory and Applications of Digital Speech Processing," 2010. [54] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition,"

Proceeding of the IEEE, vol. 86, no. 11, pp. 2278-2323, 1998, doi: doi:10.1109/5.726791. [55] R. F. Lyon and L. M. Dyer, "Experiments with a computational model of the cochlea,"

ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11, pp. 1975-1978, 1986. [56] O. Russakovsky, J. Deng, and H. Su, "ImageNet large scale visual recognition challenge,"

International Journal of Computer Vision (IJCV),

ArXiv, vol. abs/1610.02055, 2017. [58] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in dvances In Neural Information Processing Systems , F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger Eds. Red Hook, NY: Curran Associates, Inc., 2012, pp. 1097-1105. [59] C. Szegedy et al. , "Going deeper with convolutions," presented at the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015. [60] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," Cornell University, arXiv:1409.1556v6 2014. [61] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," pp. 770-778, 2015. [62] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," presented at the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [63] J. Salamon and J. P. Bello, "Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification,"

IEEE Signal Processing Letters, vol. 24, pp. 279-283, 2017. [64] J. Driedger and M. Müller, "TSM Toolbox: MATLAB Implementations of Time-Scale Modification Algorithms," in

DAFx , 2014. [65] J. Driedger, M. Müller, and S. Ewert, "Improving Time-Scale Modification of Music Signals Using Harmonic-Percussive Separation,"

IEEE Signal Processing Letters, vol. 21, pp. 105-109, 2014. [66] J. Laroche and M. Dolson, "Improved phase vocoder time-scale modification of audio,"

IEEE Trans. Speech and Audio Processing, vol. 7, pp. 323-332, 1999. [67] F. L. Bookstein, "Thin-plate splines and decomposition of deformation," 1989. [68] Y. R. Pandeya and J. Lee, "Domestic Cat Sound Classification Using Transfer Learning,"

Int. J. Fuzzy Logic and Intelligent Systems, vol. 18, pp. 154-160, 2018. [69] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, "Multi-level Attention Model for Weakly Supervised Audio Classification,"

ArXiv, vol. abs/1803.02353, 2018. [70] L. Nanni, Y. Costa, A. Lumini, M. Y. Kim, and S. R. Baek, "Combining visual and acoustic features for music genre classification,"

Expert Systems with Applications, vol. 45, no. 45 C, pp. 108-117, 2016. [71] S.-h. Zhang, Z. Zhao, Z. Xu, K. Bellisario, and B. C. Pijanowski, "Automatic Bird Vocalization Identification Based on Fusion of Spectral Pattern and Texture Features," pp. 271-275, 2018. [72] L. Nanni, Y. M. G. Costa, A. C. N. Silla Jr., and S. Brahnam, "Ensemble of deep learning, visual and acoustic features for music genre classification,"

Journal of New Music Research, vol. 47, no. 4, pp. 383-397, 2018, doi: 10.1080/09298215.2018.1438476.vol. 47, no. 4, pp. 383-397, 2018, doi: 10.1080/09298215.2018.1438476.