An Approach for Self-Training Audio Event Detectors Using Web Data
Benjamin Elizalde, Ankit Shah, Siddharth Dalmia, Min Hun Lee, Rohan Badlani, Anurag Kumar, Bhiksha Raj, Ian Lane
AAn Approach for Self-Training Audio EventDetectors Using Web Data
Benjamin Elizalde †§ , Ankit Shah ∗§ , Siddharth Dalmia †§ ,Min Hun Lee †§ , Rohan Badlani ‡§ , Anurag Kumar †§ , Bhiksha Raj † and Ian Lane † Department of Electrical and Computer Engineering,& Department of Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA † Department of Electronics and Communication, NITK Surathkal, India ∗ Department of Computer Science, BITS Pilani, India ‡ Email: † [email protected], ∗ [email protected], † [email protected], † [email protected], ‡ [email protected], † [email protected], † [email protected], † [email protected] Abstract —Audio Event Detection (AED) aims to recognizesounds within audio and video recordings. AED employs machinelearning algorithms commonly trained and tested on annotateddatasets. However, available datasets are limited in number ofsamples and hence it is difficult to model acoustic diversity.Therefore, we propose combining labeled audio from a datasetand unlabeled audio from the web to improve the sound models.The audio event detectors are trained on the labeled audio andran on the unlabeled audio downloaded from YouTube. Wheneverthe detectors recognized any of the known sounds with highconfidence, the unlabeled audio was use to re-train the detectors.The performance of the re-trained detectors is compared to theone from the original detectors using the annotated test set.Results showed an improvement of the AED, and uncoveredchallenges of using web audio from videos.
I. I
NTRODUCTION AND R ELATED W ORK
Sounds are essential to how humans perceive and interactwith the world. Audio content is captured in recordings andshared on the web on a minute-by-minute basis. Academia andindustry exploits this acoustic information throughout multipleapplications. The dominant application is multimedia videocontent analysis, where audio is combined with images andtext [1], [2] to index, search and retrieve videos. Another taskis human-robot interaction [3], [4], where sounds complementspeech as non-verbal communication. Recently, a growingapplication is in smart cities [5], where sounds are used todetect sources of noise pollution. All of these applications relyon Audio Event Detection (AED) to recognize the occurrenceof sounds within audio and video recordings.The related work on AED has mainly focused on usingavailable datasets to train machine learning models in asupervised manner [6], [5], [7], [8], [9]. However, the largestdata set, ESC-50 [8], contains only 40 samples per class. Thenumbers strongly contrast with Imagenet, the computer visioncounterpart, which has hundreds of samples per class. Hence,training a model which reflects the acoustic diversity of anaudio event class is limited. The common solution is to havehumans annotating more data. However, the process is costlyand slow and thus, other solutions should be explored. § First six authors contributed equally.
Another solution is to combine the small amount of labeleddata with a large amount of unlabeled data. A particularmethod is semi-supervised self-training, which is an algorithmthat iteratively re-trains a model. First, the model is trainedusing the labeled data set. Then, at each iteration and under acertain criteria, a portion of the unlabeled set could be labeledas any of the known classes. Lastly, using the newly labeleddata, the model is re-trained. This approach has been exploredfor audio events in two papers [10], [11]. Particularly in [10],the authors collected 17,000 labeled audio-only recordingsfrom
FindSounds.com . Two thirds were used to train and testa classifier and the rest was treated as unlabeled audio forre-training. The result was an improvement of 1.4% precisionover the baseline, suggesting a valid alternative to improvemodels. Moreover, the authors pointed out the challenges ofutilizing audio-only web recordings.In our paper, we followed a similar framework of semi-supervised self-learning, but with the following differences: • We employed UrbanSounds8k as the labeled set for train-ing and testing. However, we collected YouTube videosas the unlabeled set for re-training, creating mismatchconditions. • For re-training, we used 30 times more audio files. • The unlabeled audio is extracted from videos as opposedto audio-only recordings. Hence, posing challenges dur-ing the collection process. For instance, it is not possibleto guarantee that the YouTube audio will actually containany of the sounds.The paper is structured as follows, in Section II we describethe flow of our self-training approach for sound detectors.Within this Section we describe the sound event dataset andYouTube video collection process and how we pre-processedthe audio recordings. Then, we explain how we trained our twomachine learning based detectors to compare performance. InSection III we compare the baseline performance and the self-training performance obtained with different techniques. a r X i v : . [ c s . S D ] J un I. S
EMI -S UPERVISED S ELF -T RAINING OF A UDIO E VENT D ETECTORS
Semi-supervised self-training is an algorithm that iterativelyre-trains a model and our particular framework is illustrated inFigure 1. First, using the labeled dataset, the ten class detectorsare trained and tested to compute a baseline performance.Second, the unlabeled data is run by the detectors to obtain aclass label with its corresponding confidence score. Third, weapplied a threshold based on the confidence score to determinecandidates for self-training the detectors. Fourth, the detectorsare re-trained and again tested on the labeled data to computethe new performance and compare it with the previous. Lastly,steps two, three and four are performed iteratively until theperformance converges.
Fig. 1. Flow of the semi-supervised self-training of Audio Event Detection.Most of the tuning that improved the performance of the re-trained detectorshappened in the selection of candidates.
A. Datasets: Labeled and UnlabeledLabeled Data (Training and Testing): UrbanSound8K
The UrbanSound8K (US8K) dataset [5] has 10 classes: airconditioner, car horn, children playing, dog bark, and streetmusic, gun shot, drilling, engine idling, siren, jackhammer .The content of the audio may have other overlapping soundsand the target sound may occur in the background or in theforeground. The dataset has about 8,732 audio segments of3.5 sec average duration. These files are distributed into 10stratified cross-validation folds.
Unlabeled Data (Self-Training): YouTube videos
The unlabeled audio comes from YouTube videos and videosare the largest source of audio. The website was chosenbecause it offers a wide diversity of class samples. Thesoundtracks of the videos were crawled and downloaded using the Pafy API . The audio roughly corresponds to the 10 classesfrom US8k. The acoustic content is unstructured, and com-monly the target sound is occluded by multiple factors suchas noise, overlap with other sounds, and channel effects. Theweb set has 200,000 segments of 3.5 seconds. We convertedall of the audio files into raw 16 bit encoding, mono-channel,and 16 kHz sampling rate. Challenges of Unlabeled Data Collection
Audio from videos poses collection challenges. YouTube con-tains years of videos and in order to process and evaluate audiocontaining the target sound, the query should serve as a filter.Hence, the query formulation aims to filter in videos roughlymatching the ten classes in US8K. Typing a query composedby a noun such as air conditioner will not necessarily fetcha video containing such sound event. This happens becausethe associated tags and metadata are mainly inspired by thevideo’s visual content; contrary to what happens in audio-only websites such as freesounds.org . Therefore, we modifiedthe query to be a combination of keywords: “ < audio eventlabel > sound”, for example,“air conditioner sound”. Althoughthe results empirically improved, another issue was that theaudio event was not guaranteed to occur and if present it mostlikely occurred with a short duration within whole recording.Therefore, we restricted the video length to be larger than fiveseconds and shorter than ten minutes to reduce the amount ofirrelevant audio. B. Data PreparationExtracting Low-level Features: MFCCs
The Mel Frequency Cepstral Coefficients (MFCCs) have beenwidely used in audio event detection [12], [5], [8]. Theparameters are standard, such as 10 ms shifts, window of 25ms and 20 cepstral coefficients including delta and double-delta (time dynamics) for a total of 60 coefficients for eachtime window or vector.
Extracting Intermediate Features: BoAWs
An effective approach for characterizing audio events is theBag-of-Audio-Words (BoAWs) feature representation, whichis usually built over low-level features such as MFCCs. Themethod we followed to compute BoAWs features is broadlyillustrated in Figure 2 and detailed in these papers [13], [14].In the first step, we put all the MFCCs in the training settogether. In the second step, we learn an “audio vocabulary”by grouping the features into “audio words”. In contrast toconventional approaches which uses clustering for groupingwords, our method adapts the MFCCs to a Gaussian MixtureModel (GMM) using Expectation Maximization where eachmixture represents a word. The third step is quantization,which uses the created vocabulary to turn the MFCC matrixof a given recording into a BoAW histogram-vector of thesize of the vocabulary. The conventional quantization processcomputes the distance of each MFCCs frame to all the audiowords and sums the value of one only on the histogram bin that https://pypi.python.org/pypi/pafy orresponds to the closest word. However, our approach usessoft-quantization, which sums probabilities for all the words. Fig. 2. Process of computing our BoW features. The idea behind GMM-basedrepresentation is to capture the distribution of MFCC vectors of a recordingover the GMM components.
Formally, let the MFCC vectors of a recording be repre-sented as (cid:126)x t . (cid:126)x t is t th D dimensional MFCC frame, t =1 to T . Here, the GMM G = { w k , N ( (cid:126)µ k , Σ k ) , k = 1 to M } ,is learned over the MFCCs of training data. w k , (cid:126)µ k and Σ k are the mixture weight, mean and co-variance parametersrespectively, of the k th Gaussian in G . We train GMM withdiagonal co-variance matrices. To obtain the bag of audioword feature representation for any given recording, we firstcompute the probabilistic assignment to k th Gaussian for eachMFCC frame of that recording as in Equation 1. This softassignment is then summed and normalized over all MFCCframes for k th Gaussian as in Eq 2.
P r ( k | (cid:126)x t ) = w k N ( (cid:126)x t ; (cid:126)µ k , Σ k ) M (cid:80) j =1 w j N ( (cid:126)x t ; (cid:126)µ k , Σ k ) (1) P ( k ) = 1 T T (cid:88) i =1 P r ( k | (cid:126)x t ) (2)The final soft-count histogram feature representation, repre-sented as (cid:126)α is (cid:126)α M = [ P (1) , ..P ( k ) ..P ( M )] T . (cid:126)α M featuresare an M -dimensional (M=128) feature representation forany given recording. During testing, the BoAW features arecomputed in a similar manner however, using the createdvocabulary from training.
1) Training Detectors: Positive and Negative Classes:
We chose detectors–binary classifiers, because are able torecognize the presence or absence of a particular audio event ina recording. The binary setup also aims to simulate the imbal-ance ratio of small amount of target sound vs a large amountof non-target sounds, which is common in web retrieval tasks.The audio samples belonging to the target class are referred toas positives and those samples not belonging to the target arereferred to as negatives. Each of the ten detectors is trainedwith both, a positive and a negative class. Positive containsclass samples and negative contains samples from the rest ofthe classes. For instance, the detector for jackhammer has allthe samples corresponding to jackhammer as positives and allthe samples corresponding to the other 9 sounds as negatives.
2) Training Detectors: SVM:
One round of experimentswas performed with Support Vector Machines (SVMs) becauseSVMs have been widely explored for sound events [6]. The tenSVM-based detectors used linear decision boundaries to fit thedata and were trained with the intermediate features. To relax the constraints defining the margin of the decision boundary,the parameter “C” was tuned and set to 0.01. Then, thetrained detectors were evaluated using the test set. Althoughconventional SVMs could employ other techniques to allownon-linear decision boundaries, the problem happens whennew audio segments are added for re-training. Each iterationmeans that the SVM has to be re-trained from scratch using allthe train data. The consequence is a bottleneck, which worsensas more segments and iterations are added.
3) Training Detectors: NN:
Considering the previous issue,the second round of experiments was performed with NeuralNetworks (NNs). The NNs are more suitable for the iterativenature of self-training. For example, in order to add a newaudio segment for training, the NN does not need to be re-trained from scratch and a quick updating process suffice. TheNN-based detectors are also binary classifiers. More precisely,for the NN we utilized a Multi-Layer Perceptron (MLP), withtuned hyper-parameters such as number of layers, neurons,activation function, regularization and loss function. The finalarchitecture consisted of an input of size 128–BoW featuresdimensionality, one hidden layer of 100 neurons, and twooutput units– class or not class. The activation function was“tanh”, the regularization method was dropout (p=0.5), the lossfunction was cross-entropy and the number of epochs was 10.Then, the trained detectors were evaluated using the labeledtest set.III. E
XPERIMENTS AND E VALUATION OF M ETHODS
A. Computing the Baseline Performance and Running Detec-tors on Unlabeled Data
The initial performance computed by our two machinelearning algorithms defined the baseline to improve after self-training. The detectors used the labeled data from US8k, whichcomes divided in stratified folds. We used folds astraining data and tested on the left-out fold. This is done in different ways, resulting in runs for all the 10 eventsclasses and folds. We evaluated our detectors using averageprecision as we wanted to detect reliable positive or negativesamples. For every class, the average precision (AP) over eachfold is computed, as well as the mean AP across all foldsreferred as Mean AP. Afterwards, the detectors where run onthe unlabeled dataset to obtain confidence scores and labelsfor each of the 200,000 segments. Note that the unlabeleddata was carefully handled to be consistent with the 10 foldcross-validation setup. For example, the detectors trained usingthe first 9 folds may not yield the same performance as thedetectors trained with any other fold combination. B. Selecting Candidates and Self-Training Detectors
We employed a high confidence threshold to select audiosegments as candidates for self-training. The candidates wereused for self-training the detectors in combination to thesupervised audio segments. Once the detectors were re-trained,they were ran on the supervised test set and their performancewas computed. The Mean AP value was compared with theaseline and the whole process was repeated iteratively untilthe Mean AP converged.A key step in the self-training process is the selection ofcandidates. We tried three main approaches: Detector’s outputscores, precision and clarity index.
Score-based
Under this approach, the output of the detectoris a probability score that can be interpreted as a confidencevalue and has been used for self-training in the paper [10]. Ascore threshold of greater or equal than 0.95 was selected tofilter in any segment, where 0 means the lowest confidenceand 1 means the strongest confidence.
Precision
High precision means that the detector returnedmore relevant results than irrelevant ones. A precision thresh-old of greater or equal than 0.95 was set. The value range isthe same as the score-based.
Clarity Index
Clarity Index (CI), based on the paper [15],aims to determine those segments that are the most confusingfor the detector. CI is based on two losses called relevance loss and irrelevance loss . To understand these losses let us assumethat the training data is D = { ( x , y ) , ( x , y ) ..., ( x n , y n ) } and the detector mapping function is denoted by f . Let x u be an unlabeled data point. The Relevance Loss (RL) and theIrrelevance Loss (IL) are defined as RL ( x u , f ) = 1 |D | (cid:88) x i ∈D I ( f ( x i ) − f ( x u )) (3) IL ( x u , f ) = 1 |D | (cid:88) x i ∈D I ( f ( x u ) − f ( x i )) (4)The relevance loss is expected to be low if x u is relevant(positive) and irrelevance loss is expected to be low irrelevant(negative). The difference of the two losses CI = IL − RL isexpected to be high (close to 1) for positive instances and low(close to -1) for negative instance. Overall, the CI helps us rankunlabeled segments to choose better segments for self-training.Higher CI implies that X u is more likely to be positive. Anunlabeled point with very high CI would have outscored alarge number of training points and hence is expected to bepositive. Similarly, lower CI implies the instance is most likelynegative. IV. R ESULTS AND D ISCUSSION
A. Baseline
The NN outperformed the SVM by an absolute 8.5% inthe baseline performance. The Mean AP score was 57.8%for SVM and 66.3% for NN and are shown in Table I. Onereason for to justify the better performance of the NN, is that itemployed nonlinear decision boundaries to fit the data unlikethe SVM, which used linear boundaries. As mentioned before,the SVM can also support nonlinear decision boundaries byusing kernels, but the computation time was an issue forprocessing 200,000 segments for the 10 fold combinations,and the re-training.
B. Self-Training
The main results of this paper are the improvement gainby self-training shown in Table I and labeled as SVM Bestand NN Best. The overall Mean AP improvement was 1.2%for both classifiers. Except for SVM’s dog barking , all theaudio events improved their performance. Particularly, airconditioning and jackhammer benefited the most with about3%. More importantly, the performance did not degrade, whichis expected to happen when audio that not belonging to thetarget class is added by re-training. The SVM Best and NNBest results correspond to different threshold types– CI andPecision respectively. The performance of the three thresholdtypes was similar (0.5%-1.4%) and we cannot say that oneshould be preferred.For the three threshold types, tuning affected differently theoverall selection of candidates and the detection performance.The number of candidates varied between class from 0 to 2,000on each iteration. In general, stricter values (greater than 0.9)reduced the number of candidates to two digits. Regardingthe detection performance, stricter values (greater than 0.95)and loose values (approximately 0.5) degraded Mean AP, butvalues close to 0.9 yielded the reported gain. The thresholdalso defined the number of iterations the algorithm took to con-verged or stop improving. For our value of 0.9, our algorithmiterated three times. Afterwards, the Mean AP performanceconverged and then slowly decreased. In general, most of ourexperiments degraded its performance after several iterations.Two possible explanation are the mismatch conditions and thelack of useful files for self-training.Mismatch conditions are unavoidable if web audio is in-tended to be exploited through semi-supervised approaches.There is no control over the recording methods for unlabeledweb audio and thus it will most likely be different than thecontrol methods from labeled datasets. In our case, the datasetUS8k has different collection methods and acoustic character-istics, which do not match the user-generated YouTube audio.In our experiments, the detectors were self-trained using onlythe newly labeled “positive” segments from YouTube. Aftereach iteration, more and more YouTube data was added tothe detectors on the positive category but not on the negative.However, the improvement was limited and often degraded.After inspecting some of the rejected files, it seemed thatthe detectors were discriminating YouTube vs non-YouTubeaudio, rather than positive vs negative. On the contrary, when“positive and negative” segments were added, the performanceimproved.A manual inspection on some of the candidates helpedus better understand the audio content used to re-train thedetectors. Thumbnails examples are in Figure 3, illustratinginteresting cases. For instance, some videos may have thepresence of the sound even though the image didn’t corre-sponded. The first thumbnail-video had the siren sound, but theimage in the video was just a radio-like box. Another examplewas when sounds were acoustically similar but semanticallydifferent. The third thumbnail-video showed a scene from the ategory SVM SVM NN NNBaseline Best Baseline Bestair conditioner 39.3 45.1 49.9 53.2car horn 52.4 53.0 51.6 52.8children playing 53.8 54.3 65.1 65.2dog bark 76.2 75.9 81.7 82.0drilling 56.6 57.2 63.4 63.0engine idling 53.8 54.1 68.0 69.8gun shot 67.8 69.1 80.4 81.9jackhammer 60.2 62.3 63.7 66.2siren 72.2 72.8 80.2 80.4street music 46.0 46.4 58.5 59.0Mean AP 57.8 59.0 66.3 67.5TABLE IT
HE TABLE SHOWS THE CLASS PRECISION AND FOLD AVERAGEPRECISION (M EAN
AP). T HE M EAN AP OF THE
SVM
AND NN BASELINES WAS IMPROVED THROUGH SELF - TRAINING . T HE SVM B
ESTCORRESPONDS TO C LARITY I NDEX THRESHOLD AND
NN B
ESTCORRESPONDS TO P RECISION THRESHOLD .Fig. 3. Manual inspection of selected candidates from Siren, Dog bark, AirConditioner and Drilling. movie “Captain America”, where the audio was similar to “airconditioner”, but there was no such item. These examples doesnot necessarily degrade the quality of the detector as shownin [16]. In a similar manner, gun shot had, among some of thecandidates, object banging sounds.V. L
IMITATIONS AND F UTURE W ORK
Classifier bias
Semi-supervised approaches have limitationsrelated to what extent they can help, as discussed in [17].Especially, self-training has an inherent detector bias issuewhich happens when a detector is trained with an initial set ofdata. The detector then, is ran on the unlabeled data and theconfidence score depends on the initial model. Once we addnew segments, we are enforcing the acoustic characteristicsof the previous model and not necessarily making our modelsmore robust. Addressing the issue was out of scope, but couldbe a reason for the fast convergence in our results.
Threshold type
The set of thresholds utilized are a reason-able approach supported in the literature. However, a moreelaborated objective function should be considered to betterselect candidates. VI. C
ONCLUSIONS
In this work we proposed a framework of semi-supervisedself-training of audio event detectors, where the detectors were trained with the annotated US8K dataset, and the self-trainingemployed unlabeled audio from YouTube videos. The NNdetectors yielded a higher baseline performance than SVM.Both detectors and almost all the classes benefited from self-training. Despite the audio mismatch conditions and thepos-sibility of having few or no target sounds to be candidates,the performance after self-training did not degrade. Furtherexploration to select candidates offers a valuable opportunity.Unlabeled audio from videos can help audio event detection.R
EFERENCES[1] P. Sch¨auble,
Multimedia information retrieval: content-based informa-tion retrieval from large text and audio databases . Springer Science& Business Media, 2012, vol. 397.[2] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimediainformation retrieval: State of the art and challenges,”
ACM Transactionson Multimedia Computing, Communications, and Applications (TOMM) ,vol. 2, no. 1, pp. 1–19, 2006.[3] J. Maxime, X. Alameda-Pineda, L. Girin, and R. Horaud, “Sound rep-resentation and classification benchmark for domestic robots,” in .IEEE, 2014, pp. 6285–6292.[4] M. Janvier, X. Alameda-Pineda, L. Girinz, and R. Horaud, “Sound-event recognition with a companion humanoid,” in .IEEE, 2012, pp. 104–111.[5] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urbansound research,” in , Orlando, FL, USA, Nov. 2014.[6] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.Plumbley, “Detection and classification of acoustic scenes and events,”
IEEE Transactions on Multimedia , vol. 17, no. 10, pp. 1733–1746, 2015.[7] M. Ravanelli, B. Elizalde, K. Ni, and G. Friedland, “Audio conceptclassification with hierarchical deep neural networks,” in
Proceedingsof EUSIPCO , 2014.[8] K. J. Piczak, “Environmental sound classification with convolutionalneural networks,” in . IEEE, 2015, pp. 1–6.[9] T. Virtanen, A. Mesaros, T. Heittola, M. Plumbley, P. Foster, E. Benetos,and M. Lagrange,
Proceedings of the Detection and Classification ofAcoustic Scenes and Events 2016 Workshop (DCASE2016) . TampereUniversity of Technology. Department of Signal Processing, 2016.[10] W. Han, E. Coutinho, H. Ruan, H. Li, B. Schuller, X. Yu, and X. Zhu,“Semi-supervised active learning for sound classification in hybridlearning environments,”
PloS one , vol. 11, no. 9, p. e0162075, 2016.[11] Z. Zhang and B. Schuller, “Semi-supervised learning helps in soundevent classification,” in
Acoustics, Speech and Signal Processing(ICASSP), 2012 IEEE International Conference on . IEEE, 2012, pp.333–336.[12] F. Metze, S. Rawat, and Y. Wang, “Improved audio features for large-scale multimedia event detection,” in
Multimedia and Expo (ICME),2014 IEEE International Conference on . IEEE, 2014, pp. 1–6.[13] F.-F. Li and P. Perona, “The perceived position of moving objects:Transcranial magnetic stimulation of area MT+ reduces the flash-lageffect,” in
IEEE CVPR , vol. 2, 2005.[14] B. Elizalde, A. Kumar, A. Shah, R. Badlani, E. Vincent, B. Raj,and I. Lane, “Experiments on the dcase challenge 2016: Acousticscene classification and sound event detection in real life recording,”in
DCASE2016 Workshop on Detection and Classification of AcousticScenes and Events , 2016.[15] T. S. Huang, C. K. Dagli, S. Rajaram, E. Y. Chang, M. I. Mandel,G. E. Poliner, and D. P. Ellis, “Active learning for interactive multimediaretrieval,”
Proceedings of the IEEE , vol. 96, no. 4, pp. 648–667, 2008.[16] B. Elizalde, M. Ravanelli, and G. Friedland, “Audio concept ranking forvideo event detection on user-generated content,” in Multimedia (SLAM2013) .[17] A. Singh, R. Nowak, and X. Zhu, “Unlabeled data: Now it helps, now itdoesn’t,” in