Multi-View Networks For Multi-Channel Audio Classification
MMULTI-VIEW NETWORKS FOR MULTI-CHANNEL AUDIO CLASSIFICATION
Jonah Casebeer (cid:93) , Zhepei Wang (cid:93)
University of Illinois at Urbana-ChampaignDepartment of Computer Sciencejonahmc2, [email protected]
Paris Smaragdis
University of Illinois at Urbana-ChampaignDepartment of Computer ScienceAdobe Research
ABSTRACT
In this paper we introduce the idea of multi-view networks for soundclassification with multiple sensors. We show how one can build amulti-channel sound recognition model trained on a fixed number ofchannels, and deploy it to scenarios with arbitrary (and potentiallydynamically changing) number of input channels and not observedegradation in performance. We demonstrate that at inference timeyou can safely provide this model all available channels as it canignore noisy information and leverage new information better thanstandard baseline approaches. The model is evaluated in both ananechoic environment and in rooms generated by a room acousticssimulator. We demonstrate that this model can generalize to unseennumbers of channels as well as unseen room geometries.
Index Terms — Sound recognition, IoT sensing, neural net-works
1. INTRODUCTION
Sound classification and detection is becoming an increasingly rel-evant problem, and one in which we are seeing a lot of activity andprogress in the last few years. In this paper we focus on the casewhere we have a lot of acoustic sensors, but we are not guaranteedthat they all record a clean enough signal for the task at hand, nei-ther that they are all recording at any times. For instance, considerthe case of a few people in an office setting. Each person will likelyhave a cell phone with a couple of microphones, a laptop with a fewmore, and maybe some mic-enabled wearable devices; there mightbe a room microphone tethered to a conferencing system, or audio-enabled desktop computers in the room, perhaps a few hearing aids,and maybe some security microphones as well. In this situation wemight want to perform audio sensing tasks, e.g. diarization, and al-though we have access to a large number of recordings in the room,not all will be of use. For instance, some cell phones might be inpurses providing a non-informative signal, whereas others might beright next to the type of sound we want to detect. Our goal is toexplore algorithms that will not be misled by channels with low-informational content, and to not be tethered to a fixed number ofchannels. In order to do so we introduce the concept of the multi-view network (MVN) for the purpose of classification. The frame-work that we present here considers a simple classifier, but is easilyamenable to more elaborate extensions in order to facilitate state-of-the-art classification and detection models, as long as they fall underthe umbrella of a deep learning model.Deep learning models for monaural or binaural audio classifica-tion have been explored in many settings. Several deep architectures (cid:93)
These two authors contributed equallySupported by NSF grant have shown to be powerful tools for the tagging task [1, 2, 3, 4].These techniques have been expanded to work in multi-channel sce-narios as well.Multi-channel audio classification and detection models mirrortheir single channel counterparts. In the DCASE-2018 Task 5 Chal-lenge [5], the top performers used beamforming and source separa-tion front ends with Convolutional Neural Network (CNN) ensembleback ends in combination with various data augmentation techniques[6, 7]. Recurrent Neural Networks (RNNs) [8] operating on spectralfeatures have been explored by [9, 10] and others. Recently, [11, 12]used RNNs in an end-to-end fashion to predict the filters of a beam-formers whose output fed to another deep model.In all of this work however, when a deep audio classificationmodel is trained to perform classification on, e.g., four channels ittypically can’t guarantee the ability to leverage information whenmore channels are provided or to function when some are missing. Incontrast to neural-based methods, classic beamforming approachescan scale to an arbitrary number of input channels. However, witha fixed number of channels available, learning methods are typicallysuperior to beamforming. Here we seek to remedy this by using net-work architectures that accept inputs of variable sizes, allowing usto train on a fixed number of channels and deploy on any other num-ber of channels at inference time. Our work extends the Multi-ViewNetworks for denoising [13]. The denoising model attempted to pre-dict clean spectra from noisy recordings, our classification modelpresented here extends that idea to that of performing classification.Our results show that MVNs are fit for classification and that theycan handle channel disturbances in a dynamic environment with sim-ulated Room Impulse Responses several times larger than our Short-Time Fourier Transform(STFT) frames.
2. MULTI-VIEW NETWORKS
RNNs are a common starting point for single channel audio clas-sification. They typically operate on a chosen short-time spectralrepresentation and unroll across time to leverage the temporal de-pendencies between successive spectral frames. Due to an RNN’sability to process inputs of arbitrary length, training and testing se-quences for these models are not required to have a fixed length. Thisability allows models to be trained on short sequences and operatedon longer test sequences. In this work we study audio classificationin a multi-channel scenarios where the number of channels availablefor training might differ from the number of channels available attest time.Multi-View networks [13] use the ability of RNNs to generalizeto sequences of any length by unrolling across time and channels.RNNs perform the following operation using a non-linearity σ where a r X i v : . [ c s . S D ] F e b . . y x x k ,1 x . . y x x k ,2 x n . . y n x n x k , n ............ h h h k 1,1 h h h k 1,2 h n h n h k 1, n h k ,2 Channels
Time h k ,1 Text
Fig. 1 . MVN unrolling across channels and time. Note that themodel observes all shared time steps of each channel before makinga prediction. Then, the last channel’s hidden state feeds into thefirst channel of the next time step allowing for the propagation oftemporal information. x k,t is channel k ’s t -th spectral frame: h t = σ ( W h x t + U h h t − ) y t = σ ( W x h t ) (1)MVN’s learn a set of matrices W h , W x , U h to leverage temporalinformation and information across channels. This is achieved withthe below recurrence. h k,t = (cid:40) σ ( W h x k,t + U h h k,t − ) if k = 1 σ ( W h x k,t + U h h k − ,t ) otherwise y t = σ ( W x h k,t ) (2)This recurrence allows the model to aggregate information from allchannels at each time step before a prediction. Because this opera-tion is fundamentally a single dimension RNN it can generalize tonumbers of channels not seen before. Figure 1 demonstrates the un-rolling across channels and time.
3. EXPERIMENTS
STFT MVN Softmax
Noisy Recordings Magnitude Spectra Class Labels
Fig. 2 . Audio event classification pipeline with an MVN. For amulti-channel mixture, we first take the magnitude spectra for eachchannel with STFT. The network takes the magnitude spectra as in-put and predicts if each frame contains speech. In the following sections we introduce the setup of the audioclassification experiments, including the data set, the baseline mod-els, and two experiments evaluating the performance of the models.The models and the data correspond to a binary classification taskin which each input is classified as either speech or not. Figure 2 il-lustrates the general pipeline for the experiment. These experimentsmodel various scenarios with many microphones where most of therecorded signals are extremely noisy or contain little information.This reflects the potential application of this model in IoT sensingproblems.
We prepare the data set for the binary audio classification experi-ments by mixing speech and noise segments. The speech segmentsare selected from the TIMIT Corpus, which includes 25,200 record-ings with 630 speakers of 8 major dialects of American English, eachreading 10 phonetically rich sentences [14]. The noise segments areselected from 13 different background noise recordings such as “Air-port”, “Babble” and “Restaurant” noises [15].Our data set is made up of two-second noisy mixtures at a sam-pling rate of 16kHz. To create a sequence, we randomly select atwo-second segment from one of the noise recordings and a seg-ment between zero to two seconds in length from one of the TIMITspeech recordings. Next, we normalize each of the noise and speechsegments to unit variance. We then mix them by adding the speechsegment to a random position within the two-second noise segmentto obtain a single channel mixture. The resulting mixtures are ap-proximately half speech and half background noise.Based on this single channel mixture, we propose different waysto generate multi-channel mixtures; as described in Section 3.1.1.We apply STFT with a 1024 pt window and a 512 pt hop on eachchannel of the mixture, and take the absolute value of the coefficientsas the input to the models.
In this experiment, we set the per channel SNR of a noisy mixture byscaling it to the desired SNR in decibels (dB). For the training andvalidation set, all mixtures contain four channels whose SNR valuesare evenly spaced between -5dB and 5dB.For the test set, the number of channels is ranged between 2and 30, emulating scenes with different numbers of available sen-sors in some ad-hoc network. We propose two scenarios for gener-ating multi-channel mixtures for testing. In the first scenario, eachadditional channel has a lower SNR value than the prior channels.Specifically, for a mixture of K ∈ { , , . . . , } channels, the SNRvalues for the channels are dB , − dB , . . . , − ( K − dB. The morechannels, the lower the average SNR value. We call this scenario“decreasing SNR”. We also propose an “increasing SNR” scenarioin which a mixture of K ∈ { , , . . . , } channels contains SNRvalues of − dB , − dB , . . . , ( −
29 + K − dB. The more chan-nels, the higher the average SNR value. In both scenarios, the chan-nel indices are randomly shuffled.By providing the model progressively worse channels we test theability to ignore noisy information. By providing progressively bet-ter channels we test the ability to leverage new information. Thesesetups expose the model to a diverse set of SNRS mimicking sensornetworks in a chaotic environment. .2. Baseline Models We now introduce three different binary classification strategies asthe baseline for the experiments. Each model takes the mixture’smagnitude STFT spectra as input, and then pass it into a GRU with512 hidden units unrolling across time followed by a softmax layerthat classifies each input frame as either noise or signal.
This model averages the magnitude spectra across channels for eachmixture, and it then feeds the averaged spectra into the network. Theoutput of the softmax layer is the estimated probability distributionof each frame being noise or signal: h Θ ( X t ) = argmax c ∈{ , } P ( y t = c | x t ; Θ ) (3) x t = (cid:80) Kk =1 x k , t K (4)where X t is the set of magnitude spectra at time t for a given mixturewith K channels, x k , t is the spectra at time t for the k th channel, c ∈ { , } is the binary label for each frame, and Θ is the set ofmodel parameters. We refer to this model as the Averaging Inputmodel. This model takes the STFT coefficients for each channel of the mix-ture, and feeds it into the network. For a mixture with K channels,we will therefore have K different output probability distributionsfrom the softmax layer. We obtain the probability for each frame byaveraging the K softmax probability distributions: h Θ ( X t ) = argmax c ∈{ , } (cid:80) Kk =1 P ( y t = c | x k , t ; Θ ) K . (5)We refer to this model as the Averaging Output model.
Similar to the Averaging Output model, this model takes the STFTcoefficients for each of the K channels of the mixture and produces K output distributions for each frame. Instead of averaging the out-put probabilities, we use the prediction with highest confidence: h Θ ( X t ) = argmax c ∈{ , } max k ∈{ ...K } P ( y t = c | x k , t ; Θ ) . (6)We refer this model as the Max Output model. For all experiments, we train with 250 batches of 40 k-channel mix-tures per epoch(10k mixtures total). We use Adam [16] with an ini-tial learning rate of − to minimize the cross-entropy loss, andwe train each model for 100 epochs and use the model saved at theepoch with the lowest validation loss. The learning rate is decreasedby a factor of 0.25 every 20 epochs. We now report the performance of the MVN and the baseline modelsin two experiments.
In the first experiment, we create multi-channel mixtures by mixingspeech and audio data directly according to Section 3.1.1. Besidesshuffling the channel indices, we also shuffle the SNR for each chan-nels at each time step in the time-frequency domain. With such asetup, we aim to model a dynamic environment in which some sen-sors move around the signal of interest or the signal received by thesensors is disturbed intermittently. Figure 3 shows the prediction A cc u r a c y [Simple Mixtures] Accuracy vs Averaging InputAveraging OutputMax OutputMulti-View (a) Decreasing SNR A cc u r a c y [Simple Mixtures] Accuracy vs Averaging InputAveraging OutputMax OutputMulti-View (b) Increasing SNR
Fig. 3 . Test accuracy of the MVN and the baseline models on audioclassification using simple mixtures. The top and bottom plots cor-respond to the decreasing and increasing SNR scenarios. The x-axison the plot denotes the number of mixture channels. The y-axis in-dicates the accuracy of each model’s prediction. Each dotted curveshows the mean of the accuracy for five runs of evaluation, and theshaded area represents one standard deviation from the mean.accuracy of the MVN and the baseline models from 2 to 30 chan-nels with decreasing SNR. All models are trained on four-channelmixtures with SNRs uniformly spaced between -5 and 5dB. For de-creasing SNR, the SNR value decreases by 1 dB for each additionalchannel. The models have similar performances when the numberof channels is less than 10; however, as it goes beyond, the perfor-mance of the MVN is more stable than the baseline models. Theresult shows that the MVN is least affected by the channels with lowSNR values compared to the baseline models. For increasing SNR,each extra channel is 1 dB higher than the previous channel. TheMVN takes the fewest channels to achieve some desired accuracy, m e t e r s Sample Room Setup
WallSpeaker PathNoise SourceMicrophones
Fig. 4 . Sample setup with one noise source, four microphones and amoving speech source in a 20 by 20 meter room. Diffuse noise notpictured.indicating that the MVN collects information more effectively thanthe baseline models. Moreover, the MVN is able to generalize fromtraining on a fixed number of channels and a limited range of SNRsto testing on a varied number of channels with a large range of SNRs.
In this experiment, we use pyroomacoustics [17] to model a m by m room with the image source model using fourth order echoes.We train the model on many microphone, speaker, noise source ge-ometries and test on unseen ones. To construct a simulated roomwe first simulate a noiseless moving speech source. Then, using thesame microphone geometry we simulate a noise source randomlyplaced on a grid. To construct a mixture we take a speech recordingand a noise recording which correspond to the same microphone ge-ometry and mix them at some SNR. The separate simulation of noiseand speech lets us mix and match to construct training and testingenvironments without having to explicitly simulate every combina-tion. For training we use mixtures with per channel SNRs between-5 and 5 dB then at test time we experiment with both “decreas-ing SNR” and “increasing SNR” as described in Section 3.1.1. Allspeakers are modeled as point sources. The simulated room impulseresponses where generally five times the length of an STFT window.Inside the simulated rooms we place to microphones, anoise source and a diffuse noise. The speech source moves in a noisylinear path from one corner of the room to another. The indicesof the microphones are randomized. Figure 4 shows one possibleconfiguration of a simulated room. Figure 5 depicts the accuracyof the MVN and the baseline models for speech classification. Themixtures range from 2 to 30 channels. We observe a pattern similarto the experiment with dry sounds in Section 3.4.1. Among the fourmodels, the MVN is least affected by noise, room impulse responses,and it collects information most effectively from channels that mayhave high SNRs.
4. CONCLUSION
We have proposed a neural network method for multi-channel audioclassification using an RNN that unrolls across both channels andtime. We demonstrate that the proposed architecture can be deployed A cc u r a c y [Room Simulation] Accuracy vs Averaging InputAveraging OutputMax OutputMulti-View (a) Decreasing SNR A cc u r a c y [Room Simulation] Accuracy vs Averaging InputAveraging OutputMax OutputMulti-View (b) Increasing SNR
Fig. 5 . Test accuracy of the MVN and the baseline models on audioclassification using the room simulation. The top and bottom plotscorrespond to the decreasing and increasing SNR scenarios. Eachdotted curve shows the mean accuracy over three runs of evaluation,and the shaded area represents one standard deviation from the mean.to unseen numbers of channels and unseen room geometries at testtime. The system is robust to noisy channels in a highly dynamicenvironment, making it unnecessary to eliminate certain channels asa preprocessing step. Moreover, the system demonstrates the abilityto leverage information effectively from a limited number of cleanchannels when they appear among many noisy channels. As such,this model is capable of being trained once and then deployed insettings with a dynamically changing number of sensors (e.g. in anIoT case), without requiring retraining or any modifications. Theproposed framework is not limited to binary classification and canbe used for multi-class or multi-label classification, as well with anyother type of neural network architecture as long as the channel un-rolling takes place. We hope that this will form the basis of powerfulfuture systems that have to operate under uncertainty on the numberof input channels, as opposed to resorting to simple averaging of vot-ing schemes which are not as adept in taking the data into account. . REFERENCES [1] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort FGemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn ar-chitectures for large-scale audio classification,” in
Acoustics,Speech and Signal Processing (ICASSP), 2017 IEEE Interna-tional Conference on . IEEE, 2017, pp. 131–135.[2] Toan H Vu and Jia-Ching Wang, “Acoustic scene and eventrecognition using recurrent neural networks,”
Detection andClassification of Acoustic Scenes and Events , vol. 2016, 2016.[3] Keunwoo Choi, George Fazekas, and Mark Sandler, “Au-tomatic tagging using deep convolutional neural networks,” arXiv preprint arXiv:1606.00298 , 2016.[4] Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, andMark D Plumbley, “Convolutional gated recurrent neural net-work incorporating spatial features for audio tagging,” in
Neu-ral Networks (IJCNN), 2017 International Joint Conferenceon . IEEE, 2017, pp. 3461–3466.[5] Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldege-breal Adhana, Henk Brouckxon, Toon van Waterschoot, BartVanrumste, Marian Verhelst, and Peter Karsmakers, “TheSINS database for detection of daily activities in a home envi-ronment using an acoustic sensor network,” in
Proceedings ofthe Detection and Classification of Acoustic Scenes and Events2017 Workshop (DCASE2017) , November 2017, pp. 32–36.[6] Tadanobu Inoue, Phongtharin Vinayavekhin, Shiqiang Wang,David Wood, Nancy Greco, and Ryuki Tachibana, “Domesticactivities classification based on cnn using shuffling and mix-ing data augmentation,” 2018.[7] Ryo Tanabe, Takashi Endo, Yuki Nikaido, Takeshi Ichige,Phong Nguyen, Yohei Kawaguchi, and Koichi Hamada, “Mul-tichannel acoustic scene classification by blind dereverbera-tion, blind source separation, data augmentation, and modelensembling,” Tech. Rep., DCASE2018 Challenge, September2018.[8] Tomas Mikolov, Martin Karafit, Luks Burget, Jan Cernock,and Sanjeev Khudanpur, “Recurrent neural network based lan-guage model.,” in
INTERSPEECH , Takao Kobayashi, KeikichiHirose, and Satoshi Nakamura, Eds. 2010, pp. 1045–1048,ISCA.[9] Giambattista Parascandolo, Heikki Huttunen, and Tuomas Vir-tanen, “Recurrent neural networks for polyphonic soundevent detection in real life recordings,” arXiv preprintarXiv:1604.00861 , 2016.[10] Hyoung-Gook Kim and Jin Young Kim, “Acoustic event de-tection in multichannel audio using gated recurrent neural net-works with high-resolution spectral features,”
ETRI Journal ,vol. 39, no. 6, pp. 832–840, 2017.[11] Bo Li, Tara N Sainath, Ron J Weiss, Kevin W Wilson, andMichiel Bacchiani, “Neural network adaptive beamforming forrobust multichannel speech recognition.,” in
INTERSPEECH ,2016, pp. 1976–1980.[12] Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, JohnHershey, Michael L Seltzer, Guoguo Chen, Yu Zhang, MichaelMandel, and Dong Yu, “Deep beamforming networks formulti-channel speech recognition,” in
Acoustics, Speech andSignal Processing (ICASSP), 2016 IEEE International Confer-ence on . IEEE, 2016, pp. 5745–5749. [13] Jonah Casebeer, Brian Luc, and Paris Smaragdis, “Multi-view networks for denoising of arbitrary numbers of channels,”
CoRR , vol. abs/1806.05296, 2018.[14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S.Pallett, and N. L. Dahlgren, “Darpa timit acoustic phoneticcontinuous speech corpus cdrom,” 1993.[15] Ding Liu, Paris Smaragdis, and Minje Kim, “Experiments ondeep learning for speech denoising,”
Proceedings of the An-nual Conference of the International Speech CommunicationAssociation, INTERSPEECH , pp. 2685–2689, 1 2014.[16] Diederik P. Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,”
CoRR , vol. abs/1412.6980, 2014.[17] Robin Scheibler, Eric Bezzam, and Ivan Dokmani´c, “Py-roomacoustics: A python package for audio room simulationand array processing algorithms,” in . IEEE, 2018, pp. 351–355.[18] D. H. Johnson, “Signal-to-noise ratio,”