[PDF] Multi-Channel Speech Enhancement using Graph Neural Networks

Abstract

Multi-channel speech enhancement aims to extract clean speech from a noisy mixture using signals captured from multiple microphones. Recently proposed methods tackle this problem by incorporating deep neural network models with spatial filtering techniques such as the minimum variance distortionless response (MVDR) beamformer. In this paper, we introduce a different research direction by viewing each audio channel as a node lying in a non-Euclidean space and, specifically, a graph. This formulation allows us to apply graph neural networks (GNN) to find spatial correlations among the different channels (nodes). We utilize graph convolution networks (GCN) by incorporating them in the embedding space of a U-Net architecture. We use LibriSpeech dataset and simulate room acoustics data to extensively experiment with our approach using different array types, and number of microphones. Results indicate the superiority of our approach when compared to prior state-of-the-art method.

Full PDF

MMULTI-CHANNEL SPEECH ENHANCEMENT USING GRAPH NEURAL NETWORKS

Panagiotis Tzirakis, Anurag Kumar and Jacob Donley

Facebook Reality Labs Research, [email protected], { anuragkr, jdonley } @fb.com ABSTRACT

Multi-channel speech enhancement aims to extract clean speechfrom a noisy mixture using signals captured from multiple mi-crophones. Recently proposed methods tackle this problem byincorporating deep neural network models with spatial ﬁlteringtechniques such as the minimum variance distortionless response(MVDR) beamformer. In this paper, we introduce a different re-search direction by viewing each audio channel as a node lying ina non-Euclidean space and, speciﬁcally, a graph. This formulationallows us to apply graph neural networks (GNN) to ﬁnd spatialcorrelations among the different channels (nodes). We utilize graphconvolution networks (GCN) by incorporating them in the embed-ding space of a U-Net architecture. We use LibriSpeech dataset andsimulate room acoustics data to extensively experiment with ourapproach using different array types, and number of microphones.Results indicate the superiority of our approach when compared toprior state-of-the-art method.

Index Terms — Speech enhancement, deep learning, multi-channel processing, graph neural networks

1. INTRODUCTION

Humans can naturally focus their auditory system to attend on a sin-gle sound source while cognitively ignoring other sounds. The exactmechanism that the brain employs to perform such task in difﬁcultnoisy scenarios, often termed the cocktail party problem [1], is stillnot completely understood. However, studies have shown that bin-aural processing can help alleviate this problem [2]. Spatial informa-tion helps the auditory system group sounds from speciﬁc directionsand segregate them from other directional interfering sounds.Multi-channel speech enhancement is the process of enhancinga target’s speech corrupted by background interference using multi-ple microphones. It is very crucial to many applications including,but not limited to, human-machine interfaces [3], mobile communi-cation [4], and hearing aid [5, 6].While the problem has been studied for a long time, it remains achallenging one. The target’s speech signal can be corrupted by notonly other sound sources but also by reverberation from surface re-ﬂections. Traditional approaches include spatial ﬁltering methods [7,8] that often make use of the spatial information from the soundscene, such as angular position of the target’s speech and the micro-phone array conﬁguration. These approaches are regularly termed beamforming , a linear processing model that weights (“masks”) dif-ferent microphone channels in the time-frequency domain in orderto suppress source signal components that are not the target soundsource. In the case of the minimum variance distortion-less response(MVDR) [9] beamformer, ﬁrst the desired source transfer functionand noise covariance matrices are estimated, often via power spectraldensity (PSD) matrices, then beamforming weights are computed and applied to the signals. Although these approaches can performwell, their performance depends on reliable estimation of spatial in-formation, which can be challenging to accurately estimate in noisyconditions.Deep neural networks (DNN) have been widely used in varietyof audio tasks, such as emotion recognition [10], automatic speechrecognition [11], and speech enhancement and separation [12]. Formulti-channel processing, they have been incorporated with tradi-tional spatial ﬁltering methods, such as the conventional ﬁlter-and-sum beamformer. This has mainly been accomplished in two ways,both of which are applied in the frequency domain. In one approach,a DNN is used to predict directly the beamforming weights [13]. In asecond approach, a DNN is used to estimate a mask which is appliedto the short-time Fourier transform (STFT) of the signal such thatthe PSD matrices are computed [14]. Then, a beamforming methodis applied, such as MVDR, that computes the ﬁlter coefﬁcients us-ing the PSD matrices [15, 16, 14]. These methods use DNN’s indifferent ways, however, the end goal for each of them is the same,which is to predict the ﬁlter coefﬁcients. Recently, however, a shiftin the audio community has emerged towards incorporating attentionmechanisms in the deep neural network architectures [17] to implic-itly perform spatial ﬁltering.In this paper, rather than using traditional beamforming methodwith a DNN or attention mechanism, we propose a novel approachfor multi-channel speech enhancement and de-reverberation. In par-ticular, we view each audio channel as lying in a non-Euclideanspace, more speciﬁcally, a graph , which is learned from the observa-tions. Formulating the problem in such a manner allows us to exploitmethods from the graph neural network’s (GNN) domain [18], andperform our training in an end-to-end manner. In addition, learninga graph structure allows the network to adapt its structure accord-ing to the dynamic sound scene. To the best of our knowledge, thisis the ﬁrst method which formulates multi-channel speech enhance-ment and de-reverberation through a graph and uses graph neuralnetworks to solve it.Our approach relies on both real and imaginary parts of the com-plex mixture in the short-time Fourier transform (STFT) domain andestimates a complex ratio mask (CRM) for a reference microphone.The CRM is then applied to the mixture STFT to obtain the cleanspeech. We apply our proposed method for simultaneous speech en-hancement and de-reverberation tasks. To this end, we simulate dataleveraging the LibriSpeech [19] dataset. In particular, we simulatedata for different microphone array conﬁgurations - linear, circular,and distributed while varying the number of microphones. We useShort-Time Objective Intelligibility (STOI) [20], Perceptual Evalu-ation of Speech Quality (PESQ) [21] and Signal to Distortion Ratio(SDR) [22] as evaluation metrics in our experiments. We also showthat our approach outperforms a recently proposed neural networkbased multi-channel speech enhancement method.© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or futuremedia, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale orredistribution to servers or lists, or reuse of any copyrighted component of this work in other works. a r X i v : . [ c s . S D ] F e b ig. 1 . The proposed model to multi-channel speech enhancement using graph neural networks. The complex spectrogram of each microphonesignal is computed and passed to the encoder. The extracted representations of each channel are passed to a graph convolution network forspatial feature learning. The extracted features of each channel are passed to a decoder, and a weighted sum of the decoder output is performed.The output is (complex) multiplied with a reference microphone complex spectrogram to produce the clean spectrogram.

2. GRAPH NEURAL NETWORKS

Graph neural networks (GNN) [18] are a generalization of conven-tional neural networks, designed to operate on non-Euclidean data informs of graph. Graphs provide considerable ﬂexibility in how thedata can be represented and structured and GNNs allow one to oper-ate and generalize neural network methods to graph structured data .A particular type of GNN is the convolutional GNNs which is basedon the principles of learning through shared-weights, similar to con-volutional neural networks (CNNs) [23]. Broadly, there are two ap-proaches for building graph convolutional neural networks,

Spectral

GCNs and

Spatial

GCNs [24, 25, 26]. Spectral GCNs are based onprinciples of spectral graph theory. More speciﬁcally, the graph pro-cessing is based on Eigen-decomposition of graph Laplacian, whichare used to compute the Fourier transform of a graph signal, throughwhich graph ﬁltering operations are deﬁned. Spatial GCNs deﬁneconvolutions directly on graph data and tries to capture informationby aggregating information from neighboring nodes through sharedweights. Spatial GCNs are computationally less complex as well asgeneralize better to different graphs. While Spectral GCNs operateon ﬁxed graph Spatial GCNs have the ﬂexibility of working locallyon each node without taking into account the full ﬁxed graph. Theydo however require node ordering.A key aspect of our proposed method is that we construct thegraph dynamically conditioned on the task at hand. This dynamicgraph construction approach enables the framework to capture multi-channel information for each audio in a sample speciﬁc manner.

3. MULTI-CHANNEL GRAPH PROCESSING

Our proposed framework is schematically illustrated in Fig. 1. Theaudio signal from each channel is transformed to a time-frequency(T-F) representation which are fed to the neural network framework.The inputs are ﬁrst passed through an encoder network which triesto learn higher level features from the inputs (Sec 3.1). These fea-ture representations for the audio channels are then used to constructa graph which captures the multi-channel information through itsnodes and edges. At this point, we use GCNs (Sec 3.3) to aggre-gate information from each microphone. The output representation of each node in the graph is passed as input to the decoder, whichthen transform the signals back to their original dimensions. Finally,a weighted sum of the decoder outputs is computed, and it is (com-plex) multiplied with the STFT of a reference microphone to com-pute the clean STFT.

The ﬁrst major step in our proposed framework is to learn represen-tations for audio signals from each microphone. To do this, the audiosignals are ﬁrst converted to a time-frequency representation throughShort-Time Fourier transform (STFT). The real and imaginary partsof this complex are stacked together to obtain a 2 channel tensor ofsize × T × F , where where T represents the total number of timesegments, and F represents the total number of frequency bins. Con-sidering all M channels, it leads to a M × × T × F dimensionalinput to the framework.We utilize a U-Net architecture [27] to learn representations forthe inputs. U-Net based architectures has been shown to work wellfor speech enhancement [28]. Complex spectrograms from eachchannel are passed to the encoder. The representations producedby from the encoder are used to obtain the multi-channel graph.The decoder outputs M tensors of the same dimension as the input.We combine these tensors in a uniﬁed representation using attentionlayer, i. e., a weighted sum of them. Traditionally, signal processing based methods have been used toextract information from audio signals captured by multiple micro-phone. Here, we propose a novel approach which extracts multi-channel information through graph processing. The ﬁrst step in thisprocess is to construct a graph using the audio representations fordifferent channels obtained in the previous step.We construct an undirected graph, G = ( V , E ) , where V rep-resents the set of nodes, v i (i. e. microphones), of the graph. E represents the edges, ( v i , v j ) , of the graph between two nodes. Ingraph theory, the graph is characterized by an adjacency matrix A ∈ R |V|×|V| , where | · | indicates the cardinality and a degree matrix D ethod .

67 1 .

73 0 .

51 0 .

68 1 .

75 0 .

23 0 .

66 1 .

70 0 . .

66 1 .

62 0 .

22 0 .

65 1 .

69 0 .

59 0 .

66 1 .

64 0 . CRNN-C [14] 2 .

68 1 .

75 2 .

29 0 .

68 1 .

72 2 .

17 0 .

67 1 .

73 2 . .

70 1 .

88 4 .

89 0 .

68 1 .

75 3 .

17 0 .

67 1 .

71 2 . Proposed Single-Channel – .

69 1 .

98 6 .

38 0 .

69 1 .

96 6 .

48 0 .

68 1 .

94 5 . Proposed 2 .

72 2 .

20 6 .

40 0 .

72 2 .

10 7 .

03 0 .

71 2 .

04 6 . .

73 2 .

21 8 .

53 0 .

71 2 .

10 7 .

04 0 .

71 2 .

02 6 . Table 1 . Results (STOI, PESQ, SDR) of the enhanced signal for three array types, namely, circular, linear, and distributed conﬁgurations.[29]. We consider weighted adjacency matrix, where the entries of A correspond to a weighted edge ( v i , v j ) ∈ E between two nodesof the graph. Intuitively, each weight represents a similarity betweenthe feature vectors of two nodes in the graph. In our approach here,these weights, w ij , { i, j ∈ | V |} , of the adjacency matrix A arelearned during the training process. For two nodes v i and v j , weﬁrst concatenate their representations, f v i , f v j ∈ R N as [ f v i || f v i ] and then pass them through a non-linear function F ([ f v i || f v j ]) . Weconstruct our adjacency matrix by normalizing the weights of eachnode to sum to one. The node degree matrix D is a diagonal matrix D ii = (cid:80) j A ij . The graph, G constructed in the previous section, provides a struc-tured way to capture the information from all microphones. We cannow exploit GCNs to learn spatial relations from this graph. We ap-ply GCN to learn higher abstraction levels for the node features bylearning representations for each node with respect to its neighbors.Given a graph G = ( V , E ) , the GCN applies non-linear transfor-mation on the input feature matrix X ∈ R |V|× N with N features.Mathematically, GCN can be represented as follows H ( l ) = g ( D − / AD − / H ( l − W ( l − ) . (1) H ( l ) ∈ R |V|× K is the l th layer with K features with H (0) = X . D is a diagonal node degree matrix, W ( l − is the trainable weightmatrix at the l − th layer, and g is an activation function. To train the overall framework, we consider loss computation indifferent forms. We consider loss computation through magnitudespectrogram, complex spectrogram and in raw signal domain. Morespeciﬁcally, following four losses are considered, L Mag = (cid:13)(cid:13)(cid:13) ˆ M − M (cid:13)(cid:13)(cid:13) , L spec = (cid:13)(cid:13)(cid:13) ˆ S − S (cid:13)(cid:13)(cid:13) (2) L Mag + Spec = L Mag + L spec (3) L Mag + raw = L Mag + (cid:107) ˆ s − s (cid:107) (4) (cid:107)·(cid:107) indicates the L1 norm, M , S and s indicate the magnitude spec-trogram, complex spectrogram and clean signal, respectively, and ˆ indicates the corresponding predicted entities.

4. EXPERIMENTS4.1. Dataset

We utilize LibriSpeech dataset, which is a corpus of approximately h of English speech, captured in an anechoic chamber, with kHz sampling rate. For our purposes, we use the dataset’s sen-tences to simulate room acoustic data of three array types: linear,circular, and distributed. With the distributed array, we randomlyplace the microphones in the room. In addition, we experimentwith M ∈ { , } microphones in the array. The simulated obser-vations consist of one speech signal mixed with M − noise sig-nals, selected randomly from AudioSet [30] and located randomlyin the room. The SNR levels of the mixed signals are from the set {− . , − , , , . } dB. The training set is comprised of 3 roomswith dimensions ( width × depth × height ) from the set { × × , × × , × × } meters, the development set of 2 roomswith dimensions from { × × , × × } meters, and the testset of 2 rooms with dimensions from { × × , × × } meters.All rooms have RT = 0 . sec. The training set is comprised of h, and the development and test sets of h. We used the Adam optimization algorithm [31] to train the mod-els with a ﬁxed learning rate of − , and a mini-batch of size frames. The input representation is the complex STFT computedwith a Hanning window of length , and frequency bins ,with a hop size of . The number of channels of the convolutionlayers in the encoder are , , , , , , respectively.The number of channels of the decoder are the same as the encoderin reverse order. All (de-)convolution of the encoder (decoder) use × kernel size, with stride × , and no padding. Encoder (de-coder) layers are comprised of the (de-)convolution, with batch nor-malization, and a SELU activation function. For our GCN we usetwo layers with number of hidden units to be the same as the dimen-sion of the embedding space. We compare our approach, with Chakrabarty’s et al. (CRNN-C) [14], a recently proposed multi-channel deep learning basedenhancement model. Their approach uses as input the phase andmagnitude of the raw signal for each channel and performs convo-lution across the channels to predict the ideal ratio mask (IRM) to ethod Input SNR Input SNR Input SNR Input SNR Input SNR-7.5 dB -5.0 dB 0 dB 5.0 dB 7.5 dBSTOI PESQ SDR STOI PESQ SDR STOI PESQ SDR STOI PESQ SDR STOI PESQ SDRNoisy 0.51 1.22 -6.05 0.58 1.51 -3.34 0.65 1.74 0.30 0.71 1.91 5.03 0.78 2.09 6.96CRNN-C [14] 0.55 1.28 -1.08 0.62 1.57 0.82 0.68 1.79 2.61 0.73 1.97 6.36 0.80 2.13 7.32Proposed 0.60 1.67 3.84 0.66 1.98 5.74 0.72 2.11 6.84 0.77 2.24 8.86 0.81 2.36 9.94

Table 2 . Results (STOI, PESQ and SDR) for different SNR levels using a 4-microphone linear array.

Fig. 2 . STOI, PESQ, and SDR trends for different SNR levels.compute the clean magnitude. We also show results of the U-Netarchitecture without utilizing the spatial information, i. e., usinga single-channel for the enhancement. Table 1 shows results fordifferent array conﬁgurations.Our method outperforms, the single-channel approach with highmargins on all metrics. In addition, our model also surpasses theCRNN-C method. Surprisingly, we observe that CRNN-C approachis the worst performing, with values marginally above the noisy sig-nal, even when comparing with the single channel method. In addi-tion, our approach is robust against different microphone geometriesin contrast with conventional methods where human a-priori knowl-edge is essential to achieve high-performing models.Finally, in Table 2 and Fig. 2, we analyze results at different SNRlevels. Results are shown for a linear array with microphones at5 different SNR levels, [ − . , − , , , . (dB). In general higherperformance gains are obtained for negative SNR values comparedto the positive ones. For -7.5 dB input SNR, we see that the proposedGCN-based method leads to more than 9 dB improvement in SDRover the noisy case. For the highest input SNR, we observe only 2.98dB improvement in SDR. In addition, we observe that our modeloutperforms CRNN-C on all SNR values. This is again more clearin very low SNR values such as − . dB, where our model has . , . , . improvement on STOI, PESQ, and SDR, respectively. We perform an ablation study using a 4-microphone linear array toﬁnd an appropriate loss function, and to verify that the GCN im- Method PESQ STOI SDRNoisy 1.72 0.66 0.19 L Mag L Spec L Mag + Spec L Mag + raw Table 3 . Results on the development set for different loss functions.Method PESQ STOI SDRNoisy 1.72 0.66 0.19w/o GCN 2.08 0.71 7.05w/ GCN 2.13 0.73 7.73

Table 4 . Proposed model with and without the graph representation.proves the model’s performance. Experiments in this section use an8 microphone circular array and results are shown with STOI, PESQ,and SDR. Table 3 shows the results on the development set of oursimulated dataset. The results indicate that the loss that combinesthe magnitude and the raw signal have the overall best performance.To verify the utility of the GCN in the embedding space of theU-Net, we perform two experiments where we: (i) discard the GCN,and (ii) include GCN, from the embedding space of U-Net. Table 4depicts the results. Including GCNs the performance of the modelis better on all the metrics than without using GCN. We should notethat when we do not use GCN we still use all channels as we combinethe extracted U-Net representations with the attention layer.

5. CONCLUSIONS

In this paper, we propose to utilize graph neural networks to ex-ploit the spatial correlations in the multi-channel speech enhance-ment problem. We use a U-Net type architecture where the en-coder learns representations for each channel separately, and a graphis constructed using these representations. Graph convolution net-works are used to propagate messages in the graph, and hence learnspatial features. The features of each node are passed to the de-coder to reconstruct the spectrogram of each channel. We combinethese using attention weights for the ﬁnal prediction of a referencemicrophone. An analysis of the proposed method is provided withdifferent array geometries and microphone counts in reverberant andnoisy environments. This is the ﬁrst study that utilizes GCN forspeech enhancement. Results show the superiority of the proposedapproach compared to the prior state-of-the-art method.Future work could look at performing a quantitative evaluationof the trained model by inspecting the most important nodes andedges in the graph. Also, performing speech enhancement by usingthe raw waveform (i. e., end-to-end approach) instead of the complexspectrograms is another possible direction.

EFERENCES [1] E. C. Cherry, “Some experiments on the recognition of speech,with one and with two ears,”

The Journal of the acousticalsociety of America , pp. 975–979, 1953.[2] M. L. Hawley, R. Y. Litovsky, and J. F. Culling, “The beneﬁtof binaural hearing in a cocktail party: Effect of location andtype of interferer,”

The Journal of the Acoustical Society ofAmerica , pp. 833–843, 2004.[3] M. R Bai, J. Ih, and J. Benesty,

Acoustic array systems: theory,implementation, and application , John Wiley & Sons, 2013.[4] K. Tan, X. Zhang, and D. Wang, “Real-time speech enhance-ment using an efﬁcient convolutional recurrent network fordual-microphone mobile phones in close-talk scenarios,” in

Proc. IEEE ICASSP , 2019, pp. 5751–5755.[5] Soha A Nossier, MRM Rizk, Nancy Diaa Moussa, and Salehel Shehaby, “Enhanced smart hearing aid using deep neuralnetworks,”

Alexandria Engineering Journal , vol. 58, no. 2, pp.539–550, 2019.[6] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel au-dio source separation with deep neural networks,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , pp.1652–1664, 2016.[7] J. Benesty, J. Chen, and Y. Huang,

Microphone array signalprocessing , vol. 1, Springer Science & Business Media, 2008.[8] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov,“A Consolidated Perspective on Multimicrophone Speech En-hancement and Source Separation,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , pp. 692–730,2017.[9] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain multichannel linear ﬁltering for noise reduction,”

Transactions on audio, speech, and language processing , pp.260–276, 2009.[10] P. Tzirakis, G. Trigeorgis, M. Nicolaou, B. Schuller, andS. Zafeiriou, “End-to-end multimodal emotion recognition us-ing deep neural networks,”

IEEE Journal of Selected Topics inSignal Processing , 2017.[11] L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automaticspeech recognition for under-resourced languages: A survey,”

Speech Communication , 2014.[12] D. Wang and J. Chen, “Supervised speech separation based ondeep learning: An overview,”

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , pp. 1702–1726, 2018.[13] T. N Sainath, R. J Weiss, K. W Wilson, B. Li, A. Narayanan,E. Variani, M. Bacchiani, I. Shafran, A. Senior, K. Chin, et al.,“Multichannel signal processing with deep neural networks forautomatic speech recognition,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , 2017.[14] S. Chakrabarty and E. Habets, “Time-Frequency MaskingBased Online Multi-Channel Speech Enhancement With Con-volutional Recurrent Neural Networks,”

IEEE Journal of Se-lected Topics in Signal Processing , 2019.[15] T. Higuchi, K. Kinoshita, N. Ito, S. Karita, and T. Nakatani,“Frame-by-frame closed-form update for mask-based adaptive mvdr beamforming,” in

Proc. IEEE ICASSP , 2018, pp. 531–535.[16] Z.-Q. Wang and D. Wang, “All-neural multi-channel speechenhancement,” in

Proc. Interspeech , 2018, pp. 3234–3238.[17] B. Tolooshams, R. Giri, A H Song, U. Isik, and A. Krish-naswamy, “Channel-attention dense u-net for multichannelspeech enhancement,” in

Proc. IEEE ICASSP , 2020.[18] M. M Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-dergheynst, “Geometric deep learning: going beyond eu-clidean data,”

IEEE Signal Processing Magazine , 2017.[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in

Proc. IEEE ICASSP , 2015.[20] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,“An algorithm for intelligibility prediction of time–frequencyweighted noisy speech,”

IEEE Transactions on Audio, Speech,and Language Processing , pp. 2125–2136, 2011.[21] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-stra, “Perceptual evaluation of speech quality (PESQ)-a newmethod for speech quality assessment of telephone networksand codecs,” in

Proc. IEEE ICASSP , 2001, pp. 749–752.[22] E. Vincent, R. Gribonval, and C. F´evotte, “Performance mea-surement in blind audio source separation,”

IEEE Transactionson Audio, Speech, and Language Processing , 2006.[23] T. N. Kipf and M. Welling, “Semi-supervised classiﬁca-tion with graph convolutional networks,” arXiv preprintarXiv:1609.02907 , 2016.[24] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and Y. Philip,“A comprehensive survey on graph neural networks,”

IEEETransactions on Neural Networks and Learning Systems , 2020.[25] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio,and Y. Bengio, “Graph attention networks,” arXiv preprintarXiv:1710.10903 , 2017.[26] M. Balcilar, G. Renton, P. H´eroux, B. Gauzere, S. Adam,and P. Honeine, “Bridging the Gap Between Spectral andSpatial Domains in Graph Neural Networks,” arXiv preprintarXiv:2003.11702 , 2020.[27] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet:A deep convolutional encoder-decoder architecture for imagesegmentation,”

IEEE transactions on pattern analysis and ma-chine intelligence , pp. 2481–2495, 2017.[28] K. Tan and D. Wang, “A convolutional recurrent neural net-work for real-time speech enhancement,” .[29] X. J. Zhu, “Semi-supervised learning literature survey,” Tech.Rep., University of Wisconsin-Madison Department of Com-puter Sciences, 2005.[30] J. F. Gemmeke, D. Ellis, D. Freedman, A. Jansen,W. Lawrence, C. Moore, M. Plakal, and M. Ritter, “Audioset: An ontology and human-labeled dataset for audio events,”in

Proc. ICASSP , 2017, pp. 776–780.[31] D. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980