[PDF] Deep Ad-hoc Beamforming

Abstract

Far-field speech processing is an important and challenging problem. In this paper, we propose \textit{deep ad-hoc beamforming}, a deep-learning-based multichannel speech enhancement framework based on ad-hoc microphone arrays, to address the problem. It contains three novel components. First, it combines \textit{ad-hoc microphone arrays} with deep-learning-based multichannel speech enhancement, which reduces the probability of the occurrence of far-field acoustic environments significantly. Second, it groups the microphones around the speech source to a local microphone array by a supervised channel selection framework based on deep neural networks. Third, it develops a simple time synchronization framework to synchronize the channels that have different time delay. Besides the above novelties and advantages, the proposed model is also trained in a single-channel fashion, so that it can easily employ new development of speech processing techniques. Its test stage is also flexible in incorporating any number of microphones without retraining or modifying the framework. We have developed many implementations of the proposed framework and conducted an extensive experiment in scenarios where the locations of the speech sources are far-field, random, and blind to the microphones. Results on speech enhancement tasks show that our method outperforms its counterpart that works with linear microphone arrays by a considerable margin in both diffuse noise reverberant environments and point source noise reverberant environments. We have also tested the framework with different handcrafted features. Results show that although designing good features lead to high performance, they do not affect the conclusion on the effectiveness of the proposed framework.

Full PDF

DDeep ad-hoc beamforming

Xiao-Lei Zhang , Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, China Center for Intelligent Acoustics and Immersive Communications, School of Marine Science and Technology, Northwestern Polytechnical University, China

Abstract

Far-ﬁeld speech processing is an important and challenging problem. In this paper, we propose deep ad-hoc beamforming , adeep-learning-based multichannel speech enhancement framework based on ad-hoc microphone arrays, to address the problem.It contains three novel components. First, it combines ad-hoc microphone arrays with deep-learning-based multichannel speechenhancement, which reduces the probability of the occurrence of far-ﬁeld acoustic environments signiﬁcantly. Second, it groupsthe microphones around the speech source to a local microphone array by a supervised channel selection framework based ondeep neural networks. Third, it develops a simple time synchronization framework to synchronize the channels that have di ﬀ erenttime delay. Besides the above novelties and advantages, the proposed model is also trained in a single-channel fashion, so that itcan easily employ new development of speech processing techniques. Its test stage is also ﬂexible in incorporating any number ofmicrophones without retraining or modifying the framework. We have developed many implementations of the proposed frameworkand conducted an extensive experiment in scenarios where the locations of the speech sources are far-ﬁeld, random, and blind tothe microphones. Results on speech enhancement tasks show that our method outperforms its counterpart that works with linearmicrophone arrays by a considerable margin in both di ﬀ use noise reverberant environments and point source noise reverberantenvironments. Keywords:

Adaptive beamforming, ad-hoc microphone array, channel selection, deep learning, distributed microphone array

1. Introduction

Deep learning based speech enhancement has demonstratedits strong denoising ability in adverse acoustic environments(Wang & Chen, 2018), which has attracted much attentionsince its ﬁrst appearance (Wang & Wang, 2013). Current deep-learning-based techniques employ either a single microphoneor a conventional microphone array to pick up speech sig-nals, where the conventional microphone array describe the mi-crophone array ﬁxed in a single device. Deep-learning-basedsingle-channel speech enhancement, e.g. (Wang & Wang,2013; Zhang & Wu, 2013; Lu et al., 2013; Wang et al., 2014;Huang et al., 2015; Xu et al., 2015; Williamson et al., 2016),employs a deep neural network (DNN), which is a multilayerperceptron with more than one nonlinear hidden layer, to learn anonlinear mapping function from noisy speech to clean speechor its ideal time-frequency masks.Deep-learning-based multichannel speech enhancement hastwo major forms. The ﬁrst form (Jiang et al., 2014) uses amicrophone array as a feature extractor to extract spatial fea-tures, e.g. interaural time di ﬀ erence and interaural level dif-ference, as the input of the DNN-based single-channel en-hancement. The second form (Heymann et al., 2016; Erdo-gan et al., 2016), which we denote bravely as deep beamform-ing , estimates a monaural time-frequency (T-F) mask (Wang Email address: [email protected] (Xiao-Lei Zhang , ) Moving

Figure 1: Illustration of an ad-hoc microphone array. et al., 2014; Heymann et al., 2016; Higuchi et al., 2016) us-ing a single-channel DNN so that the spatial covariance matri-ces of speech and noise can be derived for adaptive beamform-ing, e.g. minimum variance distortionless response (MVDR)or generalized eigenvalue beamforming. It is fundamentallya linear method, whose output does not su ﬀ er from nonlineardistortions. Due to its success on speech recognition, it hasbeen extensively studied, including the aspects of the integra-tion with the spatial-clustering-based masking (Nakatani et al.,2017), acoustic features (Wang & Wang, 2018), model training(Xiao et al., 2017; Tu et al., 2017; Higuchi et al., 2018; Zhou& Qian, 2018), mask estimations (Erdogan et al., 2016), post-processing (Zhang et al., 2017), etc.Although many positive results have been observed, existingdeep-learning-based speech enhancement and its applications Preprint submitted to Elsevier February 27, 2020 a r X i v : . [ c s . S D ] F e b ere studied with a single microphone or a conventional mi-crophone array only, such as a linear array in a portable equip-ment. Its performance drops when the distance between thespeech source and the microphone (array) is enlarged. Finally,how to maintain the enhanced speech at the same high qualitythroughout an interested physical space becomes a new prob-lem.Ad-hoc microphone arrays provide a potential solution to theabove problem. As illustrated in Fig. 1, an ad-hoc microphonearray is a set of randomly distributed microphones. The mi-crophones collaborate with each other. Compared to conven-tional microphone arrays, an ad-hoc microphone array has thefollowing two advantages. First, it has a chance to enhancea speaker’s voice with equally good quality in a range wherethe array covers. Second, its performance is not limited to thephysical size of application devices, e.g. cell-phones, goose-neck microphones, or smart speaker boxes. Ad-hoc microphonearrays also have a chance to be widespread in real-world en-vironments, such as meeting rooms, smart homes, and smartcities. The research on ad-hoc microphone arrays is an emerg-ing direction (Heusdens et al., 2012; Zeng & Hendriks, 2014;O’Connor et al., 2016; O’Connor & Kleijn, 2014; Tavakoliet al., 2016; Jayaprakasam et al., 2017; Tavakoli et al., 2017;Zhang et al., 2018; Koutrouvelis et al., 2018). It contains atleast the following three fundamental problems: • Channel selection.

Because the microphones may dis-tribute in a large area, taking all microphones into con-sideration may not be a good way, since that the micro-phones that are far away from the speech source may betoo noisy. Channel selection aims to group a handful mi-crophones around a speaker into a local microphone arrayfrom a large number of randomly distributed microphones. • Device synchronization.

Because the microphones aredistributed in di ﬀ erent positions and maybe also di ﬀ erentdevices, their output signals may have di ﬀ erent time delay,clock rates, or adaptive gain controllers. Device synchro-nization aims to synchronize the signals from the micro-phones, so as to fascinate its following speciﬁc applica-tions. • Task-driven multichannel signal processing.

It aimsto maximize the performance of a speciﬁc task by, e.g.,adapting a multichannel signal processing algorithm de-signed for a conventional array to an ad-hoc array. Thetasks include speech enhancement, multi-talker speechseparation, speech recognition, speaker recognition, etc.However, current research on ad-hoc microphone arrays is stillat the very beginning. For example, some work has focused onthe channel selection problem in an ideal scenario where perfectnoise estimation and voice activity detection is available (Zhanget al., 2018). Although some work has tried to jointly conductnoise estimation and channel selection, it has to make many as-sumptions and carry out advanced mathematical formulations,such as the bi-alternating direction method of multipliers for thedistributed optimization of the (cid:96) regularized channel-selectionobjective (O’Connor et al., 2016). A possible explanation for the above di ﬃ culty is that an ad-hoc microphone array lacks so much important prior knowledgewhile contains so many interferences that we ﬁnally have littleinformation about the array except the received signals in theextreme case. To overcome the di ﬃ culty, we may consider theway of putting enough prior knowledge to the array. Super-vised deep learning, which learns prior knowledge by neuralnetworks, provides us this opportunity as it did for the super-vised speech separation with conventional microphone arrays(Wang & Chen, 2018).In this paper, we propose a framework named deep ad-hocbeamforming (DAB) which brings deep learning to ad-hoc mi-crophone arrays. It has the following three novelties: • A supervised channel selection framework is proposed.

It ﬁrst predicts the quality of the received speech signal ofeach channel by a deep neural network. Then, it groupsthe microphones that have high speech quality and strongcross-channel signal correlation into a local microphonearray. Several channel selection algorithms have been de-veloped, including a one-best channel selection methodand several N -best channel selection methods with thepositive integer N ≥ ﬀ erent channel selection crite-ria. • A simple supervised time synchronization frameworkis proposed.

It ﬁrst picks the output of the best channelas a reference signal, then estimates the relative time delayof the other channels by a traditional time delay estima-tor, and ﬁnally synchronize the channels according to theestimation result, where the best channel is selected in asupervised manner. • A speech enhancement algorithm is implemented as anexample.

The algorithm applies the channel selection andtime synchronization frameworks to deep beamforming.It is designed to demonstrate the overall e ﬀ ectiveness andﬂexibility of DAB. Its implementation is straightforwardwithout a large modiﬁcation of existing deep beamformingalgorithms.We have conducted an extensive experimental comparisonbetween DAB and its deep-learning-based multichannel speechenhancement counterpart with linear microphone arrays, inscenarios where the speech sources and microphone arrayswere placed randomly in typical physical spaces with randomtime delay and where the noise sources were either di ﬀ usenoise or point source noise. Experimental results with noise-independent training show that DAB outperforms its counter-part by a large margin.This paper is organized as follows. Section 2 presents math-ematical notations of this paper. Section 3 presents the sig-nal model of ad-hoc microphone arrays. Section 4 presents theframework of the proposed DAB. Section 5 presents the chan-nel selection module of DAB. Section 6 presents the applicationof DAB to speech enhancement. Section 7 evaluates the e ﬀ ec-tiveness of the proposed method. Finally, Section 8 concludesour ﬁndings.2

10 15 20

Distance (in meters) P D F (cid:1) Distance (in meters) P D F (cid:1) Distance (in meters) P D F (cid:1) Distance (in meters) CD F -2 -2 -2 (c) Best microphonein ad-hoc microphone array(b) Ad-hoc microphone array(a) Conventional microphone array (d) Comparison Distance (in meters) P D F (cid:1) Distance (in meters) P D F (cid:1) Distance (in meters) P D F (cid:1) Distance (in meters) CD F Conventional microphone arrayAd-hoc microphone arrayBest microphone in ad-hoc microphone array

Figure 2: Monte Carlo simulation of the distance distribution between a speech source and a microphone array in comparison. The physical spaces for this simulationcontain a square room, a rectangle room, and a circle room (see sFig. 1 in the supplementary materials for the details of the three rooms). The farest distance betweenthe speech source and the microphone array in any of the rooms is set to 20 meters. Each microphone array in comparison consists of 16 microphones. (a) Probabilitydensity function (PDF) of the distance distribution of a conventional microphone array. The mean and standard deviation of this distribution are 7.28 and 3.71 metersrespectively. (b) PDF of the distance distribution of an ad-hoc microphone array, where the distance is deﬁned as the average distance between the speaker andeach microphone in the ad-hoc array. The mean and standard deviation of this distribution are 7.28 and 1.68 meters respectively. (c) PDF of the distribution of thedistance between the speech source and the best microphone in the ad-hoc microphone array, where the word “best microphone” denotes the closest microphone tothe speech source. The mean and standard deviation of the distribution are 1.92 and 1.21 meters respectively. (d) Cumulative distribution functions (CDF) of thedistance distributions in Figs. 2a, 2b, and 2c.

2. Notations

We ﬁrst introduce some notations here. Regular lower-caseletters, e.g. s , f , and γ , indicate scalars. Bold lower-case letters,e.g. y and α , indicate vectors. Bold capital letters, e.g. P and Φ ,indicate matrices. Letters in calligraphic fonts, e.g. X , indicatesets. ( ) is a vector with all entries being 1 (0). The operator T denotes the transpose. The operator H denotes the conjugatetranspose of complex numbers.

3. Signal model of ad-hoc microphone arrays

Ad-hoc microphone arrays can signiﬁcantly reduce the prob-ability of the occurrence of far-ﬁeld environments. We takethe case described in Fig. 2 as an example. When a speakerand a microphone array are distributed randomly in a room, thedistribution of the distance between the speaker and an ad-hocmicrophone array has a smaller variance than that between thespeaker and a conventional microphone array (Figs. 2a and 2b).For example, the conventional array has a probability of 24% tobe placed over 10 meters away from the speech source, whilethe number regarding to the ad-hoc array is only 7%. Particu-larly, the distance between the best microphone in the ad-hocarray and the speech source is only 1.9 meters on average, andthe probability of the distance that is larger than 5 meters is only2% (Fig. 2c).Here we build the signal model of an ad-hoc microphone ar-ray. All speech enhancement methods throughout the paper op-erate in the frequency domain on a frame-by-frame basis. Sup-pose that a physical space contains one target speaker, multiplenoise sources, and an ad-hoc microphone array of M micro-phones. The physical model for the signals arrived at the ad-hocarray is assumed to be v ( t , f ) = c ( f ) s ( t , f ) + h ( t , f ) + n ( t , f ) (1)where s ( t , f ) is the short-time Fourier transform (STFT) valueof the target clean speech at time t and frequency f , c ( f ) is the time-invariant acoustic transfer function from the speech sourceto the array which is an M -dimensional complex number: c ( f ) = [ c ( f ) , c ( f ) , . . . , c M ( f )] T (2) c ( f ) s ( t , f ) and h ( t , f ) are the direct sound and early and late re-verberation of the target signal, and n ( t , f ) is the additive noise: n ( t , f ) = [ n ( t , f ) , n ( t , f ) , . . . , n M ( t , f )] T (3) v ( t , f ) = [ v ( t , f ) , v ( t , f ) , . . . , v M ( t , f )] T . (4)which are the STFT values of the received signals by the m -th microphone at time t and frequency f . Usually, we denote x ( t , f ) = c ( f ) s ( t , f ).After processed by the devices { D m ( · ) } Mm = where the micro-phones are ﬁxed, the signals that the DAB ﬁnally receives are: z m ( t , f ) = D m ( v m ( t , f )) , ∀ m = , . . . , M (5)with z ( t , f ) = [ z ( t , f ) , . . . , z M ( t , f )] T . Real-world devices { D m ( · ) } Mm = may cause many problems including the unsynchro-nization of time delay, clock rates, adaptive gain controllers,etc. Here we consider the time unsynchronization problem: z m ( t , f ) = v m ( t + τ m , f ) = x m ( t + τ m , f ) + h m ( t + τ m , f ) + n m ( t + τ m , f ) (6)where τ m is the time delay caused by the m th device.

4. Deep ad-hoc beamforming: A system overview

A system overview of DAB is shown in Fig. 3. It con-tains three core components—a supervised channel selectionframework, a supervised time synchronization framework, anda speech enhancement module.The core idea of the channel selection framework is to ﬁlterthe received signals z ( t , f ) by a channel-selection vector p = [ p , . . . , p M ] T : z p ( t , f ) = p ◦ z ( t , f ) (7)3 NN2 for weight estimation Channel reweighting with sparsity constraints Speech enhancementConcatenation Feature for SNR estimationEnhanced single channel speech Enhanced speechMultichannel noisy speech DNN1 for mask estimation

Channel selection

Average poolingAverage pooling Time synchronization

Figure 3: Diagram of deep ad-hoc beamforming. The channel-selection framework is described in the red dashed box. such that the channels that output low quality speech signalscan be suppressed or even discarded, where p is the outputmask of the channel-selection method described in the red boxof Fig. 3, and ◦ denotes the element-wise product operator.Without loss of generality, we assume the selected channels are z ( t , f ) , . . . , z N ( t , f ).The time synchronization module ﬁrst selects the noisy sig-nal from the best channel, assumed to be z k ( t , f ), as a referencesignal by a supervised 1-best channel selection algorithm thatwill be described in Section 5.2. Then, it estimates the relativetime delay of the noisy signals from the selected microphonesover the reference signal by a time delay estimator:ˆ τ n = h ( z n ( t , f ) | z k ( t , f )) , ∀ n = , . . . , N (8)where h ( z n ( t , f ) | z k ( t , f )) is the time delay estimator with z k ( t , f )as the reference signal, and ˆ τ n is the estimated relative timedelay of z n ( t , f ) over z k ( t , f ). Finally, it synchronizes the mi-crophones according to the estimated time delay: y n ( t , f ) = z n ( t − ˆ τ n , f ) , ∀ n = , . . . , N (9)which is the output of the module. Note that ˆ τ n includes the rel-ative time delay caused by both the device and the transmissionof the signal through air. Because developing a new accuratetime delay estimator is not the focus of this paper, we simplyuse the classic generalized cross-correlation phase transform(Knapp & Carter, 1976; Carter, 1987) as the estimator, thoughmany other time delay estimators can be adopted as well (Chenet al., 2006).The speech enhancement module takes y ( t , f ) = [ y ( t , f ) , . . . , y N ( t , f )] T as its input. Many deep-learning-based speech enhancement methods can be used directly orwith slight modiﬁcation as the speech enhancement module.Here we take the MVDR based deep beamforming as anexample. Because the deep beamforming is trained in a single-channel fashion as the other parts of DAB, it makes the overallDAB ﬂexible in incorporating any number of microphones inthe test stage without retraining or modifying the DAB model.This is an important requirement of real world applicationsthat should also be considered in other DAB implementations.Note that, if N =

1, then DAB outputs the noisy speech ofthe selected single channel directly without resorting to deepbeamforming anymore.In the following two sections, we will present the supervisedchannel selection framework and speech enhancement modulerespectively.

5. Supervised channel selection

The channel-selection algorithm is applied to each channel independently . It contains two steps described in the followingtwo subsections respectively.

Suppose there is a test utterance of U frames, and supposethe received speech signal at the i -th channel is { (cid:101) z i ( t ) } Ut = : (cid:101) z i ( t ) = [ | z | i ( t , , . . . , | z | i ( t , F )] T (10)where | z | i ( t , f ) is the amplitude spectrogram of z ( t , f ) at the i -thchannel. We ﬁrst use a DNN-based single-channel speech en-hancement method, denoted as DNN1, to generate an estimated ideal ratio mask (IRM) of the direct sound of (cid:101) z i ( t ), denoted as { ˆ x i ( t ) } Ut = :ˆ x i ( t ) = [ (cid:100) IRM i ( t , , . . . , (cid:100) IRM i ( t , F )] T (11)where (cid:100) IRM i ( t , f ) is the estimate of the IRM at the i -th channel.The IRM is the training target of DNN1: IRM ( t , f ) = | x ( t , f ) || x ( t , f ) | + | h ( t , f ) + n ( t , f ) | (12)where | x ( t , f ) | , | h ( t , f ) | , and | n ( t , f ) | are the amplitude spectro-grams of the direct and early reverberant speech, late rever-berant speech, and noise components of single-channel noisyspeech respectively.Then, we merge all noisy frames and the estimated cleanspeech respectively to two vectors by average pooling:¯ (cid:101) z i = U U (cid:88) t = (cid:101) z i ( t ) (13)¯ˆ x i = U U (cid:88) t = ˆ x i ( t ) (14)Finally, we get the estimated channel weight q i by q i = g (cid:32)(cid:20) ¯ (cid:101) z Ti , ¯ˆ x Ti (cid:21) T (cid:33) (15)where g ( · ) is a DNN-based channel-reweighting model, denotedas DNN2, and q i is the channel weight of the i th channel. Notethat the reason why we use both ¯ˆ x and ¯ (cid:101) z i as the input instead ofsimply using ¯ (cid:101) z i is for improving the estimation accuracy.To train g ( · ), we need to ﬁrst deﬁne a training target. Manymeasurements may be used as training targets, such as perfor-mance evaluation metrics including signal to noise ratio (SNR),4hort-time objective intelligibility (STOI) (Taal et al., 2011),etc., as well as other device-speciﬁc metrics including the bat-tery life of a cell phone, etc. For example, if a cell phone is tobe out of power, then DAB should prevent the cell phone beingan activated channel of the ad-hoc microphone array so as tosave power of the cell phone. This paper uses a variant of SNRas the target: (cid:80) t | x | time ( t ) (cid:80) t | x | time ( t ) + (cid:80) t | n | time ( t ) (16)where { x time ( t ) } t and { n time ( t ) } t are the direct sound and additivenoise of the received noisy speech signal in time-domain.As presented above, both DNN1 and DNN2 are trained onsingle-channel data only instead of multichannel data collectedby ad-hoc microphone arrays, which is an important merit forthe practical use of DAB. In practice, the training data of DNN1and DNN2 need to be independent so as to prevent overﬁtting. Given the estimated weights q = [ q , . . . , q M ] T of the testutterance, many advanced sparse learning methods are able toproject q to p , i.e. p = δ ( q ) where δ ( · ) is a channel-selectionfunction that enforces sparse constraints on q . This section de-signs several δ ( · ) functions as follows. The simplest channel-selection method is to pick the channelwith the highest SNR: p i =  , if q i = max ≤ k ≤ M q k , otherwise ∀ i = , . . . , M . (17)After this channel selection, DAB outputs the noisy speechfrom the selected channel directly. Another simple channel-selection method is to select allchannels with equivalent importance: p i = , ∀ i = , . . . , M . (18)This method is an extreme case of channel selection that usuallyperforms well when the microphones are distributed in a smallspace. When the microphone number M is large enough, theremight exist several microphones close to the speech sourcewhose received signals are all informative. It is better to groupthem together into a local array instead of selecting one bestchannel: p i =  , if q i ∈ { q (cid:48) , q (cid:48) , . . . , q (cid:48) N } , otherwise ∀ i = , . . . , M . (19)where q (cid:48) ≥ q (cid:48) ≥ . . . ≥ q (cid:48) M is the descent order of { q i } Mi = , and N is a user-deﬁned hyperparameter, N ≤ M . Here we develop a simple method that determines the hyper-parameter N in (19) on-the-ﬂy. It ﬁrst ﬁnds q ∗ = max i ∈{ ,..., M } q i ,and then determines p by p i =  , if q i q ∗ − q ∗ − q i > γ , otherwise , ∀ i = , . . . , M . (20)where γ ∈ [0 ,

1] is a tunable threshold. See Appendix for theproof of (20).

One way to encode the signal quality of the selected channelsin (20) is to use soft weights as follows: p i =  q i , if q i q ∗ − q ∗ − q i > γ , otherwise , ∀ i = , . . . , M . (21) The above channel selection methods determine the selectedchannels by SNR only, without considering the correlation be-tween the channels. As we know, the correlation between thechannels, which encodes environmental information and thetime delay between the microphones, is important to adaptivebeanforming. Here, we develop a spectral clustering basedchannel selection method that takes the correlation into the de-sign of the a ﬃ nity matrix of the spectral clustering.Unlike other channel selection algorithms, “learning-N-best”should ﬁrst conduct the time synchronization, which takes y ( t , f ) = [ y ( t , f ) , . . . , y M ( t , f )] T as its input. Then, it calculatesthe covariance matrix of the noisy speech across the channelsby Φ yy ( f ) = (cid:88) t y ( t , f ) y ( t , f ) H . (22)and normalize (22) to an amplitude covariance matrix Φ norm yy ( f ): Φ norm yy ( f )( i , j ) = | Φ yy ( f )( i , j ) | Φ yy ( f )( i , i ) Φ yy ( f )( j , j ) , ∀ i , j = , . . . , M . (23)It calculates a new matrix K by averaging the amplitude covari-ance matrix along the frequency axis by K ( i , j ) = F F (cid:88) f = Φ norm yy ( f )( i , j ) , ∀ i , j = , . . . , M (24)where F is the number of the DFT bins. The a ﬃ nity matrix A of the spectral clustering is deﬁned as A = exp (cid:32) − | K − I | σ (cid:33) , ∀ i , j = , . . . , M (25)where I is the identity matrix, and σ is a hyperparameter witha default value 1. Following the Laplacian eigenvalue decom-position (Ng et al., 2001) of A , it obtains a J × M -dimensional5epresentation of the channels, U = [ u , . . . , u M ], where u i isthe representation of the i th microphone and J denotes the di-mension of the representation.“Learning-N-best” conducts agglomerative hierarchical clus-tering on U , and takes the maximal lifetime of the dendrogramas the threshold to partition the microphones into B clusters(1 ≤ B ≤ M ), denoted as U , . . . , U B . The maximum pre-dicted SNRs of the microphones in the clusters are denoted as q (cid:48) , . . . , q (cid:48) B respectively. Finally, it groups the microphones thatsatisfy the following condition into a local microphone array: p i =  , if u i ∈ U b and q (cid:48) b q (cid:48)∗ − q (cid:48)∗ − q (cid:48) b > γ , otherwise , ∀ i = , . . . , M , ∀ b = , . . . , B . (26)where q (cid:48)∗ = max ≤ b ≤ B q (cid:48) b .

6. Speech enhancement: An application case

After getting the synchronized signals y ( t , f ) = [ y ( t , f ) , . . . , y N ( t , f )] T , we may use existing multichannelsignal processing techniques directly or with slight mod-iﬁcation for a speciﬁc application. Here we use a deepbeamforming algorithm (Heymann et al., 2016) directly forspeech enhancement as an example.The deep beamforming algorithm ﬁnds a linear estimator w opt ( f ) to ﬁlter y ( t , f ) by the following equation:ˆ x ref . ( t , f ) = w H opt ( f ) y ( t , f ) . (27)where ˆ x ref . ( t , f ) is an estimate of the direct sound at the refer-ence microphone of the array. For example, MVDR ﬁnds w opt by minimizing the average output power of the beamformerwhile maintaining the energy along the target direction:min w ( f ) w H ( f ) Φ nn ( f ) w ( f ) (28)subject to w H ( f ) c ( f ) = Φ nn ( f ) is an M × M -dimensional cross-channel covari-ance matrix of the received noise signal n ( f ). (28) has a closed-form solution: w opt ( f ) = (cid:98) Φ − nn ( f )ˆ c ( f )ˆ c H ( f ) (cid:98) Φ − nn ( f )ˆ c ( f ) (29)where the variables (cid:98) Φ nn ( f ) and ˆ c ( f ) are the estimates of Φ nn ( f )and c ( f ) respectively which are derived by the following equa-tions according to (Heymann et al., 2015; Zhang et al., 2017;Wang & Wang, 2018): (cid:98) Φ xx ( f ) = (cid:80) t η ( t , f ) (cid:88) t η ( t , f ) y ( t , f ) y ( t , f ) H (30) (cid:98) Φ nn ( f ) = (cid:80) t ξ ( t , f ) (cid:88) t ξ ( t , f ) y ( t , f ) y ( t , f ) H (31)ˆ c ( f ) = principal (cid:16)(cid:98) Φ xx ( f ) (cid:17) (32) where (cid:98) Φ xx ( f ) is an estimate of the covariance matrix of thedirect sound x ( t , f ), principal( · ) is a function returning the ﬁrstprincipal component of the input square matrix, and η ( t , f ) and ξ ( t , f ) are deﬁned as the product of individual estimated T-Fmasks: η ( t , f ) = M (cid:89) i = (cid:100) IRM i ( t , f ) (33) ξ ( t , f ) = M (cid:89) i = (cid:16) − (cid:100) IRM i ( t , f ) (cid:17) (34)Note that, in our experiments, when we calculate η ( t , f ) and ξ ( t , f ), we take all channels of the ad-hoc array into considera-tion, which empirically results in slight performance improve-ment over the method that we take only the selected channelsinto the calculation.

7. Experiments

In this section, we study the e ﬀ ectiveness of DAB in di ﬀ usenoise and point source noise environments under the situationwhere the output signals of the channels have random time de-lay caused by devices. Speciﬁcally, we ﬁrst present the exper-imental settings in Section 7.1, then present the experimentalresults in the di ﬀ use noise and point source noise environmentsin Section 7.2, and ﬁnally discuss the e ﬀ ects of hyperparametersettings on performance in Sections 7.3 and 7.4. The clean speech was generated from the TIMITcorpus. We randomly selected half of the training speakers toconstruct the database for training DNN1, and the remaininghalf for training DNN2. We used all test speakers for test. Thenoise source for the training database was a large-scale sounde ﬀ ect library which contains over 20,000 sound e ﬀ ects. Theadditive noise for the test database was the babble, factory1,and volvo noise respectively from the NOISEX-92 database. Training data:

We simulated a rectangle room for each train-ing utterance. The length and width of the rectangle room weregenerated randomly from a range of [5 ,

30] meters. The heightwas generated randomly from [2 . ,

4] meters. The reverberantenvironment was simulated by an image-source model. Its T60was selected randomly from a range of [0 ,

1] second. A speechsource, a noise source, and a single microphone were placedrandomly in the room. The SNR, which is the energy ratio be-tween the speech and noise at the locations of their sources,was randomly selected from a range of [ − ,

20] dB. We syn-thesized 50,000 noisy utterances to train DNN1, and 100,000noisy utterances to train DNN2.

Test data:

We constructed a rectangle room for each test ut-terance. The length, width, and height of the room were ran-domly generated from [10 , , . , .

5] metersrespectively. The additive noise is assumed to be either di ﬀ usenoise or point source noise. https: // github.com / ehabets / RIR-Generator ﬀ use noise environment, a speech source and a mi-crophone array were placed randomly in the room. The T60 forgenerating reverberant speech was selected randomly from arange of [0 . , .

8] second. To simulate the uncorrelated di ﬀ usenoise, the noise segments at di ﬀ erent microphones do not haveoverlap, and they were added directly to the reverberant speechat the microphone receivers without referring to reverberation.The noise power at the locations of all microphones were main-tained at the same level, which was calculated by the SNR ofthe direct sound over the additive noise at a place of 1 meteraway from the speech source, denoted as the SNR at the origin (SNRatO). Note that the SNRs at di ﬀ erent microphones weredi ﬀ erent due to the energy degradation of the speech signal dur-ing its propagation. The SNRatO was selected from 10, 15, and20 dB respectively. We generated 1,000 test utterances for eachSNRatO, each noise type, and each kind of microphone array,which amounts to 9 test scenarios and 18,000 test utterances.For the point source noise environment, a speech source, apoint noise source, and a microphone array were placed ran-domly in the room. The T60 of the room was selected ran-domly from a range of [0 . , .

8] second for generating reverber-ant speech and reverberant noise at the microphone receivers.The SNRatO was deﬁned as the log ratio of the speech powerover the noise power at their source locations respectively. Itwas chosen from {− , , } dB. Like the di ﬀ use noise environ-ment, we also generated 1,000 test utterances for each SNRatO,each noise type, and each kind of microphone array.For both of the test environments, we generated a randomtime delay τ from a range of [0 , .

5] second at each microphoneof an ad-hoc microphone array for simulating the time delaycaused by devices.

Comparison methods:

The baseline is the MVDR-based DB(Heymann et al., 2016) with a linear array of 16 microphones,which is described in Section 6. All DB models employedDNN1 for the single-channel noise estimation. The aperturesize of the linear microphone array (i.e. the distance betweentwo neighboring microphones) was set to 10 centimeters.DAB also employed an ad-hoc array of 16 microphones. Wedenote the DAB with di ﬀ erent channel selection algorithms as: • DAB + • DAB + all-channels. • DAB + ﬁxed- N -best. We set N = √ M . • DAB + auto- N -best. We set γ = . • DAB + soft- N -best. We set γ = . • DAB + learning- N -best. We set J = M / σ =

1, and γ = . ﬀ ectiveness of time synchronization (TS) module,we further compared the following systems: • DAB + channel selection method. It does not use the TSmodule. • DAB + channel selection method + GT.

It uses the groundtruth (GT) time delay caused by the devices to synchronizethe microphones. • DAB + channel selection method + TS.

It uses the TSmodule to estimate the time delay caused by both di ﬀ erentlocations of the microphones and di ﬀ erent devices wherethe microphones are installed.We implemented the above comparison methods with di ﬀ er-ent channel selection algorithms. For example, if the channelselection algorithm is “auto- N -best”, then the comparison sys-tems are “DAB + auto- N -best”, “DAB + auto- N -best + GT”, and“DAB + auto- N -best + TS” respectively.

DNN models:

For each comparison method, we set the framelength and frame shift to 32 and 16 milliseconds respectively,and extracted 257-dimensional STFT features. We used thesame DNN1 for DS, DB, and DAB. DNN1 is a standard feed-forward DNN. It contains two hidden layers. Each hidden layerhas 1024 hidden units. The activation functions of the hiddenunits and output units are rectiﬁed linear unit and sigmoid func-tion, respectively. The number of epochs was set to 50. Thebatch size was set to 512. The scaling factor for the adaptivestochastic gradient descent was set to 0.0015, and the learningrate decreased linearly from 0.08 to 0.001. The momentum ofthe ﬁrst 5 epochs was set to 0.5, and the momentum of otherepochs was set to 0.9. A contextual window was used to ex-pand each input frame to its context along the time axis. Thewindow size was set to 7. DNN2 has the same parameter set-ting as DNN1 except that DNN2 does not need a contextualwindow and was trained with a batch size of 32. All DNNswere well-tuned. Note that although bi-directional long short-term memory may lead to better performance, we simply usedthe feedforward DNN since the type of the DNN models is notthe focus of this paper.

Evaluation metrics:

The performance evaluation metrics in-clude STOI (Taal et al., 2011), perceptual evaluation of speechquality (PESQ) (Rix et al., 2001), and signal to distortion ra-tio (SDR) (Vincent et al., 2006). STOI evaluates the objectivespeech intelligibility of time-domain signals. It has been shownempirically that STOI scores are well correlated with humanspeech intelligibility scores (Wang et al., 2014; Du et al., 2014;Huang et al., 2015; Zhang & Wang, 2016). PESQ is a testmethodology for automated assessment of the speech qualityas experienced by a listener of a telephony system. SDR is ametric similar to SNR for evaluating the quality of enhance-ment. The higher the value of an evaluation metric is, the betterthe performance is.

We list the performance of the comparison methods in thedi ﬀ use noise and point source noise environments in Tables 1and 2 respectively. From the tables, we see that the DAB vari-ants given the TS module or the ground-truth time delay out-perform the DB baseline signiﬁcantly in terms of all evaluationmetrics. Even the simplest “DAB + x-axis (meters) y - a x i s ( m e t e r s ) x-axis (meters) y - a x i s ( m e t e r s ) fixed-N-best (0.7006) x-axis (meters) y - a x i s ( m e t e r s ) auto-N-best (0.7177) x-axis (meters) y - a x i s ( m e t e r s ) learning-N-best (0.6981) x-axis (meters) y - a x i s ( m e t e r s ) all-channels (0.5995) x-axis (meters) y - a x i s ( m e t e r s ) x-axis (meters) y - a x i s ( m e t e r s ) fixed-N-best (0.7763) x-axis (meters) y - a x i s ( m e t e r s ) auto-N-best (0.8080) x-axis (meters) y - a x i s ( m e t e r s ) learning-N-best (0.8075) Speech sourcePoint noise sourceMicrophonesSelected microphones x-axis (meters) y - a x i s ( m e t e r s ) all-channels (0.7671) x-axis (meters) y - a x i s ( m e t e r s ) x-axis (meters) y - a x i s ( m e t e r s ) fixed-N-best (0.5882) x-axis (meters) y - a x i s ( m e t e r s ) auto-N-best (0.6300) x-axis (meters) y - a x i s ( m e t e r s ) learning-N-best (0.6368) x-axis (meters) y - a x i s ( m e t e r s ) all-channels (0.6368) (a)(b)(c) Figure 4: Examples of channel-selection results in the babble point source noise environments at the SNRatO of − × × . . (b) Room size: 13 × × . . (c) Room size: 14 × × . We compare the DB variants with the TS module or theground-truth time delay for studying the e ﬀ ectiveness of thechannel selection algorithms. We ﬁnd that “auto- N -best” per-forms the best among the channel selection algorithms in mostcases, followed by “soft- N -best”. The “learning- N -best” and“ﬁxed- N -best” algorithms perform equivalently well in general,both of which perform better than the “all-channels” algorithm.Although “1-best” performs the poorest in terms of STOI andPESQ, it usually produces good SDR scores that are compa-rable to those produced by “auto- N -best”. Note that although“learning- N -best” seems an advanced technique, this advantagedoes not transfer to superior performance. This may be causedby (26) which is an expansion of the channel selection result of“auto- N -best”. This problem needs further investigation in thefuture. Comparing “auto- N -best” and “soft- N -best”, we furtherﬁnd that di ﬀ erent amplitude ranges of the channels a ﬀ ect theperformance, though this phenomenon is not so obvious dueto that the nonzero weights do not vary in a large range. Asa byproduct, the idea of “soft- N -best” is a way of synchroniz-ing the adaptive gain controllers of the devices. The synchro-nization of the adaptive gain controllers is not the focus of thispaper, hence we leave it for the future study.We compare the “DAB + channel selection method”,“DAB + channel selection method + GT”, and “DAB + channelselection method + TS” given di ﬀ erent channel selection meth-ods for studying the e ﬀ ectiveness of the time synchronizationmodule. We ﬁnd that the DAB without the TS module doesnot work at all when there exists a serious time unsynchro-nization problem caused by devices. “DAB + channel selectionmethod + TS” performs better than “DAB + channel selectionmethod + GT” in terms of STOI, and is equivalently good with the latter in terms of PESQ and SDR in all SNRatO levels, eventhough the latter was given the ground-truth time delay causedby devices. This phenomenon demonstrates the e ﬀ ectivenessof the proposed TS module. It also implies that the timeunsynchronization problem caused by di ﬀ erent locations of themicrophones a ﬀ ects the performance, though not so serious.Figure 4 shows three examples of the channel selection re-sults, of which we ﬁnd some interesting phenomena after look-ing into the details. Figure 4a is a typical scenario where thespeech source is far away from the noise point source. We seeclearly from the ﬁgure that, although “1-best” is relative muchpoorer than the other algorithms, all comparison algorithmsperform not so bad according to the absolute STOI scores, sincethat the SNRs at many selectted microphones are relative high.Figure 4b is a special scenario where the speech source is veryclose to the noise point source. Therefore, the SNRs at all mi-crophones are low. As shown in the ﬁgure, it is better to selectmost microphones as “auto- N -best” and “learning- N -best” do inthis case, otherwise the performance is rather poor as “1-best”yields. Figure 4c is a special scenario where there is a micro-phone very close to the speech source. It can be seen fromthe channel selection result that the best way is to select theclosest microphone, while “all-channels” perform much poorerthan the other channel selection algorithms. To summarize theabove phenomena, we see that the adaptive channel selectionalgorithms, i.e. “auto- N -best” and “learning- N -best”, alwaysproduce top performance among the comparison algorithms. ﬀ ect of the number of the microphones in an array To study how the number of the microphones in an array af-fects the performance, we repeated the experimental setting in8 able 1: Results with 16 microphones per array in di ﬀ use noise environments. SNRatO Comparison methods Babble Factory VolvoSTOI PESQ SDR STOI PESQ SDR STOI PESQ SDR10 dB Noisy 0.5989 1.86 1.12 0.5969 1.80 1.20 0.6785 2.10 1.62DB 0.6911 1.87 2.75 0.6900 1.86 3.42 0.7766 2.16 3.95DAB (1-best) 0.7154 2.06 5.14 0.7143 2.00 5.13 0.7892 2.31 5.23DAB (all-channels) 0.5824 1.83 -0.93 0.5760 1.78 -1.55 0.6061 1.88 -1.49DAB (all-channels + GT) 0.7206 2.06 4.40 0.7137 2.00 4.47 0.7831 2.39 4.42DAB (all-channels + TS) 0.7405 2.00 3.49 0.7388 1.95 3.54 0.8039 2.32 3.25DAB (ﬁxed-N-best) 0.6026 1.87 -0.12 0.6022 1.82 -0.33 0.6351 1.92 -0.32DAB (ﬁxed-N-best + GT) 0.7451 2.12 5.10 0.7437 2.07 5.16 0.8117 2.42 5.54DAB (ﬁxed-N-best + TS) 0.7675 2.11 5.18 0.7634 2.05 5.01 0.8460 2.43 5.97DAB (auto-N-best) 0.5982 1.87 -0.13 0.5927 1.83 -0.61 0.6573 2.00 0.82DAB (auto-N-best + GT) 0.7531 2.14 5.74 0.7518 2.09 5.73 0.8164 2.46 5.97DAB (auto-N-best + TS) 0.7696 2.12 5.45 0.7641 2.06 5.36 0.8405 2.44 5.85DAB (soft-N-best) 0.5999 1.85 -0.28 0.5952 1.83 -0.76 0.6645 2.01 0.84DAB (soft-N-best + GT) 0.7463 2.13 5.22 0.7455 2.07 5.26 0.8055 2.42 5.50DAB (soft-N-best + TS) 0.7659 2.12 5.11 0.7610 2.05 5.09 0.8363 2.43 5.65DAB (learning-N-best) 0.5973 1.86 -0.29 0.5892 1.81 -0.99 0.6488 1.98 0.21DAB (learning-N-best + GT) 0.7405 2.12 5.22 0.7387 2.06 5.34 0.8026 2.43 5.35DAB (learning-N-best + TS) 0.7631 2.07 4.55 0.7606 2.02 4.59 0.8330 2.41 4.9715 dB Noisy 0.6410 1.97 3.05 0.6400 1.93 2.79 0.6847 2.10 2.87DB 0.7350 2.02 4.37 0.7396 1.99 4.61 0.7804 2.19 4.86DAB (1-best) 0.7496 2.17 6.59 0.7527 2.14 6.45 0.7906 2.31 6.58DAB (all-channels) 0.5977 1.85 -0.52 0.5990 1.84 -0.87 0.6102 1.88 -0.75DAB (all-channels + GT) 0.7575 2.22 5.45 0.7588 2.18 5.39 0.7887 2.42 5.23DAB (all-channels + TS) 0.7809 2.15 4.53 0.7869 2.12 4.54 0.8091 2.35 4.32DAB (ﬁxed-N-best) 0.6218 1.89 0.27 0.6189 1.88 0.03 0.6463 1.93 0.33DAB (ﬁxed-N-best + GT) 0.7788 2.26 6.15 0.7832 2.22 6.11 0.8172 2.43 6.37DAB (ﬁxed-N-best + TS) 0.8074 2.26 6.65 0.8095 2.21 6.36 0.8518 2.44 7.04DAB (auto-N-best) 0.6188 1.90 0.44 0.6142 1.87 -0.13 0.6641 2.00 1.44DAB (auto-N-best + GT) 0.7877 2.30 6.83 0.7946 2.26 6.92 0.8219 2.48 6.85DAB (auto-N-best + TS) 0.8082 2.27 6.82 0.8140 2.23 6.81 0.8476 2.47 6.93DAB (soft-N-best) 0.6179 1.90 0.20 0.6140 1.88 -0.29 0.6625 2.00 1.13DAB (soft-N-best + GT) 0.7792 2.27 6.10 0.7858 2.24 6.18 0.8094 2.43 6.06DAB (soft-N-best + TS) 0.8045 2.26 6.35 0.8090 2.22 6.28 0.8428 2.46 6.54DAB (learning-N-best) 0.6187 1.90 0.31 0.6154 1.87 -0.24 0.6562 1.98 0.93DAB (learning-N-best + GT) 0.7768 2.27 6.40 0.7799 2.24 6.30 0.8096 2.46 6.35DAB (learning-N-best + TS) 0.8049 2.24 5.98 0.8100 2.20 5.90 0.8394 2.45 5.9820 dB Noisy 0.6622 2.03 3.81 0.6653 2.01 3.73 0.6860 2.12 3.71DB 0.7539 2.09 4.90 0.7619 2.09 5.23 0.7792 2.21 5.29DAB (1-best) 0.7768 2.25 7.25 0.7790 2.25 7.25 0.7967 2.31 7.13DAB (all-channels) 0.6196 1.88 -0.31 0.6212 1.89 -0.19 0.6213 1.89 -0.50DAB (all-channels + GT) 0.7784 2.33 5.74 0.7834 2.32 5.68 0.7937 2.43 5.44DAB (all-channels + TS) 0.8057 2.27 5.21 0.8113 2.25 5.02 0.8161 2.38 4.72DAB (ﬁxed-N-best) 0.6583 1.96 1.03 0.6487 1.95 0.72 0.6553 1.94 0.57DAB (ﬁxed-N-best + GT) 0.8011 2.35 6.60 0.8046 2.34 6.33 0.8183 2.42 6.65DAB (ﬁxed-N-best + TS) 0.8352 2.37 7.36 0.8346 2.34 7.08 0.8551 2.45 7.47DAB (auto-N-best) 0.6632 1.99 1.78 0.6504 1.97 1.20 0.6816 2.03 2.19DAB (auto-N-best + GT) 0.8098 2.40 7.28 0.8134 2.39 7.10 0.8257 2.48 7.21DAB (auto-N-best + TS) 0.8361 2.39 7.67 0.8383 2.36 7.21 0.8515 2.48 7.35DAB (soft-N-best) 0.6610 1.99 1.52 0.6479 1.97 0.95 0.6810 2.03 2.00DAB (soft-N-best + GT) 0.8018 2.37 6.70 0.8034 2.36 6.30 0.8164 2.45 6.59DAB (soft-N-best + TS) 0.8317 2.38 7.16 0.8338 2.35 6.74 0.8473 2.46 7.01DAB (learning-N-best) 0.6564 1.97 1.30 0.6491 1.96 0.93 0.6733 2.00 1.66DAB (learning-N-best + GT) 0.7968 2.38 6.74 0.8006 2.37 6.53 0.8139 2.47 6.57DAB (learning-N-best + TS) 0.8309 2.36 6.74 0.8338 2.33 6.37 0.8447 2.47 6.43

Section 7.1 except that the number of the microphones in anarray was reduced to 4. Because the experimental phenomenawere consistent across di ﬀ erent SNRatO levels and noise types,we list the comparison results of only one test scenario in Tables3 and 4 for saving the space of the paper. From the tables, wesee that, even if the number of the microphones in an array waslimited, the DAB variants still perform equivalently well withDB except “DAB + ﬀ use noise environment as an exam-ple. The STOI score of “DAB + auto- N -best + TS” is improvedby relatively 20.22% when the number of the microphones is

Table 2: Results with 16 microphones per array in point source noise environ-ments.

SNRatO Comparison methods Babble Factory VolvoSTOI PESQ SDR STOI PESQ SDR STOI PESQ SDR-5 dB Noisy 0.4465 1.29 -6.75 0.4336 1.19 -6.08 0.6286 1.90 -0.20DB 0.5429 1.63 -3.50 0.5250 1.51 -2.22 0.7406 2.04 3.82DAB (1-best) 0.5741 1.73 -1.96 0.5512 1.59 -1.52 0.7647 2.25 5.16DAB (all-channels) 0.4246 1.95 -8.97 0.4194 1.73 -8.10 0.5106 1.71 -3.70DAB (all-channels + GT) 0.5756 1.70 -2.41 0.5487 1.50 -2.07 0.7424 2.22 3.95DAB (all-channels + TS) 0.5954 1.70 -2.43 0.5488 1.50 -2.20 0.7775 2.22 3.30DAB (ﬁxed-N-best) 0.4665 1.82 -6.69 0.4619 1.66 -5.83 0.5745 1.79 -1.85DAB (ﬁxed-N-best + GT) 0.5891 1.74 -1.99 0.5619 1.58 -1.68 0.7736 2.30 5.16DAB (ﬁxed-N-best + TS) 0.6065 1.74 -1.65 0.5692 1.58 -1.25 0.8124 2.33 5.57DAB (auto-N-best) 0.4753 1.93 -6.60 0.4547 1.75 -6.48 0.5773 1.86 -1.18DAB (auto-N-best + GT) 0.6029 1.76 -1.26 0.5707 1.55 -1.21 0.7745 2.32 5.55DAB (auto-N-best + TS) 0.6160 1.75 -1.24 0.5696 1.55 -1.23 0.8047 2.32 5.21DAB (soft-N-best) 0.4806 1.95 -6.26 0.4601 1.76 -6.08 0.5822 1.87 -1.05DAB (soft-N-best + GT) 0.6035 1.77 -1.15 0.5725 1.57 -1.05 0.7681 2.29 5.08DAB (soft-N-best + TS) 0.6164 1.76 -1.15 0.5719 1.58 -1.10 0.8013 2.32 4.92DAB (learning-N-best) 0.4606 1.92 -7.34 0.4465 1.75 -6.94 0.5654 1.83 -1.80DAB (learning-N-best + GT) 0.5915 1.73 -1.73 0.5621 1.53 -1.54 0.7617 2.29 4.91DAB (learning-N-best + TS) 0.6086 1.72 -1.74 0.5604 1.53 -1.72 0.7967 2.29 4.395 dB Noisy 0.5678 1.68 0.11 0.5607 1.61 0.50 0.6550 1.99 2.69DB 0.6975 1.87 2.80 0.6856 1.85 3.19 0.7695 2.14 4.78DAB (1-best) 0.7232 2.05 5.22 0.7187 2.00 5.53 0.7939 2.31 7.42DAB (all-channels) 0.4942 1.79 -3.59 0.4806 1.74 -3.62 0.5207 1.74 -2.69DAB (all-channels + GT) 0.7263 2.12 4.71 0.7168 2.06 4.98 0.7770 2.37 5.71DAB (all-channels + TS) 0.7602 2.11 4.22 0.7481 2.05 4.31 0.8075 2.33 4.79DAB (ﬁxed-N-best) 0.5522 1.80 -1.80 0.5421 1.74 -1.82 0.5889 1.83 -0.92DAB (ﬁxed-N-best + GT) 0.7473 2.14 5.28 0.7425 2.09 5.55 0.8117 2.42 7.22DAB (ﬁxed-N-best + TS) 0.7768 2.15 5.61 0.7734 2.11 5.81 0.8492 2.44 7.60DAB (auto-N-best) 0.5820 1.89 -0.36 0.5901 1.86 0.39 0.6118 1.93 0.55DAB (auto-N-best + GT) 0.7601 2.17 5.94 0.7568 2.12 6.29 0.8187 2.46 7.68DAB (auto-N-best + TS) 0.7835 2.17 6.02 0.7751 2.12 6.32 0.8455 2.45 7.54DAB (soft-N-best) 0.5834 1.90 -0.47 0.5915 1.85 0.25 0.6122 1.93 0.42DAB (soft-N-best + GT) 0.7534 2.16 5.48 0.7510 2.10 5.83 0.8081 2.42 6.85DAB (soft-N-best + TS) 0.7797 2.16 5.61 0.7714 2.11 5.91 0.8420 2.44 7.09DAB (learning-N-best) 0.5605 1.86 -1.27 0.5617 1.82 -0.86 0.5975 1.90 -0.09DAB (learning-N-best + GT) 0.7462 2.15 5.37 0.7397 2.09 5.71 0.8034 2.43 6.95DAB (learning-N-best + TS) 0.7806 2.16 5.40 0.7672 2.10 5.38 0.8378 2.43 6.5715 dB Noisy 0.6394 1.92 2.71 0.6405 1.90 2.76 0.6700 2.02 3.16DB 0.7534 2.11 4.85 0.7596 2.10 5.01 0.7767 2.21 5.21DAB (1-best) 0.7868 2.26 7.48 0.7886 2.23 7.39 0.8024 2.32 7.63DAB (all-channels) 0.5215 1.76 -2.52 0.5152 1.73 -2.68 0.5183 1.76 -2.69DAB (all-channels + GT) 0.7770 2.36 6.16 0.7763 2.33 6.12 0.7871 2.43 5.98DAB (all-channels + TS) 0.8173 2.35 5.77 0.8189 2.31 5.66 0.8173 2.41 5.14DAB (ﬁxed-N-best) 0.5924 1.82 -0.69 0.5886 1.80 -0.78 0.5911 1.83 -0.81DAB (ﬁxed-N-best + GT) 0.8015 2.35 6.81 0.7999 2.32 6.77 0.8177 2.44 7.23DAB (ﬁxed-N-best + TS) 0.8434 2.40 7.68 0.8419 2.35 7.54 0.8591 2.48 7.87DAB (auto-N-best) 0.6503 2.01 2.22 0.6042 1.89 0.39 0.6602 2.03 2.39DAB (auto-N-best + GT) 0.8156 2.40 7.93 0.8088 2.38 7.48 0.8292 2.46 7.94DAB (auto-N-best + TS) 0.8405 2.41 8.08 0.8422 2.37 7.50 0.8502 2.47 8.12DAB (soft-N-best) 0.6499 2.01 2.13 0.6042 1.90 0.30 0.6595 2.03 2.27DAB (soft-N-best + GT) 0.8088 2.39 7.44 0.8009 2.35 6.84 0.8226 2.44 7.45DAB (soft-N-best + TS) 0.8379 2.40 7.79 0.8385 2.37 7.09 0.8477 2.46 7.85DAB (learning-N-best) 0.6272 1.95 1.25 0.5862 1.85 -0.33 0.6340 1.98 1.18DAB (learning-N-best + GT) 0.8028 2.39 7.30 0.7966 2.36 6.97 0.8148 2.45 7.28DAB (learning-N-best + TS) 0.8415 2.42 7.28 0.8392 2.37 6.89 0.8500 2.49 7.22 increased from 4 to 16, while the relative improvement aboutDB is only 2.56%. ﬀ ect of hyperparameter γ To study how the hyperparameter γ a ﬀ ects the per-formance of “DAB + auto- N -best + LS”, “DAB + soft- N -best + LS”, and “DAB + learning- N -best + LS”, we tune γ from { . , . , . , . , . } . To save the space of the paper, we onlyshow the results in the babble noise environments at the lowestSNRatO levels in Figs. 5 and 6. From the ﬁgures, we observethat “DAB + auto- N -best + LS” and “DAB + soft- N -best + LS”perform similarly if γ is well-tuned, both of which are betterthan “DAB + soft- N -best + LS”. The working range of γ is9 able 3: Results with 4 microphones per array in babble di ﬀ use noise environ-ment at an SNRatO of 10 dB. SNRatO Comparison methods BabbleSTOI PESQ SDR10 dB Noisy 0.5919 1.80 0.99DB 0.6830 1.91 3.14DAB (1-best) 0.6400 1.86 2.38DAB (all-channels + TS) 0.7154 1.92 2.70DAB (ﬁxed-N-best + TS) 0.6821 1.90 2.58DAB (auto-N-best + TS) 0.7112 1.93 3.01DAB (soft-N-best + TS) 0.7013 1.91 2.27DAB (learning-N-best + TS) 0.7135 1.92 2.82

Table 4: Results with 4 microphones per array in babble point source noiseenvironment at an SNRatO of − SNRatO Comparison methods BabbleSTOI PESQ SDR-5 dB Noisy 0.4576 1.56 -6.14DB 0.5079 1.47 -5.70DAB (1-best) 0.4996 1.57 -5.10DAB (all-channels + TS) 0.5056 1.49 -6.80DAB (ﬁxed-N-best + TS) 0.5068 1.55 -5.72DAB (auto-N-best + TS) 0.5111 1.51 -6.28DAB (soft-N-best + TS) 0.5140 1.53 -6.19DAB (learning-N-best + TS) 0.5064 1.50 -6.67 . S T O I . PES Q . S DR ( d B ) DAB (auto- N -best+TS)DAB (soft- N -best+TS)DAB (learning- N -best+TS) Figure 5: E ﬀ ect of hyperparameter γ in the babble di ﬀ use noise environment atthe SNRatO of 10 dB. . S T O I . PES Q . -2.4-2.2-2-1.8-1.6-1.4-1.2-1 S DR ( d B ) DAB (auto- N -best+TS)DAB (soft- N -best+TS)DAB (learning- N -best+TS) Figure 6: E ﬀ ect of hyperparameter γ in the babble point source noise environ-ment at the SNRatO of − [0 . , .

7] for “DAB + auto- N -best + LS” and “DAB + soft- N -best + LS”, and [0 . , .

9] for “DAB + learning- N -best + LS”.

8. Conclusions

In this paper, we have proposed deep ad-hoc beamforming,which is to our knowledge the ﬁrst deep learning based beam- forming method for ad-hoc microphone arrays. DAB has thefollowing novel aspects. First, DAB employs an ad-hoc micro-phone array to pick up speech signals, which has a potential toenhance the speech signals with equally high quality in a rangewhere the array covers. It may also signiﬁcantly improve theSNR at the microphone receivers by physically placing somemicrophones close to the speech source in probability. Sec-ond, DAB employs a channel-selection algorithm to reweightthe estimated speech signals with a sparsity constraint, whichgroups a handful microphones around the speech source into alocal microphone array. We have developed several channel-selection algorithms as well. At last, we have developed a timesynchronization framework based on time delay estimators andthe supervised 1-best channel selection.Besides the above novelties and advantages, the proposedDAB is ﬂexible in incorporating new development of DNN-based single channel speech processing techniques, since thatits model is trained in the single-channel fashion. Its test pro-cess is also ﬂexible in incorporating any number of micro-phones without retraining or revising the model, which meetsthe requirement of real-world applications. Moreover, althoughwe applied DAB to speech enhancement as an example, we mayapply it to other tasks as well by replacing the deep beamform-ing to other task-speciﬁc algorithms.We have conducted extensive experiments in the scenariowhere the location of the speech source is far-ﬁeld, random,and blind to the microphones. Experimental results in both thedi ﬀ use noise and point source noise environments demonstratethat DAB outperforms its MVDR based deep ad-hoc beam-forming counterpart by a large margin given enough numberof microphones. Acknowledgments

The author would like to thank Prof. DeLiang Wang for help-ful discussions.

Appendix A.

Proof.

We denote the energy of the direct sound and additivenoise components of the test utterance at the i -th channel as X i and N i respectively, i.e. X = (cid:80) t | x | time ( t ) and N = (cid:80) t | n | time ( t ).Our core idea is to ﬁlter out the signals of the channels whoseclean speech satisﬁes: X i < γ X ∗ (A.1)Under the assumptions that the estimated weights are perfectand that the statistics of the noise components are consistentacross the channels, we have q i = S i S i + N ∗ , q ∗ = S ∗ S ∗ + N ∗ (A.2)Substituting (A.2) into (A.1) derives (20).10 eferencesReferences Carter, G. C. (1987). Coherence and time delay estimation.

Proceedings of theIEEE , , 236–255.Chen, J., Benesty, J., & Huang, Y. A. (2006). Time delay estimation in roomacoustic environments: an overview. EURASIP Journal on Advances inSignal Processing , , 026503.Du, J., Tu, Y., Xu, Y., Dai, L., & Lee, C.-H. (2014). Speech separation ofa target speaker based on deep neural networks. In Proc. IEEE Int. Conf.Signal Process. (pp. 473–477).Erdogan, H., Hershey, J. R., Watanabe, S., Mandel, M. I., & Le Roux, J. (2016).Improved MVDR beamforming using single-channel mask prediction net-works. In

Interspeech (pp. 1981–1985).Heusdens, R., Zhang, G., Hendriks, R. C., Zeng, Y., & Kleijn, W. B. (2012).Distributed MVDR beamforming for (wireless) microphone networks usingmessage passing. In

Acoustic Signal Enhancement; Proceedings of IWAENC2012; International Workshop on (pp. 1–4). VDE.Heymann, J., Drude, L., Chinaev, A., & Haeb-Umbach, R. (2015). BLSTMsupported GEV beamformer front-end for the 3rd CHiME challenge. In (pp. 444–451). IEEE.Heymann, J., Drude, L., & Haeb-Umbach, R. (2016). Neural network basedspectral mask estimation for acoustic beamforming. In

Acoustics, Speechand Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 196–200). IEEE.Higuchi, T., Ito, N., Yoshioka, T., & Nakatani, T. (2016). Robust MVDRbeamforming using time-frequency masks for online / o ﬄ ine ASR in noise.In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Interna-tional Conference on (pp. 5210–5214). IEEE.Higuchi, T., Kinoshita, K., Ito, N., Karita, S., & Nakatani, T. (2018). Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming.In (pp. 531–535). IEEE.Huang, P.-S., Kim, M., Hasegawa-Johnson, M., & Smaragdis, P. (2015). Jointoptimization of masks and deep recurrent neural networks for monauralsource separation.

IEEE / ACM Trans. Audio, Speech, Lang. Process. , ,2136–2147.Jayaprakasam, S., Rahim, S. K. A., & Leow, C. Y. (2017). Distributed and col-laborative beamforming in wireless sensor networks: Classiﬁcations, trends,and research directions. IEEE Communications Surveys & Tutorials , ,2092–2116.Jiang, Y., Wang, D., Liu, R., & Feng, Z. (2014). Binaural classiﬁcation for re-verberant speech segregation using deep neural networks. IEEE / ACM Trans-actions on Audio, Speech and Language Processing (TASLP) , , 2112–2121.Knapp, C., & Carter, G. (1976). The generalized correlation method for esti-mation of time delay. IEEE transactions on acoustics, speech, and signalprocessing , , 320–327.Koutrouvelis, A. I., Sherson, T. W., Heusdens, R., & Hendriks, R. C. (2018).A low-cost robust distributed linearly constrained beamformer for wirelessacoustic sensor networks with arbitrary topology. IEEE / ACM Transactionson Audio, Speech and Language Processing (TASLP) , , 1434–1448.Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement basedon deep denoising autoencoder. In Interspeech (pp. 436–440).Nakatani, T., Ito, N., Higuchi, T., Araki, S., & Kinoshita, K. (2017). IntegratingDNN-based and spatial clustering-based mask estimation for robust MVDRbeamforming. In

Acoustics, Speech and Signal Processing (ICASSP), 2017IEEE International Conference on (pp. 286–290). IEEE.Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysisand an algorithm. In

NIPS .O’Connor, M., & Kleijn, W. B. (2014). Di ﬀ usion-based distributed MVDRbeamformer. In Acoustics, Speech and Signal Processing (ICASSP), 2014IEEE International Conference on (pp. 810–814). IEEE.O’Connor, M., Kleijn, W. B., & Abhayapala, T. (2016). Distributed sparseMVDR beamforming using the bi-alternating direction method of multi-pliers. In

Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEEInternational Conference on (pp. 106–110). IEEE.Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Percep-tual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In

Proc. IEEE Int. Conf.Acoust., Speech, Signal Process. (pp. 749–752).Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithmfor intelligibility prediction of time–frequency weighted noisy speech.

IEEETrans. Audio, Speech, Lang. Process. , , 2125–2136.Tavakoli, V. M., Jensen, J. R., Christensen, M. G., & Benesty, J. (2016).A framework for speech enhancement with ad hoc microphone arrays. IEEE / ACM Transactions on Audio, Speech and Language Processing(TASLP) , , 1038–1051.Tavakoli, V. M., Jensen, J. R., Heusdens, R., Benesty, J., & Christensen, M. G.(2017). Distributed max-SINR speech enhancement with ad hoc microphonearrays. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEEInternational Conference on (pp. 151–155). IEEE.Tu, Y.-H., Du, J., Sun, L., & Lee, C.-H. (2017). LSTM-based iterative maskestimation and post-processing for multi-channel speech enhancement. In

Asia-Paciﬁc Signal and Information Processing Association Annual Summitand Conference (APSIPA ASC), 2017 (pp. 488–491). IEEE.Vincent, E., Gribonval, R., & Févotte, C. (2006). Performance measurement inblind audio source separation.

IEEE Trans. Audio, Speech, Lang. Process. , , 1462–1469.Wang, D., & Chen, J. (2018). Supervised speech separation based on deeplearning: An overview. IEEE / ACM Transactions on Audio, Speech, andLanguage Processing , .Wang, Y., Narayanan, A., & Wang, D. L. (2014). On training targets for super-vised speech separation.

IEEE / ACM Trans. Audio, Speech, Lang. Process. , , 1849–1858.Wang, Y., & Wang, D. L. (2013). Towards scaling up classiﬁcation-basedspeech separation. IEEE Trans. Audio, Speech, Lang. Process. , , 1381–1390.Wang, Z.-Q., & Wang, D. (2018). All-neural multichannel speech enhance-ment. to appear in Interspeech , .Williamson, D. S., Wang, Y., & Wang, D. L. (2016). Complex ratio maskingfor monaural speech separation. IEEE / ACM Trans. Audio, Speech, Lang.Process. , , 483–492.Xiao, X., Zhao, S., Jones, D. L., Chng, E. S., & Li, H. (2017). On time-frequency mask estimation for MVDR beamforming with application inrobust speech recognition. In Acoustics, Speech and Signal Processing(ICASSP), 2017 IEEE International Conference on (pp. 3246–3250). IEEE.Xu, Y., Du, J., Dai, L.-R., & Lee, C.-H. (2015). A regression approach tospeech enhancement based on deep neural networks.

IEEE / ACM Trans. Au-dio, Speech, Lang. Process. , , 7–19.Zeng, Y., & Hendriks, R. C. (2014). Distributed delay and sum beamformerfor speech enhancement via randomized gossip. IEEE / ACM Transactionson Audio, Speech, and Language Processing , , 260–273.Zhang, J., Chepuri, S. P., Hendriks, R. C., & Heusdens, R. (2018). Microphonesubset selection for MVDR beamformer based noise reduction. IEEE / ACMTransactions on Audio, Speech, and Language Processing , , 550–563.Zhang, X., Wang, Z.-Q., & Wang, D. (2017). A speech enhancement algorithmby iterating single-and multi-microphone processing and its application torobust ASR. In Acoustics, Speech and Signal Processing (ICASSP), 2017IEEE International Conference on (pp. 276–280). IEEE.Zhang, X.-L., & Wang, D. (2016). A deep ensemble learning method formonaural speech separation.

IEEE / ACM Transactions on Audio, Speech andLanguage Processing (TASLP) , , 967–977.Zhang, X.-L., & Wu, J. (2013). Denoising deep neural networks based voiceactivity detection. In the 38th IEEE International Conference on Acoustic,Speech, and Signal Processing (pp. 853–857).Zhou, Y., & Qian, Y. (2018). Robust mask estimation by integrating neuralnetwork-based and clustering-based approaches for adaptive acoustic beam-forming. In Int Conf on Acoustics, Speech, and Signal Processing, in press.Google Scholar ..