[PDF] A Machine Learning Approach to DoA Estimation and Model Order Selection for Antenna Arrays with Subarray Sampling

Abstract

In this paper, we study the problem of direction of arrival estimation and model order selection for systems employing subarray sampling. Thereby, we focus on scenarios, where the number of active sources is not smaller than the number of simultaneously sampled antenna elements. For this purpose, we propose new schemes based on neural networks and estimators that combine neural networks with gradient steps on the likelihood function. These methods are able to outperform existing estimators in terms of mean squared error and model selection accuracy, especially in the low snapshot domain, at a drastically lower computational complexity.

Full PDF

11 A Machine Learning Approach to DoA Estimationand Model Order Selection for Antenna Arrays withSubarray Sampling

Andreas Barthelme, Wolfgang UtschickProfessur f¨ur Methoden der Signalverarbeitung, Technical University Munich, 80290 Munich, GermanyEmail: { a.barthelme, utschick } @tum.de Abstract —In this paper, we study the problem of directionof arrival estimation and model order selection for systemsemploying subarray sampling. Thereby, we focus on scenarios,where the number of active sources may exceed the number ofsimultaneously sampled antenna elements. For this purpose, wepropose new schemes based on neural networks and estimatorsthat combine neural networks with gradient steps on the like-lihood function. These methods are able to outperform existingestimators in terms of mean squared error and model selectionaccuracy, especially in the low snapshot domain, at a drasticallylower computational complexity.

I. I

NTRODUCTION

In the last decade, data-driven approaches have become in-creasingly popular in the signal processing community. Drivenby astonishing results from image and speech processing, deepneural networks have become a tool that ﬁnds its way tomany different signal processing applications. Wherever thereare model imperfections or the existing solutions are verycomplex to compute, data-based approaches may improve theperformance considerably.Traditionally, direction of arrival (DoA) estimation is aﬁeld where appropriate stochastic models and potent algo-rithms are available. However, there are still some areaswhere existing solutions lead to a rather limited performance.Subarray sampling is one of these applications for whichclassical methods do not provide fully satisfying results.The idea behind subarray sampling is to reduce costs bysequentially sampling subarrays instead of sampling the wholeantenna array simultaneously, which means that fewer radiofrequency (RF) chains than antennas are needed. Speciﬁcally,in the domain where there are more sources than sampledantenna elements per time step [1], [2], the performance ofexisting DoA estimation algorithms is—as we will show—only acceptable for a prohibitively high number of snapshots.Therefore, we investigate the suitability of machine learning–based approaches for systems with subarray sampling. Inparticular, we discuss neural networks (NNs) for the tasks ofDoA estimation and model order selection.Subarray sampling can be seen as a special form of time-varying arrays. Computable methods for DoA estimation fortime-varying arrays fall into two categories depending onthe ratio of transmitting sources to simultaneously sampledantenna elements. For fewer sources than sampled antennasper time step, previous work goes back to [3], where the single source case is studied. In [4], the same authors extendtheir analysis of time-varying arrays to multiple sources andpropose eigenstructure methods based on array interpolationand focusing matrices. A more direct approach to employingMVDR and MUSIC for these systems is studied in [5], [6].The more demanding scenario, where the number of sourcesexceeds the number of simultaneously sampled antennas, isdiscussed in [1]. There, the authors propose to use a costfunction that matches the subarray covariance matrices to theobserved sample covariance matrices in a general least squares(GLS) sense. In [2], the covariance matrix of the full antennaarray is estimated by a special subarray sampling scheme.Afterwards, the DoAs are estimated from the reconstructed,full covariance matrix with MUSIC [7]. Lastly, the recent workin [8] on non-coherent processing of partly calibrated arraysyields an estimator that utilizes a sparse signal representationof the system model and is applicable to subarray sampling.Utilizing data-based machine learning techniques for DoAestimation goes back to the ’80s to Rastogi et al. [9]. A reviewof the work from the last century in this ﬁeld can be found in[10]. More recently, with the increase in computing resources,the methods have shifted towards larger fully connected andconvolutional multilayer NNs. The proposed NN approachescan be assigned to three different groups. One group posesthe DoA estimation problem as a classiﬁcation problem by adiscretization of the angular domain in several sectors (e.g.,[11]–[13]). The DoA estimation problem then reduces to de-termining if a source is present in a speciﬁc angular sector. Forthe next group, the idea is to estimate a discretized spectrum(MUSIC [14], [15] or transmit power [16]) as a proxy bymeans of a regression network, and derive the DoA estimatesby determining the maxima of the respective spectrum. Adisadvantage of the aforementioned methods is that a minimalangular spread between two sources has to be assumed, suchthat each grid point or sector is only associated with onepossible source. The third group, in which our proposedmethod falls, does not suffer from this. There, the NN shoulddirectly produce the DoA estimates at its output. These modelsare trained directly on the cost function of interest, e.g., meansquared error (MSE). In [17], the authors propose to use asignal-to-noise-ratio (SNR) classiﬁcation network to choosebetween two different DoA regression networks with the goalto resolve two narrowly spaced sources. A more generalapproach is presented in [18]. There, a NN is discussed that a r X i v : . [ ee ss . SP ] S e p is able to simultaneously estimate the number of sources andtheir DoAs. The authors show that this network is able toachieve the same performance as a maximum likelihood (ML)estimator in a scenario with two sources and a single snapshot.For model order selection, NNs have been ﬁrst proposed in[19]. Recently, the model order selection problem has beenrevisited for state of the art fully connected, feedforwardnetwork architectures with different input data formats [18],[20], [21]. In the context of subarray sampling, previouswork on model order selection is limited to [22], wherethe applicability of information criteria to the time-varyingpreprocessing case is discussed.In this paper, we discuss several DoA estimation methodsfor systems with subarray sampling in Section III. There,our contributions lie in the proposal of a NN-based DoAestimator and the modiﬁcation of the GLS estimator for asmall number of snapshots. Moreover, we provide a thoroughcomparison of the newly proposed and existing schemes forthe critical case of more sources than simultaneously sampledantennas by means of Monte Carlo simulations. These simula-tions show that the proposed NN-based estimation scheme isable to outperform the state-of-the-art estimators in terms ofestimation accuracy and computational complexity. In SectionIV, the model order selection problem for subarray samplingis discussed. We present a new estimation scheme for themodel order based on a NN and are—to the best of ourknowledge—the ﬁrst to provide simulation results for theachievable selection accuracy for these systems. Again, theproposed NN-based approach is able to provide a signiﬁcantlybetter performance compared to existing methods based oninformation criteria, as it provides a higher selection accuracyat a fraction of the computational cost.II. S YSTEM M ODEL

Let us consider antenna arrays consisting of M antennas.Throughout this work, we investigate systems which only use W < M

RF chains and a switching network to sample thereceived signals, i.e., only a subset of the antenna elementsis sampled simultaneously. In the following, we assume thatat any given time each RF chain is connected to exactlyone antenna element and that there are K different states ofthe switching network, i.e., we have K different subarraysconsisting of W antennas. For each subarray, we collect N snapshots for a joint processing. Then, the n -th sample of thereceived signal for the k -th subarray can be written as y ( k ) ( n ) = G ( k ) (cid:16) A ( θ ) s ( k ) ( n ) + η ( k ) ( n ) (cid:17) , (1)where the steering matrix A ( θ ) ∈ C M × L captures theresponse of the whole array on the DoAs θ of L far-ﬁeldsources, s ( k ) denotes the narrow-band transmit signals, and η ( k ) ∼ CN ( , σ η I M ) is some additive white Gaussian noise.The matrix G ( k ) ∈ { , } W × M represents the connectionsbetween the RF chains and the antenna elements that formthe k -th subarray. III. D O A E

STIMATION

In this section, we brieﬂy discuss existing DoA estimationapproaches that are derived from the underlying stochastic model, before we present a new data-driven NN approach.Afterwards, we compare these methods by means of MonteCarlo simulations.

A. Model-Based DoA Estimation

Traditionally, DoA estimation methods have been derivedfrom the underlying stochastic model. In the scope of DoAestimation, two different stochastic models are commonlyassociated with the received signals y given in (1) that differin the assumed distribution of s [23]. On the one hand, wemay treat s as an unknown parameter of the stochastic model.On the other hand, we may assume that s itself follows someprobability distribution, which leads to a stochastic model for y that no longer depends on the individual realizations of s ,but on the parameters of its distribution.

1) Maximum Likelihood Estimator:

As the most prominentmodel-based estimation method, we start by discussing theML estimator, which ﬁnds its estimates by maximizing theprobability density function at the observed received signals y with respect to the model parameters. Deriving the MLestimator under the aforementioned stochastic models, weobtain the deterministic ML (DML) for the former modeland the stochastic ML (SML) for the latter model. In thecase of L < W , i.e., fewer sources than RF chains, thecomputation of the DML estimates is straightforward. As itis well summarized in [24], we can ﬁnd closed form estimatesof the signal and noise parameters for ﬁxed angles θ . Pluggingthese estimates into the likelihood function gives a non-convexfunction in θ . To ﬁnd the global maximum of this concentratedlikelihood, a multi-dimensional grid search over θ followed byany kind of gradient ascent technique can be employed.More challenging is the case with an equal to or greaternumber of sources than RF chains L ≥ W , which we willfocus on throughout this section. For the DML case, we have ¯ y ( n ) =  y (1) ( n ) ... y ( K ) ( n )  ∼ CN (cid:0) ¯ A ¯ s ( n ) , I KW (cid:1) , (2)with ¯ A ( θ ) = blockdiag (cid:110) G (1) A ( θ ) , . . . , G ( K ) A ( θ ) (cid:111) , (3) ¯ s ( n ) = (cid:104) s (1) , H ( n ) , . . . , s ( K ) , H ( n ) (cid:105) H . (4)Since in general ¯ A ( θ ) does not have full column rank for L ≥ W , there is a manifold of solutions for θ and ¯ s ( n ) thatgive the same distribution for ¯ y ( n ) , i.e., θ cannot be uniquelyestimated with the DML model [25].In contrast, the optimization problem corresponding to theSML estimator, where s ( t ) ∼ CN ( , R s ) , cannot be reducedto an optimization which only depends on θ for L > [4].Instead, the SML likelihood estimates are the solution to max θ , R s (cid:31) ,σ η ≥ − K (cid:88) k =1 (cid:104) ln (cid:16) det( R ( k ) y ) (cid:17) + tr( R ( k ) , − y ˆ R ( k ) y ) (cid:105) , (5)with the covariance matrices R ( k ) y = G ( k ) A ( θ ) R s A H ( θ ) G ( k ) , T + σ η I W , (6) and the sample covariance matrices ˆ R ( k ) y = 1 N N (cid:88) n =1 y ( k ) ( n ) y ( k ) , H ( n ) . (7)This optimization problem has almost surely a unique max-imizer if R s is diagonal and L ≤ (cid:98) ρ (cid:99) [8], where ρ is theKruskal rank of the co-array manifold ˘ V ( θ ) =  (cid:0) G (1) A ∗ ( θ ) (cid:1) ◦ (cid:0) G (1) A ( θ ) (cid:1) ... (cid:0) G ( K ) A ∗ ( θ ) (cid:1) ◦ (cid:0) G ( K ) A ( θ ) (cid:1) , (8)that uses the Khatri-Rao product, denoted by ◦ . For thecorrelated source case with a dense covariance matrix R s ,the extension of the identiﬁability proof in [8] is non-trivial,and, as of yet, remains an open problem.Unfortunately, for the optimization of (5), we cannot ﬁndclosed form solutions for R s and σ η for ﬁxed θ . Therefore, theoptimization of this non-convex function is over L + L + 1 variables, which is computationally very expensive for any L ≥ . To overcome this problem, different methods thatreplace the likelihood objective with a covariance-matchingcriterion have been proposed to estimate θ speciﬁcally for L ≥ W [1], [2], [8]. As the method introduced in [2] onlyworks for a special sampling scheme, in which every lag in thecovariance matrix needs to be sampled by at least one of thesubarrays, we focus instead on the GLS [1] and sparse signalrepesentation (SSR) [8] methods that do not suffer from thisrestriction.

2) GLS Estimator:

The GLS estimator has been shown tobe an asymptotically consistent and efﬁcient estimator. Theidea is to obtain the estimates by a covariance ﬁtting criterion.The GLS approach solves the following optimization problem min θ , R s (cid:23) ,σ η ≥ K (cid:88) k =1 (cid:107) T ( k ) (cid:16) ˆ R ( k ) y − R ( k ) y (cid:17) T ( k ) , H (cid:107) F , (9)where T ( k ) is a whitening ﬁlter, for which the choice T ( k ) =ˆ R ( k ) , − / can be motivated by asymptotic considerations [1].In the optimization above, the signal and noise estimates canbe computed for ﬁxed θ , such that the non-convex optimiza-tion problem results in a search for the optimal θ , which canagain be solved by a grid-search approach.Note that in contrast to the original paper [1], we includethe positive-semideﬁniteness constraints on R s and σ η in theoptimization problem (9), because otherwise, we obtain someinfeasible results for the estimates of R s and σ η when workingin the low snapshot domain. This, in turn, means that theeffort for determining the noise and signal estimates for ﬁxed θ is not a simple least squares problem, but requires thesolution of a semideﬁnite program in the general case andnon-negative least squares problem (quadratic program) in thecase of uncorrelated sources.

3) SSR Estimator:

The SSR estimator has been derived in[8] from the SPICE estimator [26] for DoA estimation inpartly calibrated arrays. Due to its non-coherent processing,i.e., phase offsets between the subarrays are not estimated,it is directly applicable to the subsampling system model at hand. Similar to the GLS approach, the SSR method is basedon a covariance-matching cost function given by K (cid:88) k =1 (cid:107) R ( k ) , − / y (cid:16) ˆ R ( k ) y − R ( k ) y (cid:17) ˆ R ( k ) , − / y (cid:107) F . (10)Using a sparse representation of the covariance matrices R ( k ) y for uncorrelated signals, a second-order cone program (SOCP)can be derived from (10). Although the derivation of the SSRestimator is based on the assumption of uncorrelated signals,the authors argue that due to the robustness of sparse signalmodels the method can also be used for the correlated case.The resulting SOCP can be either solved by a general purposesolver or, as has been proposed for SPICE in [26], a cost-effective alternating optimization method can be used (fordetails see Appendix A). B. Purely Data-Based DoA Estimation

In contrast to the previously discussed model-based meth-ods, in this subsection, we discuss a purely data-based DoAestimation approach. In particular, we present an end-to-end,feedforward NN that is trained on artiﬁcial training datasampled from the SML signal model. In the following, wewill refer to this NN as

MCENet . As the assumption ofuncorrelated transmit signals, i.e., R s is diagonal, providessufﬁcient identiﬁability conditions, is at the core of the SSRestimator, and reduces the complexity of the GLS estimator,we will focus on the uncorrelated case. Note that an extensionof the proposed scheme to the correlated case simply meanssampling data from the respective stochastic model.

1) Data and preprocessing:

For our training set, we sampledata from the system model in (1). Hereby, the entries of theDoA vector θ are drawn from a uniform distribution over thecomplete ﬁeld of view, i.e., θ ∼ U (0 , U ) . For now, let usassume that the entries in θ are sorted in ascending order,which will be discussed later. The noise and transmit signalrealizations are drawn from uncorrelated Gaussian distribu-tions according to the SML model. Thereby, we ﬁx the powerof the strongest source to σ s, max = 1 , and for each realization,we draw the power of each weaker source in decibel froma uniform distribution between dB and σ s, min . The noisepower also follows a uniform distribution between σ η, min and σ η, max . Each data sample consists of KN i.i.d. received signalrealizations y ( k ) ( n ) , n = 1 , . . . , N, k = 1 , . . . , K . Sincewe use artiﬁcial data, we can feed new, previously unseenrealizations to the NN in each step of the gradient descent ofthe learning algorithm, which makes the training inherentlyrobust towards overﬁtting. Additionally, we know the trueDoAs for each realization, therefore, we can use them as thelabel for each data sample in a supervised learning approach.We pass sample covariance matrix information to the inputlayer of the NN, which has been shown to be a goodpreprocessing step for the complex received signals in theDoA context [21], [27]. To this end, we form the K subarraysample covariance matrices ˆ R ( k ) y , k = 1 , . . . , K , and stacktheir real parameters, i.e., their diagonal elements and the realand imaginary parts of their upper triangle, in one large vector per data sample. Note that the input size of the NN is KW ,and thus does not depend on the number of snapshots N .

2) Architecture and cost function:

Due to its simplicity, weuse a fully connected, feedforward NN with N h hidden layers,each consisting of N u neurons. For the non-linear activationfunction of the hidden layers, we employ the rectiﬁed linearunit (ReLU). The output layer produces L outputs, which arethe estimates of the DoA ˆ θ .The most common cost function for parameter estimationis the MSE. However, for DoA estimation the π -periodicityof the angles has to be taken into account. To that end, themean squared periodic error (MSPE), given byMSPE ( θ, ˆ θ ) = E θ (cid:20)(cid:12)(cid:12)(cid:12) mod [ − π,π ) (cid:16) θ − ˆ θ (cid:17)(cid:12)(cid:12)(cid:12) (cid:21) , (11)has been proposed [28]. An alternative, which is differentiableat every point and is equivalent to the MSE in the small errorregion, is the mean cyclic error (MCE) [29]. The MCE canbe calculated according toMCE ( θ, ˆ θ ) = E θ (cid:104) (cid:16) − cos (cid:16) θ − ˆ θ (cid:17)(cid:17)(cid:105) . (12)Although the non-differentiability of the MSPE is only at onepoint, and hence, can be simply replaced by its left derivativewithout any adverse impact on the learning procedure, we usethe MCE with its continuous derivative for the cost functionof the NN.For L > , the order of the elements in ˆ θ should beirrelevant for the value of the cost function. Therefore, theminimum of the sum of the element-wise errors between thetrue DoA and all permutations of ˆ θ is used for the cost function f ( θ , ˆ θ ) , i.e., f ( θ , ˆ θ ) = min Π L (cid:88) (cid:96) =1 f (cid:16) θ (cid:96) , π T (cid:96) ˆ θ (cid:17) , (13)where Π = [ π , . . . , π L ] T is a permutation matrix. Includingthis minimization over all permutation matrices in the costfunction of the NN adds a signiﬁcant computational load onthe training procedure. However, in our studies we observedthat if the network is fed with sorted labels, it converges toa point where the optimal permutation matrix Π is constant.This means that the minimizer of (13) for every input sample isthe same, i.e., the output of the network follows a ﬁxed order.Further studies showed that we can even omit the minimizationover all permutations and simply use the sum of the element-wise errors. The network will then converge to a point, whereit produces the outputs in the correct order that minimizes thesum MCE. C. Hybrid DoA Estimation

By hybrid DoA estimation we understand the combinationof two different estimation approaches in a two-stage process.In our case, this combination uses one of the model-basedmethods GLS and SSR or the purely data-based NN method asan initialization step and a consecutive gradient ascent methodon the SML likelihood. For the model-based approaches,the consecutive gradient steps alleviate the grid mismatch

TABLE IS

UBARRAY S AMPLING S CHEME k Antenna Elements , ,

92 1 , ,

83 1 , ,

74 1 , , problem that is inherent to any grid-based approach [30].The NN method, as posed above, does not suffer from thisgrid mismatch problem due to its formulation as a regressionproblem. However, by utilizing a purely data-based method ina scenario, where an appropriate stochastic model exists, weignore a signiﬁcant amount of information about the problemat hand. Hence, we propose the combination of NN basedinitialization and model aware gradient steps on the SMLlikelihood function to improve the DoA estimate.To combine the NN initialization with the SML gradientsteps, an additional intermediate step is necessary. The NNdoes not directly yield estimates for the noise variance andsignal covariance matrix, which are needed for the consecutivegradient approach. To that end, we propose to use the GLSestimates of these nuisance parameters for the ﬁxed angularestimates ˆ θ , which requires the solution of a convex optimiza-tion problem as discussed in Section III-A2.For the gradient approach on the SML likelihood, weobserved that the gradient of the log-likelihood function isoften dominated by the derivative w.r.t. the noise variance σ η .This, in turn, can lead to a slow progress in the parametersof interest, viz., the DoA estimates, during a simple gradientascent approach. Instead, a block coordinate ascent method canbe applied that alternates between updating the DoA estimates,the estimate of the signal covariance R s , and an estimate of σ η . This led to a much faster convergence in our simulations. D. Simulations

To assess the performance of the previously presented al-gorithms, we provide some simulation results. The consideredsystem consists of M = 9 omnidirectional antennas that forma uniform circular array (UCA). For simplicity, we assumethat all of the L = 3 sources lie in the same horizontal planeas the antenna array, such that the steering vector of the UCAonly depends on the azimuth. In our simulations, we ﬁx theratio of the array radius R and wavelength λ to be equal to . The switching network selects K = 4 subarrays consistingof W = 3 antennas according to the conﬁguration given inTable I, which uses a clockwise numbering of the antennaelements of the UCA.The parameters for the training of our algorithms, as well asthe MCENet parameters, have been chosen according to TableII. For the test set data, we use received signal realizationsstemming from equally powered signals. Note that the param-eters for the signal and noise variances in the training set havebeen chosen such that the resulting parameter space coversa reasonably broad operating range for the DoA estimationtask. The knowledge about this limited parameter space is TABLE IIS

IMULATION P ARAMETERS D O A E

STIMATION

Parameter Value σ s, min − dB σ η, min − dB σ η, max dB N h N u Weight Initialization Glorot[31]Batch Size

Optimizer Adam[32]Learning Rate − Samples per Training Set × Samples per Test Set not incorporated in the model-based algorithms. However, theMCENet is trained with data from this range, which mightintroduce a certain advantage. By choosing the range broadenough, we want to make sure that this advantage is not toosigniﬁcant.We denote the SSR method, for which we use YALMIP[33] in combination with the MOSEK solver [34] to solvethe internal SOCP, simply by “SSR”, whereas the alternatingoptimization variant (see Appendix A) is referred to by “SSRiter.”. The “SSR iter.” variant uses a ﬁxed number of updatesteps, which we set to . For the presented SSR variants,we chose an oversampling factor of , i.e., we use M equidistant grid points to cover the whole ﬁeld of view. Incontrast, the oversampling factor for the GLS method is setto , because otherwise the computational complexity for L = 3 sources becomes prohibitively large. As a reference,we added the results for a Genie ML approach, which consistsof an initialization with the true DoAs followed by a blockcoordinate ascent on the SML likelihood. The Cram´er-Rao-Bound (CRB) is, as a stochastic bound on the error variance,not really applicable if only one noise and signal realization ispaired with each DoA realization, unless an immense numberof realizations is considered. Additionally, without enforcingany minimal distance between the DoAs, the calculation of theCRB suffers from numerical issues for closely spaced angles.Therefore, this Genie ML estimator gives a more reasonableperformance bound than the CRB for our simulations.In Fig. 1, we depict the root MSPE (RMSPE) of the differentDoA estimators over the SNR, deﬁned as /σ η , for N = 10 snapshots. The plot shows the hybrid approaches consisting ofan initialization step and block coordinate ascent on the SMLlikelihood as solid lines and the plain results of the discussedmethods without subsequent gradient steps as dashed lines.We can see that none of the proposed methods comes closeto the performance of the Genie ML. Nevertheless, we canidentify a large advantage of the MCENet approach over themodel-based approaches.To understand where this performance advantage of theMCENet stems from, we plot the RMSPE of the top of realizations for each DoA estimator in Fig. 2. Now wecan see that the hybrid MCENet approach almost achievesthe performance of the Genie ML for a SNR of dBand higher. Meanwhile, the model-based approaches are still SNR [dB] R M SP E [ ◦ ] GLS + ML MCENet + ML SSR iter.SSR + ML GLS MCENetSSR iter. + ML SSR Genie MLFig. 1. RMSPE vs. SNR, N = 10 . SNR [dB] R M SP E [ ◦ ] GLS + ML MCENet + ML SSR iter.SSR + ML GLS MCENetSSR iter. + ML SSR Genie MLFig. 2. RMSPE vs. SNR, Top of realizations, N = 10 . falling behind. Interestingly, update steps are not enoughfor the alternating optimization approach “SSR iter.” to achievethe same performance as the general purpose solver solution.In last place is the GLS approach that might perform better fora denser grid, which, however, is computationally intractable.This reduced gap between the GenieML performance and theperformance of the other algorithms shows that the results inFig. 1 are dominated by the suboptimal performance for someof the realizations, which we will refer to as outliers, whereasfor the majority of the realizations the algorithms achieve anacceptable accuracy. The hybrid MCENet approach suffersfrom fewer outliers than the model-based approaches, whichcan be seen in Fig. 3 as well, where we plotted the empiricalcumulative density of the RMSPE per DoA estimator at dBSNR for the Genie ML, the SSR and the MCENet methods.When we compare the hybrid MCENet results with the plainMCENet output, we see in both Fig. 2 and Fig. 3 that thecombination of the data-based MCENet with the model-basedgradient steps is crucial. The MCENet alone cannot provide − . . . . RMSPE [ ◦ ] SSR + ML SSR Genie MLMCENet + ML MCENetFig. 3. Empirical Cumulative Density Function, SNR = 20 dB, N = 10 , M = 9 . − . . RMSPE [ ◦ ] SSR + ML MCENet + ML Genie ML N = 10 N = 1000 Fig. 4. Empirical Cumulative Density Function, SNR = 20 dB, varying N , M = 9 . estimates that can compete with a model-based approach fornon-outlier realizations. This comes as a trade-off betweenhigh SNR accuracy and outlier robustness, which is alsoreﬂected in the cost function of the NN. By design, the NNtries to minimize the average MCE over all realizations. Inthat sense, minimizing the occurrence of outliers that lead tolarge errors weighs more than further improving the accuracyfor realizations with a small error such that during trainingthe emphasis lies ﬁrst and foremost on the robustness againstoutliers.The higher susceptibility to outliers of the model-basedapproaches vanishes for a higher number of snapshots N ,as can be seen in Fig. 4. There, we compare the relevantcut-out of the empirical cumulative density of the hybridapproaches for N = 10 and N = 1000 . For high N ,almost no outliers occur and the performance of SSR is onpar with the MCENet. This result is not surprising, as theSSR method is based on a covariance-matching criterion. Thesample covariance matrices are consistent estimates of thetrue subarray covariance matrices, which in turn justiﬁes thevalidity of the covariance-matching objective for high N . Notethat the GLS estimator, which is based on a similar objective asthe SSR method, has been proven to be a consistent estimator(for a sufﬁciently dense grid) [1]. − . . RMSPE [ ◦ ] SSR + ML MCENet + ML Genie ML M = 9 M = 25 Fig. 5. Empirical Cumulative Density Function, SNR = 20 dB, N = 10 ,varying M . TABLE IIIC OMPUTATION T IMES OF D O A E

STIMATORS

MCENet SSR SSR iter. GLSw/o gradient steps . s . s . s . sw/ gradient steps . s . s . s . s An increase in the number of antennas M improves theaverage performance of all discussed methods, as is shownin Fig. 5. However, for M = 25 antennas, we still seea signiﬁcant amount of outliers for the MCENet and SSRapproaches. Furthermore, increasing M not only comes withadditional hardware expenditures, but the number of subarrays K , which have to be sampled, and therefore, the time to scanthe whole array, grows as well.On a ﬁnal note, we want to brieﬂy discuss the complexityof the presented estimators. To this end, we show in Table IIIthe computation times of the individual estimators with andwithout consecutive gradient steps. The presented times are for realizations at dB SNR in MATLAB on a simulationserver equipped with two Intel Xeon Gold 6134 processors.Although we know that computation times do not achieve thesame validity as a rigorous complexity analysis in Landaunotation, due to their dependence on the used hardware andimplementation, they still yield some qualitative insights. Fromour simulations, we see that the MCENet inference steps tookabout a tenth of the evaluation time of the “SSR” estimator,which itself is again about ten times faster than the “SSRiter.” with a ﬁxed iteration count of iterations and theGLS estimator, whose complexity grows exponentially withthe number of sources. Taking the consecutive gradient stepsinto account , we see that these steps implemented by a blockcoordinate ascent take roughly the same time for the hybridMCENet and SSR methods. In contrast, for the hybrid GLSapproach the gradient steps take much longer to converge,which can be explained by the numerous poor initial estimatesprovided by the GLS estimator. Note that the required time for the gradient steps heavily depends onthe target accuracy. A looser stopping criterion may signiﬁcantly reducethe required computation times. For the presented simulations, the stoppingcriterion for the gradient steps is very tight ( < − absolute change in thelog-likelihood), to achieve meaningful results for the MDL approach discussedin the next section. IV. M

ODEL O RDER S ELECTION

Knowledge about the number of sources in the transmissionenvironment L is essential in any of the previously presentedDoA estimation approaches. With an inaccurate estimate of themodel order, we base our algorithms on the wrong stochasticmodel or choose the wrong NN, which has been trained onmismatched data. Hence, we discuss the problem of modelorder selection in this section. Again, we follow the structureof the previous section and discuss model-based approachesﬁrst, namely information criteria (IC). Then we present adata-based approach, which uses a classiﬁcation NN, and aperformance comparison based on Monte Carlo simulations. A. Information Criteria

The most common model-based model order selectionmethods are based on ICs [35]. All variants of these IC followa common structure of their underlying optimization problem.For the considered system model this optimization problem isgiven by ˆ L = argmax (cid:96) ∈{ ,...,L max } ln (cid:16) p (cid:96) (cid:16) Y ; ˆ θ , ˆ R s , ˆ σ η (cid:17)(cid:17) + c ( (cid:96) ) , (14)where Y contains all observations y k ( n ) , k = 1 , . . . , K, n =1 , . . . , N , the likelihood function of the received signals underthe assumption that the model order is equal to L andparameterized by the ML estimates of the model parametersis denoted by p (cid:96) ( · ) , and c ( (cid:96) ) is a penalty term that combatsoverﬁtting of the model order.In the fully sampled case, the likelihood function can bereparameterized by the eigenvalues of the sample covariancematrix. This leads to a very convenient expression for thevalue of the likelihood function that depends only on theseeigenvalues and no longer on the DoA estimates for eachconsidered model order [36]. Therefore, the computationalload is basically reduced to an eigenvalue decomposition incontrast to evaluating ML estimates for very high model ordersup to L max . Unfortunately, this reparameterization is no longeravailable when we consider systems with subsampling. This ismade visible by looking at eigendecompositions of the samplecovariances ˆ R ( k ) y , where the true model order L is largerthan the number of RF chains W . Here, the eigenspace canno longer be decomposed into a signal and noise subspace.Additionally, as we see from the discussion in Section III-A,the ML estimates that are generally needed for the evaluationof the IC cannot be obtained directly for L ≥ W .Instead, we can replace the ML estimates of the modelparameters in (14) by the GLS estimates, as has been proposedin [22]. Applying the same rational, the SSR estimator or anyhybrid version can be used to evaluate the IC. B. Purely Data-Based Model Order Selection

As we have seen in Section III-D, the DoA estimates in thelow SNR and low number of snapshots region are heavilyaffected by outliers. In [21], it is shown that in the fullysampled case a NN-based model order selection approachcan outperform classical IC in exactly this region, whilesimultaneously performing equally for high SNR and many snapshots. Therefore, we follow the lines of [21] and discussa similar NN, to which we refer to as

CovNet , for model orderselection for systems with subsampling.

1) Data and preprocessing:

For the NN, we use the samekind of preprocessing based on the sample covariance matricesas described in Section III-B1. Again, due to the artiﬁcial datawe use, we can sample from the underlying stochastic modelas described for MCENet. However, the network is now notonly fed with data stemming from a stochastic model withﬁxed model order L , but we have to provide data for varyingmodel orders L = 0 , . . . , L max . This model order is used inthe form of a one-hot encoded vector as the label for eachdata sample. During training, each batch consists of an equalnumber of realizations from the varying model orders suchthat no bias towards one model order is introduced.

2) Architecture and cost function:

Again, we use a fullyconnected, feedforward NN with N h hidden layers with N u neurons each and ReLU activation. The output layer consistsof L max + 1 neurons and applies a softmax operation to formthe outputs z ( (cid:96) ) , (cid:96) = 0 , . . . , L max [37]. By training based onthe cross-entropy loss, which for one-hot encoded labels isgiven by max w ln ( z ( (cid:96) ∗ | x ; w )) , (15)the output values z ( (cid:96) ) can be interpreted as estimates ofthe posterior probabilities for model order (cid:96) . The trainingprocedure can be seen as a heuristic approach to the optimalmaximum a posteriori (MAP) estimator, because the trainingadapts the weights w such that the estimate of the posteriorprobability z ( L ) of the true model order L of the input x ismaximized [38]. C. Simulations

We conducted simulations for model order selection withthe same data generating model as introduced in Section III-D.The maximal number of sources L max that we consider for oursimulations is , i.e., we are operating in the range of L max ≥ W . For CovNet, we use a smaller network than MCENet.CovNet has the same structure as its counterpart in [21] with N h = 3 layers with N u = 1024 neurons and has been trainedon batches with samples in each batch, by using theAdam optimizer [32] with ﬁxed learning rate of − . Asa reference, we use the maximum description length (MDL)estimator, whose penalty term in (14) under the assumption ofuncorrelated transmit signals is given as [22] c ( (cid:96) ) = 2 (cid:96) + 12 ln ( KN ) , (16)and uses the hybrid SSR method, which is computationallytractable and achieves a better DoA estimation performancethan the GLS approach (cf. Section III-D), to obtain thenecessary parameter estimates.In Fig. 6, we show the model order selection accuracy ofthe discussed methods for varying SNR. To that end, we usea test set consisting of · data samples with a ﬁxedSNR and an equal number of data points from all modelorders. Note that due to our SNR deﬁnition, ﬁxed SNR meansthat the power ratio of the strongest source to the noise is − SNR [dB] A cc u r ac y [ % ] CovNetMDL with SSR + MLFig. 6. Model Order Selection Accuracy vs. SNR, N = 10 constant, but the transmit powers of the weaker sources arestill randomly drawn, i.e., the ratio of transmit power andnoise power for these sources is smaller than the stated SNR.Similarly, we show the achieved accuracy of the differentmethods for a different number of snapshots N in Fig. 7,where the respective test sets consist of data samples with aﬁxed number of snapshots N and randomly drawn SNR. Inboth cases, CovNet achieves a signiﬁcantly higher accuracythan the MDL estimator. As we are operating in a low snapshotregion, the SSR estimators are prone to outliers, as discussedfor L = 3 in Section III-D, which leads to the suboptimalperformance of the MDL estimator compared to the NN-basedapproach.Note that, in Fig. 7, we show two different CovNet results.The solid red line shows the accuracy for a NN, wherethe number of snapshots in the training set and test set arematching, i.e., N train = N , whereas the dashed orange lineshows the performance of a CovNet model that has beentrained on data with N train = 10 snapshots, which means thatfor this model the data in the test and training sets are different.Interestingly, the CovNet model trained on snapshots isable to generalize well to data with a different number ofsnapshots and achieves almost the same performance over all N as the NNs that have been trained on the same number ofsnapshots as in the test set N = N train . This means that forthe implementation in a direction ﬁnder, NNs for each possiblenumber of snapshots do not have to be stored, but a certainrealization can cover different N .Again, we end this section by a short comparison of therequired computation times of each presented model orderselection algorithm. For realizations with varying SNRevaluated on the same simulation server as discussed inSection III-D, the inference from the CovNet model takes . seconds. However, the MDL estimator takes . seconds forthe same task, since a DoA estimate for all possible modelorders—including the computationally expensive high modelorders—has to be performed for every realization. This is adifference by a factor of . Although the CovNet approachdoes not automatically yield a DoA estimate like the MDLapproach, its execution time combined with the time for aconsecutive DoA estimation for the estimated model order N A cc u r ac y [ % ] CovNet N train = N CovNet N train = 10 MDL with SSR + MLFig. 7. Model Order Selection Accuracy vs. N (cf. Section III-D for L = 3 ) is still signiﬁcantly smaller.V. C ONCLUSION

From the simulation results in Section III-D and IV-C, wesee that NN-based approaches to DoA estimation and modelorder selection are viable alternatives to existing model-basedtechniques for systems with subarray sampling. In terms ofselection accuracy and DoA estimation error, the proposed NNschemes are able to outperform model-based techniques whenthe number of available snapshots is small. Hereby, a combina-tion of NN based initialization and model-based gradient stepswas crucial to achieve competitive DoA estimates, althoughimprovements on the architecture or training procedure mayfurther improve the purely NN-based estimates (cf. results in[18]). Additionally, the computational complexity of a NNinference is considerably lower than the evaluation of model-based estimators, which enables completely NN-based DoAestimation chains for time-critical applications.However, there are still some open questions that needto be addressed: How do NN based approaches cope witharray calibration? And, how robust are these methods whenmodel imperfections come into play? One idea to tackle theseproblems is to use an online learning procedure to adapt apretrained NN to the changed model as has already beenproposed in [21]. A

PPENDIX AA LTERNATING U PDATE FOR

SSR E

STIMATOR

The optimization problem for the SSR estimator, accordingto [8, Equation (46)], is given by min p ,σ η K (cid:88) k =1 tr (cid:16) ˇ R ( k ) , − y ˆ R ( k ) y (cid:17) s . t . : p ≥ , σ η ≥ , G (cid:88) g =1 w g p g + ¯ wσ η = 1 , (17)with the sparse representation of the covariance matrix ˇ R ( k ) y = ˇ A ( k ) diag { p } ˇ A ( k ) , H + σ η I , (18) where ˇ A ( k ) is a dictionary containing G subarray steering vec-tors G ( k ) A ( ˇ θ g ) , g = 1 , . . . , G , and p contains the respectivepower values.The weights in (17) are given as w g = 1 KW K (cid:88) k =1 a H ( ˇ θ g ) G ( k ) , H ˆ R ( k ) , − y G ( k ) a ( ˇ θ g ) , (19) ¯ w = 1 KW K (cid:88) k =1 tr (cid:16) ˆ R ( k ) , − y (cid:17) . (20)Note that we added a missing factor of /K compared to [8,Equation (46)], as (cf. [26, Equation (17)]) K (cid:88) k =1 G (cid:88) g =1 p g a H ( ˇ θ g ) G ( k ) , H ˆ R ( k ) , − y G ( k ) a ( ˇ θ g )+ K (cid:88) k =1 σ η tr (cid:16) ˆ R ( k ) , − y (cid:17) −−−−→ N →∞ KW. (21)Following the lines of [26, Section III], we obtain thealternating update rules in the i + 1 -th iteration as p [ i +1] g = p [ i ] g (cid:13)(cid:13)(cid:13)(cid:13) K (cid:80) k =1 a H ( ˇ θ g ) G ( k ) , H ˇ R ( k ) , − y ˆ R ( k ) , / y (cid:13)(cid:13)(cid:13)(cid:13) w / g ξ [ i ] , (22) σ , [ i +1] η = σ , [ i ] η (cid:18) K (cid:80) k =1 tr (cid:16) ˇ R ( k ) , − y ˆ R ( k ) y ˇ R ( k ) , − y (cid:17)(cid:19) / ¯ w / ξ [ i ] , (23)with ξ [ i ] = G (cid:88) g =1 w / g p [ i ] g (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K (cid:88) k =1 a H ( ˇ θ g ) G ( k ) , H ˇ R ( k ) , − y ˆ R ( k ) , / y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ¯ wσ , [ i ] η K (cid:88) k =1 tr (cid:16) ˇ R ( k ) , − y ˆ R ( k ) y ˇ R ( k ) , − y (cid:17) . (24)R EFERENCES[1] J. Sheinvald and M. Wax, “Direction ﬁnding with fewer receivers viatime-varying preprocessing,”

IEEE Trans. Signal Process. , vol. 47, pp.2–9, Jan. 1999.[2] E. Fishler and H. Messer, “Multiple source direction ﬁnding with anarray of M sensors using two receivers,” in

Proc. Workshop on StatisticalSignal and Array Process. , Aug. 2000, pp. 86–89.[3] A. Zeira and B. Friedlander, “Direction ﬁnding with time-varyingarrays,”

IEEE Trans. Signal Process. , vol. 43, no. 4, pp. 927–937, Apr.1995.[4] B. Friedlander and A. Zeira, “Eigenstructure-based algorithms for di-rection ﬁnding with time-varying arrays,”

IEEE Trans. Aerosp. Electron.Syst. , vol. 32, no. 2, pp. 689–701, Apr. 1996.[5] D. Rieken and D. Fuhrmann, “Generalizing MUSIC and MVDR formultiple noncoherent arrays,”

IEEE Trans. Signal Process. , vol. 52, pp.2396–2406, Sep. 2004.[6] J. Worms, “RF direction ﬁnding with a reduced number of receiversby sequential sampling,” in

Proc. Int. Conf. on Phased Array Syst. andTechnol. , May 2000, pp. 165–168.[7] R. Schmidt, “Multiple emitter location and signal parameter estimation,”

IEEE Trans. Antennas Propag. , vol. 34, no. 3, pp. 276–280, Mar. 1986.[8] W. Suleiman, P. Parvazi, M. Pesavento, and A. M. Zoubir, “Non-coherent direction-of-arrival estimation using partly calibrated arrays,”

IEEE Trans. Signal Process. , vol. 66, no. 21, pp. 5776–5788, Nov. 2018. [9] R. Rastogi, P. Gupta, and R. Kumaresan, “Array signal processing withinterconnected neuron-like elements,” in

Proc. ICASSP , vol. 12, Apr.1987, pp. 2328–2331.[10] K. L. Du, A. K. Y. Lai, K. K. M. Cheng, and M. N. S. Swamy, “Neuralmethods for antenna array signal processing: a review,”

Signal Process. ,vol. 82, no. 4, pp. 547–561, Apr. 2002.[11] S. Chakrabarty and E. A. P. Habets, “Multi-speaker DOA estimationusing deep convolutional networks trained with noise signals,”

IEEE J.Sel. Topics Signal Process. , vol. 13, no. 1, pp. 8–21, Mar. 2019.[12] E. Ozanich, P. Gerstoft, and H. Niu, “A feedforward neural network fordirection-of-arrival estimation,”

J. Acoustical Soc. Amer. , vol. 147, no. 3,pp. 2035–2048, Mar. 2020.[13] Y. Yao, H. Lei, and W. He, “A-CRNN-based method for coherent DOAestimation with unknown source number,”

Sensors , vol. 20, no. 8, p.2296, Jan. 2020.[14] Z.-M. Liu, C. Zhang, and P. S. Yu, “Direction-of-arrival estimation basedon deep neural networks with robustness to array imperfections,”

IEEETrans. Antennas Propag. , vol. 66, no. 12, pp. 7315–7327, Dec. 2018.[15] A. M. Elbir, “DeepMUSIC: Multiple signal classiﬁcation via deeplearning,”

IEEE Sensors Lett. , vol. 4, no. 4, pp. 1–4, Apr. 2020.[16] L. Wu, Z.-M. Liu, and Z.-T. Huang, “Deep convolution network fordirection of arrival estimation with sparse prior,”

IEEE Signal Process.Lett. , vol. 26, no. 11, pp. 1688–1692, Nov. 2019.[17] Y. Guo, Z. Zhang, Y. Huang, and P. Zhang, “DOA estimation methodbased on cascaded neural network for two closely spaced sources,”

IEEESignal Process. Lett. , vol. 27, pp. 570–574, Apr. 2020.[18] O. Bialer, N. Garnett, and T. Tirer, “Performance advantages of deepneural networks for angle of arrival estimation,” in

Proc. ICASSP , May2019, pp. 3907–3911.[19] P. Costa-Hirschauer, J. Grouffaud, P. Larzabal, and H. Clergeot, “Aneural network for model order selection in signal processing,” in

Proc.ICNN , vol. 6, Nov 1995, pp. 3057–3061 vol.6.[20] Y. Yang, F. Gao, C. Qian, and G. Liao, “Model-aided deep neuralnetwork for source number detection,”

IEEE Signal Process. Lett. ,vol. 27, pp. 91–95, Jan. 2020.[21] A. Barthelme, R. Wiesmayr, and W. Utschick, “Model order selectionin DoA scenarios via cross-entropy based machine learning techniques,”in

Proc. ICASSP , May 2020, pp. 4622–4626.[22] J. Sheinvald and M. Wax, “Detection and localization of multiple signalsusing subarrays data,” in

Advances in Spectr. Anal. and Array Process.(vol. III) , S. Haykin, Ed. USA: Prentice-Hall, Inc., 1995, pp. 324–351.[23] H. Krim and M. Viberg, “Two decades of array signal processingresearch: the parametric approach,”

IEEE Signal Process. Mag. , vol. 13,no. 4, pp. 67–94, July 1996.[24] A. Zeira and B. Friedlander, “On the performance of direction ﬁndingwith time-varying arrays,”

Signal Process. , vol. 43, no. 2, pp. 133–147,May 1995.[25] B. Hochwald and A. Nehorai, “Identiﬁability in array processing modelswith vector-sensor applications,”

IEEE Trans. Signal Process. , vol. 44,no. 1, pp. 83–95, Jan. 1996.[26] P. Stoica, P. Babu, and J. Li, “SPICE: A sparse covariance-basedestimation method for array processing,”

IEEE Trans. Signal Process. ,vol. 59, no. 2, pp. 629–638, Feb. 2011.[27] P. Costa, J. Grouffaud, P. Larzabal, and H. Clergeot, “Estimation of thenumber of signals from features of the covariance matrix: a supervisedapproach,”

IEEE Trans. Signal Process. , vol. 47, no. 11, pp. 3108–3115,Nov 1999.[28] R. Reggiannini, “A fundamental lower bound to the performance ofphase estimators over rician-fading channels,”

IEEE Trans. Commun. ,vol. 45, no. 7, pp. 775–778, Jul. 1997.[29] T. Routtenberg and J. Tabrikian, “Non-bayesian periodic cramr-raobound,”

IEEE Trans. Signal Process. , vol. 61, no. 4, pp. 1019–1032,Feb 2013.[30] Z. Yang, J. Li, P. Stoica, and L. Xie, “Sparse methods for direction-of-arrival estimation,” in

Academic Press Library in Signal Processing,Volume 7 . Elsevier, 2018, pp. 509–581.[31] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training deepfeedforward neural networks,” in

Proc. Int. Conf. Artif. Intell. Statist. ,vol. 9, May 2010, pp. 249–256.[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”Dec. 2014, arXiv preprint arXiv:1412.6980v9.[33] J. Lofberg, “YALMIP : a toolbox for modeling and optimization inMATLAB,” in

Proc. ICRA , Sep. 2004, pp. 284–289.[34] MOSEK ApS,

The MOSEK optimization toolbox for MATLAB manual.Version 9.0. , 2019. [Online]. Available: http://docs.mosek.com/9.0/toolbox/index.html [35] P. Stoica and Y. Selen, “Model-order selection: a review of informationcriterion rules,” IEEE Signal Process. Mag. , vol. 21, no. 4, pp. 36–47,July 2004.[36] M. Wax and T. Kailath, “Detection of signals by information theoreticcriteria,”

IEEE Trans. Acoust., Speech, Signal Process. , vol. 33, no. 2,pp. 387–392, April 1985.[37] J. S. Bridle, “Training stochastic model recognition algorithms asnetworks can lead to maximum mutual information estimation of param-eters,” in

Advances Neural Inf. Process. Syst. , Nov. 1989, pp. 211–217.[38] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”