[PDF] Reinforcement learning-based waveform optimization for MIMO multi-target detection

Abstract

A cognitive beamforming algorithm for colocated MIMO radars, based on Reinforcement Learning (RL) framework, is proposed. We analyse an RL-based optimization protocol that allows the MIMO radar, i.e. the \textit{agent}, to iteratively sense the unknown environment, i.e. the radar scene involving an unknown number of targets at unknown angular positions, and consequently, to synthesize a set of transmitted waveforms whose related beam patter is tailored on the acquired knowledge. The performance of the proposed RL-based beamforming algorithm is assessed through numerical simulations in terms of Probability of Detection ( P D ).

Full PDF

aa r X i v : . [ ee ss . SP ] N ov Reinforcement learning-based waveformoptimization for MIMO multi-target detection

Li Wang

Dept. of Electronic EngineeringTsinghua University, Beijing , [email protected]

Stefano Fortunati, Maria S. Greco, Fulvio Gini

Dept. of Information EngineeringUniversity of Pisa, Pisa, Italy { s.fortunati,f.gini,m.greco } @iet.unipi.it Abstract —A cognitive beamforming algorithm for colocatedMIMO radars, based on Reinforcement Learning (RL) frame-work, is proposed. We analyse an RL-based optimization protocolthat allows the MIMO radar, i.e. the agent , to iteratively sensethe unknown environment, i.e. the radar scene involving anunknown number of targets at unknown angular positions, andconsequently, to synthesize a set of transmitted waveforms whoserelated beam patter is tailored on the acquired knowledge. Theperformance of the proposed RL-based beamforming algorithmis assessed through numerical simulations in terms of Probabilityof Detection ( P D ). I. I

NTRODUCTION

Reinforcement learning (RL) is an area of machine learningwith connections to control theory, optimization, and cognitivesciences that potentially has a wide variety of applicationsin Radar Signal Processing. In short, the RL framework isa machine learning technique that allows an active learner,usually called agent , to learn through experience (see e.g.the survey [1] and references therein). Speciﬁcally, the agentlearns how to choose the best action to achieve its goal byinteracting with an unknown environment, without any pre-assigned control policy. As this brief introduction suggests, the“learning loop” characterizing the RL framework has strongsimilarity with the Cognitive Radar (CR) iterative feedbackcontrol system [2,3].In this paper, by drawing and merging elements form boththe RL and CR frameworks, we analyse a possible applicationof some RL basic tools to a classical problem in colocatedMultiple-Input Multiple-Output (MIMO) radar systems: the(cognitive) waveform optimization for multi-target detection.One of the main advantages of a (colocated) MIMO conﬁg-uration with respect to the classical phased array is that eachantenna element of the MIMO array is allowed to transmitdifferent probing signals [4]–[6]. As a consequence of thiswaveform diversity, the (colocated) MIMO beam pattern canbe arbitrarily shaped in order to focus the transmitted powermainly in the angular directions of the potential targets.Using the RL formalism, we can say that our agent is theMIMO radar and the environment is the radar scene involv-ing an unknown number of targets with unknown (angular)positions embedded in an unknown disturbance. So, the RLlearning loop can be successfully exploited to optimize thetransmitted waveforms in order to maximize the detection ca-pabilities of the MIMO radar acting in this unknown scenario. In this work, the statistics of the disturbance are assumedto be known, and we focus our attention on uncertaintiesregarding the targets. Speciﬁcally, by transmitting a ﬁrst set oforthonormal waveforms, that correspond to an omnidirectionalbeam pattern, the radar can acquire a ﬁrst partial knowledgeabout the number and the positions of the targets. Then,according to the acquired information, the beam pattern canbe shaped towards the targets by optimizing the second setof waveforms to be transmitted. This learning process can beiterated in order to continuously improve the radar detectioncapability and to adapt the waveforms to possible changes inthe environments, e.g. a change in the number of targets.The rest of the paper is organized as follow. Sec. II providesthe signal model of a colocated MIMO radar while in Sec. III,the Generalized Likelihood Ratio Test (GLRT) for the detec-tion problem at hand is derived. Sec. IV is the core of this workand it presents an original RL-based beamformig algorithm inthe presence of white Gaussian noise with possibly unknownpower. The performance of the proposed method is assessedin Sec. V by means of numerical simulations. Finally, someconclusions are collected in Sec. VI.

Notation : Italics indicates scalar quantities ( a ), lower caseand upper case boldface indicate column vectors ( a ) andmatrices ( A ), respectively. The relation A (cid:23) B means that A − B is a positive semi-deﬁnite matrix. The superscripts ∗ , T and H indicate the complex conjugate, the transpose andthe Hermitian operators. I N × N is the N × N identity matrix.With ⊗ , tr ( · ) and vec ( · ) , we indicate the Kronecker product,the trace and the vectorization operator. Finally, || · || indicatesthe Euclidean norm.II. C OLOCATED

MIMO

RADAR SIGNAL MODEL

Let us consider a MIMO radar system equipped with N T transmitters and N R receivers [4]. Both the receive and trans-mit subarrays are uniform linear arrays with half-wavelengthelement separation. Let W = [ w , . . . , w N T ] T denote thetransmitted signal matrix. In particular, the q -th row of W ,i.e. w q ∈ C N , contains N discrete samples of the transmittedwaveform from the q -th element, with q = 1 , . . . , N T . Follow-ing [6], let us assume that the transmitted waveforms can beexpressed by a weighted sum of N T independent orthonormalbaseband waveforms Φ = [ φ , . . . , φ N T ] T , where φ q ∈ C N isthe q -th orthonormal baseband waveform. Then the transmittedignals matrix can be expressed as W = CΦ , where theweighting matrix is C = [ c , c , . . . , c N T ] T and c q ∈ C N T is the weighting vector of the q -th transmit element whosepower is given by || c q || . The total transmitted power is then P T = P N T q =1 || c q || . The beam pattern generated by the trans-mitted waveforms is given by B ( θ ) = a TT ( θ ) R W a ∗ T ( θ ) [5,6],where R W = CC H is the covariance matrix of the transmittedwaveforms, a T ( θ ) = [1 , e jπ sin θ , . . . , e jπ ( N T −

1) sin θ ] T is thetransmitter array steering vector and θ is the Direction ofArrival (DOA) of the target. By considering a particular angle-range cell and a single transmitted pulse, after the matchedﬁlter, the measurement model can be expressed as [6,7]: C N R × N T ∋ Y = α a R ( θ ) a TT ( θ ) C + N , (1)where α ∈ C accounts for the Radar Cross Section (RCS)of the target, the two-way path loss and the straddling lossesand a R ( θ ) = [1 , e jπ sin θ , . . . , e jπ ( N R −

1) sin θ ] T is the receiverarray steering vector. The noise term N is a matrix whosecolumns are mutually independent, zero-mean, circular com-plex Gaussian vectors with covariance matrix equal to σ I N R .The model in (1) can be rewritten in a more convenientvectorial form as: C N T N R ∋ y = vec ( Y ) = α h ( θ ) + n , (2)where, by using the properties of the Kronecker product, thevector h has been deﬁned as: h ( θ ) = ( C T a T ( θ )) ⊗ a R ( θ ) , (3)while the noise vector is: n = vec ( N ) ∼ CN ( , σ I N T N R ) . (4)III. T ARGET DETECTION AND

GLRTLet us now assume a discrete radar ﬁeld of view dividedinto LG angle-range cells. Speciﬁcally, we suppose that theangle θ in the signal model given in (2) is a discrete variablewith values in the set L , { lπ/L − π/ | l = 0 , . . . , L − } .For each discrete angle (or angle bin ) θ l ∈ L , we have G range resolution cells, each of which will be indexed with theindex g ∈ G , { , . . . , G } . Moreover, we assume that theradar system transmits K pulses (indexed by k = 1 , . . . , K ),each of which is characterized by the signal matrix W deﬁnedin Sec. II. Under these assumptions, our aim is to handle thefollowing Hypothesis Testing (HT) problem: (cid:26) H : y kl,g = n kl,g H : y kl,g = α kl,g h l + n kl,g . (5)By considering the noise parameter σ as a priori known , theGLR statistic Λ kl,g ≡ Λ( y kl,g ) can be expressed as: Λ kl,g , α kl,g ∈ C p H ( y l,g | α kl,g ) p H ( y kl,g ) = 2 σ | h Hl y kl,g | || h l || . (6)From Wilks’s theorem [8,9], the asymptotic distributionunder H of the GLR statistic in (6) is given by: Λ( y kl,g | H ) d. ∼ N T N R →∞ χ , (7) where χ indicates the central χ -squared distribution with 2degrees of freedom (dof). Similarly, under H , we have that: Λ( y kl,g | H ) d. ∼ N T N R →∞ χ ( δ kl,g ) , (8)where χ ( δ ) indicates the non-central χ -squared distributionwith 2 dof and a non-centrality parameter δ kl,g = 2 | α kl,g | /σ .The Probability of False Alarm ( P F A ) is deﬁned as: P F A = Pr { Λ kl,g > λ | H } ⋍ N T N R →∞ Z ∞ λ p Λ kl,g ( a | H ) da , H χ ( λ ) , (9)where p Λ ( ·| H ) ≡ χ . Consequently, given a desired value ofthe P F A , say P F A , the threshold λ can be set as: ¯ λ = H − χ ( P F A ) , (10)where H − χ is the inverse of the function H χ deﬁned in (9).It is worth noting here that the noise power σ is, in general,unknown. For this reason, we have to replace its true value in(6) with a consistent estimate, say ˆ σ . Remarkably, if ˆ σ is a √ N T N R -consistent estimator, the asymptotic distributions ofthe GLR statistic given in (7) and (8) remains unchanged.IV. A N RL-

BASED BEAMFORMING ALGORITHM FORMULTI - TARGET DETECTION

In this section, we provide a full description of the RL-based beamforming algorithm. The basic RL notions andtools are thereafter recalled. Therefore, their application to thespeciﬁc MIMO multi-target detection problem are discussedand analysed.Let us introduce a (ﬁnite) Markov decision process (MDP)as presented in [10, Ch. 14] and [1]. The MDP is a suitablemodel to describe the closed loop of interactions between theagent (the MIMO radar) and the environments, i.e. the radarscenario involving multiple targets and Gaussian noise.Formally, an MDP model is a tuple {S , A , P, ρ } , where: • S is the (ﬁnite) sample space of the set of random states, • A is a (ﬁnite) set of actions, • P : S × A × S → [0 , is the state transition probability, • ρ : S × A → R is the reward function.Finally, let us introduce the policy π : S → A as the functionthat determines which action has to be taken at each state.The learning process can be summarized as follows. At time k , the agent observes the state s k ∈ S , that is consideredto be a random variable with values in S . Then, accordingto a speciﬁc policy π , the agent decides to take action a k = π ( s k ) ∈ A . As a consequence of the action a k , theagent will observes the state s k +1 = δ ( s k , a k ) ∈ S withprobability P ( s k , a k , s k +1 ) , Pr { s k +1 | s k , a k } by receivinga reward r k +1 = ρ ( s k , a k ) ∈ R . The function P depends onthe environment to be sensed and it is generally unknown. Atthis point, the agent has to choose the next action a k +1 ∈ A according to the policy π , and so on. Clearly, the critical pointof this learning procedure is the choice of the optimal policy π .As any other optimality criterion, the deﬁnition of the “best”olicy, say π opt , relies on a score function that in the RLframework is the so-called expected return V π ( s ) [10, Ch. 14].Speciﬁcally, given a sequence of (random) states s , . . . , s K ,the expected return for the policy π is deﬁned as: V π ( s ) = E (cid:26)X K − kτ =0 γ τ ρ ( s k + τ , π ( s k + τ )) | s k = s (cid:27) = ρ ( s, π ( s )) + γ X s ′ ∈S Pr { s ′ | s, π ( s ) } V π ( s ′ ) , (11)where s ′ = δ ( s, π ( s )) , γ ∈ (0 , is a parameter that trades ofshort-term against long-term reward [1]. Speciﬁcally, if γ isclose to zero, only immediate rewards are considered. Giventhe score function V π ( s ) , the task of the learning procedureis, therefore, to ﬁgure out the optimal policy π opt ( s ) =argmax π V π ( s ) , which can get the maximum expected return V opt ( s ) = max π V π ( s ) for each possible state value s ∈ S .To this end, let us introduce the optimal state-action valuefunction Q : S × A → R as [10, Ch. 14],[1]: Q ( s, a ) , ρ ( s, a ) + γ X s ′ ∈S Pr { s ′ | s, a } V opt ( s ′ ) . (12)When both S and A are ﬁnite sets, a state-action value matrix Q , whose entries are [ Q ] s,a = Q ( s, a ) , can be introduced.Given all the values of the entries of Q , the optimal policy atthe state s can be simply deﬁned as a sort of lookup table: ∀ s ∈ S , π opt ( s ) = argmax a ∈A [ Q ] s,a . (13)However, as already pointed out, the conditional probability Pr { s ′ | s, a } in (12) is generally unknown. Then, the function Q ( s, a ) , at least in the form presented in (12), cannot beused directly to ﬁnd the optimal policy. Fortunately, someadditional manipulation is allowed. By deﬁnition, V opt ( s ′ ) =max a ′ ∈A Q ( s ′ , a ′ ) , so that (12) can be rewritten in terms of aconditional expectation as: Q ( s, a ) = E s ′ { ρ ( s, a ) + γ max a ′ ∈A Q ( s ′ , a ′ ) | s } , (14)which does not explicitly depend on Pr { s ′ | s, a } . Finally, wecan exploit some well-known stochastic approximation algo-rithms to iteratively obtain all entries of the matrix Q ( s, a ) asdiscussed e.g. in [10, Ch. 14]. The aim of the next subsectionsis to show how to apply this iterative learning algorithm to theMIMO beamforming problem at hand. A. The state space

The MIMO radar detection scheme consists in testing all LG angle-range cells one by one using the GLR statistic Λ kl,g introduced in (6). Starting from the values assumed by Λ kl,g in each ( l, g ) angle-range cell at time k , the state space S canbe set up as follows. Let us ﬁrst deﬁne the statistic ¯Λ kl , (cid:26) ∃ g ∈ G , Λ kl,g > ¯ λ . (15)In words, the statistic ¯Λ kl is equal to 1 if the decision statistic Λ kl,g exceeds the threshold at least in one range cell, at timestep k , for the given l -th angular bin. Roughly speaking ¯Λ kl tells us that the l -th angular bin may, or may not, contains some targets at time k . Let us now deﬁne the discrete randomvariable s k as: s k , X Ll ¯Λ kl , (16)which tells us how many angular bins may contain targets.The state space S = { , . . . , T max } ⊂ N is then set tobe equal to the sample space of the random variable s k . Notethat T max is the maximum number of targets, in different anglebins, say T max , that can be identiﬁed by the MIMO radar [11]. B. The set A of the actions An action a ∈ A = { a i | i ∈ { , , . . . , T max }} in theMIMO beamforming problem at hand can be deﬁned asthe combination of the two radar tasks of collecting a datasnapshot y kl,g deﬁned in (2) and optimizing the weightingmatrix C in order to focus the transmitted power on the i anglebins that contain potential targets. The cardinality T max = |A| of the set of actions A should then be set to be equal to themaximum number of the identiﬁable targets T max .Let us now describe in detail a typical action that the MIMOradar at hand has to perform at each time step k . As saidbefore, an action a k ∈ A consists ﬁrstly in the acquisition ofthe data snapshot y kl,g and then in a more involved optimizationtask. Speciﬁcally, the agent (the MIMO radar) ﬁrstly has toﬁgure out the set of i angle bins ¯Θ i = { ¯ θ , . . . , ¯ θ i } ⊂ L thatmost likely contains the targets, and then it has to optimize thebeamforming procedure, i.e. it has to ﬁnd the best matrix C to synthesize a beam pattern with a power distribution basedon ¯Θ i . It is worth underlining here that the index i used todeﬁne the sequence of sets { ¯Θ i } T max i =1 is the same index thatcharacterizes each element of the set A . In the following, wedescribe these two steps more accurately.

1) Step 1:

Let Λ k ∈ R L × G be the matrix whose entriesare the values of the GLR statistic at time k for each angle-range cell, i.e. [ Λ k ] l,g , Λ kl,g . Then, let t ∈ R L be the vectorwhose l -th entry represents the maximum value of the decisionstatistic over the range cells for the l -th angular bins, i.e. [ t ] l =max g ∈G Λ l,g . Finally, let T i be the set of indices of the i largervalues of the entries of t , i.e.: T i , i argmax l ∈{ ,...,L − } t . (17)Consequently, ¯Θ i , { ¯ θ j ∈ L| j ∈ T i } .

2) Step 2:

After having obtained the set ¯Θ i , the weightingmatrix C has to be designed in order to focus the transmittedpower on the angles bins indexed in ¯Θ i . The aim of theresulting optimization problem is to maximize the minimumof B (¯ θ j ) = a TT (¯ θ j ) R W a ∗ T (¯ θ j ) with ¯ θ j ∈ ¯Θ i , under theenergy constraint tr ( R W ) = P T , where R W = CC H . Thisoptimization problem can be expressed as: max C min j ∈T i (cid:8) a TT (¯ θ j ) CC H a ∗ T (¯ θ j ) (cid:9) c . t . tr (cid:0) CC H (cid:1) = P T , (18)or, equivalently, as a semi-deﬁnite program (SDP) [12,13] as: max R W ζ c . t . R W (cid:23) , ζ ≥ , tr( R W )= P T , a TT (¯ θ j ) R s a ∗ T (¯ θ j ) ≥ ζ, ∀ j ∈T i , (19)here, after having obtained R W , the weighting matrix C canbe derived using the algorithm in [14]. C. The reward

We indicate with r k +1 the immediate reward obtained whenthe action a k ∈ A is taken in the case of the state s k ∈ S .For the MIMO radar detection problem at hand, a reasonablereward has to be related to the overall Probability of Detection( P D ) for all targets. Since, in each ( l, g ) angle-range cell attime k , P D can be asymptotically approximated as: ( P D ) kl,g ⋍ N T N R →∞ Z ∞ ¯ λ p Λ kl,g ( a | H ) da, (20)where ¯ λ is the threshold deﬁned in (10) and p Λ kl,g ( ·| H ) ≡ χ (ˆ δ kl,g ) (see (8)), a possible reward function may be: r k +1 = ρ ( s k , a k ) , L − X l =0 G X g =1 ψ (ˆ δ kl,g ) , (21)where: ψ (ˆ δ kl,g ) , ( p Λ kl,g (ˆ δ kl,g | H ) ˆ δ kl,g > ¯ λ , (22) ˆ δ kl,g = 2 | ˆ α kl,g | /σ , ˆ α kl,g , = ( h kl ) H y kl,g / || h kl || , (23)and ˆ α kl,g is the Maximum Likelihood (ML) estimate of α inthe ( l, g ) -th angle-range cell at time step k . D. The SARSA Q -learning algorithm As discussed before, the crucial step of any RL-basedprocedure is the choice of the optimal policy π opt deﬁnedin (13). To this end, the state-action value function Q in (14)has to be ﬁrstly estimated. Among all the possible stochasticapproximation-based algorithms available for this task, in thiswork we choose to apply the so-called SARSA (state-action-reward-state-action) (see e.g. [15] or [10, Sec. 14.5.4]). Inbrief, SARSA is an iterative procedure involving two mainsteps:1) Obtain a new state s k +1 = δ ( s k , a k ) ,2) Choose a new action a k +1 through an ǫ -greedy algorithmaccording to the current value of the Q -function, i.e. Q k .In particular, let a opt , argmax a ∈A Q k ( s k +1 , a ) be theoptimal action, then a k +1 = (cid:26) a opt with prob . ǫa ∈ A \ a opt with prob . − ǫ . (24)Note that this ǫ -greedy selection of a k +1 is required toguarantee the convergence of the SARSA algorithm (see[10, Ch. 14] for more details).3) Starting from an initial value, say Q , update the Q -function according to the following iteration: Q k +1 ← βQ k ( s k , a k ) + (1 − β ) ×× [ r k +1 + γQ k ( s k +1 , a k +1 ) − Q k ( s k , a k )] , (25)where r k +1 is the reward deﬁned in (21), β ∈ (0 , controls the convergence speed and γ has been alreadydeﬁned in (11). It is worth noting that the SARSA algorithm has three freeparameters, i.e γ , β and ǫ , that need to be chosen heuristically.V. N UMERICAL RESULTS

In this section, the performance of the proposed RL-basedalgorithm is assessed through Monte Carlo simulations in twodifferent radar scenarios. In the ﬁrst study case, we simulate 4targets with different Signal-to-Noise Ratio (SNR) that main-tain the same angular position for the whole observation periodof K time steps. In the second study case, a scenario in whichthe number and the positions of the targets change during theobservation period is considered. For both the study cases,we consider a colocated MIMO radar system with a uniformlinear transceiver array with N T = N R = 16 elements. TheSNR of the t -th simulated target is SNR t , E {| α kt | } / σ where α kt ∼ CN (0 , σ t ) . Note that α kt has different realization fromtime step to time step. The noise power σ is chosen tobe equal to 1. To save some computational time, the radarscene is restricted in an uniform angular grid of L = 22 angle bins between − ◦ and ◦ with G = 100 range cells.The maximum number of identiﬁable targets is loosely set as T max = 10 . The decision threshold ¯ λ is chosen for a nominal P F A = 10 − . The free parameters of the SARSA algorithmare: β = 0 . , γ = 0 . , ǫ = 0 . . The number of Monte Carloruns is M C = 1000 . Study case 1:

The four “ﬁxed” targets have been generatedaccording to the following ordered couples of angular positionand SNR: Target 1 ( − ◦ , −

10 dB) ; Target 2 (14 ◦ , − ,Target 3 ( − ◦ , − ; Target 4 (30 ◦ , − . In Fig. 1, weplot the averaged (over M C

Monte Carlo runs) normalizedbeam pattern deﬁned as D ( θ ) ,

10 log ( B ( θ ) / max θ B ( θ ) ) foreach time step k = 1 , . . . , K . As Fig. 1 shows, the proposedRL-based beamforming is able to exploit the informationabout the radar scene provided by the GLRT to focus thetransmitted power on the four angle bins containing the targets.Fig. 3a shows the absolute value of the difference betweenthe Q -function estimated by the SARSA algorithm at twoconsecutive time steps, i.e ξ k , | Q k − Q k − | . As we cansee, ξ k → as k → ∞ and this is an heuristic proof that theSARSA algorithm converges. Finally, in Table I, we comparethe Probability of Detection of the four targets, averaged over K time steps, when the proposed RL-based beamformingis used against a classical omnidirectional beamformer. Asexpected, better detection performance is achieved by usingthe proposed beamforming method. P D T1 T2 T3 T4RL-based 0.22 0.55 0.78 0.91Omni 0.14 0.41 0.69 0.86

TABLE I: Detection performance comparison: study case 1.

Study Case 2:

In this second study case, we assess theperformance of the proposed RL-based beamforming in adynamic environment, where the number of targets and theirangular positions change over the observation time. The SNR

Time step [k] -38-30-22-14-621018263442 A ng l e g r i d [ deg ] -12-10-8-6-4-20Target 1Target 2Target 3Target 4 Fig. 1: Normalized beam patter D ( θ ) : study case 1. of each target is assumed to be equal to -8 dB. The radar sceneis generated as follows: • k = 1 → : two targets located at − ◦ and ◦ , • k = 101 → : no targets are present, • k = 201 → : three targets at − ◦ , − ◦ and ◦ , • k = 351 → : two target at − ◦ and ◦ , • k = 451 → : four targets at − ◦ , − ◦ , ◦ and ◦ .As clearly shown by Fig. 2, the proposed RL-based beam-forming algorithm is able to handle this dynamic scenario byreshaping the beam pattern according to the changes in thenumber of targets presented in the radar scene and in theirangular locations. Finally, Fig. 3b shows the progress of theindex ξ k previously deﬁned. It can be noted that a transitionin the estimate of Q -function is present every time a changein the scenario occurs. However, it eventually converges aftersome time steps.

100 200 300 400 500 600

Time step [k] -38-30-22-14-621018263442 A ng l e g r i d [ deg ] -10-9-8-7-6-5-4-3-2-10 True Target angular position

Fig. 2: Normalized beam patter D ( θ ) : study case 2. Time step [k] -3 -2 k (a) Study case 1 Time step [k] -4 -3 -2 k (b) Study case 2Fig. 3: The convergence index ξ k , | Q k − Q k − | . VI. C

ONCLUSIONS

In this paper we have shown that some basic RL tools canbe successfully applied in a dynamic MIMO radar detectionproblem in the presence of an environment with unknown,and variable in time, number of targets with unknown angularpositions. Speciﬁcally, a RL-based waveforms optimizationalgorithm capable to focus the transmitted power only in theangle bins that contain possible targets has been proposed andanalysed under the assumption of a priori known disturbancestatistics. Future works will explore the possibility to extendthe proposed algorithm by endowing it with the capability oflearning the disturbance model and consequently, of optimiz-ing the set of transmitted waveforms in order to mitigate itsimpact on the detection performance.R

EFERENCES[1] L. Busoniu, R. Babuska, and B. D. Schutter, “A comprehensive surveyof multiagent reinforcement learning,”

IEEE Transactions on Systems,Man, and Cybernetics, Part C (Applications and Reviews) , vol. 38, no. 2,pp. 156–172, March 2008.[2] S. Haykin, “Cognitive radar: a way of the future,”

IEEE Signal Process-ing Magazine , vol. 23, no. 1, pp. 30–40, Jan 2006.[3] S. Haykin and J. M. Fuster, “On cognitive dynamic systems: Cognitiveneuroscience and engineering learning from each other,”

Proceedings ofthe IEEE , vol. 102, no. 4, pp. 608–628, April 2014.[4] J. Li and P. Stoica, “Mimo radar with colocated antennas,”

IEEE SignalProcessing Magazine , vol. 24, no. 5, pp. 106–114, Sept 2007.[5] D. R. Fuhrmann and G. S. Antonio, “Transmit beamforming for mimoradar systems using signal cross-correlation,”

IEEE Transactions onAerospace and Electronic Systems , vol. 44, no. 1, pp. 171–186, January2008.[6] B. Friedlander, “On transmit beamforming for mimo radar,”

IEEETransactions on Aerospace and Electronic Systems , vol. 48, no. 4, pp.3376–3388, October 2012.[7] L. Xu, J. Li, and P. Stoica, “Radar imaging via adaptive mimo tech-niques,” in , Sept2006, pp. 1–5.[8] S. S. Wilks, “The large-sample distribution of the likelihood ratio fortesting composite hypotheses,”

Ann. Math. Statist. , vol. 9, no. 1, pp.60–62, 03 1938.[9] I. Bekkerman and J. Tabrikian, “Target detection and localization usingMIMO radars and sonars,”

IEEE Transactions on Signal Processing ,vol. 54, no. 10, pp. 3873–3883, Oct 2006.[10] M. Mohri, A. Rostamizadeh, and A. Talwalkar,

Foundations of MachineLearning . The MIT Press, Cambridge, Massachusetts, London, Eng-land, 2012.[11] J. Li, P. Stoica, L. Xu, and W. Roberts, “On parameter identiﬁabilityof mimo radar,”

IEEE Signal Processing Letters , vol. 14, no. 12, pp.968–971, Dec 2007.[12] A. B. Gershman, N. D. Sidiropoulos, S. Shahbazpanahi, M. Bengtsson,and B. Ottersten, “Convex optimization-based beamforming,”

IEEESignal Processing Magazine , vol. 27, no. 3, pp. 62–75, May 2010.[13] Z. Q. Luo, W. K. Ma, A. M. C. So, Y. Ye, and S. Zhang, “Semideﬁniterelaxation of quadratic optimization problems,”

IEEE Signal ProcessingMagazine , vol. 27, no. 3, pp. 20–34, May 2010.[14] L. Wang, Y. Zhang, Q. Liao, J. Tang, and J. Pan, “Robust waveformdesign for multi-target detection in cognitive MIMO radar,” in