[PDF] Joint Design of Radar Waveform and Detector via End-to-end Learning with Waveform Constraints

Abstract

The problem of data-driven joint design of transmitted waveform and detector in a radar system is addressed in this paper. We propose two novel learning-based approaches to waveform and detector design based on end-to-end training of the radar system. The first approach consists of alternating supervised training of the detector for a fixed waveform and reinforcement learning of the transmitter for a fixed detector. In the second approach, the transmitter and detector are trained simultaneously. Various operational waveform constraints, such as peak-to-average-power ratio (PAR) and spectral compatibility, are incorporated into the design. Unlike traditional radar design methods that rely on rigid mathematical models with limited applicability, it is shown that radar learning can be robustified by training the detector with synthetic data generated from multiple statistical models of the environment. Theoretical considerations and results show that the proposed methods are capable of adapting the transmitted waveform to environmental conditions while satisfying design constraints.

Full PDF

11 Joint Design of Radar Waveform and Detectorvia End-to-end Learning with WaveformConstraints

Wei Jiang,

Student Member, IEEE,

Alexander M. Haimovich,

Fellow, IEEE andOsvaldo Simeone,

Fellow, IEEE

Abstract

The problem of data-driven joint design of transmitted waveform and detector in a radar systemis addressed in this paper. We propose two novel learning-based approaches to waveform and detectordesign based on end-to-end training of the radar system. The ﬁrst approach consists of alternatingsupervised training of the detector for a ﬁxed waveform and reinforcement learning of the transmitter fora ﬁxed detector. In the second approach, the transmitter and detector are trained simultaneously. Variousoperational waveform constraints, such as peak-to-average-power ratio (PAR) and spectral compatibility,are incorporated into the design. Unlike traditional radar design methods that rely on rigid mathematicalmodels with limited applicability, it is shown that radar learning can be robustiﬁed by training thedetector with synthetic data generated from multiple statistical models of the environment. Theoreticalconsiderations and results show that the proposed methods are capable of adapting the transmittedwaveform to environmental conditions while satisfying design constraints.

Index Terms

Waveform design, radar detector design, waveform constraints, reinforcement learning, supervisedlearning.

W. Jiang and A. M. Haimovich are with the Center for Wireless Information Processing (CWiP), Department of Elec-trical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA (e-mail: [email protected];[email protected]).Osvaldo Simeone is with the King’s Communications, Information Processing & Learning (KCLIP) Lab, Department ofEngineering, King’s College London, London WC2R 2LS, UK (e-mail: osvaldo. [email protected]). His work was supportedby the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (GrantAgreement No. 725731). a r X i v : . [ ee ss . SP ] F e b I. I

NTRODUCTION

A. Context and Motivation

Design of radar waveforms and detectors has been a topic of great interest to the radarcommunity (see e.g. [1]-[4]). For best performance, radar waveforms and detectors should bedesigned jointly [5], [6]. Traditional joint design of waveforms and detectors typically relies onmathematical models of the environment, including targets, clutter, and noise. In contrast, thispaper proposes data-driven approaches based on end-to-end learning of radar systems, in whichreliance on rigid mathematical models of targets, clutter and noise is relaxed.Optimal detection in the Neyman-Pearon (NP) sense guarantees highest probability of detectionfor a speciﬁed probability of false alarm [1]. The NP detection test relies on the likelihood (orlog-likelihood) ratio, which is the ratio of probability density functions (PDF) of the receivedsignal conditioned on the presence or absence of a target. Mathematical tractability of modelsof the radar environment plays an important role in determining the ease of implementation ofan optimal detector. For some target, clutter and noise models, the structure of optimal detectorsis well known [7]-[9]. For example, closed-form expressions of the NP test metric are availablewhen the applicable models are Gaussian [9], and, in some cases, even for non-Gaussian models[10].However, in most cases involving non-Gaussian models, the structure of optimal detectorsgenerally involves intractable numerical integrations, making the implementation of such detec-tors computationally intensive [11], [12]. For instance, it is shown in [11] that the NP detectorrequires a numerical integration with respect to the texture variable of the K-distributed clutter,thus precluding a closed-form solution. Furthermore, detectors designed based on a speciﬁcmathematical model of environment suffer performance degradation when the actual environmentdiffers from the assumed model [13], [14]. Attempts to robustify performance by designingoptimal detectors based on mixtures of random variables quickly run aground due to mathematicalintractability.Alongside optimal detectors, optimal radar waveforms may also be designed based on the NPcriterion. Solutions are known for some simple target, clutter and noise models (see e.g. [2],[4]). However, in most cases, waveform design based on direct application of the NP criterionis intractable, leading to various suboptimal approaches. For example, mutual information, J-divergence and Bhattacharyya distance have been studied as objective functions for waveform design in multistatic settings [15]-[18].In addition to target, clutter and noise models, waveform design may have to account forvarious operational constraints. For example, transmitter efﬁciency may be improved by con-straining the peak-to-average-power ratio (PAR) [19]-[22]. A different constraint relates to therequirement of coexistence of radar and communication systems in overlapping spectral regions.The National Telecommunications and Information Administration (NTIA) and Federal Com-munication Commission (FCC) have allowed sharing of some of the radar frequency bands withcommercial communication systems [23]. In order to protect the communication systems fromradar interference, radar waveforms should be designed subject to speciﬁed compatibility con-straints. The design of radar waveforms constrained to share the spectrum with communicationssystems has recently developed into an active area of research with a growing body of literature[24]-[28].Machine learning has been successfully applied to solve problems for which mathematicalmodels are unavailable or too complex to yield optimal solutions, in domains such as computervision [29], [30] and natural language processing [31], [32]. Recently, a machine learningapproach has been proposed for implementing the physical layer of communication systems.Notably, in [33], it is proposed to jointly design the transmitter and receiver of communicationsystems via end-to-end learning. Reference [35] proposes an end-to-end learning-based approachfor jointly minimizing PAR and bit error rate in orthogonal frequency division multiplexingsystems. This approach requires the availability of a known channel model. For the case ofan unknown channel model, reference [36] proposes an alternating training approach, wherebythe transmitter is trained via reinforcement learning (RL) on the basis of noiseless feedbackfrom the receiver, while the receiver is trained by supervised learning. In [37], the authorsapply simultaneous perturbation stochastic optimization for approximating the gradient of atransmitter’s loss function. A detailed review of the state of the art can be found in [38] (seealso [39]-[41] for recent work).In the radar ﬁeld, learning machines trained in a supervised manner based on a suitable lossfunction have been shown to approximate the performance of the NP detector [42], [43]. Asa representative example, in [43], a neural network trained in a supervised manner using datathat includes Gaussian interference, has been shown to approximate the performance of the NPdetector. Note that design of the NP detector requires express knowledge of the Gaussian natureof the interference, while the neural network is trained with data that happens to be Gaussian, but the machine has no prior knowledge of the statistical nature of the data.

B. Main contributions

In this work, we introduce two learning-based approaches for the joint design of waveform anddetector in a radar system. Inspired by [36], end-to-end learning of a radar system is implementedby alternating supervised learning of the detector for a ﬁxed waveform, and RL-based learningof the transmitter for a ﬁxed detector. In the second approach, the learning of the detector andwaveform are executed simultaneously, potentially speeding up training in terms of required radartransmissions to yield the training samples compared alternating training. In addition, we extendthe problem formulation to include training of waveforms with PAR or spectral compatibilityconstraints.The main contributions of this paper are summarized as follows:1) We formulate a radar system architecture based on the training of the detector and thetransmitted waveform, both implemented as feedforward multi-layer neural networks.2) We develop two end-to-end learning algorithms for detection and waveform generation. Inthe ﬁrst learning algorithm, detector and transmitted waveform are trained alternately: Fora ﬁxed waveform, the detector is trained using supervised learning so as to approximate theNP detector; and for a ﬁxed detector, the transmitted waveform is trained via policy gradient-based RL. In the second algorithm, the detector and transmitter are trained simultaneously.3) We extend learning algorithms to incorporate waveform constraints, speciﬁcally PAR andspectral compatibility constraints.4) We provide theoretical results that relate alternating and simultaneous training by computingthe gradients of the loss functions optimized by both methods.5) We also provide theoretical results that justify the use of RL-based transmitter training bycomparing the gradient used by this procedure with the gradient of the ideal model-basedlikelihood function.This work extends previous results presented in the conference version [44]. In particular,reference [44] proposes a learning algorithm, whereby supervised training of the radar detectoris alternated with RL-based training of the unconstrained transmitted waveforms. As compared tothe conference version [44], this paper studies also a simultaneous training; it develops methodsfor learning radar waveforms with various operational waveform constraints; and it provides a theoretical results regarding the relationship between alternating training and simultaneoustraining, as well as regarding the adoption of RL-based training of the transmitter.The rest of this paper is organized as follows. A detailed system description of the end-to-endradar system is presented in Section II. Section III proposes two iterative algorithms of jointlytraining the transmitter and receiver. Section IV provides theoretical properties of gradientsNumerical results are reported in Section V. Finally, conclusions are drawn in Section VI.Throughout the paper bold lowercase and uppercase letters represent vector and matrix,respectively. The conjugate, the transpose, and the conjugate transpose operator are denotedby the symbols ( · ) ∗ , ( · ) T , and ( · ) H , respectively. The notations C K and R K represent sets of K -dimensional vectors of complex and real numbers, respectively. The notation | · | indicatesmodulus, || · || indicates the Euclidean norm, and E x ∼ p x {·} indicates the expectation of theargument with respect to the distribution of the random variable x ∼ p x , respectively. (cid:60) ( · ) and (cid:61) ( · ) stand for real-part and imaginary-part of the complex-valued argument, respectively. Theletter j represents the imaginary unit, i.e., j = √− . The gradient of a function f : R n → R m with respect to x ∈ R n is ∇ x f ( x ) ∈ R n × m .II. P ROBLEM F ORMULATION

Consider a pulse-compression radar system that uses the baseband transmit signal x ( t ) = K (cid:88) k =1 y k ζ (cid:0) t − [ k − T c (cid:1) , (1)where ζ ( t ) is a ﬁxed basic chip pulse, T c is the chip duration, and { y k } Kk =1 are complexdeterministic coefﬁcients. The vector y (cid:44) [ y , . . . , y K ] T is referred to as the fast-time waveform of the radar system, and is subject to design.The backscattered baseband signal from a stationary point-like target is given by z ( t ) = αx ( t − τ ) + c ( t ) + n ( t ) , (2)where α is the target complex-valued gain, accounting for target backscattering and channelpropagation effects; τ represents the target delay, which is assumed to satisfy the target de-tectability condition condition τ >> KT c ; c ( t ) is the clutter component; and n ( t ) denotes signal-independent noise comprising an aggregate of thermal noise, interference, and jamming. The clutter component c ( t ) associated with a detection test performed at τ = 0 may be expressed c ( t ) = K − (cid:88) g = − K +1 γ g x (cid:0) t − gT c (cid:1) , (3)where γ g is the complex clutter scattering coefﬁcient at time delay τ = 0 associated with the g th range cell relative to the cell under test. Following chip matched ﬁltering with ζ ∗ ( − t ) , andsampling at T c − spaced time instants t = τ + [ k − T c for k ∈ { , . . . K } , the K × discrete-time received signal z = [ z ( τ ) , z ( τ + T c ) , . . . , z ( τ + [ K − T c )] T for the range cell under testcontaining a point target with complex amplitude α , clutter and noise can be written as z = α y + c + n , (4)where c and n denote, respectively, the clutter and noise vectors.Detection of the presence of a target in the range cell under test is formulated as the followingbinary hypothesis testing problem:  H : z = c + n H : z = α y + c + n . (5)In traditional radar design, the golden standard for detection is provided by the NP criterion ofmaximizing the probability of detection for a given probability of false alarm. Application ofthe NP criterion leads to the likelihood ratio test Λ( z ) = p ( z | y , H ) p ( z | y , H ) H ≷ H T Λ , (6)where Λ( z ) is the likelihood ratio, and T Λ is the detection threshold set based on the probabilityof false alarm constraint [5]. The NP criterion is also the golden standard for designing a radarwaveform that adapts to the given environment, although, as discussed earlier, a direct applicationof this design principle is often intractable.The design of optimal detectors and/or waveforms under the NP criterion requires on channelmodels of the radar environment, namely, knowledge of the conditional probabilities p ( z | y , H i ) for i = { , } . The channel model p ( z | y , H i ) is the likelihood of the observation z conditionedon the transmitted waveform y and hypothesis H i . In the following, we introduce an end-to-endradar system in which the detector and waveform are jointly learned in a data-driven fashion. Transmitter ReceiverRadar operating environment Detector

Fig. 1. An end-to-end radar system operating over an unknown radar operating environment. Transmitter and receiver areimplemented as two separate parametric functions f θ T ( · ) and f θ R ( · ) with trainable parameter vectors θ T and θ R , respectively. A. End-to-end radar system

The end-to-end radar system illustrated in Fig. 1 comprises a transmitter and a receiver thatseek to detect the presence of a target. Transmitter and receiver are implemented as two separateparametric functions f θ T ( · ) and f θ R ( · ) with trainable parameter vectors θ T and θ R , respectively.As shown in Fig. 1, the input to the transmitter is a user-deﬁned initialization waveform s ∈ C K . The transmitter outputs a radar waveform obtained through a trainable mapping y θ T = f θ T ( s ) ∈ C K . The environment is modeled as a stochastic system that produces the vector z ∈ C K from a conditional PDF p ( z | y θ T , H i ) parameterized by a binary variable i ∈ { , } .The absence or presence of a target is indicated by the values i = 0 and i = 1 respectively,and hence i is referred to as the target state indicator . The receiver passes the received vector z through a trainable mapping p = f θ R ( z ) , which produces the scalar p ∈ (0 , . The ﬁnal decision ˆ i ∈ { , } is made by comparing the output of the receiver p to a hard threshold in the interval (0 , . B. Transmitter and Receiver Architectures

As discussed in Section II-A, the transmitter and the receiver are implemented as two separateparametric functions f θ T ( · ) and f θ R ( · ) . We now detail an implementation of the transmitter f θ T ( · ) and receiver f θ R ( · ) based on feedforward neural networks.A feedforward neural network is a parametric function ˜ f θ ( · ) that maps an input real-valuedvector u in ∈ R N in to an output real-valued vector u out ∈ R N out via L successive layers, where N in and N out represent, respectively, the number of neurons of the input and output layers. Notingthat the input to the l th layer is the output of the ( l − th layer, the output of the l th layer is given by u l = ˜ f θ [ l ] ( u l − ) = φ (cid:0) W [ l ] u l − + b [ l ] (cid:1) , for l = 1 , . . . , L, (7)where φ ( · ) is an element-wise activation function, and θ [ l ] = { W [ l ] , b [ l ] } contains the train-able parameter of the l th layer comprising the weight W [ l ] and bias b [ l ] . The vector of train-able parameters of the entire neural network comprises the parameters of all layers, i.e., θ = vec { θ [1] , · · · , θ [ L ] } .The architecture of the end-to-end radar system with transmitter and receiver implementedbased on feedforward neural networks is shown in Fig. 2. The transmitter applies a complexinitialization waveform s to the function f θ T ( · ) . The complex-value input s is processed by acomplex-to-real conversion layer. This is followed by a real-valued neural network ˜ f θ T ( · ) . Theoutput of the neural network is converted back to complex-values, and an output layer normalizesthe transmitted power. As a result, the transmitter generates the radar waveform y θ T .The receiver applies the received signal z to the function f θ R ( · ) . Similar to the transmitter, aﬁrst layer converts complex-valued to real-valued vectors. The neural network at the receiver isdenoted ˜ f θ R ( · ) . The task of the receiver is to generate a scalar p ∈ (0 , that approximates theposterior probability of the presence of a target conditioned on the received vector z . To this end,the last layer of the neural network ˜ f θ R ( · ) is selected as a logistic regression layer consisting ofoperating over a linear combination of outputs from the previous layer. The presence or absenceof the target is determined based on the output of the receiver and a threshold set according toa false alarm constraint.III. T RAINING OF E ND - TO -E ND R ADAR S YSTEMS

This section discusses the joint optimization of the trainable parameter vectors θ T and θ R to meet application-speciﬁc performance requirements. Two training algorithms are proposed totrain the end-to-end radar system. The ﬁrst algorithm alternates between training of the receiverand of the transmitter. This algorithm is referred to as alternating training , and is inspired bythe approach used in [36] to train encoder and decoder of a digital communication system. Incontrast, the second algorithm trains the receiver and transmitter simultaneously. This approach isreferred to as simultaneous training . Note that the proposed two training algorithms are applicableto other differentiable parametric functions implementing the transmitter f θ T ( · ) and the receiver Radar operating environmentNeural networkNormalization layer Neural network ReceiverTransmitter

Fig. 2. Transmitter and receiver architectures based on feedforward neural networks. f θ R ( · ) , such as recurrent neural network or its variants [45]. In the following, we ﬁrst discussalternating training and then we detail simultaneous training. A. Alternating Training: Receiver Design

Alternating training consists of iterations encompassing separate receiver and transmitter up-dates. In this subsection, we focus on the receiver updates. A receiver training update optimizesthe receiver parameter vector θ R for a ﬁxed transmitter waveform y θ T . Receiver design issupervised in the sense that we assume the target state indicator i to be available to the receiverduring training. Supervised training of the receiver for a ﬁxed transmitter’s parameter vector θ T is illustrated in Fig. 3. Receivertraining

Fig. 3. Supervised training of the receiver for a ﬁxed transmitted waveform.

The standard cross-entropy loss [43] is adopted as the loss function for the receiver. For agiven transmitted waveform y θ T = f θ T ( s ) , the receiver average loss function is accordingly given by L R ( θ R ) = (cid:88) i ∈{ , } P ( H i ) E z ∼ p ( z | y θ T , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) , (8)where P ( H i ) is the prior probability of the target state indicator i , and (cid:96) (cid:0) f θ R ( z ) , i (cid:1) is theinstantaneous cross-entropy loss for a pair (cid:0) f θ R ( z ) , i (cid:1) , namely, (cid:96) (cid:0) f θ R ( z ) , i (cid:1) = − i ln f θ R ( z ) − (1 − i ) ln (cid:2) − f θ R ( z ) (cid:3) . (9)For a ﬁxed transmitted waveform, the receiver parameter vector θ R should be ideally optimizedby minimizing (8), e.g., via gradient descent or one of its variants [46]. The gradient of averageloss (8) with respect to the receiver parameter vector θ R is ∇ θ R L R ( θ R ) = (cid:88) i ∈{ , } P ( H i ) E z ∼ p ( z | y θ T , H i ) (cid:8) ∇ θ R (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) . (10)This being a data-driven approach, rather than assuming known prior probability of the targetstate indicator P ( H i ) and likelihood p ( z | y θ T , H i ) , the receiver is assumed to have access to Q R independent and identically distributed (i.i.d.) samples D R = (cid:8) z ( q ) ∼ p ( z | y θ T , H i ( q ) ) , i ( q ) ∈{ , } (cid:9) Q R q =1 .Given the output of the receiver function f θ R ( z ( q ) ) for a received sample vector z ( q ) andthe indicator i ( q ) ∈ { , } , the instantaneous cross-entropy loss is computed from (9), and theestimated receiver gradient is given by ∇ θ R (cid:98) L R ( θ R ) = 1 Q R Q R (cid:88) q =1 ∇ θ R (cid:96) (cid:0) f θ R ( z ( q ) ) , i ( q ) (cid:1) . (11)Using (11), the receiver parameter vector θ R is adjusted according to stochastic gradient descentupdates θ ( n +1) R = θ ( n ) R − (cid:15) ∇ θ R (cid:98) L R ( θ ( n ) R ) (12)across iterations n = 1 , , · · · , where (cid:15) > is the learning rate. B. Alternating Training: Transmitter Design

In the transmitter training phase of alternating training, the receiver parameter vector θ R isheld constant, and the function f θ T ( · ) implementing the transmitter is optimized. The goal of transmitter training is to ﬁnd an optimized parameter vector θ T that minimizes the cross-entropyloss function (8) seen as a function of θ T .As illustrated in Fig. 4, a stochastic transmitter outputs a waveform a drawn from a distribution π ( a | y θ T ) conditioned on y θ T = f θ T ( s ) . The introduction of the randomization π ( a | y θ T ) of thedesigned waveform y θ T is useful to enable exploration of the design space in a manner akinto standard RL policies. To train the transmitter, we aim to minimize the average cross-entropyloss L πT ( θ T ) = (cid:88) i ∈{ , } P ( H i ) E a ∼ π ( a | y θ T ) z ∼ p ( z | a , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) . (13)Note that this is consistent with (8), with the caveat that an expectation is taken over policy π ( a | y θ T ) . This is indicated by the superscript “ π ”. TransmittertrainingStochastic transmitter

Fig. 4. RL-based transmitter training for a ﬁxed receiver design.

Assume that the policy π ( a | y θ T ) is differentiable with respect to the transmitter parametervector θ T , i.e., that the gradient ∇ θ T π ( a | y θ T ) exists. The policy gradient theorem [47] statesthat the gradient of the average loss (13) can be written as ∇ θ T L πT ( θ T ) = (cid:88) i ∈{ , } P ( H i ) E a ∼ π ( a | y θ T ) z ∼ p ( z | a , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1) ∇ θ T ln π ( a | y θ T ) (cid:9) . (14)The gradient (14) has the important advantage that it may be estimated via Q T i.i.d. samples D T = (cid:8) a ( q ) ∼ π ( a | y θ T ) , z ( q ) ∼ p ( z | a ( q ) , H i ( q ) ) , i ( q ) ∈ { , } (cid:9) Q T q =1 , yielding the estimate ∇ θ T (cid:98) L πT ( θ T ) = 1 Q T Q T (cid:88) q =1 (cid:96) (cid:0) f θ R ( z ( q ) ) , i ( q ) (cid:1) ∇ θ T ln π ( a ( q ) | y θ T ) . (15)With estimate (15), in a manner similar to (12), the transmitter parameter vector θ T may be optimized iteratively according to the stochastic gradient descent update rule θ ( n +1) T = θ ( n ) T − (cid:15) ∇ θ T (cid:98) L πT ( θ ( n ) T ) (16)over iterations n = 1 , , · · · . The alternating training algorithm is summarized as Algorithm 1.The training process is carried out until a stopping criterion is satisﬁed. For example, a prescribednumber of iterations may have been reached, or a number of iterations may have elapsed duringwhich the training loss (13), estimated using samples D T , may have not decreased by more thana given amount. Algorithm 1:

Alternating Training

Input: initialization waveform s ; stochastic policy π θ T ( ·| y ) ; learning rate (cid:15) Output: learned parameter vectors θ R and θ T initialize θ (0) R and θ (0) T , and set n = 0 while stopping criterion not satisﬁed do /* receiver training phase */ evaluate the receiver loss gradient ∇ θ R (cid:98) L R ( θ ( n ) R ) from (11) with θ T = θ ( n ) T update receiver parameter vector θ R via θ ( n +1) R = θ ( n ) R − (cid:15) ∇ θ R (cid:98) L R ( θ ( n ) R ) and stochastic transmitter policy turned off /* transmitter training phase */ evaluate the transmitter loss gradient ∇ θ T (cid:98) L πT ( θ ( n ) T ) from (15) with θ R = θ ( n +1) R update transmitter parameter vector θ T via θ ( n +1) T = θ ( n ) T − (cid:15) ∇ θ T (cid:98) L πT ( θ ( n ) T ) n ← n + 1 end C. Transmitter Design with Constraints

We extend the transmitter training discussed in the previous section to incorporate waveformconstraints on PAR and spectral compatibility. To this end, we introduce penalty functions thatare used to modify the training criterion (13) to meet these constraints.

1) PAR Constraint:

Low PAR waveforms are preferred in radar systems due to hardwarelimitations related to waveform generation. A lower PAR entails a lower dynamic range of the power ampliﬁer, which in turn allows an increase in average transmitted power. The PAR of aradar waveform y θ T = f θ T ( s ) may be expressed J PAR ( θ T ) = max k =1 , ··· ,K | y θ T ,k | || y θ T || /K , (17)which is bounded according to ≤ J PAR ( θ T ) ≤ K .

2) Spectral Compatibility Constraint:

A spectral constraint is imposed when a radar systemis required to operate over a spectrum partially shared with other systems such as wirelesscommunication networks. Suppose there are D frequency bands { Γ d } Dd =1 shared by the radarand by the coexisting systems, where Γ d = [ f d,l , f d,u ] , with f d,l and f d,u denoting the lower andupper normalized frequencies of the d th band, respectively. The amount of interfering energygenerated by the radar waveform y θ T in the d th shared band is (cid:90) f d,u f d,l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K − (cid:88) k =0 y θ T ,k e − j πfk (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) df = y H θ T Ω d y θ T , (18)where (cid:2) Ω d (cid:3) v,h =  f d,u − f d,l if v = he j πf d,u ( v − h ) − e j πf d,l ( v − h ) j π ( v − h ) if v (cid:54) = h (19)for ( v, h ) ∈ { , · · · , K } . Let Ω = (cid:80) Dd =1 ω d Ω d be a weighted interference covariance matrix,where the weights { ω d } Dd =1 are assigned based on practical considerations regarding the impactof interference in the D bands. These include distance between the radar transmitter and interfer-enced systems, and tactical importance of the coexisting systems [48]. Given a radar waveform y θ T = f θ T ( s ) , we deﬁne the spectral compatibility penalty function as J spectrum ( θ T ) = y H θ T Ωy θ T , (20)which is the total interfering energy from the radar waveform produced on the shared frequencybands.

3) Constrained Transmitter Design:

For a ﬁxed receiver parameter vector θ R , the averageloss (13) is modiﬁed by introducing a penalty function J ∈ { J PAR , J spectrum } . Accordingly, we formulate the transmitter loss function, encompassing (13), (17) and (20), as L πT,c ( θ T ) = L πT ( θ T ) + λJ ( θ T )= (cid:88) i ∈{ , } P ( H i ) E a ∼ π ( a | y θ T ) z ∼ p ( z | a , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) + λJ ( θ T ) . (21)where λ controls the weight of the penalty J ( θ T ) , and is referred to as the penalty parameter .When the penalty parameter λ is small, the transmitter is trained to improve its ability to adapt tothe environment, while placing less emphasis on reducing the PAR level or interference energyfrom the radar waveform; and vice versa for large values of λ . Note that the waveform penaltyfunction J ( θ T ) depends only on the transmitter trainable parameters θ T . Thus, imposing thewaveform constraint does not affect the receiver training.It is straightforward to write the estimated version of the gradient (21) with respect to θ T byintroducing the penalty as ∇ θ T (cid:98) L πT,c ( θ T ) = ∇ θ T (cid:98) L πT ( θ T ) + λ ∇ θ T J ( θ T ) , (22)where the gradient of the penalty function ∇ θ T J ( θ T ) is provided in Appendix A.Substituting (15) into (22), we ﬁnally have the estimated gradient ∇ θ T (cid:98) L πT,c ( θ T ) = 1 Q T Q T (cid:88) q =1 (cid:96) (cid:0) f θ R ( z ( q ) ) , i ( q ) (cid:1) ∇ θ T ln π ( a ( q ) | y θ T ) + λ ∇ θ T J ( θ T ) , (23)which is used in the stochastic gradient update rule θ ( n +1) T = θ ( n ) T − (cid:15) ∇ θ T (cid:98) L πT,c ( θ ( n ) T ) for n = 1 , , · · · . (24) D. Simultaneous Training

This subsection discusses simultaneous training, in which the receiver and transmitter areupdated simultaneously as illustrated in Fig. 5. To this end, the objective function is the averageloss L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) E a ∼ π ( a | y θ T ) z ∼ p ( z | a , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) . (25)This function is minimized over both parameters θ R and θ T via stochastic gradient descent. ReceivertrainingTransmittertrainingStochastic transmitter

Fig. 5. Simultaneous training of the end-to-end radar system. The receiver is trained by supervised learning, while the transmitteris trained by RL.

The gradient of (25) with respect to θ R is ∇ θ R L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) E a ∼ π ( a | y θ T ) z ∼ p ( z | a , H i ) (cid:8) ∇ θ R (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) , (26)and the gradient of (25) with respect to θ T is ∇ θ T L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) ∇ θ T E a ∼ π ( a | y θ T ) z ∼ p ( z | a , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) = (cid:88) i ∈{ , } P ( H i ) E a ∼ π ( a | y θ T ) z ∼ p ( z | a , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1) ∇ θ T ln π ( a | y θ T ) (cid:9) . (27)To estimate gradients (26) and (27), we assume access to Q i.i.d. samples D = (cid:8) a ( q ) ∼ π ( a | y θ T ) , z ( q ) ∼ p ( z | a ( q ) , H i ( q ) ) , i ( q ) ∈ { , } (cid:9) Qq =1 . From (26), the estimated receiver gradient is ∇ θ R (cid:98) L π ( θ R , θ T ) = 1 Q Q (cid:88) q =1 ∇ θ R (cid:96) (cid:0) f θ R ( z ( q ) ) , i ( q ) (cid:1) . (28)Note that, in (28), the received vector z ( q ) is obtained based on a given waveform a ( q ) sampledfrom policy π ( a | y θ T ) . Thus, the estimated receiver gradient (28) is averaged over the stochasticwaveforms a . This is in contrast to alternating training, in which the receiver gradient dependsdirectly on the transmitted waveform y θ T .From (27), the estimated transmitter gradient is given by ∇ θ T (cid:98) L π ( θ R , θ T ) = 1 Q Q (cid:88) q =1 (cid:96) (cid:0) f θ R ( z ( q ) ) , i ( q ) (cid:1) ∇ θ T ln π ( a ( q ) | y θ T ) . (29)Finally, denote the parameter set θ = { θ R , θ T } , from (28) and (29), the trainable parameter set θ is updated according to the stochastic gradient descent rule θ ( n +1) = θ ( n ) − (cid:15) ∇ θ (cid:98) L π ( θ ( n ) R , θ ( n ) T ) (30)across iterations n = 1 , , · · · . The simultaneous training algorithm is summarized in Algorithm 2. Like alternating training,simultaneous training can be directly extended to incorporate prescribed waveform constraintsby adding the penalty term λJ ( θ T ) to the average loss (25). Algorithm 2:

Simultaneous Training

Input: initialization waveform s ; stochastic policy π ( ·| y θ T ) ; learning rate (cid:15) Output: learned parameter vectors θ R and θ T initialize θ (0) R and θ (0) T , and set n = 0 while stopping criterion not satisﬁed do evaluate the receiver gradient ∇ θ R (cid:98) L π ( θ ( n ) R , θ ( n ) T ) and the transmitter gradient ∇ θ T (cid:98) L π ( θ ( n ) R , θ ( n ) T ) from (28) and (29), respectively update receiver parameter vector θ R and transmitter parameter vector θ T simultaneously via θ ( n +1) R = θ ( n ) R − (cid:15) ∇ θ R (cid:98) L π ( θ ( n ) R , θ ( n ) T ) and θ ( n +1) T = θ ( n ) T − (cid:15) ∇ θ T (cid:98) L π ( θ ( n ) R , θ ( n ) T ) n ← n + 1 end IV. T

HEORETICAL PROPERTIES OF THE GRADIENTS

In this section, we discuss two useful theoretical properties of the gradients used for learningreceiver and transmitter.

A. Receiver Gradient

As discussed previously, end-to-end learning of transmitted waveform and detector may beaccomplished either by alternating or simultaneous training. The main difference between alter-nating and simultaneous training concerns the update of the receiver trainable parameter vector θ R . Alternating training of θ R relies on a ﬁxed waveform y θ T (see Fig. 3), while simultaneoustraining relies on random waveforms a generated in accordance with a preset policy, i.e., a ∼ π ( a | y θ T ) , as shown in Fig. 5. The relation between the gradient applied by alternatingtraining, ∇ θ R L R ( θ R ) , and the gradient of simultaneous training, ∇ θ R L π ( θ R , θ T ) , with respectto θ R is stated by the following proposition. Proposition 1.

For the loss function (8) computed based on a waveform y θ T and loss function(13) computed based on a stochastic policy π ( a | y θ T ) continuous in a , the following equalityholds: ∇ θ R L R ( θ R ) = ∇ θ R L π ( θ R , θ T ) . (31) Proof.

See Appendix B. (cid:4)

Proposition 1 states that the gradient of simultaneous training, ∇ θ R L π ( θ R , θ T ) , equals thegradient of alternating training, ∇ θ R L R ( θ R ) , even though simultaneous training applies a randomwaveform a ∼ π ( a | y θ T ) to train the receiver. Note that this result applies only to ensemblemeans according to (8) and (26), and not to the empirical estimates used by Algorithms 1 and2. Nevertheless, Proposition 1 suggests that training updates of the receiver are unaffected bythe choice of alternating or simultaneous training. That said, given the distinct updates of thetransmitter’s parameter, the overall trajectory of the parameters ( θ R , θ T ) during training maydiffer according to the two algorithms. B. Transmitter gradient

As shown in the previous section, the gradients used for learning receiver parameters θ R by alternating training (11) or simultaneous training (28) may be directly estimated from thechannel output samples z ( q ) . In contrast, the gradient used for learning transmitter parameters θ T according to (8) cannot be directly estimated from the channel output samples. To obviate thisproblem, in Algorithms 1 and 2, the transmitter is trained by exploring the space of transmittedwaveforms according to a policy π ( a | y θ T ) . We refer to the transmitter loss gradient obtained viapolicy gradient (27) as the RL transmitter gradient . The beneﬁt of RL-based transmitter trainingis that it renders unnecessary access to the likelihood function p ( z | y θ T , H i ) to evaluate the RLtransmitter gradient, rather the gradient is estimated via samples. We now formalize the relationof the RL transmitter gradient (27) and the transmitter gradient for a known likelihood obtainedaccording to (8).As mentioned, if the likelihood p ( z | y θ T , H i ) were known, and if it were differentiable withrespect to the transmitter parameter vector θ T , the transmitter parameter vector θ T may be learned by minimizing the average loss (8), which we rewrite as a function of both θ R and θ T as L ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) E z ∼ p ( z | y θ T , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1)(cid:9) . (32)The gradient of (32) with respect to θ T is expressed as ∇ θ T L ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) E z ∼ p ( z | y θ T , H i ) (cid:8) (cid:96) (cid:0) f θ R ( z ) , i (cid:1) ∇ θ T ln p ( z | y θ T , H i ) (cid:9) , (33)where the equality leverages the following relation ∇ θ T p ( z | y θ T , H i ) = p ( z | y θ T , H i ) ∇ θ T ln p ( z | y θ T , H i ) . (34)The relation between the RL transmitter gradient ∇ θ T L π ( θ R , θ T ) in (27) and the transmittergradient ∇ θ T L ( θ R , θ T ) in (33) is elucidated by the following proposition. Proposition 2.

If likelihood function p ( z | y θ T , H i ) is differentiable with respect to the transmitterparameter vector θ T for i ∈ { , } , the following equality holds ∇ θ T L π ( θ R , θ T ) = ∇ θ T L ( θ R , θ T ) . (35) Proof.

See Appendix C. (cid:4)

Proposition 2 establishes that the RL transmitter gradient ∇ θ T L π ( θ R , θ T ) equals the trans-mitter gradient ∇ θ T L ( θ R , θ T ) for any given receiver parameters θ R . Proposition 2 hence pro-vides a theoretical justiﬁcation for replacing the gradient ∇ θ T L ( θ R , θ T ) with the RL gradient ∇ θ T L π ( θ R , θ T ) to perform transmitter training as done in Algorithms 1 and 2.V. N UMERICAL R ESULTS

This section ﬁrst introduces the simulation setup, and then it presents numerical examples ofwaveform design and detection performance that compare the proposed data-driven methodologywith existing model-based approaches. While simulation results presented in this section relyon various models of target, clutter and interference, this work expressly distinguishes data-driven learning from model-based design. Learning schemes rely solely on data and not onmodel information. In contrast, model-based design implies a system structure that is basedon a speciﬁc and known model. Furthermore, learning may rely on synthetic data containingdiverse data that is generated according to a variety of models. In contrast, model-based design typically relies on a single model. For example, as we will see, a synthetic dataset for learningmay contain multiple clutter sample sets, each generated according to a different clutter model.Conversely, a single clutter model is typically assumed for model-based design. A. Models, Policy, and Parameters1) Models of target, clutter, and noise:

The target is stationary, and has a Rayleigh envelope,i.e., α ∼ CN (0 , σ α ) . The noise has a zero-mean Gaussian distribution with the correlation matrix [ Ω n ] v,h = σ n ρ | v − h | for ( v, h ) ∈ { , · · · , K } , where σ n is the noise power and ρ is the one-lagcorrelation coefﬁcient. The clutter vector in (4) is the superposition of returns from K − consecutive range cells, reﬂecting all clutter illuminated by the K -length signal as it sweeps inrange across the target. Accordingly, the clutter vector may be expressed as c = K − (cid:88) g = − K +1 γ g J g y , (36)where J g represents the shifting matrix at the g th range cell with elements (cid:2) J g (cid:3) v,h =  if v − h = g if v − h (cid:54) = g ( v, h ) ∈ { , · · · , K } . (37)The magnitude | γ g | of the g th clutter scattering coefﬁcient is generated according to a Weibulldistribution [5] p ( | γ g | ) = βν β | γ g | β − exp (cid:18) − | γ g | β ν β (cid:19) , (38)where β is the shape parameter and ν is the scale parameter of the distribution. Let σ γ g representthe power of the clutter scattering coefﬁcient γ g . The relation between σ γ g and the Weibulldistribution parameters { β, ν } is [50] σ γ g = E {| γ g | } = 2 ν β Γ (cid:18) β (cid:19) , (39)where Γ( · ) is the Gamma function. The nominal range of the shape parameter is . ≤ β ≤ [51]. In the simulation, the complex-valued clutter scattering coefﬁcient γ g is obtained bymultiplying a real-valued Weibull random variable | γ g | with the factor exp( jψ g ) , where ψ g isthe phase of γ g distributed uniformly in the interval (0 , π ) . When the shape parameter β = 2 ,the clutter scattering coefﬁcient γ g follows the Gaussian distribution γ g ∼ CN (0 , σ γ g ) . Basedon the assumed mathematical models of the target, clutter and noise, it can be shown that the optimal detector in the NP sense is the square law detector [9], and the adaptive waveformfor target detection can be obtained by maximizing the signal-to-clutter-plus-noise ratio at thereceiver output at the time of target detection (see Appendix A of [44] for details).

2) Transmitter and Receiver Models:

Waveform generation and detection is implementedusing feedforward neural networks as explained in Section II-B. The transmitter ˜ f θ T ( · ) is afeedforward neural network with four layers, i.e., an input layer with K neurons, two hiddenlayers with M = 24 neurons, and an output layer with K neurons. The activation functionis exponential linear unit (ELU) [52]. The receiver ˜ f θ R ( · ) is implemented as a feedforwardneural network with four layers, i.e., an input layer with K neurons, two hidden layers with M neurons, and an output layer with one neuron. The sigmoid function is chosen as the activationfunction. The layout of transmitter and receiver networks is summarized in Table I. Table IL

AYOUT OF THE TRANSMITTER AND RECEIVER NETWORKS

Transmitter ˜ f θ T ( · ) Receiver ˜ f θ R ( · ) Layer 1 2-3 4 1 2-3 4Dimension K M K K M Activation - ELU Linear - Sigmoid Sigmoid

3) Gaussian policy:

A Gaussian policy π ( a | y θ T ) is adopted for RL-based transmitter training.Accordingly, the output of the stochastic transmitter follows a complex Gaussian distribution a ∼ π ( a | y θ T ) = CN (cid:0)(cid:112) − σ p y θ T , σ p K I K (cid:1) , where the per-chip variance σ p is referred to as the policy hyperparameter . When σ p = 0 , the stochastic policy becomes deterministic [49], i.e.,the policy is governed by a Dirac function at y θ T . In this case, the policy does not explorethe space of transmitted waveforms, but it “exploits” the current waveform. At the oppositeend, when σ p = 1 , the output of the stochastic transmitter is independent of y θ T , and thepolicy becomes zero-mean complex-Gaussian noise with covariance matrix I K /K . Thus, thepolicy hyperparameter σ p is selected in the range (0 , , and its value sets a trade-off betweenexploration of new waveforms versus exploitation of current waveform.

4) Training Parameters:

The initialization waveform s is a linear frequency modulated pulsewith K = 8 complex-valued chips and chirp rate R = (100 × ) / (40 × − ) Hz/s. Speciﬁcally, the k th chip of s is given by s ( k ) = 1 √ K exp (cid:8) jπR (cid:0) k/f s (cid:1) (cid:9) (40)for ∀ k ∈ { , . . . , K − } , where f s = 200 kHz. The signal-to-noise ratio (SNR) is deﬁned asSNR = 10 log (cid:26) σ α σ n (cid:27) . (41)Training was performed at SNR = 12 . dB. The clutter environment is uniform with σ γ g = − . dB, ∀ g ∈ {− K + 1 , . . . , K − } , such that the overall clutter power is (cid:80) K − g = − ( K − σ γ g = 0 dB.The noise power is σ n = 0 dB, and the one-lag correlation coefﬁcient ρ = 0 . . Denote β train and β test the shape parameters of the clutter distribution (38) applied in training and test stage,respectively. Unless stated otherwise, we set β train = β test = 2 .To obtain a balanced classiﬁcation dataset, the training set is populated by samples belongingto either hypothesis with equal prior probability, i.e., P ( H ) = P ( H ) = 0 . . The number oftraining samples is set as Q R = Q T = Q = 2 in the estimated gradients (11), (15), (28),and (29). Unless stated otherwise, the policy parameter is set to σ p = 10 − . , and the penaltyparameter is λ = 0 , i.e., there are no waveform constraints. The Adam optimizer [53] is adoptedto train the system over a number of iterations chosen by trial and error. The learning rate is (cid:15) = 0 . . In the testing phase, × samples are used to estimate the probability of falsealarm ( P fa ) under hypothesis H , while × samples are used to estimate the probability ofdetection ( P d ) under hypothesis H . Receiver operating characteristic (ROC) curves are obtainedvia Monte Carlo simulations by varying the threshold applied at the output of the receiver. Resultsare obtained by averaging over ﬁfty trials. Numerical results presented in this section assumesimultaneous training, unless stated otherwise. B. Results and Discussion1) Simultaneous Training vs Training with Known Likelihood:

We ﬁrst analyze the impact ofthe choice of the policy hyperparameter σ p on the performance on the training set. Fig. 6 showsthe empirical cross-entropy loss of simultaneous training versus the policy hyperparameter σ p upon the completion of the training process. The empirical loss of the system training with aknown channel (32) is plotted as a comparison. It is seen that there is an optimal policy parameter σ p for which the empirical loss of simultaneous training approaches the loss with known channel. Fig. 6. Empirical training loss versus policy hyperparameter σ p for simultaneous training algorithm and training with knownchannel, respectively. As the policy hyperparameter σ p tends to , the output of the stochastic transmitter a is closeto the waveform y θ T , which leads to no exploration of the space of transmitted waveforms.In contrast, when the policy parameter σ p tends to , the output of the stochastic transmitterbecomes a complex Gaussian noise with zero mean and covariance matrix I K /K . In both cases,the RL transmitter gradient is difﬁcult to estimate accurately.While Fig. 6 evaluates the performance on the training set in terms of empirical cross-entropyloss, the choice of the policy hyperparameter σ p should be based on validation data and interms of the testing criterion that is ultimately of interest. To elaborate on this point, ROCcurves obtained by simultaneous training with different values of the policy hyperparameter σ p and training with known channel are shown in Fig. 7. As shown in the ﬁgure, simultaneoustraining with σ p = 10 − . achieves a similar ROC as training with known channel. The choice σ p = 10 − . , also has the lowest empirical training loss in Fig. 6. These results suggest thattraining is not subject to overﬁtting [34].

2) Simultaneous Training vs Alternating Training:

We now compare simultaneous and alter-nating training in terms of ROC curves in Fig. 8. ROC curves based on the optimal detector inthe NP sense, namely, the square law detector [9] and the adaptive/initialization waveform areplotted as benchmarks. As shown in the ﬁgure, simultaneous training provides a similar detectionperformance as alternating training. Furthermore, both simultaneous training and alternating Fig. 7. ROC curves for training with known channel and simultaneous training with different values of policy parameter σ p . training are seen to result in signiﬁcant improvements as compared to training of only thereceiver, and provide detection performance comparable to adaptive waveform [44] and squarelaw detector. Fig. 8. ROC curves with and without transmitter training.

3) Learning Gaussian and Non-Gaussian Clutter:

Two sets of ROC curves under differentclutter statistics are illustrated in Fig. 9. Each set contains two ROC curves with the same clutter statistics: one curve is obtained based on simultaneous training, and the other one is based onmodel-based design. For simultaneous training, the shape parameter of the clutter distribution(38) in the training stage is the same as that in the test stage, i.e, β train = β test . In the test stage, forGaussian clutter ( β test = 2 ), the model-based ROC curve is obtained by the adaptive waveform andthe optimal detector in the NP sense. As expected, simultaneous training provides a comparabledetection performance with the adaptive waveform and square law detector (also shown in Fig.8). In contrast, when the clutter is non-Gaussian ( β test = 0 . ), the optimal detector in the NPsense is mathematically intractable. Under this scenario, the data-driven approach is beneﬁcialsince it relies on data rather than a model. As observed in the ﬁgure, for non-Gaussian clutterwith a shape parameter β test = 0 . , simultaneous training outperforms the adaptive waveformand square law detector. Fig. 9. ROC curves for Gaussian/non-Gaussian clutter. The end-to-end radar system is trained and tested by the same clutterstatistics, i.e, β train = β test .

4) Simultaneous Training with Mixed Clutter Statistics:

The robustness of the trained radarsystem to the clutter statistics is investigated next. As discussed previously, model-based designrelies on a single clutter model, whereas data-driven learning depends on a training dataset. Thedataset may contain samples from multiple clutter models. Thus, the system based on data-drivenlearning may be robustiﬁed by drawing samples from a mixture of clutter models. In the test stage,the clutter model may not be the same as any of the clutter models used in the training stage. As shown in the ﬁgure, for simultaneous training, the training dataset contains clutter samplesgenerated from (38) with four different values of shape parameter β train ∈ { . , . , . , } . Thetest data is generated with a clutter shape parameter β test = 0 . not included in the training dataset.The end-to-end leaning radar system trained by mixing clutter samples provides performancegains compared to a model-based system using an adaptive waveform and square law detector. Fig. 10. ROC curves for non-Gaussian clutter. To robustify detection performance, the end-to-end leaning radar system is trainedwith mixed clutter statistics, while testing for a clutter model different than used for training.

5) Simultaneous Training under PAR Constraint:

Detection performance with waveformslearned subject to a PAR constraint is shown in Fig. 11. The end-to-end system trained withno PAR constraint, i.e., λ = 0 , serves as the reference. It is observed the detection performancedegrades as the value of the penalty parameter λ increases. Moreover, PAR values of waveformswith different λ are shown in Table II. As shown in Fig. 11 and Table II, there is a tradeoffbetween detection performance and PAR level. For instance, given P fa = 5 × − , trainingthe transmitter with the largest penalty parameter λ = 0 . yields the lowest P d = 0 . withthe lowest PAR value . dB. In contrast, training the transmitter with no PAR constraint, i.e., λ = 0 , yields the best detection with the largest PAR value . dB. Fig. 12 compares thenormalized modulus of waveforms with different values of the penalty parameter λ . As shownin Fig. 12 and Table II, the larger the penalty parameter λ adopted in the simultaneous training,the smaller the PAR value of the waveform. Table IIPAR

VALUES OF WAVEFORMS WITH DIFFERENT VALUES OF PENALTY PARAMETER λ λ = 0 (reference) λ = 0 . λ = 0 . PAR [dB] (17) 3.92 1.76 0.17

Fig. 11. ROC curves for PAR constraint with the different values of the penalty parameter λ .Fig. 12. Normalized modulus of transmitted waveforms with different values of penalty parameter λ .

6) Simultaneous Training under Spectral Compatibility Constraint:

ROC curves for spectralcompatibility constraint with different values of the penalty parameter λ are illustrated in Fig.13. The shared frequency bands are Γ = [0 . , . and Γ = [0 . , . . The end-to-end systemtrained with no spectral compatibility constraint, i.e., λ = 0 , serves as the reference. Trainingthe transmitter with a large value of the penalty parameter λ is seen to result in performancedegradation. Interfering energy from radar waveforms trained with different values of λ areshown in Table III. It is observed that λ plays an important role in controlling the tradeoffbetween detection performance and spectral compatibility of the waveform. For instance, for aﬁxed P fa = 5 × − , training the transmitter with λ = 0 yields P d = 0 . with an amount ofinterfering energy − . dB on the shared frequency bands, while training the transmitter with λ = 1 creates notches in the spectrum of the transmitted waveform at the shared frequency bands.Energy spectral densities of transmitted waveforms with different values of λ are illustrated inFig. 14. A larger the penalty parameter λ results in a lower amount of interfering energy in theprescribed frequency shared regions. Note, for instance, that the nulls of the energy spectrumdensity of the waveform for λ = 1 are much deeper than their counterparts for λ = 0 . . Fig. 13. ROC curves for spectral compatibility constraint for different values of penalty parameter λ . Table IIII

NTERFERING ENERGY FROM RADAR WAVEFORMS WITH DIFFERENT VALUES OF WEIGHT PARAMETER λ λ = 0 (reference) λ = 0 . λ = 1 Interfering energy [dB] (20) -5.79 -10.39 -17.11

Fig. 14. Energy spectral density of waveforms with different values of penalty parameter λ . VI. C

ONCLUSIONS

In this paper, we have formulated the radar design problem as end-to-end learning of waveformgeneration and detection. We have developed two training algorithms, both of which are able toincorporate various waveform constraints into the system design. Training may be implementedeither as simultaneous supervised training of the receiver and RL-based training of the transmitter,or as alternating between training of the receiver and of the transmitter. Both training algorithmshave similar performance. We have also robustiﬁed the detection performance by training thesystem with mixed clutter statistics. Numerical results have shown that the proposed end-to-end learning approaches are beneﬁcial under non-Gaussian clutter, and successfully adapt thetransmitted waveform to actual statistics of environmental conditions, while satisfying operationalconstraints. A PPENDIX AG RADIENT OF P ENALTY F UNCTIONS

In this appendix are derived the respective gradients of the penalty functions (17) and (20) withrespect to the transmitter parameter vector θ T . To facilitate the presentation, let y θ T representa K × real vector comprising the real and imaginary parts of the waveform y θ T , i.e., y θ T = (cid:2) (cid:60) ( y θ T ) , (cid:61) ( y θ T ) (cid:3) T .

1) Gradient of PAR Penalty Function:

As discussed in Section II-B, the transmitted poweris normalized such that || y θ T || = || y θ T || = 1 . Let subscript “max” represent the chip indexassociated with the PAR value (17). By leveraging the chain rule, the gradient of (17) withrespect to θ T is written ∇ θ T J PAR ( θ T ) = ∇ θ T y θ T · g PAR , (A.1)where g PAR represents the gradient of the PAR penalty function J PAR ( θ T ) with respect to y θ T ,and is given by g PAR = (cid:2) , . . . , , K (cid:60) ( y θ T , max ) , , . . . , , . . . , , K (cid:61) ( y θ T , max ) , , . . . , (cid:3) T . (A.2)

2) Gradient of Spectral Compatibility Penalty Function:

According to the chain rule, thegradient of (20) with respect to θ T is expressed ∇ θ T J spectrum ( θ T ) = ∇ θ T y θ T · g spectrum , (A.3)where g spectrum denotes the gradient of the spectral compatibility penalty function J spectrum ( θ T ) with respect to y θ T , and is given by g spectrum =  (cid:60) (cid:2) ( Ωy θ T ) ∗ (cid:3) − (cid:61) (cid:2) ( Ωy θ T ) ∗ (cid:3)  . (A.4)A PPENDIX BP ROOF OF P ROPOSITION Proof.

The average loss function of simultaneous training L π ( θ R , θ T ) (25) could be expressed L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) (cid:90) A π ( a | y θ T ) (cid:90) Z (cid:96) (cid:0) f θ R ( z ) , i (cid:1) p ( z | a , H i ) d z d a . (B.1) As discussed in Section II-B, the last layer of the receiver implementation consists of a sigmoidactivation function, which leads to the output of the receiver f θ R ( z ) ∈ (0 , . Thus there exist aconstant b such that sup z ,i (cid:96) (cid:0) f θ R ( z ) , i (cid:1) < b < ∞ . Furthermore, for i ∈ { , } , the instantaneousvalues of the cross-entropy loss (cid:96) (cid:0) f θ R ( z ) , i (cid:1) , the policy π ( a | y θ T ) , and the likelihood p ( z | a , H i ) are continuous in variables a and z . By leveraging Fubini’s theorem [54] to exchange the orderof integration in (B.1), we have L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) (cid:90) Z (cid:96) (cid:0) f θ R ( z ) , i (cid:1) (cid:90) A p ( z | a , H i ) π ( a | y θ T ) d a d z . (B.2)Note that for a waveform y θ T and a target state indicator i , the product between the likelihood p ( z | a , H i ) and the policy π ( a | y θ T ) becomes a joint PDF of two random variables a and z ,namely, p ( z | a , H i ) π ( a | y θ T ) = p ( a , z | y θ T , H i ) . (B.3)Substituting (B.3) into (B.2), we obtain L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) (cid:90) Z (cid:96) (cid:0) f θ R ( z ) , i (cid:1) (cid:90) A p ( a , z | y θ T , H i ) d a d z = (cid:88) i ∈{ , } P ( H i ) (cid:90) Z (cid:96) (cid:0) f θ R ( z ) , i (cid:1) p ( z | y θ T , H i ) d z , (B.4)where the second equality holds by integrating the joint PDF p ( z , a | y θ T , H i ) over the randomvariable a , i.e., (cid:82) A p ( a , z | y θ T , H i ) d a = p ( z | y θ T , H i ) .Taking the gradient of (B.4) with respect to θ R , we have ∇ θ R L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) (cid:90) Z p ( z | y θ T , H i ) ∇ θ R (cid:96) (cid:0) f θ R ( z ) , i (cid:1) d z = ∇ θ R L R ( θ R ) , (B.5)where the second equality holds via (10). Thus, the proof of Proposition 1 is completed. (cid:4) A PPENDIX CP ROOF OF P ROPOSITION Proof.

According to (B.4), the gradient of the average loss function of simultaneous trainingwith respect to θ T is given by ∇ θ T L π ( θ R , θ T ) = (cid:88) i ∈{ , } P ( H i ) (cid:90) Z (cid:96) (cid:0) f θ R ( z ) , i (cid:1) ∇ θ T p ( z | y θ T , H i ) d z = ∇ θ T L ( θ R , θ T ) , (C.1)where the last equality holds by (33). The proof of Proposition 2 is completed. (cid:4) R EFERENCES [1] S. M. Kay,

Fundamentals of Statistical Signal Processing, Vol. II: Detection Theory . New York, NY, USA: Pearson, 1998.[2] D. F. Delong, E. M. Hofstteter, “On the design of optimum radar waveforms for clutter rejection,”

IEEE Trans. Inf. Theory ,vol. 13, no. 7, pp. 454-463, Jul. 1967.[3] P. Stoica, H. He, and J. Li, “Optimization of the receive ﬁlter and transmit sequence for active sensing,”

IEEE Trans. SignalProcess. , vol. 60, no. 4, pp. 1730-1740, Apr. 2012.[4] S. M. Kay, “Optimal signal design for detection of Gaussian point targets in stationary Gaussian clutter/reverberation,”

IEEE J. Sel. Topics Signal Process. , vol. 1, no. 1, pp. 31-41, Jun. 2007.[5] M. A. Richards, J. A. Scheer, and W. A. Holm,

Principles of Modern Radar.

Edison, NJ, USA: Scitech Pub., 2010.[6] F. Gini, A. De Maio, and L. Patton,

Waveform Design and Diversity for Advanced Radar Systems . London, UK: Inst. Eng.Technol., 2012.[7] H. L. Van Trees,

Detection, Estimation, and Modulation Theory, Part I: Detection, Estimation and Linear Modulation Theory.

New York, NY, USA: John Wiley & Sons, 2004.[8] D. P. Meyer and H. A. Mayer,

Radar Target Detection.

New York, NY, USA: Academic Press, 1973.[9] M. A. Richards,

Fundamentals of Radar Signal processing.

New York, NY, USA: McGraw-Hill, 2005.[10] K. J. Sangston and K. R. Gerlach, “Coherent detection of radar targets in a non-Gaussian background,”

IEEE Trans. Aerosp.Electron. Syst. , vol. 30, no. 2, pp. 330-340, Apr. 1994.[11] F. Gini, “Sub-optimum coherent radar detection in a mixture of K-distributed and Gaussian clutter,” in

Proc. IEE Radar,Sonar and Navigat. , vol. 144, no. 1, pp. 39-48, Feb. 1997.[12] K. J. Sangston, F. Gini, M. V. Greco, and A. Farina, “Structures for radar detection in compound Gaussian clutter,”

IEEETrans. Aerosp. Electron. Syst. , vol. 35, no. 2, pp. 445-458, Apr. 1999.[13] A. Farina, A. Russo, and F. A. Studer, “Coherent radar detection in log-normal clutter,”

IEE Proc. F, Commun. Radar andSignal Process. , vol. 133, no. 1, Feb. 1986.[14] F. A. Pentini, A. Farina, and F. Zirilli, “Radar detection of targets located in a coherent K-distributed clutter background,”

IEE Proc. F, Radar and Signal Process. , vol. 139, no. 3, pp. 239-245, Jun. 1992.[15] S. M. Kay, “Waveform design for multistatic radar detection,”

IEEE Trans. Aerosp. Electron. Syst. , vol. 45, no. 4, pp.1153-1165, Jul. 2009. [16] M. M. Naghsh, M. Modarres-Hashemi, S. Shahbazpanahi, M. Soltanalian, and P. Stoica, “Uniﬁed optimization frameworkfor multi-static radar code design using information-theoretic criteria,” IEEE Trans. Signal Process. , vol. 61, no. 21, pp.5401-5416, Nov. 2013.[17] S. Khalili, O. Simeone and A. Haimovich, “Cloud radio-multistatic radar: Joint optimization of code vector and backhaulquantization,”

IEEE Signal Process. Lett. , vol. 22, no. 4, pp. 194-498, Apr. 2015.[18] S. Jeong, S. Khalili, O. Simeone, A. Haimovich, and J. Kang, “Multistatic cloud radar systems: Joint sensing andcommunication design,”

Trans. Emerg. Telecommun. Technol. , vol. 27, no. 5, pp. 716-730, Feb. 2016.[19] A. De Maio, Y. Huang, M. Piezzo, S. Zhang, and A. Farina, “Design of optimized radar codes with a peak to averagepower ratio constraint,”

IEEE Trans. Signal Process. , vol. 59, no. 6, pp. 2683-2697, Jun. 2011.[20] B. Tang, M. M. Naghshm, and J. Tang, “Relative entropy-based waveform design for MIMO radar detection in the presenceof clutter and interference,”

IEEE Trans. Signal Process. , vol. 63, no. 14, pp. 3783-3796, Apr. 2015.[21] M. M. Naghsh, M. Modarres-Hashemi, M. A. Kerahroodi, and E. H. M. Alian, “An information theoretic approach torobust constrained code design for MIMO radars,”

IEEE Trans. Signal Process. , vol. 65, no. 14, pp. 3647-3661, Jul. 2017.[22] L. Wu, P. Badu, and D. P. Palomar, “Transmit waveform/receive ﬁlter design for MIMO radar with multiple waveformconstraints,”

IEEE Trans. Signal Process.

IEEE Aerosp. Electron. Syst. Mag. , vol. 31, no. 12, pp. 14-25, Dec. 2016.[25] W. Jiang and A. M. Haimovich, “Joint optimization of waveform and quantization in spectral congestion conditions,” in

Proc. 52nd Asilomar Conf. Signals, Syst., Comput. , Oct. 2018, pp. 1897-1898.[26] L. Zheng, M. Lops, X. Wang, and E. Grossi, “Joint design of overlaid communication systems and pulse radars,”

IEEETrans. Signal Process. , vol. 66, no. 1, pp. 139-154, Jan. 2018.[27] W. Jiang and A. M. Haimovich, “Waveform optimization in cloud radar with spectral congestion constraints,” in

Proc.IEEE Radar Conf. , Apr. 2019, pp. 1-6.[28] B. Tang and J. Li, “Spectrally constrained MIMO radar waveform design based on mutual information,”

IEEE Trans. SignalProcess. , vol. 67, no. 3, pp. 821-834, Feb. 2019.[29] N. Sebe, I. Cohen, A. Garg, and T. S. Huang,

Machine Learning in Computer Vision.

Heidelberg, Germany: Springer, 2005.[30] J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in

Proc.IEEE Conf. on Computer Vision and Pattern Recognition , Jun. 2009, pp. 248-255.[31] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,”

IEEE Comput. Intell. Mag. , vol. 13, no. 3, pp. 55-75, Aug. 2018.[32] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for languageunderstanding,” arXiv preprint arXiv:1810.04805 , 2018.[33] T. O’Shea and J. Hoydis, “An Introduction to deep learning for the physical layer,”

IEEE Trans. Cogn. Commun. Netw. ,vol. 3, no. 4, pp. 563-575, Dec. 2017.[34] O. Simeone, “A brief introduction to machine learning for engineers,”

Found. Trends ® Signal Process. , vol. 12, no. 3-4,pp. 200-431, Aug. 2018.[35] M. Kim, W. Lee, and D. H. Cho, “A novel PAPR reduction scheme for OFDM system based on deep learning,”

IEEECommun. Lett. , vol. 22, no. 3, pp. 510-513, Mar. 2018. [36] F. A. Aoudia and J. Hoydis, “Model-free training of end-to-end communication systems,” IEEE J. Sel. Areas Commun. ,vol. 37, no. 11, pp. 2503-2516, Nov. 2019.[37] V. Raj and S. Kalyani, “Backpropagating through the air: Deep learning at physical layer without channel models,”

IEEECommun. Lett. , vol. 22, no. 11, pp. 2278-2281, Nov. 2018.[38] O. Simeone, “A very brief introduction to machine learning with applications to communication systems,”

IEEE Trans.Cogn. Commun. Netw. , vol. 4, no. 4, pp. 648-664, Dec. 2018.[39] O. Simeone, S. Park, and J. Kang, “From learning to meta-learning: Reduced training overhead and complexity forcommunication systems,” in

Proc. 6G Wireless Summit , Mar. 2020.[40] S. Park, O. Simeone, and J. Kang, “Meta-learning to communication: Fast end-to-end training for fading channels,” in

Proc. IEEE 45th Int. Conf. Acoustic, Speech and Signal Process. (ICASSP) , May 2020.[41] S. Park, O. Simeone, and J. Kang, “End-to-end fast training of communication links without a channel model via onlinemeta-learning,” arXiv preprint arXiv:2003.01479 , 2020.[42] M. P. Jarabo-Amores, M. Rosa-Zurera, R. Gil-Pita, and F. Lopez-Ferreras, “Study of two error functions to approximate theNeyman-Pearson detector using supervised learning machines,”

IEEE Trans. Signal Process. , vol. 57, no. 11, pp. 4175-4181,Nov. 2009.[43] M. P. Jarabo-Amores, D. de la Mata-Moya, R. Gil-Pita, and M. Rosa-Zurera, “Radar detection with the Neyman–Pearsoncriterion using supervised-learning-machines trained with the cross-entropy error,”

EURASIP Journal on Advances in SignalProcess. , vol. 2013, no. 1, pp. 44-54, Mar. 2013.[44] W. Jiang, A. M. Haimovich, and O. Simeone, “End-to-end learning of waveform generation and detection for radarsystems,” in

Proc. IEEE 53rd Asilomar Conf. on Signals, Systems, and Computers , Nov. 2019, pp. 1672-1676.[45] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning.

Cambridge, MA, USA: MIT Press, 2016.[46] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747 , 2016.[47] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning withfunction approximation,”

Adv. Neural Inf. Process. Syst., vol. 12, 2000.[48] A. Aubry, A. De Maio, Y. Huang, M. Piezzo, and A. Farina, “A new radar waveform design algorithm with improvedfeasibility for spectral coexistence,”

IEEE Trans. Aerosp. Electron. Syst. , vol. 51, no. 2, pp. 1029-1038, Apr. 2015.[49] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in

Proc. Int. Conf. Machine Learning , 2014, pp. 387–395.[50] A. Farina, A. Russo, F. Scannapieco, and S. Barbarossa, “Theory of radar detection in coherent Weibull clutter,”

IEE Proc.F, Commun., Radar and Signal Process. , vol. 134, no. 2, pp. 174-190, Apr. 1987.[51] D. A. Shnidman, “Generalized radar clutter model,”

IEEE Trans. Aerosp. Electron. Syst. , vol. 35, no. 3, pp. 857-865, Jul.1999.[52] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),”in

Proc. Int. Conf. Learn. Represent. , May 2016, pp. 1-14.[53] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in

Proc. Int. Conf. Learn. Represent. , May 2015, pp.1-15.[54] W. Rudin,