[PDF] Multi-Person Continuous Tracking and Identification from mm-Wave micro-Doppler Signatures

Abstract

In this work, we investigate the use of backscattered mm-wave radio signals for the joint tracking and recognition of identities of humans as they move within indoor environments. We build a system that effectively works with multiple persons concurrently sharing and freely moving within the same indoor space. This leads to a complicated setting, which requires one to deal with the randomness and complexity of the resulting (composite) backscattered signal. The proposed system combines several processing steps: at first, the signal is filtered to remove artifacts, reflections and random noise that do not originate from humans. Hence, a density-based classification algorithm is executed to separate the Doppler signatures of different users. The final blocks are trajectory tracking and user identification, respectively based on Kalman filters and deep neural networks. Our results demonstrate that the integration of the last-mentioned processing stages is critical towards achieving robustness and accuracy in multi-user settings. Our technique is tested both on a single-target public dataset, for which it outperforms state-of-the-art methods, and on our own measurements, obtained with a 77 GHz radar on multiple subjects simultaneously moving in two different indoor environments. The system works in an online fashion, permitting the continuous identification of multiple subjects with accuracies up to 98%, e.g., with four subjects sharing the same physical space, and with a small accuracy reduction when tested with unseen data from a challenging real-life scenario that was not part of the model learning phase.

Full PDF

IIEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Multi-Person Continuous Tracking and Identiﬁcationfrom mm-Wave micro-Doppler Signatures

Jacopo Pegoraro,

Graduate Student Member, IEEE,

Francesca Meneghello,

Graduate Student Member, IEEE, and Michele Rossi,

Senior Member, IEEE

Abstract —In this work, we investigate the use of backscatteredmm-wave radio signals for the joint tracking and recognition ofidentities of humans as they move within indoor environments.We build a system that effectively works with multiple personsconcurrently sharing and freely moving within the same indoorspace. This leads to a complicated setting, which requires oneto deal with the randomness and complexity of the resulting(composite) backscattered signal. The proposed system combinesseveral processing steps: at ﬁrst, the signal is ﬁltered to removeartifacts, reﬂections and random noise that do not originatefrom humans. Hence, a density-based classiﬁcation algorithm isexecuted to separate the Doppler signatures of different users.The ﬁnal blocks are trajectory tracking and user identiﬁca-tion, respectively based on Kalman ﬁlters and deep neuralnetworks. Our results demonstrate that the integration of thelast-mentioned processing stages is critical towards achievingrobustness and accuracy in multi-user settings. Our techniqueis tested both on a single-target public dataset, for which itoutperforms state-of-the-art methods, and on our own mea-surements, obtained with a GHz radar on multiple subjectssimultaneously moving in two different indoor environments. Thesystem works in an online fashion, permitting the continuousidentiﬁcation of multiple subjects with accuracies up to %,e.g., with four subjects sharing the same physical space, andwith a small accuracy reduction when tested with unseen datafrom a challenging real-life scenario that was not part of themodel learning phase. Index Terms —multi-person identiﬁcation, convolutional neu-ral networks, density-based clustering, mm-wave radar, micro-Doppler, indoor monitoring, human tracking.

I. I

NTRODUCTION R ADAR devices for indoor spaces have recently gatheredconsiderable attention. They work by emitting radiowaves and analyzing the signal that is reﬂected by the en-vironment and collected at their receiving antennas. In con-trast with camera surveillance systems, they are insensitiveto poor light conditions and are more privacy preserving,as no video of the scene is collected [1]. Radars are alsoenergy efﬁcient compared to other technologies such as LI-DARs [2]. In this work, we propose a multi-person onlineidentiﬁcation framework that is based on the analysis of the

This work has been supported, in part, by MIUR (Italian Ministry ofEducation, University and Research) through the initiative ”Departments ofExcellence” (Law 232/2016) and by the EU MSCA ITN project MINTS“MIllimeter-wave NeTworking and Sensing for Beyond 5G” (grant no.861222).The authors are with the Department of Information Engineering at theUniversity of Padova, via Gradenigo 36/b, 35131, Padova, Italy.Digital Object Identiﬁer 10.1109/TGRS.2020.3019915 (reﬂected) signal received by a millimeter-wave (mm-wave)low power frequency-modulated continuous-wave (FMCW)radar. Our work stems from the observation that reﬂectedsignals collected as a subject walks in near proximity of theradar are person-speciﬁc, as radio reﬂections depend on thebody shape and, in time, on the movement. As such, theycan be used to recognize the identity of humans moving inproximity of the radar device. Our system achieves accuraciesas high as % with four persons moving within a relativelysmall indoor place. Such performance is achieved in an onlinefashion (continuous tracking and identiﬁcation), allowing oneto recognize user identities as these share the same physicalspace, without relying on any visual representation of thescene. We stress that previous work [1], [3], [4] has copedwith a single-person identiﬁcation problem and the multi-usercase has only been addressed in an ofﬂine fashion through thesuperposition of multiple single-person signals. In contrast,we build a system that effectively works when multiplepersons concurrently share and freely move within the sameindoor space, directly working on the composite reﬂectedsignal that they generate. To distinguish different persons fromtheir way of walking (gait), we analyze their micro-Dopplersignature ( µ D), i.e., the small scale Doppler effect causedby the human motion. In the interest of developing a low-complexity system, we ﬁrst extract µ D features performingrange-Doppler (RD) processing (i.e., distance and velocity)of the signal gathered from a single receiving antenna. Afterthat, we address the limitations of RD processing by tacklingthe so called range-Doppler-azimuth (RDA) space, through theintegration of the angle-of-arrival (AoA) of the received radioreﬂections, estimated using multiple receiving antennas. TheAoA information allows resolving targets which are at thesame distance from the radar device, and that move with thesame velocity; these targets would hardly be separable in thesimpler RD space.The simultaneous identiﬁcation of multiple targets requiresto track and separate the subjects (namely, their contributionsto the composite backscattered signal) in order to extract their µ D (temporal) traces. Our technique operates in either theRD or RDA spaces, integrating tracking and identiﬁcationthrough the following steps: 1) detection: random noise isremoved and a density-based clustering algorithm (on eitherRD or RDA maps) is applied for target detection, 2) tracking: a dedicated Kalman ﬁltering (KF) algorithm is utilized totrack the detected target points in the RD (RDA) space, a r X i v : . [ ee ss . SP ] F e b EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 2 and 3) identiﬁcation: a deep convolutional neural network(DCNN) is exploited to carry out the ﬁnal identiﬁcation. Westress that the joint estimation of user movement (the trackingstep 2) and computation of identiﬁcation features (step 3) iskey to correctly disentangle the RD/RDA signals from multiplesubjects. As we experimentally verify in Section V, trackingerrors and consequent wrong identiﬁcations critically dependon this joint processing.When processing radar data for identiﬁcation purposes, theanalytical models of the propagation and backscattering phe-nomena often fail to handle the high randomness of mm-wavereﬂections and hardware non-idealities. To cope with this,we exploit a deep learning architecture (i.e., the DCNN), asit enables a data-driven system training. This technique hasbecome dominant for this type of processing tasks [1], [5].Differently from previous research efforts, the proposedframework is evaluated by measuring its online accuracy inthe simultaneous identiﬁcation of multiple targets, taking intoaccount the additional disturbances, blockages and spuriousreﬂections that are due to the presence of other people, andusing experiments designed to reproduce a worst-case scenariofor target tracking. To this end, we have emulated a real-lifesetting, letting subjects walk freely within the scene, at adistance that ranges from to meters. In addition, we testthe generality of the proposed approach in a room for whichno training data was recorded, i.e., this environment is unseen from the perspective of model learning.The main contributions of the paper are summarized next.1) We propose a system for the simultaneous indoor iden-tiﬁcation of multiple targets from µ D signatures of gaitusing only RD information, reaching an average onlineaccuracy of % when three subjects walk concurrentlywithin the same physical environment. The approach thatwe devise for this scenario (RD signal space) worksup to long distances ( m) in indoor environments.To the best of our knowledge, no other study in theliterature proposes a working system for the consideredmulti-target online identiﬁcation task.2) We introduce a novel DCNN for µ D processing andquantify its performance improvement with respect toother models presented in the literature by evaluating iton a publicly available dataset (IDRad [1]) obtaining anaccuracy of . %.3) We design a new approach for tracking that is robust totrajectory tracking errors thanks to the feedback on thesubject identity provided by the DCNN classiﬁer. Ourdesign entails the integration of tracking and identiﬁca-tion blocks, which leads to a signiﬁcant improvement interms of online identiﬁcation accuracy.4) We show how the proposed processing pipeline can alsobe applied to RDA data, solving some limitations of theRD signals. This allows one to achieve higher targetdetection capabilities at the cost of a higher compu-tational complexity and of a reduced detection range.With RDA information, we reach an online accuracy ofup to % for four subjects. The RDA-based system isalso evaluated in an unseen environment, with furnitureand static objects, achieving an accuracy of % on two subjects. To the best of our knowledge, it is the ﬁrst timethat this evaluation is conducted for the mm-wave radarmulti-target tracking and identiﬁcation problem.The rest of the paper is organized as follows. In Section II,the existing literature is reviewed, underlining the novel as-pects underpinning our approach. In Section III, the FMCWradar signal model and the computation of RD, RDA maps and µ D signatures is detailed. The new framework is thoroughlypresented in Section IV. In Section V, experimental results arepresented, while concluding remarks are given in Section VI.II. R

ELATED WORK

Human identiﬁcation from radar sensors is a research themethat is rapidly gaining momentum. Some papers target theclassiﬁcation of the subject identity from the µ D signatureof gait using radio signals [1], [3], [6]–[10]. Other studiesfocus on human activity recognition from the backscatteredradio signal for security or smart-home applications [5], [11],[12]. Respiration rate and heartbeat can also be tracked, asthey cause a detectable movement of the subject’s chest [13],[14]. As the focus of this paper is on gait recognition andperson identiﬁcation, in the following we brieﬂy review themost important contributions on this topic.In [6], the authors employ for the ﬁrst time a classiﬁerbased on the deep CNN AlexNet [15] to identify a personfrom her/his µ D signature of gait, reaching an accuracy ofabout % with four subjects. Differently from our setup,their experiments take place in an outdoor environment, wherecorrelated noisy reﬂections from static objects are typicallyweak: walls in indoor environments are signiﬁcantly close tothe target of interest in most scenarios, and they cause thenoise level to increase making the extraction of the usefulsignal features much harder.Chen et al. [9] utilize a multi-static radar with three nodesand a pre-trained deep CNN for image recognition, in orderto detect whether a person carries a weapon or to identifya person between two subjects. The authors of [7] addressidentiﬁcation using the µ D signature of six different move-ments including walking and running. Running turned out tobe the most discriminative action, providing an identiﬁcationaccuracy of . % with subjects. In [8], instead, atreadmill placed at different distances from the radar device isused, and a ResNet50 [16] neural network is trained to classify subjects.The above studies focused on simpliﬁed experimental sce-narios, where the person was required to walk on a straightline, in a radial direction from the radar device. This approachcan be useful to simplify the classiﬁcation task, by makinggait features more evident, but it is not realistic and lacks thegenerality that would be required by a practical application. Inour current work, we focus on a more realistic setup, lettingthe subjects walk in an unconstrained, free manner within themonitored physical space.Vandersmissen et al. [1] train a CNN classiﬁer on a datasetfeaturing ﬁve subjects who randomly walk in two differentrooms, in an attempt to implement a more robust learningphase. However, each subject needs to be alone in the room EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 3 in order for the system to work, as no method to separatethe different target contributions in the backscattered signal isprovided. This heavily limits the applicability of the proposedalgorithm to real situations, where multiple targets are likely toshare and concurrently move in the physical space. The sameauthors also propose two improvements over their algorithm,to improve its accuracy, but the single-target limitation is stillpresent [3], [10].A ﬁrst attempt at performing multi-subject identiﬁcationcan be found in [4], where -dimensional radar point cloudsobtained by RDA processing are used in place of µ D sig-natures, in combination with a recurrent neural network withlong short-term memory (LSTM) cells for a-posteriori iden-tiﬁcation. The overall accuracy obtained for subjects isaround %, and evidence that the system is able to distinguishbetween two subjects is provided. However, no evaluation ofthe accuracy is conducted when more than subjects share thesame environment, and no evaluation is conducted to assessthe generality of the classiﬁer by testing it in a different indoorenvironment (e.g., a new room) after the training. The sparsityof radar point-cloud data can become a source of inaccuracywhen a high number of subjects has to be tracked, due tofailures in the clustering procedure. To date, no method existsto deal with the superposition of the signal clusters caused bythe proximity of the subjects, thus limiting the working rangeof identiﬁcation systems to a radius of − meters.In this work, we improve over previous studies by ﬁrstshowing the feasibility of identifying multiple persons onlyusing RD information, with a lightweight processing workﬂowand limited hardware requirements. Further, we extend theproposed system to deal with RDA data, in case a higherdetection performance is required, e.g., to handle more targets,or in case precise tracking of the subjects in the x − y space is sought. We also show how the complex task ofreliably separating the different user’s reﬂections (especiallyin RD images) can be successfully tackled by feeding back theidentiﬁcation output into the user’s trajectory tracking module,combining these two processing stages. Improvements anddrawbacks of our approach are duly quantiﬁed and discussed.III. MM - WAVE RADAR SIGNAL MODEL

A FMCW radar allows the joint estimation of the distanceand the radial velocity of the target with respect to the radardevice. This is achieved by transmitting sequences of chirps ,i.e., sinusoidal waves with frequency that varies in time, andmeasuring the frequency shift of the backscattered signal atthe receiver.In this paper, we use a linear FMCW (LFMCW) radar forwhich the frequency of the transmitted chirp signal (TX) islinearly increased from a base value f o to a maximum f in T seconds. Deﬁning the bandwidth of the chirp as B = f − f o ,bandwidth B and transmission duration T are related through ζ = B/T , and the transmitted signal can be expressed as s ( t ) = exp (cid:26) j π (cid:18) f o + 12 ζt (cid:19) t (cid:27) , ≤ t ≤ T. (1)The chirps are transmitted every T rep seconds in sequencesof P chirps each, so that the total duration of a transmitted sequence is P T rep . At the receiver, a mixer combines thereceived signal (RX) with the transmitted one, generatingthe intermediate frequency (IF) signal, i.e., a sinusoid whoseinstantaneous frequency is the difference between the frequen-cies of the TX and RX signals. Each chirp is sampled withsampling period T s (referred to as fast time sampling) obtain-ing N points, while P samples, one per chirp from adjacentchirps, are taken with period T rep ( slow time sampling).The use of multiple-input multiple-output (MIMO) radardevices allows the additional estimation of the AoA of thereﬂections, by computing the phase shifts between the receiverantenna elements due to their different positions (i.e., theirdifferent distances from the target). This is referred to as spatial sampling, and enables the localization of the targetsin the physical space using polar and cartesian coordinates. Inthe present work we employ a linear receiver antenna array,i.e., the RX antennas are aligned along a single dimension andspaced apart by a distance δ . A. Range, Doppler and azimuth information

The transmitted signal hits the target at some spatial point,generating a backscattered signal that can be detected atthe receiver. This reﬂected signal is equal to the transmittedwaveform with a delay τ that depends on the distance betweenthe target and the radar, their relative radial velocity, and onthe additional distance due to the different positions of thereceiving antenna elements. Considering the most general casewhere Q targets are present in the radar illumination rangeand L antennas are available at the linear receiver array, andindicating with c the speed of light, letting R q , v q and θ q respectively be the range, velocity and azimuth angle withrespect to the device of target q , the delay measured at antennaelement (cid:96) for the signal reﬂection coming from target q canbe computed as τ (cid:96)q = 2( R q + v q t ) + (cid:96)δ sin θ q c . (2)After mixing and sampling, the IF signal is expressed as [17] y ( n, p, (cid:96) ) = Q − (cid:88) q =0 α q exp { j πφ q ( n, p, (cid:96) ) } + w ( n, p, (cid:96) ) , (3)where α q is a coefﬁcient that accounts for the attenuationeffects due to the antenna gains, path loss and radar crosssection (RCS) of the target and w is a Gaussian noise term.The phase φ q ( n, p, (cid:96) ) depends on the target, the fast time,slow time and spatial sampling indices. By neglecting theterms giving a small contribution, its approximate expressioncan be written by introducing the quantities f d q = 2 f o v q /c and f b q = 2 ζR q /c , which respectively represent the Dopplerfrequency and the beat frequency of the signal reﬂected fromtarget q , φ q ( n, p, (cid:96) ) ≈ f o R q c + f d q pT rep + f o (cid:96)δ sin θ q c + (cid:0) f d q + f b q (cid:1) nT s . (4)Samples of y can be arranged into a 3-dimensional tensorcalled radar data cube , that contains all the informationprovided by the radar device for a given time frame. The EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 4

Targets (a) RD map r a ng e [ m ] a n g l e [ ° ] − − − −

200 204060 v e l o c it y [ m / s ] − − − (b) RDA map v e l o c i t y [ m / s ] Subject 0

Subject 1 v e l o c i t y [ m / s ] Subject 2

Subject 3 N o r m . P o w e r N o r m . P o w e r (c) µ D signature

Fig. 1: Visual representation of the RD, RDA maps and µ D signature after a thresholding operation is applied. In the RD map targets are present, while in RDA and µ D targets are considered.frequency shifts of interest, which reveal the target range,velocity and angular position, can be extracted after applying adiscrete Fourier transform (DFT) along the fast time, slow timeand spatial dimension (beamforming). In the resulting signal,the position of the peak along the fast time dimension revealsthe frequency of the IF signal f d q + f b q ≈ f b q , the peak alongslow time gives the Doppler frequency f d q . From the peak ofthe DFT along the spatial dimension we get the phase shiftdue to the angular displacement of the target, ϕ a q . The desiredquantities are then estimated as follows (we indicate with thesymbol ∆ the corresponding resolution) ˆ R q = f b q c ζ , ∆ ˆ R q = c B , (5) ˆ v q = f d q c f o , ∆ˆ v q = c f o P T rep , (6) ˆ θ q = sin − (cid:18) ϕ a q c πδf o (cid:19) , ∆ˆ θ q = c δL cos(ˆ θ q ) . (7)In the following, the radar cube after applying the DFT in thethree dimensions will be referred to as range-Doppler-azimuthmap (RDA). An example of the RDA map for four subjectsis shown in Fig. 1b.In the case of a single receiving antenna, spatial sampling isnot possible, and we can only estimate the range and velocityof the targets with the same approach used above, with thedifference that Eq. (2) and Eq. (4) do not depend on theantenna element (cid:96) . The result of the 2-dimensional DFT iscalled range-Doppler map (RD), see the example in Fig. 1a. B. µ -Doppler map Human targets present different moving parts, thereforetheir overall motion is more complex than just translation.The small-scale vibrations or rotations of their body partsintroduce a Doppler shift that is time dependent and that canbe represented as a frequency modulation on the reﬂectedsignal, which carries unique features depending on the speciﬁctarget considered. A model for this phenomenon is presentedin [18], [19], where it is shown that the sensitivity to µ Deffects is higher when using small wavelenghts: mm-wave radios are therefore more suited for applications where ﬁnegrained information is needed.The extraction of the µ D signature from the received signalcan be performed by computing a short-time Fourier transform(STFT) on the slow-time sampled waveform to estimate thepower spectral density (PSD) along the Doppler dimension, asdone in [8]. An alternative is to compute the RD (or RDA) mapﬁrst, and subsequently integrate along the range and angular(or range only) dimensions [1], as shown in Fig. 1c. Thissecond option is computationally more expensive, but it ispreferred here because the RD (respectively RDA) map canbe used to locate the targets and separate their contributionsin a 2D (resp. 3D) space, while this separation would be veryhard from the µ D spectrogram, as it lacks the range (resp.range and angle) information.IV. P

ROPOSED ALGORITHM

In this section, we offer a general overview of the proposedalgorithm. The blocks that are presented here are used forboth RD and RDA processing, with minor differences in theimplementation details of each algorithm, due to the differentproperties of the two maps.

A. Overview of the signal processing pipeline

The extraction of the gait features from the µ D spectrogramcan be very difﬁcult, and the results are heavily affected byenvironment and hardware non-idealities. In addition, in thecase of multiple targets, the µ D is a composite temporal signalresulting from the superimposed contributions of all movingentities. The separation of such contributions is very hard,whereas it is easier in the RD or RDA spaces as the reﬂectionsfrom different users are further spaced out by the distance ofthe users from the radar (RD) or by their distance and angleof arrival (RDA), resulting in point clouds as shown in Fig. 1.For this reason, our dynamic processing framework works oneither the RD or RDA spaces, through the following steps (seeFig. 2).1)

Detection . At ﬁrst, a pre-processing step is applied tothe raw data at the output of the radar mixer, to remove

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 5

Radar Data

PreprocessingDensity-based

Clustering

Kalman FilterData Association LabelingClassifier

Detection

Tracking Identification

Obs.

Slow time Fast timeChn.

Identity

Fig. 2: Signal processing workﬂow.static reﬂections and noise (see Section IV-C). Hence, aclustering scheme from the family of “density-based spa-tial clustering for applications with noise” (DBSCAN)algorithms is executed to separate the RD/RDA contri-butions from distinct subjects from the composite signal(see Section IV-D).2)

Tracking . A Kalman ﬁlter operating on subsequentRD or RDA frames is applied to obtain a reliableestimation of the true subject’s state (i.e., its location, seeSection IV-E). The association of the RD/RDA clustersdetected in the current time-frame with the right usertrajectories is performed using the Hungarian algorithm(see Section IV-F).3)

Identiﬁcation . Feature extraction and user identiﬁcationare performed with a DCNN model based on inceptionblocks (IBs) that takes as input portions of the µ Dspectrogram of each subject (obtained from the RD/RDAdata of the subject, after the use of DBSCAN and tra-jectory tracking). In case tracking fails and the RD/RDAclusters of some subjects cannot be separated, the DCNNoutput is used to re-establish the correct labeling of thetargets, by feeding back the identity information to thetrajectory tracking block (see Section IV-J for details).Multi-person identiﬁcation from backscattered mm-wavesignals presents several challenges. First, an effective andreliable separation of the different targets is difﬁcult to achievedue to the high level of randomness in mm-wave indoor propa-gation environments. Second, a robust classiﬁcation based on µ D signatures requires high generalization capabilities fromthe DCNN identity classiﬁer. Indeed, we seek to differen-tiate subjects from their way of moving rather than fromproperties that may be less person-speciﬁc, such as theiraverage walking speed. A distinctive and key feature of theproposed approach is the dynamic integration of trajectorytracking and identiﬁcation, which allows correcting trajectorytracking errors based on the output of the identiﬁcation block.As a result, our system is suited to online processing, isrobust to the superposition of user clusters in the RD/RDAspaces, to variable walking speeds, to fake targets due toreﬂecting objects/surfaces, to classiﬁcation instability and totargets appearing on (disappearing from) the scene.

B. Notation

The system operates at discrete time increments, t =1 , , . . . T , where time steps have a ﬁxed duration of ∆ t seconds, corresponding to the radar frame period. In theremainder, the sequential evolution of the algorithms is in-terchangeably expressed in terms of time steps and radarframes. The RD/RDA clusters detected in the current timestep t are marked with indices d = 0 , , . . . , D t − and are D t in total. Similarly, the K t trajectories that are currentlymaintained by the trajectory tracking block are indexed usingvariable k = 0 , , . . . , K t − . With U , we indicate the numberof classes (identities) on which the system is trained, i.e.,the identities that will be recognized as known, representedthrough index u = 0 , , . . . , U . Boldface, capital letters referto matrices, e.g., X , with elements X ij , whereas boldfacelowercase letters refer to vectors, e.g., x . Symbol ⊗ denotesthe Kronecker product between matrices, X − denotes theinverse of matrix X , and x T denotes the transpose of vector x . N ( µ, σ ) indicates a Gaussian random variable with mean µ and variance σ . C. Pre-processing

The pre-processing involves two different phases, namelyremoval of static reﬂections and denoising.

1) Removal of Static Reﬂections:

This is the ﬁrst block inthe processing pipeline: it receives as input the raw radar data,i.e., the radar cube containing the -dimensional signal (seeEq. (3)) that the radar outputs at every time step. As discussedin Section III-A, DFT is applied to this signal to obtainthe RD or RDA map. In the RD case, only one channel iscollected (one receiving antenna), the DFT is applied along therange dimension ﬁrst and then along the Doppler dimension,resulting in a matrix containing range and Doppler informationon the targets. In the RDA processing case, an additionalDFT along the angular dimension is computed. Before theDFT, a Hanning window is applied along each dimension.The RD and RDA maps are further processed to removereﬂections from static targets. As ﬁxed objects are mapped intoa vertical line in correspondence of the m/s velocity value,we remove their contributions by cutting the Doppler channelsrelated to negligible velocities from both the RD and RDAmaps. This processing step is of key importance, as the staticclutter would dominate the RDA maps if not removed, causingthe reﬂections from the subjects to be merged with staticcontributions with consequent severe difﬁculties in the trackingprocess. As an additional beneﬁt, the algorithm becomes lessdependent on the environment characteristics. EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 6

2) Denoising:

Denoising is applied in two phases. In theﬁrst phase, a received power threshold is applied along therange dimension, keeping only the signal values that lie aboveit. The threshold is decreased linearly in the logarithmicdomain as the range increases, going from − dBm atminimum range to − dBm at maximum range. This ismotivated by the fact that targets further away from the radardevice would be penalized by using a ﬁxed threshold due tothe smaller power they receive. In case of RDA processing,a further thresholding is applied along the angular dimension,discarding the angular bins where the received power levelis weaker than dB with respect to the peak value. Thisis implemented to mitigate the effects of the side lobesgenerated by the beamforming procedure. The resulting datapoints represent the locations in the -dimensional (RD) or -dimensional (RDA) maps, where a sufﬁciently high reﬂectedpower is received. These points represent candidate reﬂectionsfrom the targets. D. Target clustering in RD/RDA spaces – DBSCAN

Density-based clustering, as opposed to distance -based clus-tering, groups input samples depending on their density. Oneof the most widely used algorithms belonging to this cate-gory is DBSCAN [20], which has been previously appliedto clusterize radar point clouds in [4], [21]. The algorithmoperates a sequential scanning of all the data points, expandinga cluster until a certain density condition is no longer satisﬁed.It requires one to specify two input parameters, (cid:15) and m pts ,respectively representing a radius around each point and theminimum number of other points inside of it to satisfy the den-sity condition. In this work, we use (cid:15) = 0 . and m pts = 40 .Each point of the radar map, after denoising, is mapped ontoa vector of coordinates p i = [ r i , v i ] T (range and velocity)for RD processing and p i = [ r i , v i , θ i ] T (range, velocity andangle) for RDA processing, with an associated received power P RX ( p i ) . To simplify the selection of the distance thresholdparameter, (cid:15) , the range, angle and velocity coordinates of thepoints p i are normalized in the interval [0 , before the actualclustering step. DBSCAN is applied on the normalized set ofpoints: some, having low density, are classiﬁed as noise anddiscarded, while a partition of the remaining ones is outputtedat each time step t . We denote by C , C , . . . , C D t − theresulting clusters, one for each of the D t detections. Afterthe clustering operation, the point clouds are re-mapped ontothe original range of values. For each cluster, we select itscentroid as a noisy observation of the true coordinates (rangeand velocity for RD, range, velocity and angle for RDA) of theperson. Centroids z d , d = 0 , , . . . , D t − , are computed byweighting each cluster point by the corresponding normalizedreﬂected power value, namely, z d = (cid:80) p i ∈ C d p i P RX ( p i ) (cid:80) p j ∈ C d P RX ( p j ) . (8)In this way, the centroid tends towards those points with ahigher power, assigning them more importance in representingthe actual target position. Note that, DBSCAN clusteringperforms the detection of the clusters by solely operating on the present time step, i.e., points in previous time steps arenot considered. While this is desirable, as it leads to a lowcomplexity clustering algorithm, it presents some drawbacks.In fact, not all the clusters that are detected in any speciﬁctime step may represent actual subjects, but noisy reﬂectionsand ghost targets often appear (at random) in the RD/RDAspace. When their power is comparable with that of theactual target reﬂections, DBSCAN may enroll them amongthe detected clusters. To compensate for this, we use a furthertracking procedure, described in the following Section IV-E,that analyzes the movement of the clusters in the RD/RDAspace across subsequent frames. This allows detecting andremoving spurious clusters, as these are likely to appear (anddisappear soon after) at random times, whereas the clustersgenerated by actual targets tend to be persistent across frames. E. Trajectory tracking – Kalman ﬁlter

Trajectory tracking is carried out by applying a discreteKalman ﬁlter (KF) on the past positions of the targets, whichare represented by the cluster centroids z , . . . , z D t − . Notethat the number of maintained trajectories at the beginning oftime step t , K t − , may differ from the number of clusters D t detected by DBSCAN, due to errors in the clusteringprocedure or to subjects entering or leaving the monitoredenvironment. These facts need to be carefully handled throughdedicated strategies, which are detailed in Section IV-F. Next,the KF tracking procedure is presented for a single trajectory,but this same procedure is applied in parallel to each trajectory.Also, for improved clarity, the RD and RDA processing casesare treated separately.

1) RD system model:

The KF model relates the actualdistance (from the radar device) and velocity of the target, x t = [ r t , v t ] T , i.e., the hidden system state, to the centroidvalues z t , i.e., the (noisy) observations. The model of motionis deﬁned by two matrices, F and H . F is the transitionmatrix, relating the system state in the current time step x t to x t − , while H is the observation matrix, which relates z t to x t . Referring to u t and r t as the process noise and observationnoise, respectively, a dynamic model of the system is obtainedas follows x t = F x t − + u t , (9) z t = Hx t + r t , (10)Assuming a constant velocity model, the transition and obser-vation matrices are F = (cid:20) t (cid:21) , (11) H = I , (12)where I is a × identity matrix. We assume the processnoise u t to be caused by a random acceleration a t that followsa Gaussian distribution with mean and variance σ a , i.e., a t ∼ N (0 , σ a ) , leading to u t = g a t , (13) g = (cid:20) ∆ t ∆ t (cid:21) . (14) EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 7

The process noise covariance matrix is computed as Q = E (cid:2) u t u Tt (cid:3) = σ a gg T , (15)while the observation noise covariance matrix is R = E (cid:2) r t r Tt (cid:3) = (cid:20) σ r σ v (cid:21) . (16)Suitable values for σ a , σ r and σ v are difﬁcult to computeanalytically. In this work, we determined them empiricallyfrom experimental observations, obtaining σ a = 0 . m/s , σ r = 0 . m and σ v = 0 . m/s.A new KF model is initialized for each detected cluster inthe ﬁrst frame received by the radar. In successive frames,the trajectories are maintained through the KF predict-updatesteps, computing the estimates of the state ˆ x t and state covari-ance matrix ˆ P t , from which the estimated posterior distributionof the state is derived as ˆ p ( x t | z , . . . , z t ) = N ( ˆ x t , ˆ P t ) [22].

2) RDA system model:

In the RDA case, tracking is onlyperformed using the observations on range and azimuth, as theintroduction of radial velocity in the model would cause thesystem to become too non-linear to obtain reliable estimatesusing KF. In detail, the observation vector z t contains therange and the angular position of the target, z t = [ r t , θ t ] T .The system state is deﬁned as x t = [ x, v x , y, v y ] T , where x and y are the target cartesian coordinates, and v x and v y thevelocities along the two axes. The resulting non-linear modelis x t = F x t − + u t , (17) z t = h ( x t ) + r t , (18)with h ( x t ) = (cid:104)(cid:112) x + y , arctan ( y/x ) (cid:105) T . To handle thenon-linearity in Eq. (18), upon receiving a new obser-vation z t , we compute a transformed observation vector z (cid:48) t = [ r t cos θ t , r t sin θ t ] T . Using z (cid:48) , the model becomes lin-ear as in Eq. (9), Eq. (10), with matrices F = I ⊗ (cid:20) t (cid:21) , (19) H = (cid:20) (cid:21) . (20)The covariance matrices of the process and observation noisesare Q = I ⊗ σ a gg T , (21) R = (cid:20) σ x σ y (cid:21) . (22)Again, a direct computation of the noise variances is difﬁcultto obtain, so we used the empirical values for human subjectsproposed in [21]: σ a = 8 m/s , σ x = σ y = 0 . m/s . Thelinear equations of the predictions and update steps are thesame as in the linear KF from the case of RD processing,thanks to the use of the transformation (polar coordinates).The constant velocity model we used has provided goodapproximations of the movement of a human walking target:with movements speeds in the order of m/s and a frame rateof fps, the KF was able to track the targets reliably. F. Matching trajectories to clusters – Hungarian algorithm

To match trajectories to clusters, we use an approach basedon the nearest neighbor standard ﬁlter (NNSF). At each frame,we must associate the D t new cluster detections with the K t − previous trajectories, which is a combinatorial problem. Theprocedure consists in two steps, ﬁrst we compute a K t − × D t cost matrix J that relates trajectories at time step ( t − with cluster detections at time step t . Each element of J , J ij , encodes the cost of associating trajectory i with clusterdetection j . Given the slightly different properties of RD andRDA data, we found that the best choice for the cost functiondiffers in the two cases, as described below.

1) RD cost matrix: in the RD case, from each target state x i we deﬁne a box B i to contain the subject reﬂections,centered on the state and with ﬁxed dimensions h B and w B .We assume that, given the high frame rate with respect tothe velocity of the subjects, over two subsequent frames thebox with reﬂections from a given target should signiﬁcantlyoverlap with her/his box at the previous time step. Let B i and B j respectively represent the box of the cluster that wasassociated with trajectory i at the previous time step ( t − andthe one associated with a new target detection j at the currenttime step t , centered on z j . The cost of the association betweentrajectory i and the newly detected cluster j is computed viathe negative intersection over union (IOU) score, deﬁned as J RD ij = − IOU ( B i , B j ) = − Area ( B i ∩ B j ) Area ( B i ∪ B j ) . (23)The idea underpinning this, is that the more the two boxesoverlap, the more likely they will be representing two clusterscontaining the reﬂected signal components from the same target user as she/he moves from ( t − to t .

2) RDA cost matrix: in the RDA case, the cost matrixelements are deﬁned as the squared Mahalanobis distancebetween the predicted observation from the trajectory stateand the real observation (detection): J RDA ij = (cid:16) z jt − Hx it (cid:17) T S − t (cid:16) z jt − Hx it (cid:17) , (24)where z jt − Hx it is the innovation process and S t its co-variance matrix computed as HP it H T + R , and are bothobtained as part of the KF update step.The choice of two different score functions for RD andRDA processing is motivated by the different properties ofthe radar maps in the two cases. In the RDA space, trajectorytracking uses range and angle information, which leads tocompact clusters around the centroids. Conversely, the velocityinformation that is used in the RD space leads to sparseclusters along this dimension, and the IOU score allows oneto control the box shape, i.e., its form factor through h B and w B , in order to weigh less a superposition along the velocityaxis than that along the range axis. This signiﬁcantly mitigatesthe tracking errors due to cluster sparsity in the RD space.Given the cost matrix, the Hungarian algorithm [23] is usedto efﬁciently obtain the associations yielding the lowest totalcost, with complexity O (( K t − D t ) ) . The algorithm uses thecost matrix as input and pairs each trajectory with only onecluster detection. EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 8

G. Trajectory management

During trajectory tracking we must deal with (i) undetectedtrajectories and new cluster detections (that is the case of anon-square matrix J ), (ii) trajectory instability due to misseddetections, and (iii) presence of ghost targets generated byreﬂections from metal objects. To deal with these problems,we conceived the following trajectory management measures.

1) Unmatched trajectories (RD and RDA):

All past tra-jectories that are not associated with any current clusterdetection are marked as undetected and are maintained for max_age = 10 frames before being deleted. During theseframes, their state is updated using Eq. (9). Cluster detectionsthat are not associated with any existing trajectory are called unmatched , and are initialized as new trajectories if they aredetected for min_det = 15 consecutive frames.This mechanism makes the system robust to subjects thatrandomly appear on and disappear from the environment,and whose tracks are created or deleted as needed with anaffordable delay. We stress that a subject reﬂection could goundetected, subsequently deleted and reinitialized for any rea-son, e.g., due to blockage at the beginning of the measurementsequence, and more generally, blockage at any point of themonitored sequence, or due to targets moving in and out ofthe scene. These temporary effects will not affect the correcttracking of the proposed system, whose trajectories will becontinuously reﬁned and reinitialized as soon as a reliablemeasurement is obtained.

2) Ameliorating trajectory instability (RDA):

Trajectoryinstability and merging trajectories due to missed detectionsare a problem in the RDA case, where clutter is moresigniﬁcant. For this reason, we introduced a gating region around each trajectory, i.e., a detection is never associatedwith the trajectory if the cost (squared Mahalanobis distance)of the association at time step t is higher than a thresholdvalue denoted by γ . This operation discards all the possibleassociations between a trajectory and clusters that lie outsideof an ellipsoidal region whose shape and size are determinedby the innovation covariance S t and the threshold γ , whichis typically chosen according to a desired level of conﬁdencefrom an inverse χ distribution with degrees of freedom [24].In this work, we use a % conﬁdence, which leads to γ = 4 . .

3) Dealing with merging trajectories (RDA):

Merging tra-jectories are detected by checking the Euclidean distancebetween their estimated states. If the distance between twotrajectories gets lower than a minimum distance d min = 0 . m,the trajectory with the highest variance in the last stateestimations is deleted in order to favor stability.

4) Removal of “ghost” targets (RDA):

As a last trajectorymanagement measure, we eliminate all trajectories whoseestimated state lies outside of the room boundaries. This hasa signiﬁcant positive effect in removing ghost targets due tomultipath reﬂections on metal objects and wide ﬂat surfaces.These unwanted reﬂections often closely resemble the directones from the real subjects, but appear at different angularpositions, and at a longer distance due to the longer pathfollowed by the signal. The method used in this work for theremoval of ghost targets uses some prior information about the dimensions of the room. In practice, a coarse knowledgeis sufﬁcient and the room dimensions along the x and y reference axes are enough. This data could be obtained from aplanimetry, as we did, or via some pre-processing performedon the radar signal. The latter approach is left as a futureresearch work. H. Computation of µ D time series

The µ D signature of each target is computed by selectingthose points belonging to the cluster that is currently associatedwith her/his trajectory. This allows obtaining a separate signa-ture for each subject. Such signature is inputted into a DCNNbased classiﬁer to perform identiﬁcation, see Section IV-I. Forthe computation of the µ D vector in a given time step, thereceived power over the range (RD) or range and angle (RDA)dimensions is accumulated, producing vectors with dimensionequal to the number of considered Doppler bins, n chn . Hence,these vectors are stacked over time and passed to the DCNNclassiﬁer as a spectrogram image. This image is the input X for the following identiﬁcation block, see Section IV-I. I. Identiﬁcation – DCNN

The proposed classiﬁer architecture is a DCNN. This kindof neural network is suited for classiﬁcation and featureextraction when the input data exhibits spatial structure, likein image processing applications. The main components of theDCNN are convolutional layers, where the input is convolvedwith a ﬁlter (or kernel ) of learned weights in order to recognizecertain patterns, organized into so called feature maps , thatbecome more and more complex and abstract with the depthof the layer. DCNNs have been broadly utilized in the lastfew years for feature extraction in spectrogram data, e.g., inspeech recognition and audio processing applications [25].The proposed DCNN is based on inception and residualnetworks structures, two architectures that are commonly usedin state-of-the-art image classiﬁcation tasks. IBs are a DCNNstructure developed for complex feature extraction at differentscales, using at every layer of the DCNN different kernel sizes,in a parallel fashion, and concatenating the resulting featuremaps [26]. In our case, × , × and × kernel ﬁltersare used at each layer, to extract small and wide scale charac-teristics of the µ D signature. An efﬁcient implementation ofthe single inception block is shown in Fig. 4: the top branchuses × convolutions, extracting small scale features, thetwo following branches from the top use × and × convolutions, which are preceded by × convolutions toreduce their complexity, i.e., the number of feature maps, andprevent the number of parameters from becoming too large.The bottom branch performs a × max pooling operation,still extracting small scale features, but from a downsampledrepresentation of the input.Residual networks instead rely on skip connections betweenthe input and the output of convolutional blocks [16], inorder to make the network learn the residual representationof the data with respect to the input. This has been shown toallow deeper networks to be trained faster and with signiﬁcantperformance gains. In our case, skip connections are placed EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 9

Conv3 x 3 U p s a m p l e x x 4 x 4 Skip connection

Add R ec on s t r u c ti on U FC Encoder

Inception

Block

Pool

Conv

Class probabilities

Loss

Decoder

Fig. 3: Architecture of the proposed classiﬁer.

Inception BlockConv1 x 1Conv1 x 1Conv1 x 1Pool3 x 3 Conv5 x 5Conv3 x 3Conv1 x 1 C on ca t e n a t e Fig. 4: Structure of the Inception Block.between the input and the output of each IB, summing therespective tensors. A × convolution is applied to each skipconnection to adjust the number of feature maps, so that itmatches that at the output.The input signal X is a sequence of W c = 30 frames of µ D vectors, corresponding to W c T seq = 2 seconds of mea-surement time for each subject. The number of Doppler binsthat were selected is n chn = 200 (see Section V for a detaileddescription of the evaluation setup), so the input image hasdimension × . The input X is passed through the threeblocks composing the DCNN, namely, an encoder , a decoder and a fully connected (FC) network. The encoder network, E , iscomposed of four stacked IBs with a number of output featuremaps respectively equal to , , and ; the blocksare separated by × max pooling layers, which performdimensionality reduction.The ﬂattened output of the encoder, c , is a latent represen-tation of the input with lower dimensionality, i.e., a code , andis fed to both the decoder and the FC network. In detail,1) the decoder network D learns to reconstruct the inputimage. D is a CNN with four layers, × ﬁlters in eachlayer, and a number of feature maps respectively equalto , , and . A × upsampling step is carriedout before each convolution. The reconstructed copyof the input is called ˆ X . This branch of the classiﬁerdoes not directly contribute to the classiﬁcation result,but it is used during the training phase to guide thenetwork towards extracting meaningful features, acting as a regularizer . To the best of our knowledge, the useof a decoder network for this class of problems is anoriginal contribution of our design. We found its use tobe effective, leading to accuracy improvements in theorder of − % in the test set.2) The FC network F outputs a U -dimensional vec-tor containing the probabilities that the input belongsto each class using a one-of- U encoding, i.e., ˆ y =[ˆ y , . . . , ˆ y U ] T , with ˆ y u ∈ [0 , and (cid:80) u ˆ y u = 1 . Thenetwork is composed of one hidden layer with neu-rons. ELU activation functions [27] connect the inputto the hidden layer neurons, while a SoftMax layer isused to compute the U output probabilities.The following equations formalize the input-output relationsfor the encoder, decoder and FC blocks c = E ( X ) , ˆ X = D ( c ) , ˆ y = F ( c ) . (25)The loss function of the full architecture is a weighted sum ofthe loss function of the decoder, which measures the differencebetween the original input image X and the reconstructedone ˆ X , and the loss of the FC branch (classiﬁcation). For theformer, we choose the average per-pixel binary cross-entropyloss, while the categorical cross-entropy loss between thepredicted and the true labels y is used for the latter. Wecall n X = n chn W c the number of elements in the µ D inputimage, U the number of classes (the known user identities)and α rec is a weighting factor. The p -th pixels of the inputand reconstructed images, with values in [0 , , are denotedrespectively by X p and ˆ X p and the total weighted loss functionis L ( ˆ X , X , y ) = − (1 − α rec ) U (cid:88) u =1 y u log(ˆ y u ) (cid:124) (cid:123)(cid:122) (cid:125) Classiﬁcation branch term −− α rec n X n X (cid:88) p =1 X p log( ˆ X p ) + (1 − X p ) log(1 − ˆ X p ) (cid:124) (cid:123)(cid:122) (cid:125) Reconstruction branch term . (26)Fig. 3 shows the complete structure of the classiﬁer. As aregularization measure, after each layer of the encoder and theFC branch we apply batch normalization [28]. All the hidden EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 10 nodes in the network use the ELU activation function [27].The complete neural network has , tunable parameters( network size ). J. Labeling and trajectory correction procedure

Previous approaches to human identiﬁcation frommm-waves obtain trajectories rely on the sole KF output,for the entire movement and, in a following step, performthe classiﬁcation on the pre-computed trajectories using,e.g., a neural network of some kind [4]. Now, consider thetrajectories of two users and that, at a certain point intime, intersect in the considered RD/RDA space. At thispoint, the two users cannot be distinguished, as their clusterslargely overlap, and the trajectories are tracked again by theKF from the moment in which their clusters set apart. Thetarget association procedure, however, beyond this point, maywrongly associate trajectories with detections, i.e., assigningtrajectory to user and vice-versa. This problem can behardly corrected with previous algorithms, whereas it issolved with the interactive procedure that we designed, andthat we detail in this section. With our technique, identitiesare obtained in an online manner. Moreover, although theassociation of trajectories to clusters (see Section IV-F) maybe erroneous, due to the overlap of the user clusters, as soonas the trajectories set apart again, the association is correctedusing the output of the DCNN classiﬁer. Note that this is notpossible by solely exploiting the KF, as its memory amountsto a single time step, which is insufﬁcient to solve this issue.Next, the procedure is formally described.Applying the classiﬁer to the µ D signatures from the K t current trajectories, returns K t U -dimensional vectors, whichcontain the probabilities that each trajectory belongs to oneof the U (known) user classes. Hence, we build a K t × U matrix, Γ t , by stacking these vectors. The matrix containsin position ( i, j ) the probability that trajectory i belongs tosubject j ∈ { , . . . , U } . Following a reasoning similar tothat in Section IV-F, we can interpret the matrix Γ t as a score matrix for the associations between trajectories andclasses. Therefore, the optimal assignment of the labels in thecurrent time step is obtained by applying again the Hungarianalgorithm on − Γ t , which represents the association costs. Thisapproach makes it possible to jointly label all the trajectories.From the properties of the Hungarian algorithm, it descendsthat the same class is never assigned to more than one subject.A subject is classiﬁed as unknown in case no label is assignedto her/him by the algorithm (which happens if K t > U ) orwhen the score outputted by the DCNN is lower than . (athreshold that we set to avoid low probability associations).To enhance the stability in the identiﬁcation process, thecurrent labels that are outputted at time t by the DCNN areused with the past ones as follows. • for each trajectory, we store the past labels that areoutputted by the DCNN in a list; • at t = 0 , subjects are identiﬁed using the instantaneouslabels, as no past information is available; • at time step t > , each trajectory i is classiﬁed consid-ering the most recent W h labels that are outputted by the Measurement parameters

Antenna el. spacing δ . mmNumber of receiving antennas L Start frequency f o GHzChirp bandwidth B GHzChirp duration T µ sChirp repetition time T rep µ sNo. samples per chirp N No. chirps per seq. P Frame rate / ∆ t fpsADC sampling frequency F s . MHzRange resolution ∆ R . cmVelocity resolution ∆ v . cm/s TABLE 1: Summary of the radar working parameters used inthe evaluation session.DCNN classiﬁer up to and including time t , i.e., at timesteps ( t − W h + 1) , . . . , ( t − , t . If all these W h labelsmatch, we assign their common value to the trajectory;this will be the ﬁnal identity label that is outputted attime t . If instead different values appear in this list, wekeep the ﬁnal label that was previously assigned, at time ( t − , to trajectory i . Note that, in case the W h valuesin the list for any trajectory i differ, the procedure willmaintain the previous label until the DCNN will outputa sequence of W h matching labels.We remark that the value of W h encodes the level of temporalstability that is required to accept a change in the identity thatis outputted by the algorithm, for any trajectory. In fact, thisprocedure introduces additional stability in the identiﬁcation,as misclassiﬁcations that only last a few time steps are avoided.A cost is however paid in terms of correction speed whena tracking error occurs, e.g., when trajectories are swappedbetween users. As such, a desirable tradeoff has to be identiﬁedbetween the stability in the identiﬁcation results (large W h )and the delay in compensating for tracking errors (small W h ).V. E XPERIMENTAL RESULTS

A. Measurement setup and parameters

The proposed framework is evaluated using an INRASRadarLog device working at GHz center frequency. Thefront-end features transmitting antennas and receivingantennas organized as a linear array. The device working pa-rameters are set up as in Tab. 1. Operating in LFMCW mode,we respectively utilize one transmitting and L = 16 receivingantennas. To obtain ground truth values for the multi-targetmeasurements we used a camera which was time-synchronizedwith the radar device: the resulting video was used to identifyand track the users within the indoor space.To thoroughly evaluate the proposed system, several mea-surement campaigns were conducted, using two measurementrooms with very different dimensions, shapes and propagationenvironments.Measurement room A is a . × meters corridor, wherethe radar was positioned on the short edge as depicted in Fig. 5.The presence of several large windows and some radiatorsthat become sources of unwanted reﬂections and ghost targetsmakes our evaluation room very challenging. Room A wasthe main environment considered in this work, where the EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 11

20 m . m RadarRD range RDA range

RDA subjects (a) Map (b) Camera view

Fig. 5: Measurement room A.majority of tests was carried out, along with the collectionof the training data for the classiﬁer.Room B is a × meters research laboratory, with furni-ture and devices left in place in order to mimic a real-lifeindoor scenario. In addition, other people not involved inthe measurements were also present in the room, outsidethe tracking area, further increasing the difﬁculty of trackingand identiﬁcation tasks. Room B was utilized to validate ourapproach and investigate the generalization capabilities of theclassiﬁer, which was only trained on data from room A.We collected radar data for the training and validation ofour algorithm for the following experiments.1) Training the classiﬁer on single subjects (room A).

We collected RDA data from 4 different subjects (S1,S2, S3 and S4) with ages ranging from to yearsand different body shapes and weights. Each subject wasasked to walk alone, freely and without any restrictionswithin the measurement room for around minutes,to collect sequences of frames, for a total of thousand frames per subject. The sequences wereacquired in two different days to reduce the effects ofclothing or physical conditions.2) Evaluating the performance of RD multi-personidentiﬁcation (room A).

We acquired test sequencesof , RD-only frames, of them with targets (S1and S2) and the other with targets (S1, S2 and S3).All subjects were asked to walk freely, without spaceconstraints and varying their walking speed.3) Evaluating the performance of RDA multi-personidentiﬁcation (room A).

We acquired test sequencesof RDA frames with targets. To make the test morechallenging, we had the targets walking in a square-likefashion, with the ﬁrst two subjects and the second twobeing at the same distance from the radar device, andwith a small distance of about meter between the twopairs, as shown in Fig. 5. All targets are constrained towalk at (approximately) the same speed. This setup hasbeen intentionally selected, as it represents a worst-casefor the RDA-based system. In fact, in this case subjectscan not be distinguished by their different velocities ortheir range, and the detection/tracking has to mainlyrely on the angular information, which is less accurate than the range or the velocity . Moreover, the classiﬁeris forced to make the identity decision based on thefeatures of the µ D spectrogram that encode the wayof walking of the subjects, as their speed is the same.We stress that this type of analysis is new: often, µ D classiﬁers based on neural networks include thenon-informative average walking speed as a discrimi-native feature, leading to poor accuracy when subjectshave similar velocities, e.g., [1].4)

Validating the performance in a different measure-ment room (room B).

We collected RDA data withtwo subjects in a different environment in order to asseswhether the system, in particular the DCNN classiﬁer,generalizes to an unseen domain. No data from roomB is used during the training of the classiﬁer. The sequences obtained in room B contain frameseach, and the walking patterns of the subjects wereconstrained similarly to point ). The room boundaryvalues, see Section IV-G, were modiﬁed complying withthe new room dimensions to effectively deal with ghosttargets. Note that only two subjects were involved in theexperiments for Room B as the walking space availablewas reduced with respect to Room A. However, thedensity of users per square meter was higher for RoomB, which again corresponds to a more challenging setup.With the considered parameters, raw radar frames have ashape of N × L × P = 512 × × points along the fast-time, antenna element, and slow time dimensions respectively.We used points for the DFT along the angle dimensionand points for DFT along Doppler dimension. For therange dimension, we used , points for the DFT, extractingranges from to m for RDA data ( bins) and from to m in case of RD maps ( bins). The contributionsdue to static objects were removed by cutting the centralDoppler channels, corresponding to velocities in the range [ − . , . m/s, and the ﬁrst and last channels corre-sponding to velocities outside the interval [ − . , . m/swere also removed as they did not contain any useful infor-mation. The resulting radar maps after DFT processing havedimension × × points for RDA and × points for RD, which corresponds to a -fold increase in the The angular resolution degrades as the angle of arrival of the reﬂectionsapproaches ± π/ , see Eq. (7). EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 12 data frame size for RDA with respect to RD. We performed µ Dextraction by summing over the range and angular dimensions,obtaining spectrograms with

Doppler bins and variabletime length depending on the sequence ( or , frames).When using range-Doppler DFT processing, a commonassumption is that the target covers a smaller distance thanthe length of a range bin during a single frame period, i.e., vP T rep < ∆ R , with v being the velocity of the subject.If this assumption is not satisﬁed, the echo from the targetspreads across multiple range bins ( range migration ) [29],[30]. However, this effect is not harmful in our case, where thetypical extension of human subjects along the range dimensiongoes from . m up to . m, which correspond to - rangebins. Indeed, the average walking speed of human subjects inindoor environments is around . m/s [31], which lead to aranging accuracy reduction of at most − range bins, whichis deemed negligible with respect to the target size. B. Training phase

We implemented the classiﬁer network using TensorFlow . and the Keras API. Training was performed on a NVIDIARTX GPU with GB of RAM.The µ D sequences per target obtained from the mea-surements in room A were split into windows of framesalong the time dimension, with an overlap of frames. Theresulting images were divided into training and validation sets, % and % of the images respectively, and testing wascarried out on the multi-target sequences. Data augmentationwas applied to enlarge the training set: for each training imagewe generated additional images by1) adding Gaussian noise with zero mean and variance . ,2) setting to zero pixels in the image with a probability of . ( random corruption ),3) setting to zero adjacent columns (time frames) start-ing from an index selected uniformly at random ( timemasking ),4) setting to zero adjacent Doppler bins starting from anindex selected uniformly at random ( frequency masking ).These images were used as input X of the encoder, settingthe reconstruction target ˆ X at the output of the decoder tobe the original image, to force the encoder-decoder pair tolearn key structural properties of the input (the same strategyis exploited to train denoising auto-encoders (DAE) [32]). Themodel was trained on the training set until convergence of theloss L ( ˆ X , X , y ) in Eq. (26) on the validation set, using theStochastic Gradient Descent (SGD) optimizer with Nesterovmomentum . and α rec = 0 . . The learning rate was adap-tively lowered by a factor of . when the validation loss wasnot improving for more than consecutive epochs, from aninitial value η = 5 · − . We applied L regularization withcoefﬁcient λ = 3 · − on the network weights and dropoutwith probability p drop = 0 . for the fully connected layers, toreduce overﬁtting on the training data.In terms of inference time, the proposed DCNN takes onaverage ms to perform the classiﬁcation of a seconds long µ D input. The µ D signatures of all the tracked targets are fed

Classiﬁer Accuracy (IDRad) %

DCNN [1] 78.46RCN [10] 75.65SIN + LSTM [3] 89.56DCNN with IBs (our approach) 90.69

TABLE 2: Comparison between the proposed classiﬁer andavailable benchmarks from the literature on the IDRad testset.to the network in a single batch so that only one forward pass is performed in a time step.

C. DCNN evaluation on the IDRad dataset (single-target)

As a ﬁrst evaluation phase, we trained and validated theproposed DCNN on

IDRad , a publicly available dataset of GHz radar µ D signatures [1]. The dataset contains RDframes from different subjects walking one at a time inthe environment and hence, multi-target identiﬁcation is notpossible using this dataset. Training and validation/test dataare collected in two different rooms.Using the IDRad dataset, we have assessed the performanceof our framework for the single person identiﬁcation problemand have compared it with available benchmarks [1], [3], [10].For a fair comparison against previous work, we adapted theDCNN to accept as input µ D sequences with length of frames instead of . We found that our classiﬁer generalizeswell, with an overall average accuracy of . %, with slightvariations across different targets, but always above %. Thecomparison between the performance of our approach and theschemes in the literature is presented in Tab. 2. Our classiﬁeris the most accurate, signiﬁcantly outperforming the previousDCNN approach [1], the one based on reservoir computingnetworks (RCN) [10], and performs slightly better than [3],where a structured inference network (SIN) and long-shortterm memory recurrent neural networks (LSTM) are used. Webelieve this improvement is achieved due to the use of IBs,which allow for feature extraction at different scales, withoutsigniﬁcantly increasing the network complexity, which wouldeasily lead to overﬁtting. D. Performance metrics

To train and test the proposed processing pipeline in amulti-target setting, we have collected our own RD and RDAdata across several measurement campaigns (see Section V-A).The performance of the ﬁnal classiﬁer are evaluated in termsof (i) accuracy , i.e., the ratio between the number of framesin which the target is correctly identiﬁed and the number offrames in which it is detected and assigned a label differentfrom unknown (see Section IV-J); (ii) the undetected ratio ( r und ), i.e., the ratio between the number of frames in which atarget is undetected and the total number of frames collected; (iii) the unknown ratio ( r unk ), the ratio between the number A target is said to be undetected if the number of consecutive misseddetections is sufﬁcient to eliminate its trajectory from those that are beingtracked by the algorithm. As such, the target is no longer identiﬁed.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 13

Range-Doppler Range-Doppler-Azimuth

Avg.

S1 S2 S3

Avg.

S1 S2 S3 S4

Avg.

Accuracy % 98.24 97.69 97.96 95.75 98.65 91.38 95.26 99.52 98.26 100.0 95.56 98.27 r und % 0 0 0 6.65 27.31 0 11.32 6.51 6.17 18.64 6.08 9.35 r unk % 4.54 2.53 3.54 0.75 2.79 9.51 4.34 0 0 0 0 0 TABLE 3: RD and RDA average performance over the test sequences from room A ( W h = 9 ).of frames in which the target is labeled as unknown and thetotal number of frames collected. This last metric is a measureof the uncertainty of the identiﬁcation framework in providinga classiﬁcation for the targets. E. Results for the RD signal (multi-target)

In Tab. 3, we report the results per subject using the metricsof Section V-D, averaged over the test sequences. In theevaluation, we discard the initial phase where the trajectoriesneed to accumulate frames of µ D data in order to providethe ﬁrst image to the DCNN classiﬁer.With RD maps, the two targets case achieves the highestaccuracy, with an average of . %. With three targets, r und increases for some subjects, as one may expect: havingmore targets in the same area leads to a higher probabilityof superposition of their clusters. In this case, the reﬂectioncoming from target is undetectable due to the fact that %of the frames for this user overlap with those of other usersin the RD space (as they have a similar range and speed). Aninteresting point, however, is that the identiﬁcation accuracyand r unk are not signiﬁcantly impacted with respect to the twotargets case, meaning that the identiﬁcation framework canrecover from missed detections, still providing high accuracywhen targets become detectable again.A detailed analysis of the errors revealed that the mainproblem with RD processing is the superposition of clusters in the RD space: this occurs when subjects have similarrange and speed, likely being detected as a single cluster.This is an intrinsic limitation of the RD space, and is notinﬂuenced by any of the system parameters. However, thanksto the proposed processing method, that allows re-establishingtrajectories once clusters separate, and to correct errors usingthe identiﬁcation outcomes (see Section IV-J), the systemstill provides correct results for a very high percentage oftime. Other techniques from the literature treat tracking andidentiﬁcation separately, and are therefore unable to deal withmulti-target RD identiﬁcation because of their inability ofrecovering from erroneous tracking.As a last result, in Fig. 6 we show the impact of changingthe box dimension along the Doppler axis, w B , averagingthe accuracy obtained on two targets. As expected, there isa trade-off between capturing most of the target’s Dopplerinformation (large w B ) and avoiding unnecessary overlapbetween boxes (small w B ), which may lead to classiﬁcationerrors. The chosen value for the results of Tab. 3 is . m/s,as it provided the highest accuracy. The dimension of the boxalong the range dimension, h B , is instead kept ﬁxed at m. Box width w B [m/s] . . . . . . A cc u r ac y S1S2

Fig. 6: Accuracy of RD identiﬁcation by varying the box width w B along the Doppler dimension for two subjects, S1 and S2. F. Results for the RDA signal (multi-target)

Tab. 3 shows the results of RDA processing averaged overthe test sequences: our system achieves an accuracy of . % over targets. We recall that the initial phase in whichthe DCNN has to collect the ﬁrst µ D vectors is neglectedin the computation, and only frames after this initial transientperiod are considered, as for the RD analysis.The relatively high people density ( . person/m ) withrespect to that in the RD analysis causes blockage to becomemore frequent, i.e., some subjects block the signal path to othertargets during some frames, which explains the non-negligibleaverage r und of . %. Conversely, r unk is always zero forall subjects and all sequences, meaning that once a target isdetected, the network has always enough data and conﬁdenceto produce a classiﬁcation result. Remarkably, although r und isgreater than zero for all subjects, the identiﬁcation accuracy isstill very high (see in particular S3), which conﬁrms once againthe framework’s ability to recover from missed detections. Thisis possible thanks to the correction algorithm of Section IV-J. G. Impact of training parameters

The proposed DCNN architecture, in terms of number ofinception blocks, number of FC layers and presence of skipconnections, was obtained using a greedy search procedure onthe hyperparameters, i.e., we repeated the model training bychanging one parameter at a time and selecting the value thatled to the minimum loss. This procedure is suboptimal withrespect to an exhaustive search or to Bayesian methods, butwas preferred because of its lower computational cost. In thefollowing, we focus on the most interesting and inﬂuentialparameters, namely the value of α rec and the use of data EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 14 . . . . . . . . . . . . . . α rec . . . A cc u r a c y No reconstruction branch

Fig. 7: Accuracy on the RDA test data depending on thereconstruction weight α rec (room A). Training set % used % % % % No augm.Acc. % (RDA 4 trg.) 84.08 88.97 95.45 98.27 89.97Training time [min.] 8 15 23 30 9 TABLE 4: Impact of training set size and data augmentationon accuracy (RDA with four targets) and training time.augmentation, analyzing their impact on the ﬁnal test accuracy.The reconstruction weight α rec ∈ [0 , tunes the relative im-portance of the classiﬁcation and the reconstruction branchesin the training loss. Using the reconstruction branch has shownto yield a slight improvement in the generalization capabilitiesof the DCNN, similarly to what is commonly achieved usingregularization. Speciﬁcally, in Fig. 7 we plot the accuracyvalues obtained varying α rec from , i.e., the reconstructionbranch is disabled, to . . The best result is obtained for α rec = 0 . . Remarkably, these improvements are only possiblewhen using the encoder-decoder structure in combination withthe data augmentation strategy described in Section V-B.Indeed, with no data augmentation (e.g., noise addition andrandom signal deletion/corruption) the reconstruction at theoutput of the auto-encoder becomes much easier and thefeature extraction is less effective in capturing the true signalmanifold. As a result, no major beneﬁt is observed.In Tab. 4, we show the effect of only using a portion of thetotal training data available on the time required to completethe training, and on the accuracy on the RDA test data.While the training time increases almost linearly when using %, %, % or % of the full training set ( minutesper subject), the accuracy shows the smallest improvementwhen going from % to %. This is a saturation effecton the model’s performance that is customary with neuralnetwork training. In the last column of Tab. 4, we showthe results of training the model on the whole availabledataset without exploiting data augmentation. The accuracydecreases signiﬁcantly, motivating the use of the augmentationtechniques described in Section V-B. H. Integrated vs separate tracking and identiﬁcation

As described in Section IV-J, the proposed system jointlyperforms tracking and identiﬁcation. To quantify the improve-ment of this design with respect to separately obtaining trajec-tories and identities, we quantify the difference in the averageaccuracy when applying the two approaches (joint vs separateprocessing) on all the considered subjects and RD/RDA test RD 2 subj. RD 3 subj. RDA 4 subj. . . . . . . A cc u r ac y Joint tr./id.Sep. tr./id.

Fig. 8: Accuracy comparison between joint (our approach) and separate (previous work) tracking ( tr. ) and identiﬁcation ( id.).

Seq. 1 Seq. 2

Average

Acc. % 100.00 92.00 96.00 r und % 20.40 16.80 18.60 r unk % 0 0 0 TABLE 5: Accuracy results obtained in room B on twosubjects using RDA processing, after training the classiﬁer ondata from room A only.sequences. Fig. 8 conﬁrms that our integrated approach isof key importance to enable precise RD identiﬁcation, withimprovements of . % and . % on the and subjectscases, respectively. For RDA processing, the improvement issmaller ( . %), due to the higher detection capabilities of thesystem in the RDA space, which makes cluster superpositionand subsequent tracking errors less frequent. The improve-ment is however non-negligible and the proposed combinedarchitecture is still very effective. I. Dimensioning the classiﬁcation window W h As anticipated in Section IV-J, the classiﬁcation windowparameter W h plays an important role in the trade-off betweenonline classiﬁcation accuracy and speed in recovering fromerrors. In Fig. 9, we show the effect of varying W h from to frames for the RDA signal. All the sequences areconsidered, and we observe a monotonic increasing behaviorof the accuracy. Although this may not always be the case: ifthe initial guess of the classiﬁer is wrong, even in the absenceof tracking errors, a large value for the window would leadto a wrong classiﬁcation for many frames. For this reason, agood selection approach would be to pick the lowest possible W h that guarantees a given, application dependent, accuracytarget. For the results in Tab. 3, we picked W h = 9 frames,leading to a delay of . s, as this is the lowest value of W h forwhich the accuracy is above % for all the sequences. Still,all values up to W h = 15 frames would be good choices, asthe delay is below s for all of them. The same value of W h has led to the best results also in the RD case. EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 15 . . . . . . . . Window length W h . . . . . . . . . A cc u r a c y Seq. 1Seq. 2Seq. 3Seq. 4Seq. 5Seq. 6Average

Fig. 9: Accuracy of the identiﬁcation in RDA processingwith respect to the length (i.e., number of frames) of theclassiﬁcation window W h (in room A). The red solid curverepresents the average over all sequences, with uncertainty interms of one standard deviation (shaded area). J. Validation in a different indoor environment

To analyze the generalization capabilities of the proposedsystem, we evaluated its performance in a different environ-ment, room B, described in Section V-A, considering RDAprocessing. Here, we investigate whether a pretrained DCNNcan generalize well to new rooms, as the training procedurewould be too long and costly to be repeated for each newenvironment. The detection and tracking parameters for ghosttarget removal and the KF matrices can be environmentindependent (e.g., the KF parameters), can be easily obtainedfrom side information (room dimensions), or can be estimatedby taking some preliminary measurements on the empty room(denoising threshold). Speciﬁcally, the new range dependentdenoising threshold goes from − dBm at minimum rangeto − dBm, which is expected given the smaller room sizeand the considerable presence of static objects. The thresholdon the azimuth dimension is instead left unchanged.In Tab. 5, we show the results of testing, in room B,the classiﬁer trained in room A. The average accuracy for subjects over the considered sequences is %, whichis lower than the one obtained in room A with subjects,but is deemed satisfactory given the difﬁcult propagationconditions of the new environment. Indeed, we stress thatroom B contains furniture and several static objects whichcause severe multi-path effect and clutter, in addition to thepresence of other people who were working and who werenot involved in the experiment. These harsh conditions arereﬂected in the high percentage of time in which a subjectis undetectable, which is r und = 18 . %, i.e., almost doubledwith respect to room A (see Tab. 3). We conclude that theclassiﬁer is able to generalize to unseen environments evenin realistic conditions: more reliable detection schemes wouldbe further enhancing the model robustness and we leave theirstudy as a future work. VI. C ONCLUSIONS

In this work, we have presented a system for indoormulti-person identiﬁcation from mm-wave radar µ -Dopplersignatures. The proposed approach has been designed to workwith range-Doppler (RD) and range-Doppler-azimuth (RDA)data, requiring only small modiﬁcations to deal with thesetwo signals, and being able to trade working range andcomputational speed (RD) for detection and tracking accuracy(RDA). The processing steps are: removal of static reﬂectionsand random noise, a target detection phase using density-basedclustering (DBSCAN), a tracking procedure using Kalmanﬁltering and a ﬁnal classiﬁcation step exploiting deep convolu-tional neural networks (DCNNs). In our novel design, we haveintegrated the identiﬁcation information with the trajectorytracking block. This has the twofold advantage of allowingfor much higher identiﬁcation accuracies when working withboth RD and RDA signals in multi-target scenarios, i.e.,where multiple subjects share and move within the samephysical space. The proposed framework has been tested onreal measurements involving single as well as multiple targetsmoving concurrently in an indoor space (a lacking aspect in theliterature), obtaining an identiﬁcation accuracy of . % forRD, with targets, and of . % with RDA, with targets.The framework has a maximum working range of m for RDand of - m for RDA. A further evaluation was conducted toassess the generality of the proposed approach, by capturingadditional test data in a different room, that was not usedby the system at training time. Despite the new environmentbeing more challenging, e.g., with furniture and other humanpresence, we obtained an accuracy of % with two subjects.Future research avenues include: characterizing the indoorspace by (automatically) mapping static objects and ghostreﬂections, which is expected to lead to higher accuracies, us-ing multiple time-synchronized radar devices and 2D antennaarrays (elevation angle).R EFERENCES[1] B. Vandersmissen, N. Knudde, A. Jalalvand, I. Couckuyt, A. Bourdoux,W. De Neve, and T. Dhaene, “Indoor person identiﬁcation using a low-power FMCW radar,”

IEEE Transactions on Geoscience and RemoteSensing , vol. 56, pp. 3941–3952, Jul 2018.[2] S. A. Shah and F. Fioranelli, “RF sensing technologies for assisted dailyliving in healthcare: A comprehensive review,”

IEEE Aerospace andElectronic Systems Magazine , vol. 34, pp. 26–44, Nov 2019.[3] V. Polﬂiet, N. Knudde, B. Vandersmissen, I. Couckuyt, and T. Dhaene,“Structured inference networks using high-dimensional sensors forsurveillance purposes,” in

International Conference on EngineeringApplications of Neural Networks (EANN) , (Crete, Greece), May 2018.[4] P. Zhao, C. X. Lu, J. Wang, C. Chen, W. Wang, N. Trigoni, andA. Markham, “mID: Tracking and Identifying People with MillimeterWave Radar,” in , (Santorini Island, Greece), May 2019.[5] M. S. Seyﬁo˘glu, A. M. ¨Ozbayo˘glu, and S. Z. G¨urb¨uz, “Deep con-volutional autoencoder for radar-based classiﬁcation of similar aidedand unaided human activities,”

IEEE Transactions on Aerospace andElectronic Systems , vol. 54, pp. 1709–1723, Feb 2018.[6] P. Cao, W. Xia, M. Ye, J. Zhang, and J. Zhou, “Radar-ID: human identiﬁ-cation based on radar micro-Doppler signatures using deep convolutionalneural networks,”

IET Radar, Sonar & Navigation , vol. 12, pp. 729–734,Jul 2018.[7] Y. Yang, C. Hou, Y. Lang, G. Yue, Y. He, and W. Xiang, “PersonIdentiﬁcation Using Micro-Doppler Signatures of Human Motions andUWB Radar,”

IEEE Microwave and Wireless Components Letters ,vol. 29, pp. 366–368, May 2019.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 16 [8] S. Abdulatif, F. Aziz, K. Armanious, B. Kleiner, B. Yang, and U. Schnei-der, “Person identiﬁcation and body mass index: A deep learning-basedstudy on micro-Dopplers,” in

IEEE Radar Conference (RadarConf) ,(Boston, Massachusetts USA), Apr 2019.[9] Z. Chen, G. Li, F. Fioranelli, and H. Grifﬁths, “Personnel recognition andgait classiﬁcation based on multistatic micro-Doppler signatures usingdeep convolutional neural networks,”

IEEE Geoscience and RemoteSensing Letters , vol. 15, pp. 669–673, May 2018.[10] A. Jalalvand, B. Vandersmissen, W. De Neve, and E. Mannens, “Radarsignal processing for human identiﬁcation by means of reservoir com-puting networks,” in

IEEE Radar Conference (RadarConf) , (Boston,Massachusetts USA), Apr 2019.[11] F. Luo, S. Poslad, and E. Bodanese, “Human Activity Detection andCoarse Localization Outdoors Using Micro-Doppler Signatures,”

IEEESensors Journal , pp. 1–1, May 2019.[12] R. Trommel, R. Harmanny, L. Cifola, and J. Driessen, “Multi-targethuman gait classiﬁcation using deep convolutional neural networks onmicro-Doppler spectrograms,” in

European Radar Conference (EuRAD) ,(London, United Kingdom), Oct 2016.[13] F. Weishaupt, I. Walterscheid, O. Biallawons, and J. Klare, “Vital SignLocalization and Measurement Using an LFMCW MIMO Radar,” in , (Bonn, Germany), Jun 2018.[14] C.-H. Hsieh, Y.-F. Chiu, Y.-H. Shen, T.-S. Chu, and Y.-H. Huang, “AUWB radar signal processing platform for real-time human respiratoryfeature extraction based on four-segment linear waveform model,”

IEEEtransactions on biomedical circuits and systems , vol. 10, pp. 219–230,Feb 2015.[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems (NIPS) , (Lake Tahoe, Nevada, USA), Dec2012.[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

IEEE conference on computer vision and patternrecognition (CVPR) , (Las Vegas, Nevada, USA), Jun 2016.[17] S. M. Patole, M. Torlak, D. Wang, and M. Ali, “Automotive radars:A review of signal processing techniques,”

IEEE Signal ProcessingMagazine , vol. 34, pp. 22–35, Mar 2017.[18] V. Chen, “Analysis of radar micro-Doppler with time-frequency trans-form,” in

Proceedings of the Tenth IEEE Workshop on Statistical Signaland Array Processing , (Pocono Manor, Pennsylvania, USA), Aug 2000.[19] V. Chen, F. Li, S. Ho, and H. Wechsler, “Micro-Doppler effect inradar: phenomenon, model, and simulation study,”

IEEE Transactionson Aerospace and electronic systems , vol. 42, pp. 2–21, Aug 2006.[20] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. , “A density-basedalgorithm for discovering clusters in large spatial databases with noise,”in , (Portland, Oregon, USA), Aug 1996.[21] T. Wagner, R. Feger, and A. Stelzer, “Radar signal processing for jointlyestimating tracks and micro-Doppler signatures,”

IEEE Access , vol. 5,pp. 1220–1238, Feb 2017.[22] R. E. Kalman, “A new approach to linear ﬁltering and predictionproblems,”

ASME Transactions, Journal of Basic Engineering , vol. 82,(Series D), no. 1, pp. 35–45, 1960.[23] Kuhn, Harold W, “The Hungarian method for the assignment problem,”

Naval research logistics quarterly , vol. 2, no. 1-2, pp. 83–97, 1955.[24] Y. Bar-Shalom, F. Daum, and J. Huang, “The probabilistic data associ-ation ﬁlter,”

IEEE Control Systems Magazine , vol. 29, pp. 82–100, Dec2009.[25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,“WaveNet: A Generative Model for Raw Audio,” in

The 9th ISCA SpeechSynthesis Workshop , (Sunnyvale, California, USA), Sep 2016.[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , (Boston, Massachussets, USA), Jun 2015.[27] D. A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deepnetwork learning by exponential linear units (ELUs),” in

InternationalConference on Learning Representations (ICLR) , (San Juan, PuertoRico), May 2016.[28] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating DeepNetwork Training by Reducing Internal Covariate Shift,” in

InternationalConference on Machine Learning (ICML) , (Lille, France), Jul 2015.[29] Barrick, Donald E, “FM/CW radar signals and digital processing,”

Report AD-774 829, NOAA , Jul 1973.[30] M. A. Richards, J. Scheer, W. A. Holm, and W. L. Melvin,

Principlesof modern radar . Citeseer, 2010. [31] C. Willen, K. Lehmann, and K. Sunnerhagen, “Walking speed indoorsand outdoors in healthy persons and in persons with late effects of polio,”

Journal of Neurology Research , vol. 3, pp. 62–67, Apr 2013.[32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising autoencoders: Learning useful representations in adeep network with a local denoising criterion,”

Journal of machinelearning research , vol. 11, pp. 3371–3408, Dec 2010.

Jacopo Pegoraro (S’20) received the B.Sc. degreein information engineering and the M.Sc. degree inICT for Internet and Multimedia engineering fromthe University of Padova, Padua, Italy, in 2017and 2019, respectively. He is currently pursuing thePh.D. degree with the SIGNET Research Group,Department of Information Engineering, at the sameUniversity. His research interests include deep learn-ing and signal processing with applications to radiofrequency sensing and, speciﬁcally, mm-wave radarsensing.

Francesca Meneghello (S’19) received the B.Sc.degree in information engineering and the M.Sc.degree in telecommunication engineering from theUniversity of Padova, Italy, in 2016 and 2018 respec-tively. She is currently pursuing the Ph.D. degreewith the Department of Information Engineering atthe same university. Her current research interestsinclude deep-learning architectures and signal pro-cessing with application to remote radio frequencysensing and wireless networks. She was a recipientof the Best Student Paper Award at WUWNet 2016,the Best Student Presentation Award at the IEEE Italy Section SSIE 2019 andreceived an honorary mention in the 2019 IEEE ComSoc Student Competition.