[PDF] Vision-Aided 6G Wireless Communications: Blockage Prediction and Proactive Handoff

Abstract

The sensitivity to blockages is a key challenge for the high-frequency (5G millimeter wave and 6G sub-terahertz) wireless networks. Since these networks mainly rely on line-of-sight (LOS) links, sudden link blockages highly threaten the reliability of the networks. Further, when the LOS link is blocked, the network typically needs to hand off the user to another LOS basestation, which may incur critical time latency, especially if a search over a large codebook of narrow beams is needed. A promising way to tackle the reliability and latency challenges lies in enabling proaction in wireless networks. Proaction basically allows the network to anticipate blockages, especially dynamic blockages, and initiate user hand-off beforehand. This paper presents a complete machine learning framework for enabling proaction in wireless networks relying on visual data captured, for example, by RGB cameras deployed at the base stations. In particular, the paper proposes a vision-aided wireless communication solution that utilizes bimodal machine learning to perform proactive blockage prediction and user hand-off. The bedrock of this solution is a deep learning algorithm that learns from visual and wireless data how to predict incoming blockages. The predictions of this algorithm are used by the wireless network to proactively initiate hand-off decisions and avoid any unnecessary latency. The algorithm is developed on a vision-wireless dataset generated using the ViWi data-generation framework. Experimental results on two basestations with different cameras indicate that the algorithm is capable of accurately detecting incoming blockages more than \sim 90\% of the time. Such blockage prediction ability is directly reflected in the accuracy of proactive hand-off, which also approaches 87\%. This highlights a promising direction for enabling high reliability and low latency in future wireless networks.

Full PDF

11 Vision-Aided 6G Wireless Communications:Blockage Prediction and Proactive Handoff

Gouranga Charan, Muhammad Alrabeiah, and Ahmed Alkhateeb

Abstract

The sensitivity to blockages is a key challenge for the high-frequency (5G millimeter wave and 6Gsub-terahertz) wireless networks. Since these networks mainly rely on line-of-sight (LOS) links, suddenlink blockages highly threaten the reliability of the networks. Further, when the LOS link is blocked,the network typically needs to hand off the user to another LOS basestation, which may incur criticaltime latency, especially if a search over a large codebook of narrow beams is needed. A promising wayto tackle the reliability and latency challenges lies in enabling proaction in wireless networks. Proactionbasically allows the network to anticipate blockages, especially dynamic blockages, and initiate userhand-off beforehand. This paper presents a complete machine learning framework for enabling proactionin wireless networks relying on visual data captured, for example, by RGB cameras deployed at thebase stations. In particular, the paper proposes a vision-aided wireless communication solution thatutilizes bimodal machine learning to perform proactive blockage prediction and user hand-off. Thebedrock of this solution is a deep learning algorithm that learns from visual and wireless data howto predict incoming blockages. The predictions of this algorithm are used by the wireless network toproactively initiate hand-off decisions and avoid any unnecessary latency. The algorithm is developedon a vision-wireless dataset generated using the ViWi data-generation framework. Experimental resultson two basestations with different cameras indicate that the algorithm is capable of accurately detectingincoming blockages more than ∼ of the time. Such blockage prediction ability is directly reﬂectedin the accuracy of proactive hand-off, which also approaches . This highlights a promising directionfor enabling high reliability and low latency in future wireless networks. I. I

NTRODUCTION

Millimeter-wave (mmWave) and sub-terahertz communications are becoming dominant di-rections for current and future wireless networks [2], [3]. With their large bandwidths, they

The authors are with the School of Electrical, Computer, and Energy Engineering, Arizona State University. Emails: { gcharan,malrabei, alkhateeb } @asu.edu. Part of this work was submitted to IEEE ICC Workshops [1]. a r X i v : . [ ee ss . SP ] F e b have the ability to satisfy the high data rate demands of several applications such as wirelessVirtual/Augmented Reality (VR/AR) and autonomous driving. Communication in these bands,however, faces several challenges at both the physical and network layers. One of the keychallenges stems from the sensitivity of high-frequency signals (i.e., mmWave and sub-terahertz)to blockages [4]. These signals suffer from high penetration loss and attenuation, resulting instrong dips in the received Signal-to-Noise Ratio (SNR) whenever an object is present in-betweena basestation and a user. Such dips lead to sudden disruptions of the communication channel,which severely impact the reliability of wireless networks. Re-establishing LOS connection isusually done reactively, which brings about a hefty latency burden considering the Ultra-reliableLow-Latency (URLL) requirement of future networks [5]. Given all that, high-frequency wirelessnetworks need not only maintain line-of-sight (LOS) connections but also do so proactively,which implies a critical need for a sense of surrounding.The aforementioned reliance on LOS draws a striking and important parallel with computervision, in which visual data (e.g., images and video sequences) only captures visible, i.e., LOS,objects. This parallel is very interesting as computer vision systems rely on machine learningand visible objects to perform a variety of visual tasks depending on object appearance (objectdetection [6], [7]) and/or behavior (action recognition [8], [9]). In a wireless network, visibleobjects in the environment are usually the cause of link blockages, and, hence, a computervision system powered with machine learning could be utilized to provide a much needed senseof surrounding to the network; it enables the network to identify objects in its environment andtheir behavior and utilize that to proactively detect possible blockages. Such capability helpsalleviate the strain of link blockages, and as such, this work focuses on developing a vision-aided dynamic blockage prediction solution for high-frequency wireless networks. A. Prior Works

The problem of LOS link blockage has long been acknowledged as a critical challenge tohigh-frequency wireless networks [3], [10]–[12]. In those networks, the quality of service highlydeteriorates with link blockages. Therefore, solutions centered around multi-connectivity are amajor avenue to handle that problem [11]. For instance, [12] proposes a multi-cell measure-ment reporting system to keep track of the link quality between a mmWave user and multiplebasestations. All basestation in that system feed their measurements to a central unit that takescare of cell selection and scheduling. This system is further studied and tested in [11] under realistic dynamic scenarios. A slightly different look on multi-connectivity is presented in [13],[14]. In [13], the authors propose a few approaches for multi-connectivity, all of which focus onutilizing low-frequency bands (sub-6 GHz) to support the mmWave network. [14], on the otherhand, develops a multi-connectivity algorithm that does not only factor in network reliability butalso latency. Collectivity, the work on multi-connectivity has its promise and elegance, yet it islacking on two important fronts. First, it is inherently wasteful in terms of resource utilization;multiple basestations schedule resources for one user as a precaution for probable LOS blockages.The other is its reactive nature; the majority of the multi-connectivity algorithms are designedto react to link blockages, not anticipate them.A new trend in addressing LOS blockages has been developing in recent years, in whichthe driving power is machine learning [15]–[18]. Some studies such as [15], [16] have shownthat using wireless sensory data (such as channels and received power), a machine learningmodel can efﬁciently differentiate LOS and Non-LOS (NLOS) links. They both address the linkblockage problem from a reactive perspective, where uni-modal sensory data is ﬁrst acquired,and then the status of the current link is predicted. The work in [18], however, takes a stepforward towards a proactive treatment of the problem. It studies proactive blockage predictionand hand-off for a single-moving mmWave user in the presence of stationary blockages. Theproposed solution utilizes observed sequences of mmWave beamforming vectors (beams) anduses a Gated Recurrent Unit (GRU) network to learn beam patterns that proceed link blockages.Again, despite its appeal, it still falls short in meeting the latency and reliability requirementsas the sensory data are only expressive of stationary blockages. On a different note, the work in[19] explores a new dimension for blockage prediction in single user communication settings. Itproposes a modiﬁed residual network [20] that uses visual data to predict stationary blockages.However, like its wireless-data counterparts, it struggles in dealing with complex scenarios withdynamic blockages.

B. Contribution

In this paper and inspired by the recently proposed Vision-Aided Wireless Communication(VAWC) framework in [19] and [21], the link-blockage and user hand-off problems are addressedfrom a proactive perspective. Images and video sequences usually speak volumes about theenvironment they depict, and this is supported by the empirical evidence in [19]. As such, thiswork develops a deep neural network that learns proactive blockage prediction from sequences of jointly observed mmWave beams and video frames. The main contributions of this papercould be summarized in the following few points: • A novel two-component deep learning architecture is proposed to utilize sequences of ob-served RGB frames and beamforming vectors and learn proactive link-blockage prediction.The architecture harnesses the power of Convolutional Neural Networks (CNNs) [6], [7]and Gated Recurrent Units (GRU) networks [22], [23]. • The proposed architecture is leveraged to build a proactive hand-off solution. The solutiondeploys the two-stage architecture in different basestations, where it is used to predictpossible future blockages from the perspective of each basestation. Those predictions arestreamed to a central unit that determines whether the communication session of a certainuser in the environment will need to be handed over to a different basestation or not. • Based on the ViWi data-generation framework [21], the ViWi-BT challenge scenario [24]has been expanded to generate two datasets, namely the blockage-prediction and object-detection datasets. The former is an extension of the ViWi-BT challenge dataset. It providesmulti-modal data samples in the form of 4-tuple of image, mmWave beam, link status, andposition information, while the latter is a small object detection dataset that provides samplesof images, object bounding boxes, and object classes. Both datasets are derived from theViWi “ASUDT1 28” scenario [21], [25] . • The performance of the two-stage architecture and user hand-off solution are evaluated usinga blockage-prediction dataset. The evaluation results conﬁrm the importance of vision-aidedblockage prediction in highly dynamic environments; the proposed architecture is shown tobe capable of learning proactive blockage prediction from multi-modal data, and it achievesnoticeable gain over models that rely solely on wireless data (mmWave beam sequences).The rest of this paper is organized as follows. Section II presents the system and channelmodels adopted in this work. Section III provides a formal description of the link blockage anduser hand-off problems addressed in this paper. Section IV introduces a detailed description ofthe proposed solutions for both problems. Sections V and VI, respectively, present and discussthe description of the experimental setup used to evaluate the performance of the proposedsolutions and the main results of that evaluation. Finally, Section VII concludes the paper bydiscussing the main takeaways. Datasets will be made public upon the publication of this work. mmWave basestationblocked by busBlocked user LOSCamera f q AAACDnicdVDLSsNAFJ3UV62vqks3g6XgQkJSC7W7ohuXFewDkhAm00k7dPJwZiKUkC9w46+4caGIW9fu/BsnbYQqeuDC4Zx7Zw7HixkV0jA+tdLK6tr6RnmzsrW9s7tX3T/oiyjhmPRwxCI+9JAgjIakJ6lkZBhzggKPkYE3vcz9wR3hgkbhjZzFxAnQOKQ+xUgqya3WU3v+iMXHnpMaevu0mMwOkJx4PvTd28yt1kzdmAMqT6HVLEjbhN9WDRToutUPexThJCChxAwJYZlGLJ0UcUkxI1nFTgSJEZ6iMbEUDVFAhJPOk2SwrpQR9COuJpRwri5fpCgQYhZ4ajPPKH57ufiXZyXSP3dSGsaJJCFefOQnDMoI5t3AEeUESzZTBGFOVVaIJ4gjLFWDleUS/if9hm6e6Y3rZq1zUdRRBkfgGJwAE7RAB1yBLugBDO7BI3gGL9qD9qS9am+L1ZJW3ByCH9DevwBxBpse f q AAACF3icdVDLSsNAFJ3UV62vqEs3g0VwISGphdpd0Y3LCvYBTSyT6aQdOnk4MxFKyF+48VfcuFDEre78GydphCp64MLhnHtnDseNGBXSND+10tLyyupaeb2ysbm1vaPv7nVFGHNMOjhkIe+7SBBGA9KRVDLSjzhBvstIz51eZH7vjnBBw+BaziLi+GgcUI9iJJU01I3Ezh8Z8LHrJKbRPCkmtX0kJ64HvWFye2NHnPokTYd61TLMHFAtKTTqBWla8NuqggLtof5hj0Ic+ySQmCEhBpYZSSdBXFLMSFqxY0EihKdoTAaKBsgnwknySCk8UsoIeiFXE0iYq4sXCfKFmPmu2szCit9eJv7lDWLpnTkJDaJYkgDPP/JiBmUIs5LgiHKCJZspgjCnKivEE8QRlqrKymIJ/5NuzbBOjdpVvdo6L+oogwNwCI6BBRqgBS5BG3QABvfgETyDF+1Be9Jetbf5akkrbvbBD2jvXwc0n0c= Fig. 1. The illustrative ﬁgure shows a mmWave basestation equipped with a mmWave array and a camera serving multiplemobile users. One user is LOS while the other is blocked by a bus.

II. S

YSTEM AND C HANNEL M ODELS

To illustrate the potential of deep learning and VAWC in mitigating the link blockage prob-lem, this work considers a high-frequency communication network where basestations utilizeRGB cameras to monitor their environment. The following two subsections provide a detaileddescription of the system and wireless channel models adopted in this work.

A. System model

The communication system considers a small-cell mmWave basestation deployed in an outdoorenvironment, as depicted in Fig. 1. The basestation is equipped with a uniform linear array (ULA)with M elements and a standard-resolution RGB camera. For practicality [16], the basestationis assumed to employ analog-only architecture with a single RF chain and M phase shifters.As a result of this architecture, the basestation adopts a predeﬁned beamforming codebook F = { f q } Qq =1 , where f q ∈ C M × and Q is the total number of beamforming vectors. The choicefor F in this paper is a beam-steering codebook that follows from the choice of the antennaarray, i.e., a ULA. For such codebook, each beamforming vector f q , ∀ q ∈ { , . . . , Q } is givenby f q = 1 √ M (cid:104) , e j πλ d sin( φ q ) , . . . , e j ( M − πλ d cos( φ q ) (cid:105) T , (1)where λ is the wavelength, and φ q ∈ { πqQ } Q − q =0 is a uniform quantization of the azimuth anglewith an integer step of q . The communication system in this work adopts OFDM with a cyclicpreﬁx of length D and K subcarriers. For any mmWave user in the wireless environment, itsreceived downlink signal is given by y u,k = h Tu,k f q x + n k , (2)where y u,k ∈ C is the received signal of the u th user at the k th subcarrier, h u,k ∈ C M × is thechannel between the BS and the u th user at the k th subcarrier, x ∈ C is a transmitted complexsymbol that satisﬁes the following constraint E [ | x | ] = P , where P is a power budge per symbol,and ﬁnally n k is a noise sample drawn from a complex Gaussian distribution N C (0 , σ ) . B. Channel model

The channel model adopted throughout this paper is a geometric mmWave channel modelwith L clusters. This choice of model comes as a result of two facts: (i) the model captures thelimited scattering property of the mmWave band [26], [27], and (ii) the experimental results inthis paper are all based on data samples that are partially obtained from a ray tracing simulatoras will be describe in Section V. The channel vector of the u th user at the k th subcarrier isgiven by h u,k = D − (cid:88) d =0 L (cid:88) (cid:96) =1 α (cid:96) e − j πkK d p ( dT S − τ (cid:96) ) a ( θ (cid:96) , φ (cid:96) ) , (3)where L is number of channel paths, α (cid:96) , τ (cid:96) , θ (cid:96) , φ (cid:96) are the path gains (including the path-loss),the delay, the azimuth angle of arrival, and the elevation angle of arrival, respectively, of the (cid:96) thchannel path. T S represents the sampling time while D denotes the cyclic preﬁx length (assumingthat the maximum delay is less than DT S ).III. P ROBLEM FORMULATION

Two signiﬁcant problems faced in high-frequency wireless networks are LOS link blockagesand the ability to perform low-latency user hand-offs. The severity of those two problems mostlyrevolves around the mixed-dynamics in the wireless environment, i.e., it is characterized by amixture of dynamic and stationary objects. Developing a solution to the two is tightly linked to equipping the wireless network with a sense of its surroundings; such sense transforms thenetwork from being reactive to its environment to being proactive in it. This simply means havinga network able to predict incoming blockages and initiate hand-off procedures beforehand. Withthat in mind, this work attempts to utilize machine learning and a fusion of visual and wirelesssensory data, e.g., video frames and mmWave beams, to enable that sense of surrounding ina wireless network. The objective is to observe a sequence of a user’s image-beam pairs at abasestation and use that sequence to predict whether that user will be blocked within a windowof future instances or not. Such prediction task is made possible by two important facts: (i)images or visual data, in general, are rich with information about the scene they depict, e.g., thetype of objects, their relative positions to one another, and, in case of videos, the object motion;and (ii) beamforming vectors usually provide directional information that, for well-calibratedantenna arrays, summarizes major signal directions. The following two subsections will laythe groundwork for the proposed solutions by providing formal deﬁnitions for the problems ofproactive blockage and user hand-off predictions.

A. Blockage Prediction

The primary objective of this paper is to utilize sequences of RGB images and beam indicesand develop a machine learning model that learns to predict link blockages proactively, i.e.,transitions from LOS to NLOS. Formally, this learning problem could be posed as follows. Forany user u in the environment, a sequence of image and beam-index pairs is observed over atime interval of r instances. At any time instance τ ∈ Z , that sequence is given by S u = { ( X u [ t ] , b u [ t ]) } τt = τ − r +1 , (4)where b u [ t ] is the index of the beamforming vector in codebook F used to serve user u at the t th time instance, X u [ t ] ∈ R W × H × C is an RGB image of the environment taken at the t th timeinstance, W , H , and C are respectively the width, height and the number of color channels forthe image, and r ∈ Z is the extent of the observation interval. For robust network operation, theobjective is to observe S u and predict whether a blockage will occur within a window of r (cid:48) ∈ Z future instances or not, without focusing on the exact future instance. Let A u = { a u [ t ] } τ + r (cid:48) t = τ +1 represents the window (sequence) of r (cid:48) future link statuses of the u th user, where a u [ t ] ∈ { , } Since the system model assumes a predeﬁned beamforming codebook, the indices of those beams are used instead of thecomplex-valued vectors themselves. represents the link status at the t th future time instance; and 0 and 1 are, respectively, LOS andNLOS links. Then, the user’s future link status s u in the window A u (henceforth referred to asthe future link status ) could be deﬁned as s u =  , a u [ t ] = 0 , ∀ t ∈ { τ + 1 , . . . , τ + r (cid:48) } , otherwise (5)where indicates a LOS connection is maintained throughout the window A u and indicatesthe occurrence of a link blockage within that window.The primary objective is attained using a machine learning model. It is developed to learn aprediction function f Θ ( S ) that takes in the observed image-beam pairs and produces a predictionon the future link status ˆ s ∈ { , } . This function is parameterized by a set Θ representing themodel parameters and learned from a dataset of labeled sequences. To put this in formal terms,let P ( S , s ) represent a joint probability distribution governing the relation between the observedsequence of image-beam pairs S and the future link status s in some wireless environment, whichreﬂects the probabilistic nature of link blockages in the environment. A dataset of independentpairs D = { ( S u , s u ) } Uu =1 where ( S u , s u ) is sampled at random from P ( S , s ) — s u is serving asa label for the observed sequence S u . This dataset is then used to train the prediction function f Θ ( S ) such that it maintains high-ﬁdelity predictions for any dataset drawn from P ( S , s ) . Thiscould be mathematically expressed asmax f Θ ( S ) U (cid:89) u =1 P (ˆ s u = s u |S u ) , (6)where the joint probability in (6) is factored out as a result of the independent and identicallydistributed samples in D . This conveys an implicit assumption that for any user u in theenvironment, the success probability of f Θ ( S u ) predicting s u only depends on its observedsequence S u . B. Proactive Hand-off

A direct consequence of proactively predicting blockages is the ability to do proactive userhand-off. In this work the problem is studied for the case of hand-off between two high-frequency basestations, and it is solely based on the availability of a LOS link to a user . Let S ( n ) u = { ( X ( n ) u [ t ] , b ( n ) u [ t ]) } τt = τ − r +1 and S ( n (cid:48) ) u = { ( X ( n (cid:48) ) u [ t ] , b ( n (cid:48) ) u [ t ]) } τt = τ − r +1 respectively represent Cases that prompt hand-off like a user approaching the edge of a cell are not considered here. the sequences of observed image-beam pairs for the u th user at basestation n and n (cid:48) , where n, n (cid:48) ∈ { , } such that n (cid:54) = n (cid:48) . Each of these two sequences is associated with its own futurelink status, namely s ( n ) u and s ( n (cid:48) ) u , which are deﬁned similarly to (5). The goal is to determine,with high conﬁdence , when a user served by one basestation needs to be handed off to anotherbasestation given the observed two sequence S ( n ) u and S ( n (cid:48) ) u . Let z nn (cid:48) u ∈ { , } be a binary randomvariable indicating whether the u th user needs to be handed off from basestation n to basestation n (cid:48) or not, where 1 means a hand-off is needed and 0 means it is not, and let ˆ z nn (cid:48) u ∈ { , } be aprediction of the value of z nn (cid:48) u for user u . The hand-off conﬁdence is, then, formally describedby the conditional probability of successful hand-off, P (cid:16) ˆ z nn (cid:48) u = z nn (cid:48) u |S ( n ) u , S ( n (cid:48) ) u (cid:17) .The probability of successful hand-off in the case of two basestations depends on the futurelink status between a user and those two basestations, and, as such, that probability could bequantiﬁed using the predicted and groundtruth future link statuses of the two basestations. Deﬁnethe tuple of link status predictions (ˆ s ( n ) u , ˆ s ( n (cid:48) ) u , s ( n ) u , s ( n (cid:48) ) u ) . Then, the event of successful hand-offcould be formally expressed as follows H = { ˆ z nn (cid:48) u = z nn (cid:48) } =  , (ˆ s ( n ) u , ˆ s ( n (cid:48) ) u , s ( n ) u , s ( n (cid:48) ) u ) ∈ E , (ˆ s ( n ) u , ˆ s ( n (cid:48) ) u , s ( n ) u , s ( n (cid:48) ) u ) ∈ E (7)where E = { (0 , , , , (0 , , , , (1 , , , , (1 , , , , (1 , , , , (0 , , , , (0 , , , } in-dicates the set of tuples (or events) that amount to a successful no hand-off decision while E = { (1 , , , } is the set of tuples amounting to a successful hand-off decision. Guided by(7), the conditional probability of successful hand-off could be written as P (cid:16) ˆ z nn (cid:48) u = z nn (cid:48) |S ( n ) u , S ( n (cid:48) ) u (cid:17) = P (ˆ s ( n ) u = 1 , ˆ s ( n (cid:48) ) u = 0 | s ( n ) u = 1 , s ( n (cid:48) ) u = 0 , S ( n ) u , S ( n (cid:48) ) u ) P ( s ( n ) u = 1 , s ( n (cid:48) ) u = 0)+ P (ˆ s ( n ) u = 1 , ˆ s ( n (cid:48) ) u = 1 | s ( n ) u = 1 , s ( n (cid:48) ) u = 1 , S ( n ) u , S ( n (cid:48) ) u ) P ( s ( n ) u = 1 , s ( n (cid:48) ) u = 1)+ P (ˆ s ( n ) u = 1 , ˆ s ( n (cid:48) ) u = 1 | s ( n ) u = 0 , s ( n (cid:48) ) u = 0 , S ( n ) u , S ( n (cid:48) ) u ) P ( s ( n ) u = 1 , s ( n (cid:48) ) u = 1)+ P (ˆ s ( n ) u = 1 , ˆ s ( n (cid:48) ) u = 1 | s ( n ) u = 0 , s ( n (cid:48) ) u = 1 , S ( n ) u , S ( n (cid:48) ) u ) P ( s ( n ) u = 0 , s ( n (cid:48) ) u = 1)+ P (ˆ s ( n ) u = 0 | s ( n ) u = 0 , S ( n ) u ) P ( s ( n ) u = 0) (8)(8) is lower bounded by the probability of joint successful link-status prediction given S ( n ) u and Detecting Relevant ObjectDeep learning Model for Future Blockage Prediction

Visual and wireless data

Recognizing possible user and blockage

CarsLamp Post

Bus

Camera Phased Array

Likely UserLikely blockage

Fig. 2. An illustration of the main idea behind the proposed solution. It shows the two notions of detecting relevant objectsand zeroing in on those most likely to be the user and its future blockage. S ( n (cid:48) ) u P (cid:16) ˆ z nn (cid:48) u = z nn (cid:48) |S ( n ) u , S ( n (cid:48) ) u (cid:17) ≥ (cid:88) v =0 1 (cid:88) v (cid:48) =0 P (ˆ s ( n ) u = v, ˆ s ( n (cid:48) ) u = v (cid:48) | s ( n ) u = v, s ( n (cid:48) ) u = v (cid:48) , S ( n ) u , S ( n (cid:48) ) u ) P ( s ( n ) u = v, s ( n (cid:48) ) u = v (cid:48) )= P (ˆ s ( n ) u = s ( n ) u , ˆ s ( n (cid:48) ) u = s ( n (cid:48) ) u |S ( n ) u , S ( n (cid:48) ) u ) (9)Using two blockage-prediction functions f Θ ( S ( n ) ) and f Θ ( S ( n (cid:48) ) ) (one per basestation), success-ful proactive hand-off could be viewed from the lens of blockage prediction. More speciﬁcally,(9) indicates that maximizing the conditional probability of joint successful link-status predictionguarantees high-ﬁdelity hand-off prediction. Thus, the two functions f Θ ( S ( n ) ) and f Θ ( S ( n (cid:48) ) ) needto be learned such that max f Θ ( S ( n ) ) ,f Θ ( S ( n (cid:48) ) ) U (cid:89) u =1 P (ˆ s ( n ) u = s ( n ) u , ˆ s ( n (cid:48) ) u = s ( n (cid:48) ) u |S ( n ) u , S ( n (cid:48) ) u ) , (10)where U is the total number of samples drawn from the probability distribution P ( s ( n ) u , s ( n (cid:48) ) u , S ( n ) u , S ( n (cid:48) ) u ) that governs the relation between the observed sequences S ( n ) u and S ( n (cid:48) ) u and the future linkstatuses s ( n ) u and s ( n (cid:48) ) u .IV. V ISION -A IDED D YNAMIC B LOCKAGE P REDICTION AND P ROACTIVE H ANDOFF

In this section, we explain our proposed solutions for vision-aided blockage prediction andproactive hand-off using deep learning. The discussion is organized as follows. We ﬁrst startwith highlighting the key idea behind our blockage prediction solution. We further develop thatidea by going into the details of our proposed deep learning algorithm. Then, we show how thatblockage prediction algorithm is used to address the user hand-off problem. A. Blockage Prediction: Key Idea

This work aims to predict future link blockages using deep learning algorithms and a fusionof both vision and wireless data. As we progress from a single-user and stationary blockage [19]to a more realistic scenario with multiple moving objects and dynamic blockages, the task offuture blockage prediction becomes far more challenging. A successful prediction of future linkblockages in a realistic scene hinges, to a large extent, on the following two notions. First, theability to detect and identify relevant objects in the wireless environment, objects that could bewireless users or possible link blockages. This includes detecting humans in the scene; differentvehicles such as cars, buses, trucks, etc.; and other probable blockages such as trees, lamp posts,. . . , etc. Second, the ability to zero in on the objects of interest, i.e., the wireless user and itsmost likely future blockage. Only detecting relevant objects is not sufﬁcient to predict futureblockages; it needs to be augmented with the ability to recognize which of those objects is theprobable user and which of them is the probable blockage. This recognition narrows the analysisof the scene to the two objects of interest and helps answer the questions of whether and whena blockage will happen. Those two high-level notions are illustrated in Fig. 2.Guided by the above notions, the prediction function f Θ ( S ) (or the proposed solution) isdesigned to break down blockage prediction into two sequential sub-tasks with an intermediateembedding stage. The ﬁrst sub-task attempts to embody the ﬁrst notion mentioned above. Amachine learning algorithm could detect relevant objects in the scene by relying on visual dataalone as it has the needed appearance and motion information to do the job. Given recent advancesin deep learning [9], [20], [28], this sub-task is expected to be well-handled with CNN-basedobject detectors; they have been setting the bar high for state-of-the-art object detection forabout a decade now, see for instance [6], [29]. The next sub-task embodies the second notion,recognizing the objects of interest among all the detected objects. Wireless data is brought intothe picture for this sub-task. More speciﬁcally, mmWave beamforming vectors could be utilizedto help with that recognition process. They provide a sense of direction in the 3D space (i.e., thewireless environment), whether it is an actual physical direction for well-calibrated and designedphased arrays or it is a virtual direction for arrays with hardware impairments [30]. That senseof direction could be coupled with the set of relevant objects using an embedding stage. Inparticular, we propose to observe multiple bimodal tuples of beams and relevant objects over asequence of consecutive time instances, embed each tuple into high-dimensional features, and Base Base Base

Instant AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURoR6LXjxWtB/QhrLZbtqlm03YnQgl9Cd48aCIV3+RN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqEScJ9yM6VCIUjKKVHvCi1i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+6pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZn+TgdCcoZxYQpkW9lbCRlRThjadkg3BW355lbQuq55b9e6vKvWbPI4inMApnIMHNajDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwDCp41w

Instant AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURUY9FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbqd+64lrI2L1iOOE+xEdKBEKRtFKD3h22StX3Ko7A1kmXk4qkKPeK391+zFLI66QSWpMx3MT9DOqUTDJJ6VuanhC2YgOeMdSRSNu/Gx26oScWKVPwljbUkhm6u+JjEbGjKPAdkYUh2bRm4r/eZ0Uw2s/EypJkSs2XxSmkmBMpn+TvtCcoRxbQpkW9lbChlRThjadkg3BW3x5mTTPq55b9e4vKrWbPI4iHMExnIIHV1CDO6hDAxgM4Ble4c2Rzovz7nzMWwtOPnMIf+B8/gDBI41v

Instant AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cq9gPaUDbbTbt0swm7E6GE/gMvHhTx6j/y5r9x0+agrQ8GHu/NMDMvSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST29zvPHFtRKwecZpwP6IjJULBKFrpASuDas2tu3OQVeIVpAYFmoPqV38YszTiCpmkxvQ8N0E/oxoFk3xW6aeGJ5RN6Ij3LFU04sbP5pfOyJlVhiSMtS2FZK7+nshoZMw0CmxnRHFslr1c/M/rpRhe+5lQSYpcscWiMJUEY5K/TYZCc4ZyagllWthbCRtTTRnacPIQvOWXV0n7ou65de/+sta4KeIowwmcwjl4cAUNuIMmtIBBCM/wCm/OxHlx3p2PRWvJKWaO4Q+czx8UuI0M

Object Detection Embedding RNN Prediction

Future Blockage Prediction At Instant . AAAB63icbVBNS8NAEJ3Ur1q/qh69LBZBEEoigh6LXjxWsB/QhrLZbtqlu5uwOxFK6V/w4kERr/4hb/4bkzQHbX0w8Hhvhpl5QSyFRdf9dkpr6xubW+Xtys7u3v5B9fCobaPEMN5ikYxMN6CWS6F5CwVK3o0NpyqQvBNM7jK/88SNFZF+xGnMfUVHWoSCUcylC68yqNbcupuDrBKvIDUo0BxUv/rDiCWKa2SSWtvz3Bj9GTUomOTzSj+xPKZsQke8l1JNFbf+LL91Ts5SZUjCyKSlkeTq74kZVdZOVZB2Kopju+xl4n9eL8Hwxp8JHSfINVssChNJMCLZ42QoDGcopymhzIj0VsLG1FCGaTxZCN7yy6ukfVn33Lr3cFVr3BZxlOEETuEcPLiGBtxDE1rAYAzP8ApvjnJenHfnY9FacoqZY/gD5/MH62WNfA== Ground Truth: BlockedImage Sequence + Beam SequenceServed User

Fig. 3. An overall block diagram illustrating the problem and its solution. The proposed solution takes in a sequence of bimodalimage-beam tuples and produces a prediction of the future link status. boil down the second sub-task to a sequence modeling problem. Recurrent neural networks deﬁnestate-of-the-art [31] for such problems. Hence, we design a recurrent network to implicitly learnthe recognition sub-task and produce predictions of future blockages, which is the ultimate goalof our solution. The overall concept of the proposed blockage prediction solution is illustratedin Fig. 3, and it is going to be detailed in the following couple of sections.

B. Proposed Blockage Prediction Solution

This section takes a deeper dive into the three-component architecture of the machine learningalgorithm shown in Fig. 3. The inner-workings of the architecture are shown in Fig. 4 and detailedin the following three subsections.

1) Object detector:

The object detector in the proposed solution needs to meet two essen-tial requirements: (i) detecting a wide range of objects and (ii) producing quick and accuratepredictions. These two requirements have been addressed in many of the state-of-the-art objectdetectors. A good example of a fast and accurate object-detection neural network is the YouOnly Look Once (YOLO) detector, proposed ﬁrst in [6] and then improved in [32]. The latestYOLO architecture, YOLOv3, is the best in terms of detection accuracy [33], and as such, weadopt it as the object detector in our proposed solution. Object Detector

Bounding Box Embedding

Beam SequenceEmbedding R NN w i t h G R U s Image sequenceBeam index sequence Object Detection and Localization

Extraction&Padding

512 11

Recurrent Prediction Fig. 4. A block diagram showing the proposed neural network. It shows the three main components of the architecture: (i)the object detector, (ii) the embedding component with a CNN network and a beam-embedding block, and (iii) the recurrentprediction network.

Choice of object detector:

The YOLOv3 detector is a fast and reliable end-to-end objectdetection system, targeted for real-time processing. It is a fully convolutional neural networkwith a feature extraction layer and an output processing layer. Darknet-53 is the backbone featureextractor in YOLO, and the processing output layer is similar to the feature pyramid network(FPN). Darknet-53 comprises 53 convolutional layers, each followed by batch normalization layerand Leaky ReLU activation. In the convolutional layers, 1x1 and 3x3 ﬁlters are used. Insteadof a conventional pooling layer, convolutional ﬁlters with stride 2 are used to downsample thefeature maps. This prevents the loss of ﬁne-grained features as the layers learn to downsamplethe input during training. YOLO makes detection in 3 different scales in order to accommodatedifferent object sizes by using strides of 32, 16, and 8. This method of performing detectionat different layers helps address the issue of detecting smaller objects. The features learned bythe convolutional layers are passed on to the classiﬁer, which generates the detection prediction.Since the prediction in YOLOv3 is performed using a convolutional layer consisting of 1x1ﬁlters, the output of the network is a feature map consisting of the bounding box co-ordinates,the objectness score, and the class prediction, see [33]. The list of bounding box co-ordinatesconsists of top-left co-ordinates and the height and width of the bounding box. In this work, wecompute the center co-ordinates of each of the bounding boxes from the top-left and the heightand the width of the box. Integration of object detector:

Instead of building and training the YOLOv3 from scratch,the proposed solution utilizes a pre-trained YOLO network and integrates it into its architecturewith some minor modiﬁcations. First, the network architecture is modiﬁed to detect the objectsof interest, e.g., cars, buses, trucks, trees, . . . , etc; the number of those objects and their types(classes) are selected based on the target wireless environment in which the proposed solutionis going to be deployed. For any choice of the number of objects and classes, the modiﬁcationon the YOLOv3 architecture only affects the size of the classiﬁer layer, which allows us to takeadvantage of the other trained layers. Second, the YOLOv3 network with the modiﬁed classiﬁeris then ﬁne-tuned using a dataset resembling the target wireless environment. This step adjuststhe classiﬁer and, as the name suggests, ﬁne-tune the rest of the architecture to be more attentiveto the objects of interest.

2) Bounding Box Extraction and Beam Embedding:

The prediction function relies on dual-modality observed data, i.e., visual and wireless data. Although such data is expected to berife with information, its dual nature brings about a heightened level of difﬁculty from thelearning perspective. In an effort to overcome that difﬁculty, the proposed solution incorporatesan embedding component that processes the extracted bounding box values and the beam indicesseparately, as shown in the embedding component of Fig.4. It transforms them to the same N -dimensional space before they are fed to the next component. For beam indices in the inputsequence, the adopted embedding is simple and does not require any training. It generates alookup table of | F | real-valued vectors b [ t ] ∈ R N where t ∈ { τ − r + 1 , . . . , τ } . The elementsof each vector are randomly drawn from a Gaussian distribution with zeros mean and unitystandard deviation.For bounding boxes output by the object detector, they undergo a simple transform-and-stack operation. In particular, each bounding box is transformed into a 6-dimensional vectorcomprising the center co-ordinates [ x cent , y cent ] , the bottom left co-ordinates [ x , y ] , and the topright co-ordinates [ x , y ] . The co-ordinates are normalized to fall in the interval [0 , . They,collectively, help in marking the exact location of an object in the scene. Then, the transformedbounding boxes of one image (or video frame) are stacked to form one high-deminsional vector ˜ d [ t ] ∈ R M × , where M is the number of objects detected in an image and t ∈ { τ − r +1 , . . . , τ } .Since the solution is proposed for dynamic wireless environments, the number of objects in eachimage is not ﬁxed, resulting in a variable-length ˜ d [ t ] . Therefore, ˜ d [ t ] is padded by N − M zerosto transform it into a ﬁxed length vector d [ t ] ∈ R N × .

3) Recurrent prediction:

CNN networks inherently fail in capturing sequential dependenciesin input data; thereby, they are not expected to learn the relation within a sequence of embeddedfeatures. To overcome that, the third component of the proposed architecture utilizes RecurrentNeural Networks (RNNs) and performs future blockage prediction based on the learned relationamong those features. In particular, the recurrent component has two layers of Gated RecurrentUnits (GRU) separated by a dropout layer. These two layers are followed by a fully-connectedlayer that acts as a classiﬁer. The recurrent component receives a sequence of length r ofbounding-box and beam embeddings, i.e., a sequence of the form { d [ τ − r + 1] , . . . , d [ τ ] , b [ τ − r + 1] , . . . , b [ τ ] } . Hence, it implements r GRUs per layer. The output of the last unit in thesecond GRU layer is fed to the classiﬁer to predict the future link status ˆ s u . C. Proactive Hand-off

A major advantage of proactive blockage prediction is that it enables mitigation measuresfor LOS-link blockages in small-cell high-frequency wireless networks, such as proactive userhand-off. The predictions of a vision-aided blockage prediction algorithm could serve as a meansto anticipate blockages and re-assign users based on LOS link availability. To illustrate that, thedeep learning architecture presented in Section IV-B is deployed in a simple network setting,which embodies the setting adopted in Section III . Two adjacent small-cell high-frequencybasestations are assumed to operate in the same wireless environment. They are both equippedwith RGB cameras that monitor the environment, and they are also running two copies of theproposed deep architecture. A common central unit is assumed to control both basestations andhave access to their running deep architectures. Each user in the environment is connected toboth basestations but is only served by one of them at any time, i.e., both basestations keep arecord of the user’s best beamforming vector at any coherence time, but only one of them isservicing that user. The objective in this setting is to learn two blockage-prediction functionsand use their predictions in the central unit to perform proactive user hand-off. More formally,we aim to learn the two prediction functions f Θ ( S ( n ) ) and f Θ ( S ( n (cid:48) ) ) that could maximize (10). Proposed user hand-off solution:

From (10), functions f Θ ( S ( n ) ) and f Θ ( S ( n (cid:48) ) ) need to max-imize the joint conditional probability of successful link-blockage prediction. Such requirement, It is important to note that: (i) extension of this proposed solution to more than two basestation is straightforward, and (ii)the two basestation example is used for clarity. albeit being accurate, may not be computationally practical as it requires a joint learning process,which may not scale well in an environment with multiple small-cell basestations. Thus, wepropose to train two independent copies of the blockage prediction architecture on two separatedatasets, each of which is collected by one basestation. This choice could be formally translatedinto a conditional independence assumption in (10). More speciﬁcally, for the u th user, theevent of successful link-status prediction at basestation n , i.e., { ˆ s ( n ) u = s ( n ) u |S ( n ) u } , is independentfrom that of the same user at basestation n (cid:48) , i.e., { ˆ s ( n (cid:48) ) u = s ( n (cid:48) ) u |S ( n (cid:48) ) u } . The intuition behindthis assumption is rooted in the camera orientation at each basestation; each camera could viewthe environment from a different view-angle, which could result in different object positions,object orientations, motion directions, and image background. The trained deep architectures aredeployed once they reach some satisfying generalization performance. At any time instance, thetwo architectures feed their predictions to the central unit, and the unit uses them to anticipatewhether a user should be handed off or not (i.e., z nn (cid:48) = 1 ∀ n, n (cid:48) ∈ { , } and n (cid:54) = n (cid:48) ). A hand-offis only initiated when the LOS link at the serving basestation is predicted to be blocked whilethe LOS link at the other basestation is predicted to be maintained.V. E XPERIMENTAL S ETUP

In order to evaluate the proposed solution for both blockage prediction and proactive hand-off,this section introduces a discussion on the communication scenario considered for the evaluationexperiments, the process of generating the development and evaluation datasets, the evaluationmetrics used to assess the performance, and the training procedure of the proposed two-stageneural network.

A. Communication Scenario and Datasets

We ﬁrst start by describing the communication scenario used for our development and evalu-ation experiments. Then, we give a detailed description of the two datasets generated from thatscenario.

1) Scenario description:

The scenario considered in this paper is the ViWi multi-user scenario“ASUDT1 28” [21], [25], which is an outdoor mmWave communication environment built usinga game engine and a ray-tracing software. It is developed using the ViWi data generationframework. The scenario depicts a typical downtown street with its various elements, vehicles,pedestrians, lamp posts, skyscrapers, . . . , etc, see Fig. 5. Each vehicle on the street represents Basestation 1

200 m

Basestation 2 y x

Bus Various carsSkyscraper

Fig. 5. A top-view (a) and a perspective view (b) of the simulation outdoor scenario. It is modeled after a busy downtown streetwith a variety of moving and stationary objects, such as cars, buses, trucks, pedestrians, high-rises, etc. The view also depictsthe two basestations. a possible user, and large vehicles, like buses and trucks, act as dynamic blockages to smallerones, i.e., cars. The scenario has a total of 60 vehicles, 2 trucks, 8 buses, and 50 cars, and theyall move at different speeds. The scenario also considers two small-cell basestations operating at28 GHz. They are set 80 m away from each other and on opposite sides of the street, as Fig. 5shows. Each one is 4.5 m above the ground and equipped with three differently-oriented camerasproviding two side views and a central view of the street. Using the ViWi data-generation script,a raw dataset of 4-tuples is generated where each of those tuples has co-existing vision-wirelessinformation concerning one user at a certain time instance in the environment, i.e., for a user u and time instance t , a 4-tuple consists of an image, mmWave channels, link status, and location.This raw dataset is henceforth referred to as the seed dataset . It is further processed to obtainedtwo development and evaluation datasets. The ﬁrst is for training and testing the object detector,while the other is used to train and evaluate the proposed deep neural network used to addressblockage prediction and user hand-off problems. The following two subsections will shed morelight on each one of these two datasets.

2) Tiny Object Detection Dataset:

A dataset of samples is selected from the seed datasetgenerated using ViWi ASUDT1 28 scenario to ﬁne-tune the COCO pre-trained YOLOv3. Asmentioned above, there are a total of six cameras, three for each basestation. Each of thesecameras covers a different portion of the street and views the objects from a different orientation.A total of image samples are selected from the central cameras, and the remaining samples come from the side cameras. This is done to incorporate the difference in orientations,resulting in a diverse dataset. The ViWi ASUDT1 28 scenario is not object detection ready, i.e., it does not contain the labels and the bounding boxes of the objects present in the scene.Therefore, the samples are manually labeled to create the object-detector dataset. This datasethas bounding-box labels of various cars, trucks, and buses. It is split − to create trainingand validation sets.

3) Blockage Prediction Dataset:

To generate a blockage and user hand-off dataset, the seeddataset undergoes a processing pipeline that eventually generates a dataset with observed se-quences and label pairs, i.e., D . Recognizing that the proposed architecture requires sequencesof image-beam inputs, the ﬁrst step in the pipeline is to generate beamforming vectors fromthe mmWave channels in the seed dataset. The result is a dataset equal in size to the seeddataset but with beamforming vectors instead of mmWave channels. Then, the second step inthe pipeline creates the input sequences and their corresponding labels. For every user in theenvironment, every 13 consecutive 3-tuples of image, beam, and link status are stacked to formone raw sequence. In that sequence, the ﬁrst 8 images (i.e., r = 8 ) and beams are paired to formthe observed sequence S ( β ) u where β ∈ { n, n (cid:48) } . The last 5 (i.e., r (cid:48) = 5 ) link statuses in the rawsequence are used to construct the label of the observed sequence as described by (5). The ﬁnalresult of the second pipeline step is a large collection of observed sequences and their labels, alittle shy of 2 million sequences.That large collection is composed of multiple cases in terms of labels, the majority of whichhave a LOS future link status (i.e., s u = 0 ). Thus, the third pipeline step attempts to reduce thesize of the dataset and balance out its labels. This is done by randomly and equally samplingobserved sequences. More speciﬁcally, approximately observed sequences with LOSlabels (henceforth referred to non-pivotal sequences) and another approximately withNLOS labels (henceforth referred to the pivotal sequences) are randomly sampled from the largecollection. These sampled sequences are divided equally among the 6 cameras (2 basestations) inthe scenario, resulting in pivotal and non-pivotal sequences per camera. The ﬁnal outcomeof the third step is the blockage-prediction and user hand-off dataset, i.e., D = { ( S u , s u ) } Uu =1 where U = 54000 . This dataset is split − into training and validation sets. B. Evaluation Metrics

The success of future blockage prediction is heavily dependent on the performance of the objectdetector. Correctly detecting the objects in the scene, especially the user, and the blockage is ofparamount importance. The evaluation metric for the object detection task is different from that of blockage prediction. Therefore, in this subsection, we present the metric used for evaluating thesuccess of both the object detection task and the future link blockage prediction task separately.

1) Object Detection Evaluation Metric:

The objectives of an object detection model are twofolds: (i)

Classiﬁcation:

The object detection model needs to identify the objects present in theimage and respective classes of the object. (ii)

Localization:

The object detection model alsoneeds to predict the bounding boxes around each detected object, thereby detecting the locationof the objects in the image. Therefore, it is necessary to evaluate the performance of both theclassiﬁcation of different objects and the localization of using bounding boxes in the image.Several mathematical tools such as the Intersection over Union (IoU), and conﬁdence score areused to quantify the quality of the object detector outputs. The conﬁdence score is the probabilitythat a bounding-box predicted by the model contains an object. The IoU computes the area ofthe intersection divided by the area of the union of the two bounding boxes, i.e., the groundtruth ( B g ) and the predicted bounding box ( B g ). IoU can be deﬁned asIoU = area ( B g ∩ B p ) area ( B g ∪ B p ) . (11)Both the IoU and the conﬁdence score are utilized to evaluate the detection performance, morespeciﬁcally calculate precision, recall, average precision (AP), and mean average precision (mAP)scores of the model. For more details on those metrics, the reader is referred to [34], [35].

2) Blockage Prediction Evaluation Metric:

In this section, we present the evaluation metricfollowed to evaluate the performance of our proposed blockage prediction solution. Post-training,we evaluate each network on the validation set. The primary method of evaluating the modelperformance is using top-1 accuracy, which is deﬁned as follows

Acc top − = 1 U (cid:48) U (cid:48) (cid:88) u =1 { ˆ s u = s u } , (12)where ˆ s u is the predicted blockage value for user u when provided with the sequence ofobservation S u as deﬁned in Section III, s u is the ground-truth value of the same data sample, U (cid:48) is the total number of data samples in the validation dataset, { . } is the indicator function,with the value of 1 only when the condition provided is satisﬁed. We also resort to precisionand recall for a more detailed look into the blockage-prediction performance. TABLE ID

ESIGN AND T RAINING H YPER - PARAMETERS

Design Number of GRUs Per Layer ( r ) Embedding Dimension ( N ) Hidden State Dimension Number of classes Dropout Percentage 0.3Training Optimizer ADAMLearning Rate × − Batch Size

Number of Training Epochs

C. Network Training

In this subsection, we present the training methodology of both the object detector networkand the proposed blockage prediction solution. All the experiments were performed on a singleNVIDIA RTX Titan GPU using the PyTorch deep learning framework.

1) Fine-tuning object detector:

The parameters of YOLOv3 are pre-trained on the COCOdataset [36], which is a large-scale image dataset with around 80 object classes. However, thereare mainly objects of interest in the blockage-prediction dataset as mentioned in Section V-A2.The COCO pre-trained YOLOv3 object detector model is ﬁne-tuned on the tiny object-detectiondataset to improve its performance in detecting the objects of interest. In order to achieve arobust object detection model, separate models are trained for each camera in the scenario. TheYOLOv3 detector is trained with Adam optimizer. The bounding boxes are extracted with aconﬁdence threshold of . and an NMS threshold of . .

2) Training the blockage-prediction architecture:

This paper studies the ability of the proposeddeep architecture to perform blockage prediction using sequences of RGB images and observedbeams. The blockage prediction dataset described in Section.V-A3 is used to train the secondcomponent of the architecture, recurrent prediction, as the ﬁrst component is kept ﬁxed followingthe ﬁne-tuning of the object detector. The training hyper-parameters are listed in Table I, anda cross-entropy loss is used to guide the training process [31]. To highlight the potential ofvision-aided link-blockage prediction, we develop a baseline model that performs the same taskbut without the visual data, using only the beam sequences. The model is simply the recurrent (a) Input Image to YOLOv3 Bounding-boxmissing (b) Output from the pre-trained detector

Objects detected (c) Output of the ﬁne-tuned detectorFig. 6. A visualization of the output from YOLOv3 object detector. (a) Output from the pre-trained detector; (b) Output of theﬁne-tuned detector. In this example, the user and the blockage are the red bus and the truck in the fourth and third lanes. It isobserved that the ﬁne-tuned model clearly improves the object detection quality compared to the pre-trained one. component of the proposed solution described in Section IV-B3. It takes in the 8-beam sequencesand predicts the link-status. The training of this baseline solution is also performed with a cross-entropy loss. VI. P

ERFORMANCE E VALUATION

In this section, we ﬁrst analyze the performance of the ﬁne-tuned YOLOv3 object detectorbefore delving into analyzing the performance of our proposed solution for future blockageprediction and proactive hand-off.

A. Object Detector Performance

The object detector plays a crucial role in the proposed architecture; it performs the ﬁrstsub-task of identifying the relevant objects in a wireless environment. The correct prediction TABLE IIC

OMPARISON OF THE AP PRECISION SCORE OF

COCO

PRE - TRAINED AND V I W I FINE - TUNED

YOLO V MODEL

YOLOv3 ConﬁdenceThreshold NMSThreshold AP mAPCar Truck BusCOCO Pre-Trained Model

ViWi Fine-Tuned Model 0.6 0.8444 0.8968 0.9929 0.9114 of future blockage is heavily dependent on the performance of this object detector. Althoughthe YOLOv3 pre-trained on the COCO dataset can detect the objects of interest in our dataset,i.e., the cars, buses, and trucks, its performance is relatively poor on the blockage-predictiondataset. Therefore, as proposed in Section V-C1, the pre-trained YOLOv3 is further ﬁne-tunedon the tiny object-detector dataset. To show the difference in performance, Fig. 6 depicts anexample output of both the COCO pre-trained YOLOv3 and the ﬁne-tuned YOLOv3 model.

Inthis speciﬁc scene, the user is the red bus, and the blockage is the truck in the third lane.

This particular example is very interesting for our analysis as the user is about the get blockedby the truck in the future. The pre-trained YOLOv3 model fails to detect both the blockage asobserved in Fig. 6(b), which the ﬁne-tuned model overcomes. This failure will be propagateddownstream to the next component in the proposed architecture, and it is likely to lead to awrong blockage prediction as the recurrent component is completely oblivious to the presenceof the user and its blockage.To quantify the performance of the object detector, we calculate the AP and the mAP onthe validation dataset. The YOLOv3 model is separately trained and validated for the imagescoming from the central and side cameras. This is done to ensure a robust object detectionand classiﬁcation performance. Table II shows the average AP of the pre-trained and ﬁne-tunedmodels on all classes of relevant objects, i.e., car, bus, and truck, across all the cameras.The performance of the ﬁne-tuned model surpasses that of the pre-trained one. The ﬁne-tunedYOLOv3 model achieves approximately × the mAP performance of the pre-trained model. Itis likely that the pre-trained model performance drops on the validation dataset because thereis a shift in the data distribution between COCO and the ViWi dataset. Such improvement inthe detector performance signiﬁcantly reﬂects on the overall blockage prediction and proactive /26 1/263UHGLFWHGODEHO / 2 6 1 / 2 6 7 U X H O D E H O (a) Beam-only Approach /26 1/263UHGLFWHGODEHO / 2 6 1 / 2 6 7 U X H O D E H O (b) Proposed Vision-Aided ApproachFig. 7. The confusion matrices for the task of future blockage prediction, for both the (a) baseline (beam-only) solution and(b) the proposed vision-aided deep learning approach. The proposed solution achieves a precision and recall of 84% and 92%respectively, as compared to 67% and 74% of the baseline (beam-only) solution. This highlights the accurate prediction capabilityof the proposed vision-aided approach. hand-off capabilities of our proposed solution. This is discussed in the next few sections. B. Blockage Prediction

In this subsection, we present a detailed analysis of our proposed deep architecture for futureblockage prediction. As mentioned in Section V-A3, there are basestations, each equipped with cameras pointing in different directions, covering the whole street. Each of the three cameraspresents a different set of challenges to the prediction task. In order to develop a thoroughunderstanding of the proposed architecture, it is trained and tested on the data samples of eachcamera separately. The results of all those tests are presented and discussed in the followinghierarchy. We ﬁrst analyze the overall performance on all validation samples (i.e., all cameras).Then, we narrow down the scope of the discussion to the camera-wise performance, and weconclude this section with a discussion on the performance of the proposed architecture on thepivotal sequences.

1) Overall Analysis:

As mentioned above, copies of the proposed architecture are trained,one for each camera. In Fig. 7, we present the confusion matrices for both the baseline and theproposed solutions. Given an almost balanced validation set, i.e., a total of samples percamera for each label, the ﬁgure shows that beamforming vectors alone do not reveal enoughinformation about future blockages; the baseline solution achieves around . top-1 accuracy,leaving signiﬁcant room for improvement. The proposed solution, on the other hand, achieves (a) Camera 3 of Basestation 1 (b) Camera 6 of Basestation 2Fig. 8. This ﬁgure shows the far-peripheral view of the street captured from both the basestations. This ﬁgure presents therepresentative images captured by (a) Camera 3 of Basestation 1, and (b) Camera 6 of Basestation 2. The ﬁgure highlights thelimited visibility issue encountered as the distance of the user from the basestation increases, making it even more challengingto predict the future blockage correctly. approximately improvement in accuracy over that of the baseline solution. This highlightsthe importance of visual data in tackling the link-blockage prediction problem. Accuracy is aholistic measure that may not reﬂect the intricacies of blockage prediction. Hence, we take acloser look at the performance by evaluating the precision-recall performance. For the proposeddeep architecture, it achieves a precision of at a recall rate of . This reﬂects a highlevel of trustworthiness in the blockage predictions of the architecture; it successfully identiﬁes of all blockage cases while maintaining accurate predictions in mixed LOS and NLOScases. In contrast, the precision of the baseline model falls to at a recall of only , whichre-afﬁrms the value of vision for proactive blockage prediction.

2) Camera-wise Analysis:

There are a total of cameras capturing images in the dataset; eachof them capturing different segments of the street. Cameras , in basestation 1 and cameras and in basestation 2, capture the peripheral view of the street, see Fig. 8 for examples. InFig. 9, we present the prediction accuracy per camera for the baseline and the proposed solutions.This accuracy is reported for the pivotal sequences only since they are the ones that present aclear challenge in a wireless network. The proposed deep architecture performs signiﬁcantlybetter than the baseline approach for each camera. On average, the proposed solution exceedsthe performance of the baseline solution by approximately . However, an interesting thingto note here is the performance of the central cameras, i.e., camera and . As compared toperipheral cameras, the prediction accuracy slightly degrades for the central cameras. The generalexpectation is that the solution will perform better for the central cameras as compared to the Camera Number T o p - A cc u r a c y Beam-only SolutionProposed Solution

Fig. 9. The ﬁgure presents the individual validation accuracy on the pivotal sequences for each camera for both the baseline(beam-only) solution and the proposed vision-aided approach. It is observed that the future blockage prediction performance ofthe proposed solution surpasses the prediction accuracy of the beam-only solution for all the cameras, highlighting the efﬁcacyof incorporating the visual component in the proposed solution. peripheral cameras. This is because vehicular motion in central cameras is parallel to the imageplan making any displacement clearly visible. That dip in accuracy is due to the difference indata distribution between the cameras. Out of ∼ samples in the validation set of cameras and , approximately samples are pivotal. Therefore, there is a slight data-imbalance forthese two cameras, leading to a lower prediction accuracy as compared to the other cameras.

3) Discussion on pivotal sequences:

In the previous two subsections, the performance of theproposed solution for future link blockage prediction was presented for all types of data samples.A common observation is that the proposed deep architecture achieves clear improvement over thebaseline solution, especially for the pivotal samples; it achieves a blockage prediction accuracyof for the pivotal sequences in the validation set. Aside from the good performance ofthe proposed solution, we recognize that, in general, it fails in predicting blockages of thetime, and, hence, we are seeking answers for why that is the case. Unlike CNNs, it is difﬁcultto study and visualize the reason behind the predictions of the recurrent component. Thus, anyfurther improvement to the proposed solution warrants a detailed analysis of the failing cases.In particular, we analyze the architecture failures, and we present some probable causes of thosefailures. For this speciﬁc study, we only consider the pivotal sequences from the front cameras,i.e., camera and , as they represent the main target of having a proactive blockage-predictionsolution. Camera 1 Camera 2 Camera 3

Camera Number T o p - A cc u r a c y Future-1 Future-2 Future-3 Future-4 Future-5

Camera 4 Camera 5 Camera 6

Camera Number T o p - A cc u r a c y Future-1 Future-2 Future-3 Future-4 Future-5

Fig. 10. The ﬁgure shows the impact of the future blockage instance on the proposed vision-aided model’s predictionperformance. The Top-1 accuracy versus the future blockage instance is presented for the cameras at the two basestations.It is observed that, generally, the further in future the blockage happens, the more difﬁcult it is to predict the blockage.

We identify two major reasons behind failed predictions, namely instance at which blockageoccurs and object detection failure . As mentioned in Section V-A3, we consider a sequenceof future instances for generating the ground-truth link status of a user. We observe that theperformance of the proposed solution varies with the instance at which the blockage occurs, e.g.,blockage happens at ﬁrst or ﬁfth future instances. In Fig. 10, we present the top-1 validationaccuracy versus the number of instances in the future window for all 3 cameras per basestation.The top-1 validation accuracy for the proposed solution increases as the blockage happens closerin time to the beginning of the future window. In Fig. 11(a), an example sequence is shownwhere the blockage happens at the th future instance. It is observed that even at the st futureinstance, there is a signiﬁcant distance between the user and the blockage. Generally, the furtherwe try to predict in the future, it becomes increasingly difﬁcult for the deep neural networkarchitecture to generalize.The second reason behind failed predictions could be traced back to the output of the objectdetector, i.e., the bounding-box values. If the object detector fails to detect a relevant object inany of the input sequences, it is very likely that the deep architecture is going to struggle. Thisis a consequence of the sequential nature of the proposed architecture. In Fig. 11(b) an exampleof such case is highlighted. The object detector failed in detecting the incoming blockage, leadingto a misprediction. This particular reason for failed predictions could be addressed through anend-to-end design, where blockage prediction is learned through training all components of adeep architecture together. Instant Instant Instant

FutureBlockage Served User

Blockage Approaching User Significant Gap Between Blockage and User

Failed Prediction! (a) Impact of user-blockage distance

Instant Instant Instant

Blockage Not Detected! User Transitioning to NLOSDetected Objects Served User Blockage Not Detected! Wrong Prediction! (b) Object detection failureFig. 11. This ﬁgure presents some illustration of the main reasons that may lead to failed predictions. As seen in ﬁgure (a) attime instant t , there is a signiﬁcant gap between the blockage and the user, resulting in failed prediction. In ﬁgure (b), failureto detect the blockage bus resulted in failed future predictions. C. Proactive Hand-off Prediction

As discussed in Sections III-B and IV-C, successful proactive hand-off can be viewed throughthe lens of blockage prediction. Two copies of the proposed architecture are deployed at basesta-tion 1 and 2. The central unit receives the predictions for cameras 3 and 4, which have overlappingﬁelds of view. To evaluate the performance of proactive hand-off, we select a subset of sequencesfrom the blockage prediction dataset. The subset is selected such that for each user, there aretwo conjugate sequences coming from basestations 1 and 2, i.e., the user gets blocked (NLOS)for one basestation and remains LOS for the other. Furthermore, the conjugate sequences inthe subset are divided into two categories based on the serving basestation. The ﬁrst categoryconsiders the case where a user is being served by basestation 1 and requires a hand-off tobasestation 2 in the future, i.e., z u = 1 , while the second category is the opposite case, i.e., z u = 1 . The total number of conjugate sequences per category is for the ﬁrst and forthe second.In Table III, we present the performance of both the baseline and the proposed solutionsfor user hand-off. We present both the overall accuracy and the more ﬁne-grained NLOS andLOS prediction accuracy for each scenario. The proposed solution achieves a hand-off prediction TABLE IIIP

ROACTIVE H AND - OFF

Model Hand-off Accuracy Blockage Prediction Accuracy n = 1 , n (cid:48) = 2 n = 2 , n (cid:48) = 1 Basestation 1 Basestation 2

NLOS LOS NLOS LOS

Beam-only 73.41% 72.67% 77.25 % 74.82 % 78.81 % 76.24 %Vision-aided Proposed Solution accuracy of . and . , for category 1 and category 2, respectively, whereas the beam-only solution achieved . and . prediction accuracies. This is a reﬂection of the goodproactive blockage prediction performance the deep architecture is capable of achieving. Thelast four columns of Table.III conﬁrms those results; they show the LOS and NLOS predictionaccuracy per basestation. The proposed solution clearly achieves an approximately consistentimprovement of . VII. C ONCLUSION

Enabling low latency and high reliability high-frequency wireless networks calls for thedevelopment of innovative solutions that overcome the challenges of LOS-link blockages. Inthis paper, we propose a bimodal machine learning solution capable of learning future linkblockages in multi-user communication environments. It is based on deep neural networks, morespeciﬁcally an object detector and a GRU network, and it relies on observing sequences ofconsecutive RGB frames and mmWave beams. The proposed solution is capable of proactivelypredicting blockages, and, hence, it enables mitigation measures to LOS blockages such asproactive user hand-off between basestations. This is demonstrated by developing and testingthe proposed solution on a synthetic dataset of co-existing vision-wireless data, generated usingthe ViWi framework. The proposed solution achieves good blockage prediction performance,hitting an overall average test accuracy of approximately , which goes up to when onlypivotal sequences are considered. This accuracy reﬂects well on the user hand-off performance;a wireless network adopting the proposed solution at two different basestations is shown toapproach a hand-off test accuracy of . This does not only emphasize the importance ofproactive blockage prediction for improving the reliability and latency performance of wireless networks but sheds some light on the role machine learning, and multi-modal data could playin shaping the future of those networks. R EFERENCES [1] G. Charan, M. Alrabeiah, and A. Alkhateeb, “Vision-aided dynamic blockage prediction for 6G wireless communicationnetworks,” 2020.[2] R. W. Heath, N. Gonzalez-Prelcic, S. Rangan, W. Roh, and A. M. Sayeed, “An overview of signal processing techniquesfor millimeter wave MIMO systems,”

IEEE journal of selected topics in signal processing , vol. 10, no. 3, pp. 436–453,2016.[3] T. S. Rappaport, Y. Xing, O. Kanhere, S. Ju, A. Madanayake, S. Mandal, A. Alkhateeb, and G. C. Trichopoulos, “Wirelesscommunications and applications above 100 GHz: Opportunities and challenges for 6g and beyond,”

IEEE Access , vol. 7,pp. 78 729–78 757, 2019.[4] J. G. Andrews, T. Bai, M. Kulkarni, A. Alkhateeb, A. Gupta, and R. W. Heath Jr, “Modeling and analyzing millimeterwave cellular systems,” submitted to IEEE Transactions on Communications, arXiv preprint arXiv:1605.04283 , 2016.[5] M. Bennis, M. Debbah, and H. V. Poor, “Ultrareliable and low-latency wireless communication: Tail, risk, and scale,”

Proceedings of the IEEE , vol. 106, no. 10, pp. 1834–1853, 2018.[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Uniﬁed, real-time object detection,” in

Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 779–788.[7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in

European conference on computer vision . Springer, 2016, pp. 21–37.[8] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017.[9] H. Xu, A. Das, and K. Saenko, “R-c3d: Region convolutional 3D network for temporal activity detection,” in

Proceedingsof the IEEE International Conference on Computer Vision (ICCV) , Oct 2017.[10] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. Soong, and J. C. Zhang, “What will 5G be?”

IEEEJournal on selected areas in communications , vol. 32, no. 6, pp. 1065–1082, 2014.[11] M. Polese, M. Giordani, M. Mezzavilla, S. Rangan, and M. Zorzi, “Improved handover through dual connectivity in 5Gmmwave mobile networks,”

IEEE Journal on Selected Areas in Communications , vol. 35, no. 9, pp. 2069–2084, 2017.[12] M. Giordani, M. Mezzavilla, S. Rangan, and M. Zorzi, “Multi-connectivity in 5G mmwave cellular networks,” in , 2016, pp. 1–7.[13] D. Aziz, J. Gebert, A. Ambrosy, H. Bakker, and H. Halbauer, “Architecture approaches for 5G millimetre wave accessassisted by 5G low-band using multi-connectivity,” in , 2016, pp. 1–6.[14] N. H. Mahmood and H. Alves, “Dynamic multi-connectivity activation for ultra-reliable and low-latency communication,”in , 2019, pp. 112–116.[15] J. Choi, W. Lee, J. Lee, J. Lee, and S. Kim, “Deep learning based nlos identiﬁcation with commodity WLAN devices,”

IEEE Transactions on Vehicular Technology , vol. 67, no. 4, pp. 3295–3303, 2018.[16] M. Alrabeiah and A. Alkhateeb, “Deep learning for mmwave beam and blockage prediction using sub-6 GHz channels,”

IEEE Transactions on Communications , vol. 68, no. 9, pp. 5504–5518, 2020.[17] C. Huang, A. F. Molisch, R. He, R. Wang, P. Tang, B. Ai, and Z. Zhong, “Machine learning-enabled LOS/NLOSidentiﬁcation for MIMO systems in dynamic environments,”

IEEE Transactions on Wireless Communications , vol. 19,no. 6, pp. 3643–3657, 2020. [18] A. Alkhateeb, I. Beltagy, and S. Alex, “Machine learning for reliable mmwave systems: Blockage prediction and proactivehandoff,” in , Nov 2018, pp. 1055–1059.[19] M. Alrabeiah, A. Hredzak, and A. Alkhateeb, “Millimeter wave base stations with cameras: Vision aided beam and blockageprediction,” arXiv preprint arXiv:1911.06255 , 2019.[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition , 2016, pp. 770–778.[21] M. Alrabeiah, A. Hredzak, Z. Liu, and A. Alkhateeb, “ViWi: A deep learning dataset framework for vision-aided wirelesscommunications,” arXiv preprint arXiv:1911.06257 , 2019.[22] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversationalspeech recognition,” in , 2016,pp. 4960–4964.[23] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in

Internationalconference on machine learning , 2014, pp. 1764–1772.[24] M. Alrabeiah, J. Booth, A. Hredzak, and A. Alkhateeb, “ViWi vision-aided mmwave beam tracking: Dataset, task, andbaseline solutions,” arXiv preprint arXiv:2002.02445

IEEE Transactions on Wireless Communications , vol. 14, no. 11, pp. 6481–6494, 2015.[27] M. Alrabeiah and A. Alkhateeb, “Deep learning for TDD and FDD massive MIMO: Mapping channels in space andfrequency,” in , 2019, pp. 1465–1470.[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprintarXiv:1409.1556 , 2014.[29] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,”

IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 39, no. 6, pp. 1137–1149, 2017.[30] M. Alrabeiah, Y. Zhang, and A. Alkhateeb, “Neural networks based beam codebooks: Learning mmwave massive MIMObeams that adapt to deployment and hardware,” arXiv preprint arXiv:2006.14501

Proceedings of the IEEE conference on computervision and pattern recognition , 2017, pp. 7263–7271.[33] ——, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767 , 2018.[34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. ,“Imagenet large scale visual recognition challenge,”

International journal of computer vision , vol. 115, no. 3, pp. 211–252,2015.[35] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,”

International journal of computer vision , vol. 88, no. 2, pp. 303–338, 2010.[36] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Commonobjects in context,” in