[PDF] Human-Perception-Oriented Pseudo Analog Video Transmissions with Deep Learning

Abstract

Recently, pseudo analog transmission has gained increasing attentions due to its ability to alleviate the cliff effect in video multicast scenarios. The existing pseudo analog systems are sorely optimized under the minimum mean squared error criterion without taking the perceptual video quality into consideration. In this paper, we propose a human-perception-based pseudo analog video transmission system named ROIC-Cast, which aims to intelligently enhance the transmission quality of the region-of-interest (ROI) parts. Firstly, the classic deep learning based saliency detection algorithm is adopted to decompose the continuous video sequences into ROI and non-ROI blocks. Secondly, an effective compression method is used to reduce the data amount of side information generated by the ROI extraction module. Then, the power allocation scheme is formulated as a convex problem, and the optimal transmission power for both ROI and non-ROI blocks is derived in a closed form. Finally, the simulations are conducted to validate the proposed system by comparing with a few of existing systems, e.g., KMV-Cast, SoftCast, and DAC-RAN. The proposed ROIC-Cast can achieve over 4.1dB peak signal- to-noise ratio gains of ROI compared with other systems, given the channel signal-to-noise ratio as -5dB, 0dB, 5dB, and 10dB, respectively. This significant performance improvement is due to the automatic ROI extraction, high-efficiency data compression as well as adaptive power allocation.

Full PDF

11 Human-Perception-Oriented Pseudo Analog VideoTransmissions with Deep Learning

Xiao-Wei Tang,

Student Member, IEEE , Xin-Lin Huang*,

Senior Member, IEEE ,Fei Hu,

Member, IEEE , and Qingjiang Shi

Abstract —Recently, pseudo analog transmission has gainedincreasing attentions due to its ability to alleviate the cliffeffect in video multicast scenarios. The existing pseudo analogsystems are optimized under the minimum mean squared errorcriterion. However, their power allocation strategies do not takethe perceptual video quality into consideration. In this paper,we propose a human-perception-based pseudo analog videotransmission system named ROIC-Cast, which aims to intelli-gently enhance the transmission quality of the region-of-interest(ROI) parts. Firstly, the classic deep learning based saliencydetection algorithm is adopted to decompose the continuous videosequences into ROI and non-ROI blocks. Secondly, an effectivecompression method is used to reduce the data amount of sideinformation generated by the ROI extraction module. Then, thepower allocation scheme is formulated as a convex problem,and the optimal transmission power for both ROI and non-ROIblocks is derived in a closed form. Finally, the simulations areconducted to validate the proposed system by comparing with afew of existing systems, e.g., KMV-Cast, SoftCast, and DAC-RAN.The proposed ROIC-Cast can achieve over 4.1dB peak signal-to-noise ratio gains of ROI compared with other systems, giventhe channel signal-to-noise ratio as -5dB, 0dB, 5dB, and 10dB,respectively. This signiﬁcant performance improvement is due tothe automatic ROI extraction, high-efﬁciency data compressionas well as adaptive power allocation.

Index Terms —Pseudo analog transmissions, Human percep-tion, Deep learning, Video multicast .

I. I

NTRODUCTION W ITH the development of mobile terminals, a largenumber of wireless video applications have emerged,such as augmented reality/virtual reality (AR/VR), unmannedaerial vehicle (UAV) video surveillance, 4K live video stream-ing, and so on [1]. In this paper, we mainly focus on thevideo multicast scenario characterized by low delay and highquality. The two biggest challenges facing the video multicastscenario are the rigorous delay requirement and the drasticallyﬂuctuating channel conditions [2].The conventional digital systems typically divide a videoclip into groups of pictures (GoPs) and compress the GoPsinto bit stream through a standard video encoder (e.g., JPEG

Xiao-Wei Tang (email: [email protected] ) is with the De-partment of Control Science and Engineering, Tongji University, Shanghai201804, China.Xin-Lin Huang (email: [email protected] ) is with the De-partment of Information and Communication Engineering, Tongji University,Shanghai 201804, China. (corresponding author)Fei Hu (e-mail: [email protected] ) is with the Department of Electricaland Computer Engineering, University of Alabama, Tuscaloosa, AL 35487USA.Qingjiang Shi(e-mail: [email protected] ) is with the School ofSoftware Engineering, Tongji University, Shanghai 201804, China. a r X i v : . [ c s . MM ] M a y which represents different operation states. Therefore, in thispaper we will investigate how to enhance the transmissionquality of ROI part in content-focused missions through thepseudo analog transmission technology.How to accurately extract the ROI part from the transmittedimage/video is the primary problem to be solved in thisresearch. Inspired by the development of saliency detection in video compression, we propose to apply saliency detectionin pseudo analog video transmissions to extract the salientobjects for subsequent ROI coding [11]. Conventional saliencydetection methods often adopt hand-crafted low-level featuresto differentiate the foreground and background with some spe-ciﬁc separation models [12-14], including principal componentanalysis (PCA) based low-rank decomposition model [12],Gaussian mixture model [13], and other color/texture-basedmodels [14]. Generally, these methods have low accuracy andcan only work on speciﬁc data sets. Recently, deep learning(DL) has been used for saliency detection due to its fastprocessing speed, high detection accuracy as well as strongadaptability [15-17]. In [15], a fully convolutional neuralnetwork called DeepFix was proposed to automatically learnthe features in a hierarchical fashion and predict the locationof the saliency objects in an end-to-end manner. In [16],Szegedy et al. investigated various pre-trained models in termsof feature extraction for saliency detection, including AlexNet,VGG-16, and GoogLeNet. In [17], a regression-based saliencydetection method named YOLOv2 was proposed, which usedthe whole image as the input of the network and generatedthe positions of the salient objects.In this paper, we will propose a novel ROI-C haracterizedpseudo analog multi cast system named ROIC-Cast to enhancethe transmission quality of ROI part in content-focused mis-sions. In the proposed ROIC-Cast, the video frames will beﬁrstly decomposed into ROI and non-ROI blocks based on theDL-based saliency detection model. The non-ROI blocks willbe treated as the background, while ROI blocks are deliveredvia the routing paths with better channel quality and highertransmission power. To our best knowledge, this is the ﬁrstwork that seamlessly integrates DL algorithm into the pseudoanalog video transmission. To sum up, the main differencebetween our work and the existing work is that we optimize thecontent-focused system by taking the perceptual video qualityof ROI part into consideration, not just based on the MMSEcriterion.

Contributions : Our contributions can be concluded as thefollowing three-fold:1) The classic YOLOv2 algorithm is adopted in the ROI ex-traction to automatically decompose the video sequencesinto ROI and non-ROI blocks at the transmitter.2) An effective compression method is proposed to reducethe data amount of side information generated by ROIdetection.3) the power allocation scheme is formulated as a convexproblem where the optimal transmission power for bothROI and non-ROI blocks is given in a closed form.The rest of the paper is organized as follows. Section IIprovides a brief review of related work including YOLOv2 andpseudo analog transmission technology. In Section III, three main components of the proposed ROIC-Cast framework aredescribed in details, including ROI extraction, side informationcompression, and unequal power allocation. The implementa-tion details and simulation results, as well as the comparisonswith conventional schemes are covered in Section IV, followedby the conclusions in Section V. All the variables/notations tobe used in the remaining sections are listed in Table I.

TABLE I NOTATION LIST

Notation Meaning S the length of grid cells in YOLOv2 ( x, y ) the coordinate of the center of the bounding box ( w, h ) the width and height of the bounding box conﬁdence the credibility of the detected object B the number of bounding boxes and conﬁdences B i the i th DCT coefﬁcient block g i the power scaling factor of B i λ i the variance of B i P t the total transmission power K i the correlation factor between B i and the similar block (cid:96) ( K i ) the correlation coefﬁcient M the size of the B i F the whole input image t i the i th detected salient object B s the ﬁrst block of the detected object B e the last block of the detected object H the height of the transmitted frame W the width of the transmitted frame P s the transmission power allocated for side information β the corresponding SNR that meetsthe lowest BLER requirement σ − the variance of the channel noise θ i the transmitted signal set of B i N s the total number of blocks in a frame r i the received signal set of B i n the additive white Gaussian noise D i the reconstruction distortion of B i P d the transmission power allocated for DCT coefﬁcients P dr the transmission power allocatedfor ROI DCT coefﬁcients P dnr the transmission power allocatedfor non-ROI DCT coefﬁcients S ( r ) the size of ROI blocks S ( nr ) the size of non-ROI blocks η the ratio between the average transmission powerfor ROI coefﬁcients and non-ROI coefﬁcients N r the number of ROI blocks N n r the number of non-ROI blocks II. R

ELATED W ORK

In this section, we will give a brief review of YOLOv2and pseudo analog transmission technology. Firstly, we willbrieﬂy summarize the superiorities of YOLO series over otherDL-based saliency detection algorithms, and highlight the dif-ferences among YOLO series. Then, we will make a literaturereview of several well-known pseudo analog schemes, andintroduce the power optimization strategies in these pseudoanalog schemes.

A. YOLOv2 Algorithm

YOLO series (including YOLOv1 [18], YOLOv2 [17],and YOLOv3 [19]) are the state-of-the-art saliency detectionalgorithms proposed by Redmon J et al . in [17-19]. YOLOseries treat saliency detection as a regression problem, inwhich the whole image is regarded as the input of the net-work. The positions, categories, and corresponding conﬁdence probabilities of all salient objects contained in the input canbe obtained through an inference process in YOLO series.YOLO series have many superiorities over other DL-basedsaliency detection algorithms, e.g., R-CNN [20], Fast R-CNN[21], and Faster R-CNN [22]. Firstly, the training and detectionprocesses in YOLO series are implemented together in anend-to-end neural network whereas R-CNN series adopt twoseparated steps to obtain the candidate boxes. Therefore,the YOLO series are faster and more accurate than R-CNNseries. Speciﬁcally, YOLO series are capable of processing 67pictures per second with . detection accuracy. Secondly,YOLO series have lower background error detection rate com-pared to R-CNN series. Thirdly, YOLO series have strongerversatility. YOLO series’ detection rate of non-natural imageobjects is much higher than R-CNN series.The performance difference within YOLO series is alsoobvious in terms of detection speed and detection rate. Forexample, YOLOv1 has higher detection accuracy for largeobjects than small ones. Compared to YOLOv1, YOLOv2 canrun at various image sizes, offering an easy tradeoff betweenspeed and accuracy. YOLOv3 addresses the shortcomings ofYOLOv1 and YOLOv2 which have low detection accuracy forsmall objects. However, the network structure of YOLOv3 ismore complicated than YOLOv2, thus the detection speed isnot as fast as YOLOv2. To achieve real-time wireless videocommunications, we choose YOLOv2 algorithm to performthe saliency detection in this paper, whose neural networkstructure is shown in Fig. 1.The YOLOv2 network is an extended version of Darknet-19 classiﬁcation network structure, which is composed of 22convolutional layers and 5 pooling layers (see Fig. 1). UnlikeR-CNN series, the output layer of YOLOv2 is no longer a softmax function, but a tensor . Speciﬁcally, YOLOv2 dividesthe input image into S × S grid cells in the training processand each grid cell predicts whether or not the center of anobject falls into its interior. If a grid cell predicts that thecenter of an object falls within it, the grid cell continues topredict B bounding boxes and B conﬁdences for the object.Each bounding box contains ﬁve parameters: x , y , w , h , and conﬁdence . Speciﬁcally, ( x, y ) represents the coordinate ofthe center of the bounding box, ( w, h ) represents the widthand height of the bounding box, and conﬁdence reﬂects thecredibility of the object detected by the bounding box. B. Pseudo Analog Transmission

Many novel pseudo analog transmission schemes have beenproposed in recent years. In [8], Katabi et al. proposed a cross-layer design for wireless video broadcast named SoftCast,which was the ﬁrst work on pseudo analog transmission.SoftCast removes quantization and entropy coding from theconventional digital systems, and directly transmits the power-scaled discrete cosine transform (DCT) coefﬁcients. Speciﬁ-cally, DCT coefﬁcients with larger variances are allocated withmore transmission power. In addition, when the bandwidthis not sufﬁcient, SoftCast could discard the least importantDCT coefﬁcients (i.e., those with smaller variances) to savebandwidth. In [23], it indicated that the correlation of videos could befully utilized via a pseudo analog scheme called D-Cast. In D-Cast, the received frames are regarded as the side information to assist with the reconstruction of current frames. This greatlyimproves the energy efﬁciency. A data-assisted cloud radioaccess network for visual communications, named DAC-RAN,was proposed in [24]. It separates the control and data planesin the conventional digital transmission infrastructure, andintegrates a new data plane (that is speciﬁcally designedfor video communications) into the virtual base station. Thecorrelated information retrieved from video big data is utilizedas the prior knowledge in video reconstruction. However, thequality of the reconstructed video does not increase linearlywith the increase of signal-to-noise ratio (SNR), due to mutualinterference. In [25], He et al. proposed a structure-preservingvideo delivery system named SharpCast to improve both theobjective and subjective visual quality. In [26], Huang et al.proposed a knowledge-enhanced wireless video transmissionsystem called KMV-Cast which could exploit the hierarchicalBayesian model to integrate the correlated information intothe video reconstruction. In [27], a maximum a posteriori (MAP) decoding was proposed for KMV-Cast pseudo analogvideo transmission to further eliminate the residual noise inthe received video/image. In [28], the well-known block-matching and three dimensional ﬁltering (BM3D) algorithmwere adopted to remove the noise for KMV-Cast. The abovestudies [25-27] mainly focused on reducing the effect of noiseon the demodulation quality of image/video at the receiver,while ignoring the effect of video content on quality ofexperience (QoE) performance.Next, we use SoftCast and KMV-Cast as the examples tointroduce the power optimization concept in pseudo analogtransmission schemes. In SoftCast, all DCT coefﬁcients aredivided evenly into blocks (i.e., B i ) with a uniform size (e.g., × ). Each DCT coefﬁcient block B i is assigned with a powerscaling factor g i according to its variance in order to obtainthe symbols satisfying the power constraints. In SoftCast, theoptimal power scaling factor g i can be denoted as [8]: g i = λ i − / (cid:32)(cid:115) P t (cid:80) i √ λ i (cid:33) , (1)where P t denotes the total transmission power and λ i denotesthe variance of B i . From Eqn. (1), one can conclude that thetransmission power for B i is determined by its variance inSoftCast.There are two main differences between KMV-Cast andSoftCast: 1) In order to reduce the transmission delay, eachblock is de-correlated through 2D-DCT transform instead of3D-DCT transform; 2) KMV-Cast can make full use of thevideo resources already stored in the cloud to assist with thevideo demodulation at the receiver. Speciﬁcally, assume thatthe transmitted images/videos share the same statistical distri-bution with the reference ones stored in the cloud database.It has been proved in [26] that the performance of KMV-Castis affected by the correlation factor K i which represents thesimilarity between the transmitted block B i and the reference Image Conv.Max pool Conv.Max pool 3 Conv.Max pool 3 Conv.Max pool 5 Conv.Max pool 5 Conv. 3 Conv. Conv. Classes&Location information + Darknet-19

Fig. 1 The architecture of YOLOv2. block. Similar to SoftCast, the optimal power scaling factor inKMV-Cast can be denoted as [26]: g i = λ i − / (cid:32)(cid:115) P t (cid:112) (cid:96) ( K i ) (cid:80) i √ λ i (cid:112) (cid:96) ( K i ) (cid:33) , (2)where (cid:96) ( K i ) represents the correlation coefﬁcient which isonly related to K i and can be denoted as [26]: (cid:96) ( K i ) = (cid:18) K i + (cid:113) ( M − − K i ) (cid:19) , (3)where M represents the size of the block B i which is aconstant.Please note that the side information (i.e., K i and λ i ) istransmitted in digital mode due to their indispensability to thevideo reconstruction at the receiver, whereas the scaled DCTcoefﬁcients are transmitted in pseudo analog mode. Comparingto Eqn. (2), one can ﬁnd that the scaling factor of KMV-Castis determined by both λ i and K i .III. F RAMEWORK OF

ROIC-C

AST .In this section, we will describe the details of three mainsteps of the proposed ROIC-Cast system. Those three stepsare: ROI extraction, side information compression, and un-equal power allocation. The system model is shown in Fig.2. Firstly, ROI is extracted from each frame with YOLOv2structure. Then, each video frame is divided into blocksuniformly, and each block is de-correlated through 2D-DCTtransform. Next, the blocks are categorized into ROI and non-ROI ones, and an unequal power allocation scheme is proposedto enhance the transmission quality of the ROI blocks. Finally,the DCT coefﬁcients and the compressed side information aremapped into the I/Q components of the transmitted signals.

A. ROI Extraction

Let F : F → R be a whole frame and F ROI ⊂ F be theROI part contained in frame F . In F ROI , we usually have a setof salient objects with the shape of rectangles, and each object t i has a center ( x i , y i ) , width w i and height h i . At the outputlayer of YOLOv2, the position information (i.e., x i , y i , w i , h i )of each object t i can be obtained. These position informationcan help us effectively locate the ROI at the receiver, as shownin Fig. 3(a). Note that each video frame is divided into blocksat uniform size (e.g., × ).In order to transmit the location information concisely, wefurther decrease the data amount and compress the positioninformation of ROI by using run-length coding (RLC). Specif-ically, RLC is used to record the positions of the starting andending blocks of each object t i . We assume that the size ofthe video frame is H × W ( H and W represents the height and the width of the video frame, respectively). From Fig.3(b), one can see that the ROI can be determined by just thesequential numbers of the blocks B s at the left upper cornerand B e at the right lower corner of the ROI block. That is,the ROI location can be represented as Fig. 3(c). The ﬁrst redbit represents the starting block B s and the second red bit represents the ending block B e . B. Side Information Compression

In the pseudo analog video multicast system, we havepointed out that the side information ( K i and λ i ) is neededat the receiver to assist with video reconstruction (see SectionII, Part B).To prevent the side information from bit errors, the sideinformation is coded in a digital mode and the standard LTEsystem is used to transmit the side information. The LTEsystem can adjust the MCS adaptively, according to the blockerror ratio (BLER) desired by the system. In [29], 15 differentchannel quality indicators (CQI) are deﬁned, where differentCQI values correspond to different MCSs. The smaller theCQI value is, the stronger the protection for the transmittedinformation will be, but the number of bits that can be carriedby each symbol will also decrease. In the case of insufﬁcientbandwidth, it is necessary to discard some high-frequencyDCT coefﬁcients with small variances to meet the bandwidthrequirements when using the MCS with low CQI.In ROIC-Cast, there are three kinds of side information.The ﬁrst one is the ROI location information (i.e., B s and B e ), which can tell the receiver where the ROI is. The secondone is the variance of the pixel block (i.e., λ i ), which has alarge impact on the visual quality of the reconstructed video.The third one is the correlation factor (i.e., K i ) between thetransmitted block and the reference block stored in the cloud.All the side information (i.e., B s , B e , λ i , and K i ) is ﬁrstcompressed into bitstream using Huffman coding (note thatFPGA-based Huffman coding can be extremely efﬁcient, andits processing time is at the nanosecond level.) [30]. SinceHuffman coding can only use integers to represent a singlesymbol, there will be irreversible quantization errors in thepractical transmission process.The pseudo analog transmission is often used in multi-casting and fast ﬂuctuation channels, and the system mustprotect the side information which should be decoded ascorrectly as possible. In order to serve most users, we choosethe MCS which can satisfy the lowest BLER requirement.Assume that σ represents the variance of the channel noise,the transmission power allocated for the side information mustsatisfy: P s (cid:44) βσ , (4) MMSE decodingROIextraction Blocking Cloud database Reference image searching Sending information selection Unequal power allocationBlocking matching

Pseudo analog transmissionCloud database

Reference image searching Video demodulationCorrelated block Original video

Reconstructed video

Fig. 2 Overview of the proposed ROIC-Cast scheme. ... ... ... ... ... s B e B ( ) a ( ) b ( ) c s B e B W H Fig. 3 Run-length coding of ROI position information. where β represents the corresponding SNR that meets thelowest BLER requirement. Eqn. (4) shows the required min-imum power (i.e., P s ) for the side information in ROIC-Casttransmission system. C. Unequal Power Allocation

The ROIC-Cast scheme proposed in this paper can beapplied in the scenarios with relatively static background, suchas video surveillance applications which mainly detect movingobjects in a static background. Therefore, the background ofthe whole video remains almost the same. Long-term repetitivereconnaissance can be used to build a comprehensive clouddatabase [31,32], which can provide reference blocks whiletransmitting the videos. Thus the transmission quality of non-ROI blocks can be guaranteed by fully utilizing the referenceblocks provided by the cloud database and more transmissionpower can be used to protect the transmission quality of ROIblocks. Therefore, the power allocation scheme of ROIC-Cast needs to make a trade-off between ROI and non-ROI blocks,which is more challenging than that of KMV-Cast.Similar to the aforementioned pseudo analog systems, weassume that all DCT coefﬁcients in ROIC-Cast are dividedevenly into blocks with M DCT coefﬁcients in each block.The DCT coefﬁcients in B i can be regarded as independentsamples generated by a zero-mean Gaussian variable, whichcan be denoted as { θ i | θ i ∼ N (0 , λ i ) , i = 1 , ..., N s } , where N s represents the total number of blocks in a frame and λ i represents the variance of B i . Each B i is assigned witha power scaling factor g i in order to obtain the symbolssatisfying the power constraints. For ease of derivation, weassume that the signals are transmitted in an additive whiteGaussian noise (AWGN) channel. Then, the received signal(i.e., r i [ m ] , m = 1 , ..., M ) can be expressed as: r i [ m ] = g i · θ i [ m ] + n, (5)where θ i [ m ] represents the m th coefﬁcient in B i , and n represents the AWGN with variance of σ . Thus, the decodedsignal can be denoted as: ˆ θ i [ m ] = r i [ m ] g i = θ i [ m ] + ng i . (6)Then the expected reconstruction distortion of each block B i can be denoted as: D i = E (cid:20) M (cid:80) m =1 (ˆ θ i [ m ] − θ i [ m ]) (cid:21) , = M E [ n ] g i , = Mσ g i . (7)Since ROI blocks are more important than non-ROI blocks,the quality degradation of ROI blocks has a big impact on people’s understanding of the video contents. Thus, morepower should be allocated for the ROI blocks. Assume thatthe transmission power for each frame is a constant denotedas P t . Then, the transmission power for the DCT coefﬁcients(i.e., P d ) can be denoted as: P d = P t − P s , (8)where P s represents the power allocated for the side informa-tion whose expression has been given in Eqn. (4).Let P dr denote the transmission power allocated for ROIDCT coefﬁcients, and P dnr denote the transmission power fornon-ROI DCT coefﬁcients. Thus, P d can be also denoted as: P d = P dr + P dnr . (9)In this paper, we deﬁne a preference parameter η rangingfrom 0 to 1, which represents the ratio between 1) the averagetransmission power for each non-ROI pixel and 2) the averagetransmission power for each ROI pixel. The deﬁnition of η is: η = P dnr S ( nr ) (cid:30) P dr S ( r ) = P dnr P dr · S ( r ) S ( nr ) , (10)where S ( r ) represents the size of ROI blocks and S ( nr ) represents the size of non-ROI blocks. Both S ( r ) and S ( nr ) can be calculated at the receiver according to the well-protected side information. From Eqn. (10), one can see thatthe transmission power allocated for ROI DCT coefﬁcientsincreases with the increase of η .According to Eqn. (9) and Eqn. (10), P dr and P dnr can bederived as: P dr = S ( r ) ηS ( nr ) + S ( r ) · P d . (11) P dnr = ηS ( nr ) ηS ( nr ) + S ( r ) · P d . (12)In ROIC-Cast, the goal is to minimize the distortion ofboth ROI and non-ROI of the received video frames. Thus, itsoptimization functions are similar to that in KMV-Cast whichcan be formulated as follows:  min g i N r (cid:88) i =1 (cid:96) ( K i ) D i , s . t .  N r (cid:88) i =1 g i λ i (cid:96) ( K i ) (cid:54) P dr g i λ i (cid:96) ( K i ) (cid:62) , if B i ∈ F ROI , min g i N nr (cid:88) i =1 (cid:96) ( K i ) D i , s . t .  N nr (cid:88) i =1 g i λ i (cid:96) ( K i ) (cid:54) P dnr g i λ i (cid:96) ( K i ) (cid:62) , otherwise. (13)where N r represents the number of ROI pixel blocks and N nr represents the number of non-ROI pixel blocks. (cid:96) ( K i ) is onlyrelated to K i , which is already expressed in Eqn. (3).Note that the above-mentioned two parallel sub-problemsare both convex optimization problems regarding to the vari-able g i . Therefore, we can apply Karush-Kuhn-Tucker (KKT)conditions for the optimal solution [33]. Speciﬁcally, we canderive the optimal power scaling factor for ROIC-Cast in a Algorithm 1

Proposed unequal power allocation algorithm.

Input:

Reference frame, transmission frame, preferenceparameter η , total transmission power P t , and noise powervariance σ − . Output:

Optimal power allocation g i . ROI extraction : Extract the ROI from the transmittedframe using YOLOv2, and record the position information(i.e., B s and B e ) of the detected salient objects. Source processing : Divide the transmitted frame intoblocks at uniform size (e.g., × ), de-correlate each blockthrough 2D-DCT transform, and calculate the variance ofeach block (i.e., λ i ). Block matching : Find the reference block that is mostsimilar to the transmitted block, and calculate their corre-lated factor (i.e., K i ). MCS selection : Compress the side information (i.e., B s , B e , λ i , and K i ) using Huffman coding, select the MCSwhich satisﬁes the predetermined BLER requirement, andcalculate P s according to Eqn.(4). Power allocation : Calculate P d , P dr , and P dnr accordingto Eqn. (8), Eqn. (11) and Eqn. (12), respectively, and thencalculate g i according to Eqn.(14). return g i closed form according to the Lagrange multiplication methodas follows [34]: g i =  λ i − / (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) P dr (cid:112) (cid:96) ( K i ) N r (cid:80) i =1 ( √ λ i (cid:112) (cid:96) ( K i )) , if B i ∈ F ROI ,λ i − / (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) P dnr (cid:112) (cid:96) ( K i ) N nr (cid:80) i =1 ( √ λ i (cid:112) (cid:96) ( K i )) , otherwise. (14)From Eqn. (14), one can see that the optimal power scalingfactor of block B i are determined by three aspects, including1) the variance λ i , 2) the correlation factor K i , and 3)whether or not the block B i belongs to the ROI. Comparedwith the power allocation strategies of SoftCast and KMV-Cast (as shown in Eqn. (1) and Eqn. (2)), one can concludethat the proposed unequal power allocation can provide moreprotection for ROI coefﬁcients that have large effect on QoEperformance. The proposed scheme might have a slightlyhigher computation overhead than some of the existing worksdue to the extraction of the ROI part from the video viaYOLOv2 algorithm. However, with today’s high-performancecomputing hardware/software, such an additional computationoverhead should be a minor issue.The detailed solution process of the proposed unequal powerallocation scheme is provided in Algorithm 1.IV. P ERFORMANCE A NALYSIS

We ﬁrst provide the implementation details for the standardLTE transmission system. Then the parameter settings aregiven. Finally we analyze the simulation results and compareour scheme with other three typical pseudo analog schemes.

A. Implementation Details

In the standard LTE system (with the bandwidth of . MHz), the channel is divided into 64 subcarriers, of which4 subcarriers are used to transmit pilot signals for channelestimation, and 48 subcarriers are used to transmit user data.In ROIC-Cast, we separate the transmitted video into twoparts including the side information and the DCT coefﬁcients.The side information is ﬁrst compressed into bit stream byusing Huffman coding. Then, according to the required BLER,the corresponding MCS is selected to modulate the sideinformation into complex signals. Afterwards, we map eachgroup of two DCT coefﬁcients to I/Q to generate a complexsignal. IFFT transform is then performed for those complexsignals generated by the side information and the DCT coefﬁ-cients together. At the receiver, the DCT coefﬁcients and sideinformation are restored after signal synchronization, channelestimation, channel equalization, and FFT transform. Finally,the video is reconstructed with the side information and DCTcoefﬁcients. The speciﬁc implementation details are shown inFig. 4, where the modules of ROIC-Cast have been markedwith a red, dashed box.

B. Parameters settings

We have used three natural video sequences (e.g., “Coast-guard”, “Highway”, and “Container”) and three non-naturalvideo sequences (e.g., “Foreman”, “Akiyo”, and “Hall”) toverify the effectiveness of the proposed algorithm. These videosequences are open source and have been widely used forsimulation analysis in multimedia research [35-41]. The detailsof these six video sequences are provided in Table II, includingthe resolution, frame number, frame rate, and size. All of themhave a 8-level pixel depth (i.e., the pixel value ranges from 0to 255).In standard video sequences, there is usually a high spatial-temporal correlation between frames. The closer the videoframes are, the higher the correlation is, and vice versa .Therefore, we take the ﬁrst frame of each video sequence asthe reference which can provide similar blocks, and we takethe second frame and the th frame to simulate the strongcorrelation case and the weak correlation case, respectively.Both the reference frame and the transmitted frames willbe evenly divided into uniform blocks with M = 64 DCTcoefﬁcients in each blocks.

TABLE II Details of testing video sequences.

Sequences Resolution Frame number Frame rate SizeCoastguard ×

300 25fps 5.74MBHighway × ×

300 25fps 4.06MBForeman ×

300 25fps 5.88MBAkiyo ×

300 25fps 1.86MBHall ×

300 25fps 5.72MB

In the simulations, the BLER is set to − [42]. Wetest the performance of all the schemes over AWGN channelwith σ − = 10 − , and set the channel’s SNR range from − dB to 25dB. Table III shows the corresponding MCSsthat satisfy the BLER requirement under different channelconditions in LTE system, where ECR stands for an effective code rate. In this paper, we ensure that all the schemes sharethe same bandwidth (i.e., the number of the complex signalsto be transmitted is the same) and the same transmissionpower. When the bandwidth is insufﬁcient, the high-frequencycomponents of DCT coefﬁcients are discarded to meet thebandwidth budget. TABLE III The MCSs adopted for different channel conditions. β CQI Modulation ECR-5dB 1 4QAM 0.07620dB 4 4QAM 0.30085dB 7 16QAM 0.369110dB 9 16QAM 0.601615dB 12 64QAM 0.650420dB 15 64QAM 0.9258

We analyze the performance of different schemes in terms ofthe peak signal-to-noise ratio (PSNR) [43] and the subjectivevisual quality. The PSNR is a standard objective measurementof video/image quality, and is deﬁned as a function of the

MSE between all pixels of the reconstructed video and theoriginal version as follows [43]:

PSNR = 20log( 255 √ MSE ) , (15)where MSE represents the mean square error of the recon-structed image.

C. Performance Analysis of ROI Extraction

The ROI extraction results determine the performance of theproposed ROIC-Cast scheme. In this part, we investigate thevisual quality of reconstructed video frames, and evaluatethe performance of ROI extraction via YOLOv2 structure .The results are shown in Fig. 5.In Fig. 5, the three rows show the ROI extraction results of12 selected frames via YOLOv2 structure. In Coastguard D. Performance Analysis of Unequal Power Allocation

To investigate the performance of the proposed unequalpower allocation algorithm for ROI and non-ROI data, weobserve the overall PSNR of the reconstructed images as YOLOv2 structure can be referred at http://pjreddie.com/darknet/yolov2.

Add CP

Mapper

IFFT Remove CPTiming/frequency Synchronization & channel estimation Channel equalizer FFTDCT coefficients Tx RxTurbo decodingSide information Turbo codingSide information DemapperDCT coefficients Transmitted video Reconstructed video Huffman codingHuffman decodingROI extraction Unequal power allocation DCT transform MMSE decoder IDCT transform MapperDemapperBlock matchingBlock rematching

ROIC-Cast

Fig. 4 The implementation scheme based on the LTE system.Fig. 5 ROI extraction results via YOLOv2 structure. The ﬁrst row: (a) Coastguard well as the PSNR of the ROI, under different values of thepreference parameter η (see Eqn. (10)). The channel SNR isset to -5dB, 0dB, 5dB, and 10dB, respectively. The results areshown in Fig. 6.From Fig. 6, one can see that the overall PSNR improveswith the increase of η , while the PSNR of the ROI declineswith the increase of the η under different channel conditions.The larger η is, the less transmission power the ROI will get.Hence, the protection of the ROI data will be weakened withthe increase of η . Fig. 6 also shows that the essence of theunequal power allocation scheme is its capability of improvingthe quality of ROI region at the expense of the quality of non- ROI region. E. Performance Analysis of Processing Time of Each Step

We then perform a delay analysis to evaluate the processingtime for the three main steps, i.e., 1)

ROI extraction , 2) sideinformation compression , and 3) unequal power allocation .We evaluate the processing time of ROI extraction using threedifferent types of GPUs including GTX 1070 [44], GTX 1080Ti [45], and RTX 2080 Ti [46] (where GTX 1070 < GTX 1080Ti < RTX 2080 Ti in terms of computation performance),on Ubuntu 16.04 system with 4-core CPU, 16G memory,Cuda 10.0, and Cudnn 7.0.5. The processing time of side

Fig. 6 The impact of the preference parameter η on video recoveryquality under different channel conditions. information compression and unequal power allocation areevaluated using Matlab 2018b. The testing results are shownin Table IV. TABLE IV Processing time of each step.

Frame Time (s)ROI extraction Side information Unequal power1070 1080Ti 2080Ti compression allocationCoastguard

From Table IV, we can see that the processing time of

ROIextraction is mainly determined by the GPU performance,regardless of the image contents. The average processingtime of each image decreases with the improvement of theGPU performance. With the decrease of GPU price, high-performance GPUs are becoming more and more widely usedin video transmission to address the issues such as imagerecognition, image enhancement, etc.The average processing time of side information compres-sion and unequal power allocation are 0.00317s and 0.01589s,respectively, which indicates good performance compared tothe existing digital transmission schemes. With GPU of RTX2080 Ti, the average total processing time of each frame is0.03243s, which means that the proposed scheme can supporta frame rate of 30 fps.

F. Performance Analysis of Reconstructed Images

In this part, we compare the performance of the proposedROIC-Cast with three benchmark methods including SoftCast,DAC-RAN, and KMV-Cast in terms of the PSNR and thevisual quality of the reconstructed images. SoftCast is theﬁrst pseudo analog scheme in which the optimal transmissionpower allocated for each block is determined by the variance.DAC-RAN can utilize the correlation information retrieved from the video big data as the prior knowledge to assist withthe video demodulation. KMV-Cast exploits the hierarchicalBayesian model to further eliminate the mutual interferencein DAC-RAN.We focus on two cases: One has a strong correlationbetween the transmitted frame (i.e., th frame to betransmitted, respectively. The three baselines are used for thecomparisons of image reconstruction results under differentconditions which are shown in the remaining four rows ofFig. 7.The second row and the third row of Fig. 7 show thesimulation results under the condition that the transmittedframe has a strong correlation with the reference frame.One can see that the proposed scheme outperforms others.Compared with others, there is a gain of more than 4.1dB athigh channel SNR (i.e., 10dB) and a gain of more than 4.9dBat low channel SNR (i.e., 0dB), in terms of the PSNR of ROI.From the second row, one can ﬁnd that the four schemes allperform well in terms of the subjective visual quality at highSNRs. But in the case of low SNR, the ROI image part of theproposed ROIC-Cast is much clearer than other three schemes.The fourth row and the ﬁfth row of Fig. 7 show the weakcorrelation case. The proposed ROIC-Cast achieves more than1.0dB of gain at high SNR (i.e., 10dB) and more than 0.9dBof gain at low channel SNR in terms of the PSNR of ROI,compared with other three schemes. We can also see fromthe ﬁfth row that DAC-RAN performs the worst in terms ofthe subjective visual quality at high SNR (i.e., 10dB), whilethere is not much difference among ROIC-Cast, KMV-Cast,and SoftCast. From the ﬁfth row, one can clearly see the shipbelonging to the ROI part in the proposed scheme, instead ofan obscure image in other schemes under low SNR.We then make a more intuitive performance comparison fordifferent frames ( DAC-RAN SoftCast KMV-Cast ROIC-CastSNR=10dB

SNR= 0dBSNR=10dB

SNR= 0dB Baselines Coastguard

Fig. 7 Row 1: three baselines chosen from Coastguard for comparisons of image reconstruction results; Rows 2-5: the reconstructedimages of different methods under different conditions.

G. Performance analysis under the Rayleigh fading channel

We also test and evaluate the performance of the proposedROIC-Cast scheme under the channel conditions of Rayleighfading. The simulation results are shown in Fig. 9 and Fig.10.the top graph in Fig. 9 shows the SNR of the receivedpackets under the Rayleigh fading channel with an averageof 5dB. The bottom graph of Fig. 9 plots the overall PSNRof each frame as well as the PSNR of ROI part in eachframe. From Fig. 9, one can see that the proposed schemecan achieve adaptive PSNR performance in dramatically ﬂuc- tuating channels. In order to reduce the effect of fading onthe transmission quality, we apply whitening (i.e., Hadamardmatrix multiplication) to the transmitted signals. Whiteningis similar to the conventional pseudo-random scrambling andinterleaving of the bit-stream, but operates on the real inputsamples. The aim of whitening is to transform the inputsignal so that the transformed signal has the distribution of amemory-less Gaussian random variable. Therefore, whiteningcan mask fading and can improve performance beyond simpleinterleaving. From Fig. 9, we can also conclude that theproposed ROIC-Cast scheme can provide more protection for TABLE V PSNR performance of different methods under different conditions.

Dataset Methods Strong correlation case (a) (b) (c)

Fig. 8 The performance of different schemes under different SNRs (a) Coastguard (

ROI part against non-ROI part.Fig. 10 shows the visual quality of four randomly selectedreconstructed frames ( , , , ) of the testingsequence “Coastguard”. From Fig. 9, we can ﬁnd that thePSNR performance of the frame suffers from a sharpdecline in terms of PSNR due to the channel fading. However,we can see from Fig. 10 that the frame can still providea good visual quality. From Fig. 10, one can conclude thatthe proposed ROIC-Cast scheme can still achieve high-qualityreconstructed frames although channel conditions vary a lot.Especially, we can clearly extract the ROI content in eachframe. V. C ONCLUSIONS

In this paper, a novel pseudo analog video transmissionstrategy called ROIC-Cast has been proposed to enhance thecommunication quality of ROI parts. Firstly, YOLOv2 hasbeen utilized to extract the ROI from each video frame, and theROI’s location information is compressed by run-length cod-ing. The video frames can be successfully classiﬁed into ROI (a) (b) (c) (d) Fig. 10 The visual quality of four randomly selected reconstructed frames ( , , , ) of the testing sequence “Coastguard” inRayleigh fading channel. and non-ROI pixel blocks. Secondly, a compression schemefor side information has been designed to support effectivetransmissions. Finally, an unequal power allocation algorithmhas been proposed to protect the ROI transmissions, and ROIblocks can obtain more transmission power than non-ROIblocks. The simulation results have shown that the proposedROIC-Cast scheme has the best performance compared withother three typical schemes, i.e., KMV-Cast, SoftCast, andDAC-RAN. VI. F UTURE W ORK

The difﬁculty of achieving low-delay video communicationlies in the large amount of video data and low bandwidthutilization. To handle the large data amount, we will studynovel image video compression algorithms to further reducethe amount of data to be transmitted [47]. To solve the problemof low bandwidth utilization, we will combine pseudo analogtransmission technology with cognitive radio networks (CRN),a promising technology to explore the idle spectrum andgreatly improve the bandwidth utilization [48].Our future work will weaken the effect of ﬂuctuatingchannel conditions on video transmission quality from thefollowing two approaches. On one hand, we can improve thechannel quality of pseudo analog transmission by virtue ofUAV’s high mobility and line-of-sight (LoS) channel [49]. Onthe other hand, the combination of pseudo analogy technologyand intelligent reﬂecting surface (IRS) will greatly improvethe impact of channel fading and shadow on video qualitydegradation [50].In addition, although some prototypes of IEEE 802.11 serieshave been developed, these prototypes cannot be used toverify the correctness of the new proposed pseudo analogwireless video transmission algorithms directly due to thelimited modulation modes they can support. In our futurework, we plan to design a software deﬁned radio (SDR)-basedpseudo analog wireless video transceiver which is completelytransparent and allows users to learn all the implementationdetails [51, 52]. VII. A

CKNOWLEDGEMENT

The authors would like to thank all reviewers for their effortsin reviewing this manuscript. R

EFERENCES[1] Y. Sun, L. Xu, Y. Tang, and W. Zhuang, “Trafﬁc Ofﬂoading for OnlineVideo Service in Vehicular Networks: A Cooperative Approach,”

IEEETransactions on Vehicular Technology , vol. 67, no. 8, pp. 7630-7642,Aug. 2018.[2] Z. Chen, P. V. Pahalawatta, A. W. Tourapis, and D. Wu, “Improved Es-timation of Transmission Distortion for Error-Resilient Video Coding,”

IEEE Transactions on Circuits & Systems for Video Technology , vol.22, no. 4, pp. 636-647, Apr. 2012.[3] A. Agarwal, and P. Kumar, “Analysis of Variable Bit Rate SOFDMTransmission Scheme Over Multi-Relay Hybrid Satellite-Terrestrial Sys-tem in the Presence of CFO and Phase Noise,”

IEEE Transactions onVehicular Technology , vol. 68, no. 5, pp. 4586-4601, May 2019.[4] G. Alnwaimi, and H. Boujemaa, “Adaptive Packet Length and MCSUsing Average or Instantaneous SNR,

IEEE Transactions on VehicularTechnology , vol. 67, no. 11, pp. 10519-10527, Nov. 2018.[5] Y. Yan, B. Zhang, and C. Li, “Network Coding Aided CollaborativeReal-Time Scalable Video Transmission in D2D Communications,”

IEEE Transactions on Vehicular Technology , vol. 67, no. 7, pp. 6203-6217, Jul. 2018.[6] L. Wu, Y. Zhong, W. Zhang, and M. Haenggi, “Scalable TransmissionOver Heterogeneous Networks: A Stochastic Geometry Analysis,”

IEEETransactions on Vehicular Technology , vol. 66, no. 2, pp. 1845-1859,Feb. 2017.[7] X. Jiang, and H. Lu, “Joint Rate and Resource Allocation in HybridDigitalAnalog Transmission Over Fading Channels,”

IEEE Transactionson Vehicular Technology , vol. 67, no. 10, pp. 9528-9541, Oct. 2018.[8] J. Szymon, and K. Dina, “A Cross-Layer Design for Scalable MobileVideo,” in

Proc. ACM MobiCom , Sept. 2011, pp.1-12.[9] X. Liu, W. Hu, C. Luo, Q. Pu, F. Wu, and Y. Zhang, “ParCast+:Parallel Video Unicast in MIMO-OFDM WLANs,”

IEEE Transactionson Multimedia , vol. 16, no. 7, pp. 2038-2051, Nov. 2014.[10] L. Zhang , M. Liu, L. Chen, L. Qiu, C. Zhao, Y. Hu, and R.Zimmermann, “Online Modeling of Esthetic Communities Using DeepPerception Graph Analytics,”

IEEE Transactions on Multimedia , vol. 20,no. 6, pp.1462-1474, Jun. 2018.[11] R. Cong, J. Lei, H. Fu, M. Cheng, W. Lin, and Q. Huang, “Reviewof Visual Saliency Detection With Comprehensive Information,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 29, no.10, pp. 2941-2959, Oct. 2019.[12] X. Zhou, C. Yang, H. Zhao, and W. Yu, “Low-rank Modeling and ItsApplications in Image Analysis,”

ACM Comput. Surveys , vol. 47, no. 2,pp. 1-33, Dec. 2014.[13] T. Akilan, Q. Wu, and W. Zhang, “Video Foreground Extraction UsingMulti-View Receptive Field and EncoderDecoder DCNN for Trafﬁc andSurveillance Applications,”

IEEE Transactions on Vehicular Technology ,vol. 68, no. 10, pp. 9478-9493, Oct. 2019.[14] P. Chiranjeevi, and S. Sengupta, “Detection of Moving Objects usingMulti-channel Kernel Fuzzy Correlogram based Background Subtrac-tion,”

IEEE Transactions on Cybernetics , vol. 44, no. 6, pp. 870-881,Jun. 2014.[15] S. Kruthiventi, K. Ayush, and R. V. Babu, “DeepFix: A Fully Convo-lutional Neural Network for Predicting Human Eye Fixations,”

IEEETransactions on Image Processing , vol. 26, no. 9, pp. 4446-4456, Jan.2017.[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich “Going Deeper with Convolutions,”in

Proc. IEEE CVPR , Oct. 2015, pp. 1-9. [17] J. Redmon, and A. Farhadi, “YOLO9000: Better,Faster,Stronger,” in Proc. IEEE CVPR , pp. 6517-6525, Dec. 2016.[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only LookOnce: Uniﬁed, Real-time Object Detection,” in

Proc. CVPR , pp. 779-788, Jun. 2015.[19] J. Redmon, and A. Farhadi, “YOLOv3: An Incremental Improvement,”in

Proc. IEEE CVPR , pp. 1-6, Apr. 2018.[20] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich FeatureHierarchies for Accurate Object Detection and Semantic Segmentation,”in

Proc. IEEE CVPR , pp. 580-587, Oct. 2014.[21] R. Girshick, “Fast R-CNN,” in

Proc. IEEE CVPR , pp. 1440-1448, Dec.2015.[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: TowardsReal-Time Object Detection with Region Proposal Networks,”

IEEETransactions on Pattern Analysis & Machine Intelligence , vol. 39, no.6, pp. 1137-1149, Jun. 2017.[23] X. P. Fan, F. Wu, D. B. Zhao, O. C. Au, and W. Gao, “Distributed SoftVideo Broadcast (DCAST) with Explicit Motion,” in

Proc. IEEE DCC ,Apr. 2012, pp. 199-208.[24] J. Wu, D. Liu, X.-L. Huang, and C. Luo, “DaC-RAN: A Data-assistedCloud Radio Access Network for Visual Communications,”

IEEE Wire-less Communications , vol. 22, no. 3, pp. 130-136, Jun. 2015.[25] D. He, C. Luo, C. Lan, F. Wu, and W. Zeng, “Structure-PreservingHybrid Digital-Analog Video Delivery in Wireless Networks,”

IEEETransactions on Multimedia , vol. 17, no. 9, pp. 1658-1670, Sept. 2015.[26] X. -L. Huang, J. Wu, and F. Hu, “Knowledge Enhanced MobileVideo Broadcasting (KMV-Cast) Framework with Cloud Support,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 27, no.1, pp. 6-18, Jan. 2017.[27] X. -W. Tang, X. N. Huan, and X.-L. Huang, “Maximum a PosterioriDecoding for KMV-Cast Pseudo-Analog Video Transmission,”

MobileNetworks & Applications , vol. 23, no. 2, pp. 318-325, Apr. 2018.[28] X. -L. Huang, X. -W. Tang, X. N. Huan, P. Wang, and J. Wu, “ImprovedKMV-Cast with BM3D Denoising,”

Mobile Networks & Applications ,vol. 23, no. 1, pp.100-107, Feb. 2018.[29] 3GPP Team, ”Evolved Universal Terrestrial Radio Access (E-UTRA);Physical Layer Procedures,” , Tech. Rep. TS 36.213, Mar. 2009.[30] A. J. Carlos, F. A. Carlos, O. M.. Reyes, and V. C. Javier, “FPGAImplementation of a Huffman Decoder for High Speed Seismic DataDecompression,” in

Proc. DCC , Jun. 2014, pp. 1-1.[31] X. Song, X. Peng, J. Xu, G. Shi, and F. Wu, “Distributed CompressiveSensing for Cloud-Based Wireless Image Transmission,”

IEEE Trans-actions on Multimedia , vol. 19, no. 6, pp. 1351-1364, Jun. 2017.[32] B. H. K. Chen, P. Y. S. Cheung, P. Y. K. Cheung, Y. K. Kwok,“CypherDB: A Novel Architecture for Outsourcing Secure DatabaseProcessing,”

IEEE Transactions on Cloud Computing , vol. 6, no. 2, pp.372-386, Apr. 2018.[33] M. Li, “Generalized Lagrange Multiplier Method and KKT ConditionsWith an Application to Distributed Optimization,”

IEEE Transactionson Circuits and Systems , vol. 66, no. 2, pp. 252-256, Feb. 2019.[34] X. Zhang, H. Huang, W. Ying, H. Wang, and J. Xiao, “An IndirectRange-Doppler Algorithm for Multireceiver Synthetic Aperture SonarBased on Lagrange Inversion Theorem,”

IEEE Transactions on Geo-science and Remote Sensing , vol. 55, no. 6, pp. 3572-3587, Jun. 2017.[35] X. -L. Huang, X. -W. Tang, and F. Hu, “Dynamic Spectrum Accessfor Multimedia Transmission over Multi-User, Multi-Channel CognitiveRadio Networks,”

IEEE Transactions on Multimedia , vol. 22, no. 1, pp.201-214, Jan. 2020.[36] P. Seeling, and M. Reisslein, “Video Transport Evaluation With H.264Video Traces,”

IEEE Communications Surveys & Tutorials , vol. 14, no.4, pp. 1142-1165, Dec. 2012.[37] M. Y. Naderi, H. R. Rabiee, M. Khansari, and M. Salehi, “ErrorControl for Multimedia Communications in Wireless Sensor Networks:A Comparative Performance Analysis,”

Ad Hoc Networks , vol. 10, no.6, pp. 1028-1042, Aug. 2012.[38] D. Kesrarat, and V. Patanavijit, “A Novel Robust and High Reliabilityfor Lucas-Kanade Optical Flow Algorithm Using Median Filter andConﬁdence Based Technique,” in

Proc. IEEE WAINA , Mar. 2012, pp.312-317.[39] A. Pulipaka, P. Seeling, M. Reisslein, and L. J. Kram, “Trafﬁc andStatistical Multiplexing Characterization of 3-D Video RepresentationFormats,”

IEEE Transactions on Broadcasting , vol. 59, no. 2, pp. 382-389, Jun. 2013.[40] R. Gupta, A. Pulipaka, P. Seeling, L. J. Karam, and M. Reisslein,“H.264 Coarse Grain Scalable (CGS) and Medium Grain Scalable(MGS) Encoded Video: A Trace Based Trafﬁc and Quality Evaluation,”

IEEE Transactions on Broadcasting , vol. 58, no. 3, pp. 428-439, Sept.2012.[41] S. Patrick, and R. Martin, “Video Trafﬁc Characteristics of ModernEncoding Standards: H.264/AVC with SVC and MVC Extensions andH.265/HEVC,”

The Scientiﬁc World Journal , vol. 2014, pp. 1-16, Feb.2014.[42] V. Sgardoni, and A. R. Nix, “Raptor Code-Aware Link Adaptation forSpectrally Efﬁcient Unicast Video Streaming over Mobile BroadbandNetworks,”

IEEE Transactions on Mobile Computing , vol. 14, no. 6,pp. 401-415, Jun. 2014.[43] H. Alain, and D. Ziou, “Image Quality Metrics: PSNR vs. SSIM,” in

Proc. ICPR , pp. 2366-2369, Aug. 2010.[44] B. Mao, Z. M. Fadlullah, F. Tang, N. Kato, O. Akashi, T. Inoue, andK, Mizutani, “Routing or Computing? The Paradigm Shift TowardsIntelligent Computer Network Packet Transmission Based on DeepLearning,”

IEEE Transactions on Computers , vol. 66, no. 11, pp. 1946-1960, Nov. 2017.[45] D. W. Kim, Y. H. Lee, and Y. H. Seo, “High-speed Computer-generatedHologram based on Resource Optimization for Block-based ParallelProcessing: Publisher’s Note,”

Applied Optics , vol. 57, no. 16, pp. 3511-3518, May 2018.[46] D. H. C. Prez, N. Neufeld, and A. R. NuezA, “Fast Local Algorithmfor Track Reconstruction on Parallel Architectures,” in

Proc. IPDPSW ,May 2019, pp. 698-707.[47] M. Xu, C. Li, S. Zhang, and P. L. Callet, “State-of-the-Art in 360Video/Image Processing: Perception, Assessment and Compression,”

IEEE Journal of Selected Topics in Signal Processing , vol. 14, no. 1,pp. 5-26, Jan. 2020.[48] X. Huang, Y. Gao, X. Tang, and S. Wang, “Spectrum Mapping in Large-Scale Cognitive Radio Networks With Historical Spectrum DecisionResults Learning,”

IEEE Access , vol. 6, pp. 21350-21358, 2018.[49] C. You, and R. Zhang, “3D Trajectory Optimization in Rician Fadingfor UAV-Enabled Data Harvesting,”

IEEE Transactions on WirelessCommunications , vol. 18, no. 6, pp. 3192-3207, Jun. 2019.[50] B. Zheng, and R. Zhang, “Intelligent Reﬂecting Surface-EnhancedOFDM: Channel Estimation and Reﬂection Optimization,”

IEEE Wire-less Communications Letters , vol. 9, no. 4, pp. 518-522, Apr. 2020.[51] B. Mao, F. Tang, Z. M. Fadlullah, and N. Kato, “An Intelligent RouteComputation Approach Based on Real-Time Deep Learning Strategyfor Software Deﬁned Communication Systems,”

IEEE Transactions onEmerging Topics in Computing , Early Access, pp. 1-12, Feb. 2019.[52] B. Mao, F. Tang, Z. M. Fadlullah, N. Kato, O. Akashi, T. Inoue, andK. Mizutani, “A Novel Non-Supervised Deep-Learning-Based NetworkTrafﬁc Control Method for Software Deﬁned Wireless Networks,”

IEEEWireless Communications , vol. 25, no. 4, pp. 74-81, Aug. 2018.

Xiao-Wei Tang ( S’16, IEEE ) received the B.E.degree in Communication Engineering from TongjiUniversity in 2016, where he is currently pursuingthe Ph.D. degree. He has published several researchpapers on IEEE Transactions on Multimedia, IEEEAccess, IEEE Globecom, and Mobile Networks &Applications. He was a recipient of the ExcellentBachelor Thesis of Tongji University in 2016, theNational Scholarship for Graduate Students by Min-istry of Education of China in 2017, the OutstandingStudents Award of Tongji University in 2017, theOutstanding Freshman Scholarship of Tongji University in 2018, the ChineseGovernment Scholarship by China Scholarship Council in 2019, the Outstand-ing Students Award of Tongji University in 2019, and the National Scholarshipfor Graduate Students by Ministry of Education of China in 2019. From Aug.2019, he is doing research on UAV-enabled wireless video transmission in theDepartment of Electrical and Computer Engineering, the National Universityof Singapore, as a visiting scholar. His research interests include Pseudo-Analog Video Transmission, UAV Communication, Convex Optimization, andDeep Learning. Xin-Lin Huang ( S’09-M’12-SM’16, IEEE ) is cur-rently a professor and vice-head of the Departmentof Information and Communication Engineering,Tongji University, Shanghai, China. He receivedthe M.E. and Ph.D. degrees in information andcommunication engineering from Harbin Institute ofTechnology (HIT) in 2008 and 2011, respectively.His research focuses on Cognitive Radio Networks,Multimedia Transmission, and Machine Learning.He published over 70 research papers and 8 patentsin these ﬁelds. Dr. Huang was a recipient of Schol-arship Award for Excellent Doctoral Student granted by Ministry of Educationof China in 2010, Best PhD Dissertation Award from HIT in 2013, ShanghaiHigh-level Overseas Talent Program in 2013, and Shanghai Rising-StarProgram for Distinguished Young Scientists in 2019. From Aug. 2010 toSept. 2011, he was supported by China Scholarship Council to do researchin the Department of Electrical and Computer Engineering, University ofAlabama (USA), as a visiting scholar. He was invited to serve as SessionChair for the IEEE ICC2014. He served as a Guest Editor for IEEE WirelessCommunications and Chief Guest Editor for International Journal of MONETand WCMC. He serves as IG cochair for IEEE ComSoc MMTC, andAssociate Editor for IEEE Access. He is a Fellow of the EAI.

Fei Hu ( M’02, IEEE ) received the Ph.D. degree insignal processing from Tongji University, Shanghai,China, in 1999, and the Ph.D. degree in electricaland computer engineering from Clarkson University,Potsdam, NY, USA, in 2002. He is a Professor withthe Department of Electrical and Computer Engi-neering, at the University of Alabama, Tuscaloosa,AL, USA. He has authored more than 200 jour-nal/conference papers and book chapters in wire-less networks, security, and machine learning. Hisresearch interests include cognitive radio networks,AI, and cyber security.