Adaptive Video Configuration and Bitrate Allocation for Teleoperated Vehicles
AAdaptive Video Configuration and Bitrate Allocationfor Teleoperated Driving
Andreas Schimpe, Simon Hoffmann and Frank Diermeyer
Abstract — Vehicles with autonomous driving capabilities arepresent on public streets. However, edge cases remain thatstill require a human in-vehicle driver. Assuming the vehiclemanages to come to a safe state in an automated fashion,teleoperated driving technology enables a human to resolvethe situation remotely by a control interface connected viaa mobile network. While this is a promising solution, it alsointroduces technical challenges, one of them being the necessityto transmit video data of multiple cameras from the vehicle tothe human operator. In this paper, an adaptive video streamingframework specifically designed for teleoperated driving isproposed and demonstrated. The framework enables automaticreconfiguration of the video streams of the multi-camera systemat runtime. Predictions of variable transmission service qualityare taken into account. With the objective to maximize visualquality, the framework uses so-called rate-quality models to dy-namically allocate bitrates and select resolution scaling factors.Results from deploying the proposed framework on an actualteleoperated driving system are presented.
I. I
NTRODUCTION
Great effort has been put towards the development offully automated and driverless vehicles. In particular, im-provements have been made in perception systems, planning,and control algorithms. Currently, automated vehicles arealready being tested with safety drivers on public streets.Assuming the automation detects that a traffic situationcannot be resolved independently, teleoperated driving offersa solution.
A. Teleoperated Driving
With teleoperated driving (ToD) technology, a vehicle canbe controlled remotely. Sensor and vehicle data, e.g., videofeeds, are transmitted via mobile networks from the vehicleto a remote control center. There, the data are displayed toa human operator, who generates control commands. Theseare then transmitted back to the vehicle for execution.ToD comes with a number of challenges. First of all, theteleoperation is subject to latency due to delays in systemcomponents and transmission of the data via a mobilenetwork. However, with advances in computational power,sensor and actuator technologies, delays can be reducedsignificantly [1]. In addition, novel mobile network standardspromise even greater reduction [2]. Situational awarenessof the operator poses another great challenge to ToD. Notbeing physically located in the vehicle, the perception ofthe operator is primarily based on multiple video streams ofthe cameras on the vehicle. A display based on a spherical
The authors are with the Institute for Automotive Technology at theTechnical University of Munich (TUM), 85748 Garching bei M¨unchen,Germany. [email protected] video canvas is proposed in [3]. It is applied in a long-termstudy [4], reporting an increase in the operator immersion.The study also indicates an influence of different videoquality settings. With the objective to maximize the visualquality of the video streams, this paper addresses the task ofautomated dynamic adaptation of video stream parameters,in dependence on predictions of bitrate availability.
B. Predictive Quality of Service
Through predictive Quality of Service (pQoS), an applica-tion can receive predictions of the communication quality ofa mobile network, e.g., available bitrate. It enables proactiveadaptation of the data transmission, instead of reacting tooccurrences of increasing network latencies or packet loss.Predicting communication quality is affected by multipleparameters [5]. If pQoS is not provided by the mobilenetwork operator, data- and machine learning-based modelscan be used to generate predictions [6]. Coupled with 5G,the topic attracts a lot of interest in industry, especially forautomotive applications [2, 7]. With demand for low latencyand high data rates on the mobile network link, ToD is anapplication that can greatly benefit of pQoS. The frameworkin this paper assumes that predictions of the allocatableuplink bitrate are available and will be used to perform videostream adaptations of the ToD system.
C. Adaptive Multi-Camera Video Streaming
While video streaming is well-established in many do-mains, teleoperation, and therein especially ToD, is a videostreaming application with special demands. As the operatorcontrols the vehicle remotely in real time, the video streamsmust be delivered with the lowest possible delay. Therefore,it is not possible to make use of the full compressioncapabilities of nowadays codecs. Also, the bandwidth ofthe mobile network is limited and varies. These factorscompromise the achievable visual quality and make adaptivevideo streaming necessary. Over the years, adaptive videostreaming was primarily investigated for the purpose ofentertainment services. For instance, the approach in [8]optimizes playback bitrates using pre-generated rate-qualitymodels and other statistics. Other rate control methods, suchas look-ahead approaches [9, 10] enable variable allocationof a constant total bitrate for multiple videos.In the field of teleoperation, the work presented in [11] per-forms variable bitrate allocation for a multi-camera systemat varying radio conditions. This is related to some of theobjectives in this paper. However, the camera weighting wasbased more on an intuition for the priorities of the camera. a r X i v : . [ ee ss . I V ] F e b ecoder Bandwidth Manager
Compressed Videos
VehicleOperator
LTE / 4G
EncoderMode / Reconfigure
Scene Manager
Video Rate Control Mode
Single / Collective / Automatic
Video Reconfigure
On / Off, Crop, Scale, (Total) Bitrate
Bitrate Allocation ScalingSelection pQoS ClientpQoS Server
PredictedBitrate
Fig. 1. Architecture of Adaptive Video Streaming Framework
The resulting visual quality of the videos was not taken intoaccount. A thorough assessment of visual quality in the ToDsimulation setup TELECARLA [12] is presented in [13].Therein, the video streams are parametrized in dependenceof their orientation and the driving scenario. Dynamic ad-justment of the video parameters is not considered.
D. Contributions
In this paper, a video streaming framework is proposedthat aims at providing the flexibility as required by ToD.This includes variable prioritization of cameras and dy-namic adjustment of video parameters. The videos can beparametrized in different ways. The framework either takes inmanual input from the operator or predictions of the currentlyavailable bitrate. In the latter case, bitrates are allocated andresolution scaling factors are selected automatically. Thisis performed based on so-called rate-quality models, whichwere generated for each camera of the system. Insights intothe deployment of the proposed framework on an actual ToDsystem with eight cameras are given. The generated rate-quality models are discussed and the functionality of theframework during operation is demonstrated in an experi-mental test drive. The source code of the framework is open-source and available as a Robot Operating System (ROS)package on GitHub .II. A DAPTIVE V IDEO S TREAMING F RAMEWORK
To control and adapt multiple video streams, the archi-tecture, shown in Fig. 1, is proposed. For each camera onthe vehicle, the raw pixel data of the videos are compressedby an encoder. The compressed data are transmitted via amobile network to the operator side, where they are decodedand displayed. A scene manager user interface offers optionsfor the operator to switch between three video rate controlmodes: Single (S), Automatic (A) and Collective (C). Inmode S, the videos can be turned on and off, or reconfiguredindividually. This means that the region of interest (ROI),characterized by width, height, horizontal and vertical offsetin pixel, can be set for each video. Furthermore, the videoresolution scaling factor and the target bitrate of the encodercan be set. In mode A, the ROIs are kept constant, i.e., asthe operator has set them in mode S. Based on predictions ofthe (total) bitrate available from the pQoS client, the bitrateallocation and resolution scaling factor selection for the https://github.com/TUMFTM/tod perception individual video streams are performed and applied by thebandwidth manager automatically. Video rate control mode Cis similar to mode A, with the difference that the totalbitrate is not given by the pQoS client, but by the operatorthrough the scene manager. In the following, the proceduresof bitrate allocation and resolution scaling factor selection,as performed by the bandwidth manager, are described indetail. A. Bitrate Allocation
Given the operator’s selected regions of interest and aprediction of the current available bitrate, each camera ofthe multi-view video system is allocated a bitrate. At first,from the ROI for each camera i , the bitrate demand b dem ,i iscomputed by b dem ,i = p set ,i p full ,i b full ,i , (1)where p ( · ) = w ( · ) h ( · ) , with image width w ( · ) and height h ( · ) ,are the amount of pixels of the ROI, set by the operator,and the full image of the camera, respectively. For a camerathat is turned off, p set ,i is set to zero. The maximum bitratethat should be allocated to the camera for the full image isdenoted by b full ,i . This parameter, implicitly formulating aweight of the respective camera, is set manually and will bediscussed in more detail in Sec. V-A. In addition, the bitrateallocation can be complemented by a set of explicit cameraweights.From the bitrate demand, the allocated bitrate b alloc ,i iscalculated as b alloc ,i = b dem ,i (cid:80) Nj =1 b dem ,j b pred , (2)whereby N and b pred denote the total number of cameras undthe prediction of the bitrate available from the pQoS client,respectively. B. Resolution Scaling Selection
From the set ROI and allocated bitrate for each camera, theresolution scaling factor that maximizes the average visualquality is selected for each camera. This selection is basedon rate-quality models. Their generation and metrics for thevisual quality are described in more detail in the next section.For better readability, the camera index i is omitted in thissection.Generally speaking, for each camera, the rate-quality modelyields a set S of resolution scaling factors S = { s , s ... s R } . (3)Within S , the resolution scaling factors s r are R real numbersin the range ] 0 , . Each factor maximizes the average visualquality in a range B r of encoder target bitrates.Given the allocated bitrate for a camera from (2), theprocedure selects the factor s r for which b alloc p full p set ∈ B r = (cid:40) [ b min,r , b min ,r +1 [ , if r < R [ b min ,R , b full ] , if r = R (4) itrate Quality 𝒔 𝒓+𝟏 𝒔 𝒓 𝒔 𝒓−𝟏 𝒃 𝐦𝐢𝐧,𝒓 𝒃 𝐦𝐢𝐧,𝒓+𝟏 Fig. 2. Conceptual Illustration of the Rate-Quality Model holds. This aims at incorporating the selection of a greaterresolution scaling factor in case the operator has set an ROIsmaller than the full image of the camera.III. G
ENERATION OF R ATE -Q UALITY M ODELS
To perform the previously described resolution scalingfactor selection, rate-quality models are generated before-hand. This follows a concept presented in [8], which splits avideo in five second, non-overlapping chunks. Each chunk isencoded at different resolutions, and compression rates. Theaverage visual quality of each chunk is stored in a database.These data then enable the selection of parameters thatmaximize the visual quality of each chunk during playback.Fig. 2 conceptually illustrates the rate-quality model for threeresolutions, i.e., scaling factors s r . Each curve representsthe visual quality of the video at a certain scaling factorover varying bitrate. Each scaling factor achieves the highestvisual quality in a certain bitrate range. From the points ofintersection of the rate-quality curves, the values b min ,r aredetermined.In ToD, the streamed videos are not recordings, but livefeeds. In consequence, the described concept is adaptedto the problem at hand. For the presented adaptive videostreaming framework, rate-quality models are not obtainedfor short video chunks, but for each camera of the ToDsystem. Therefor, a representative driving sequence withstraight sections and turns is recorded with each camerain a raw, non-compressed format. This is then downscaledand encoded with a constant encoder target bitrate. After-wards, the compressed video is decoded and scaled up toits original resolution. From this, the visual quality of thevideo frame is assessed through a full-reference metric. Wellknown metrics are the Peak Signal-to-Noise Ratio (PSNR)or the Mean Structural Similarity (MSSIM) [14]. Alterna-tively, a no-reference image quality assessment method, suchas the Blind/Referenceless Image Spatial Quality Evalua-tor (BRISQUE) [15], may be used. Recently, also Netflix’Video Multi-Method Assessment Fusion (VMAF) [16] hasgained in popularity. The visual quality is computed for eachvideo frame of the driving sequence. Finally, the averageover all frames represents one data point of the rate-qualitymodel. The procedure is repeated for multiple resolution Camera RearCenter Cameras FrontLeft / Center / RightCameras Top ViewFront / Left / Right / Rear
Fig. 3. Camera Setup and Fields of View on Experimental Vehicle scaling factors and encoder target bitrates. This yields thefull camera rate-quality model, which is used in the adaptivevideo streaming framework, as described in the previoussection. IV. E
XPERIMENTAL S YSTEM S ETUP
The video streaming framework presented in this paper hasbeen deployed on the experimental vehicle for ToD describedin [3]. Since then, the camera setup has been complementedby two front-mounted cameras. As illustrated in Fig. 3, thesetup now consists of a total of eight cameras. These are • three front-mounted and one rear-mounted camera op-erating at 40 Hz, • and four top view cameras operating at 30 Hz, with anopening angle of
180 deg , each for monitoring the closesurroundings of the vehicle.The vehicle camera parameters are summarized in Tab. I. Thereported resolution values partially do not correspond to thenative camera resolutions, but already incorporate croppingof the videos due to overlap or regions, such as the sky,which certainly do not contain valuable information, and thusdo not need to be transmitted to the operator. The obtainedvalues of rate-quality models are discussed in more detail inthe next section.The adaptive video streaming framework is implementedwithin ROS [17]. The video compression and transmissionis handled using GStreamer [18]. The GStreamer Real-Time Streaming Protocol (RTSP) Server Library [19], im-plementing an RTSP-based client-server model, is used forestablishing and controlling the video streaming sessionsbetween the operator and the vehicle. GStreamer is a modularframework based on so-called plugins, each providing a cer-tain functionality. By putting plugins together in a pipeline,different multimedia streaming applications can be created.Within the presented video streaming framework, thepipeline for each camera on the vehicle side consists of thefollowing plugins: appsrc → videocrop videoscale → capsfilter → x264enc → rtph264pay → rtsp server .From the ROS image callback, the raw video frames arepushed to the appsrc . The x264enc compresses theraw video using the H.264 codec [20]. For the transmis-sion via the rtsp server , the compressed data is splitinto Real-Time Transport Protocol (RTP) packets by the rtph264pay . Adapting the configuration and control ofthe video stream at runtime happens through parametrizationof certain plugins. These are the videocrop element toadapt the ROI, the capsfilter element to force the videoscale to perform a certain resolution scaling, andthe x264enc to encode the video stream at the (target)bitrate.On the operator side, each video is received with the follow-ing pipeline: rtspsrc → rtph264depay → avdec h264 → appsink ,where the rtspsrc creates the RTSP client that connectsto the server to receive the RTP packets. These are passedon to the rtph264depay to reassemble the H.264 com-pressed data. The avdec h264 decodes these into rawvideo frames, which are retrieved from the appsink andpublished to ROS for the operator display.Several hardware and software design choices for the ToDsystem were driven by the goal to minimize the age ofthe information that is being displayed to the operator. Anextensive and thorough analysis of hardware componentsand their latencies is presented in [1]. The design choicesinclude high camera frame rates and, despite lower com-pression rates, the use of the faster H.264 over H.265 [21].With H.264, a low latency is achieved by parametrizingthe encoder to use the speed-preset=superfast andthe tune=zero-latency property. Also, to take advan-tage of the continuous, cut-free videos during ToD, the intra-refresh property is used to periodically distributethe refresh of the video key frames.With the described parameter settings and other hardwarecomponent choices, such as gaming monitors with minimalinput lag and high refresh rates, the latency of the videostreaming system can be reduced effectively. With a wiredconnection, the end-to-end latency, measured from an eventhappening in front of the camera until it is being displayedon the operator monitor, is below
120 ms for the three front-mounted cameras [1]. Although this paper does not explicitlyaim at optimizing latency, these characteristics of the givenToD system are described here, as they significantly influencethe obtained rate-quality models that are presented in thefollowing section. V. R
ESULTS
In this section, rate-quality models generated for theexperimental vehicle are presented. Furthermore, results ofan experimental drive give insights into the operation of theframework at runtime. Bitrate in kbit/s M SS I M (a) Front Center Bitrate in kbit/s M SS I M s=0.125s=0.250s=0.375 s=0.500s=0.750s=1.000 (b) Front Left and Right AverageFig. 4. Rate-Quality Models for Front-Mounted Cameras A. Rate-Quality Models
For the described camera setup, rate-quality models weregenerated with the methodology described in Sec. III. Se-quences of approximately seconds during regular drivingat moderate speeds were recorded in raw image format. Therecordings were then run through the described compression-decompression pipeline at eight resolution scaling factors S = { . , . , . , ... . } and encoder target bitratesbetween and / s . From the average MSSIMs forall parameter settings, resolution scaling factors that yieldthe highest quality in a certain bitrate range are selected.The rate-quality models of the front-mounted center cameraand the average of the front-mounted left and right camerasare shown in Fig. 4. All three cameras are of the sametype. For better clarity, selected resolution scaling factors aredisplayed, only. It can be observed that each factor yields abitrate range in which it maximizes the MSSIM, motivatingdynamic scaling in the first place. Another observation isthe major impact of the camera orientations on the qualitymodel. For instance, as the motion in the videos of the side-facing cameras is greater, the bitrate ranges of the respectiveresolution scaling factors are different from the center-facing camera. Furthermore, overall lower quality scores areachieved. In consequence, this should be compensated for inthe rate-quality model by assigning greater b full to the side-facing cameras. Plots of the quality models for the othercameras can be found in Fig. 8 in the Appendix.From the rate-quality models, it is concluded that not allscaling factors are worthwhile to be applied. For instance,it turned out that some scaling factors are the best choicein a comparably narrow bitrate range, only. Also, resolutionscaling factors close to . , requiring greater computationalpower and encoding time, are only reasonable if the relatedcamera bitrates are maintainable in the 4G/LTE network thatis shared with multiple users.The rate-quality model parameters used for the experimentalvehicle are reported in Tab. I. Different resolution scalingfactors turned out to be preferred for the different typesof cameras. Also, scaling factors up to . came into use,only. However, with further advancements of mobile networkstandards and computing technologies, this is expected tochange. ABLE IC
AMERA AND R ATE -Q UALITY M ODEL P ARAMETERS OF E XPERIMENTAL V EHICLE
Front Left / Right Front Center Rear Center Top View Front Top View Left / Right Top View Rear p full (MPx) . . . w full (Px), h full (Px) 1920, 1040 1920, 1200 1280, 800 S { . , . , . } { . , . , . } b full (Mbit/s) . . . . . . b min ,r (Mbit/s) { , . , . } { , . , . } { , . , . } { , . , . } { , . , . } { , . , . } Longitude in deg La t i t ude i n deg Start End T o t a l C a m e r a B i t r a t e i n k b i t/ s Fig. 5. Actual Total Camera Bitrates over GPS Location of Vehicle
B. Driving Test
After the generation of the rate-quality models for allcameras on the vehicle, the framework is put into operation.In the following, insights into the function of mode Aare given for a driving test of approximately
100 s . Thepredictions of the available bitrate are artificial, being readfrom a pre-recorded bandwidth map. As stated in Sec. I-B, the development and deployment of more sophisticatedregression and prediction models is beyond the scope of thispaper.Fig. 5 shows the total bitrate of all cameras, plotted on theGPS location of the vehicle during the driving test. Greatvariance of the bitrate can be observed, ranging from to / s . The actual camera bitrates over time areplotted in Fig. 6. The front-mounted cameras used forthe left and right views have greater b full , and are alwaysallocated higher bitrates. This is the consequence of thelower visual quality levels achieved for these cameras, dueto their orientation, as described in the previous section. Thebitrates of the top view cameras are controlled equally atoverall lower bitrate levels, compared to the front- and rear-mounted cameras. Given the variation in allocated bitrates,the framework also varies the resolution of the videos. Theresolution scaling factors are plotted over time in Fig. 7. Intwo occasions, around and
55 s , only lower bitrates areavailable for several seconds. This results in the selection oflower resolutions for all cameras, except the front-mountedcenter camera. Between and
90 s , a lower bitrate is pre-dicted as well. However, with the given rate-quality models,the resolutions of the videos are not changed.
Time in s B i t r a t e i n k b i t/ s RearCenterFrontLeft FrontCenterFrontRight (a) Front- and Rear-Mounted
Time in s B i t r a t e i n k b i t/ s TopViewRearTopViewLeft TopViewFrontTopViewRight (b) Top ViewsFig. 6. Actual Camera Bitrates over Time
Time in s R e s o l u t i on S c a li ng F a c t o r Rear CenterFront Left / Right Front Center (a) Front- and Rear Mounted
Time in s R e s o l u t i on S c a li ng F a c t o r Top Views (b) Top ViewsFig. 7. Video Resolution Factors over Time
VI. D
ISCUSSION AND O UTLOOK
The rate-quality models form a good basis to adequatelyparametrize the videos of the ToD system. However, therate-quality models were generated from one driving se-quence, only. To account for deviations from the models,an assessment of the actual image complexity at runtimeis required. This could be done through a pre-processor orfeedback, available from the encoder. In addition, extendedparameter models could be generated, in dependence on thelongitudinal and lateral motion of the vehicle. Alternatively,the generation of rate-quality models for specific driving sce-narios is also conceivable. For instance, unprotected left turnsyield a greater importance for the front-mounted camerafacing to the left. In parking scenarios, the top view cameras,capturing the close surroundings of the vehicle, should beallocated higher priority.Another aspect that can be addressed in future work is theselection of the visual quality metric for generating the rate-quality models. For instance, the PSNR, which estimatesabsolute errors, corresponds to the perceived quality to someextent, only. The MSSIM, which was used in this paper,aims to capture the similarity of images regarding structuralinformation. However, it remains an open question as towhether this is the correct metric for the task of ToD.II. C
ONCLUSION
This paper presented a flexible video streaming frameworkfor teleoperated driving. It offers the ability to dynamicallyconfigure individual video streams of the system. Given thetotal available uplink bitrate, the framework is capable ofautomatically handling bitrate allocation and optimization ofthe video resolution scaling factors. This is based on rate-quality models that were generated for each camera of thesystem. The operation of the framework is demonstrated onan experimental vehicle with eight cameras. A discussion onthe proposed methodology points out several directions forfuture work and enhancements of the framework. Ultimately,this research aims to explore these possibilities with thegoal to further improve the video streaming quality for theapplication of teleoperated driving.VIII. A
PPENDIX Bitrate in kbit/s M SS I M (a) Top View Front Bitrate in kbit/s M SS I M (b) Top View Left and Right Average Bitrate in kbit/s M SS I M (c) Top View Rear Bitrate in kbit/s M SS I M s=0.125s=0.250s=0.375 s=0.500s=0.750s=1.000 (d) Rear-Mounted CenterFig. 8. Rate-Quality Models for other Cameras IX. A
CKNOWLEDGEMENTS