2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) | 2021

A Structured Analysis of the Video Degradation Effects on the Performance of a Machine Learning-enabled Pedestrian Detector

 

Abstract


Machine Learning (ML)-enabled software systems have been incorporated in many public demonstrations for automated driving (AD) systems. Such solutions have also been considered as a crucial approach to aim at SAE Level 5 systems, where the passengers in such vehicles do not have to interact with the system at all anymore. Already in 2016, Nvidia demonstrated a complete end-to-end approach for training the complete software stack covering perception, planning and decision making, and the actual vehicle control. While such approaches show the great potential of such ML-enabled systems, there have also been demonstrations where already changes to single pixels in a video frame can potentially lead to completely different decisions with dangerous consequences in the worst case. In this paper, a structured analysis has been conducted to explore video degradation effects on the performance of an ML-enabled pedestrian detector. Firstly, a baseline of applying “You only look once” (YOLO) to 1,026 frames with pedestrian annotations in the KITTI Vision Benchmark Suite has been established. Next, video degradation candidates for each of these frames were generated using the leading video compression codecs libx264, libx265, Nvidia HEVC, and AV1: 52 frames for the various compression presets for color frames, and 52 frames for gray-scale frames resulting in 104 degradation candidates per original KITTI frame and in 426,816 images in total. YOLO was applied to each image to compute the intersection-over-union (IoU) metric to compare the performance with the original baseline. While aggressively lossy compression settings result in significant performance drops as expected, it was also observed that some configurations actually result in slightly better IoU results compared to the baseline. Hence, while related work in literature demonstrated the potentially negative consequences of even simple modifications to video data when using ML-enabled systems, the findings from this work show that carefully chosen lossy video configurations preserve a decent performance of particular ML-enabled systems while allowing for substantial savings when storing or transmitting data. Such aspects are of crucial importance when, for example, video data needs to be collected from multiple vehicles wirelessly, where lossy video codecs are required to cope with bandwidth limitations for example.

Volume None
Pages 357-362
DOI 10.1109/SEAA53835.2021.00053
Language English
Journal 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)

Full Text