ISSAFE: Improving Semantic Segmentation in Accidents by Fusing Event-based Data
IISSAFE: Improving Semantic Segmentation inAccidents by Fusing Event-based Data
Jiaming Zhang , Kailun Yang and Rainer Stiefelhagen Abstract —To bring autonomous vehicles closer to real-worldapplications, a major task is to ensure the safety of all trafficparticipants. In addition to the high accuracy under controlledconditions, the assistance system is still supposed to obtain robustperception against extreme situations, especially in accident sce-narios, which involve object collisions, deformations, overturns,etc. However, models trained on common datasets may sufferfrom a large performance degradation when applied in thesechallenging scenes. To tackle this issue, we present a rarelyaddressed task regarding semantic segmentation in accidentscenarios, along with an associated large-scale dataset
DADA-seg . Our dataset contains 313 sequences with 40 frames each, ofwhich the time windows are located before and during a trafficaccident. For benchmarking the segmentation performance, every11th frame is manually annotated with reference to Cityscapes.Furthermore, we propose a novel event-based multi-modal seg-mentation architecture
ISSAFE . Our experiments indicate thatevent-based data can provide complementary information tostabilize semantic segmentation under adverse conditions bypreserving fine-grain motion of fast-moving foreground (crashobjects) in accidents. Compared with state-of-the-art models, ourapproach achieves 30.0% mIoU with 9.9% performance gain onthe proposed evaluation set. Index Terms —Semantic scene understanding, robot safety,robustness, event-based vision, autonomous driving.
I. I
NTRODUCTION A UTONOMOUS vehicles benefit from breakthroughs indeep learning algorithms. In particular, image semanticsegmentation, one of the fundamental tasks of computer vi-sion, can provide pixel-wise understanding of driving scenes,containing object categories, shapes, and locations. In recentyears, many state-of-the-art segmentation models [1], [2], [3]have achieved impressive successes in accuracy on majorsegmentation benchmarks. Other works [4], [5] centered onimproving the efficiency of the model, in order to deploy real-time semantic segmentation on mobile platforms.Unfortunately, driving environments in the real world aremore complicated than most existing datasets, divided intonormal, critical and accidental situations. In addition to naturalrelated factors in the normal driving scene, such as weathersand illuminations, many human-centered crisis incidents fromother traffic participants may occur. For example, vehicles
This work was supported in part through the AccessibleMaps projectby the Federal Ministry of Labor and Social Affairs (BMAS) under theGrant No. 01KM151112 and in part by Hangzhou SurImage Company Ltd. (Corresponding author: Kailun Yang.) Authors are with Institute for Anthropomatics and Robotics, KarlsruheInstitute of Technology, Germany (e-mail: [email protected];[email protected]; [email protected]). The code and dataset are available at: https://github.com/jamycheung/ISSAFE (a) motion blurs (b) overturns (c) back light (d) occlusions t t t t Fig. 1:
Accident sequences from the proposed
DADA-seg dataset in-clude diverse hazards ( e.g. motion blurs, overturns, back light, objectocclusions). From top to bottom are timestamps before and duringan accident, where the t frame is the ground-truth segmentation forquantitative evaluation, and the others are predictions of our model. overtaking irregularly, pedestrians dashing across the road, orcyclists riding out of lanes, these critical situations are allpotential causes of traffic accidents, but never seen in visiondatasets. Furthermore, the initial accident scene ahead is alsodefined as an accidental situation, such as an overturned truckor a knocked down motorcycle lying on the road, which shouldbe correctly recognized by passing vehicles in time, only thencan pileups be avoided. However, these abnormalities willresult in a large performance drop of the segmentation modelswhen taken from public training imagery to the wild. Conse-quently, this makes current semantic segmentation algorithmsless stable and reliable in self-driving applications.To satisfy the rigorous requirements of safety-relevant au-tonomous vehicles, a semantic segmentation model should bethoroughly tested to verify its robustness and reliability. Toaddress this issue, this work creates an alternative benchmarkbased on a new task, namely semantic segmentation in acci-dent scenarios. As a supplement to classic benchmarks [6], [7],our evaluation samples are collected from real-world trafficaccidental situations which involve highly dynamic scenes andextremely adverse factors. Some cases are shown in Fig. 1,covering diverse situations: motion blurs while the pedestrianis dashing across the road, overturning of the motorcyclistduring the collision, back-lighting at the intersection, and theocclusions by windshield reflection. As far as we know, thesefactors are still challenging for most segmentation algorithmsand even harmful to their performance. The objective ofcreating this benchmark is to provide a set of edge cases(critical and accidental) for testing the robustness of modelsbefore deployment in real applications.In addition to traditional cameras, event cameras are bio-inspired novel sensors, such as the Dynamic Vision Sensor a r X i v : . [ c s . C V ] A ug DVS) [8], that encode changes of intensity at each pixelasynchronously and have the characteristics like higher dy-namic range ( > dB ), high time resolution ( M Hz clockor µs timestamp), and are not affected by motion blurs [9].Hence, we consider that event cameras are more sensitiveto capture the motion information during driving, especiallyfor fast-moving objects (foreground) in extreme or accidentscenarios, where classic cameras delay between frames. Inlow-lighting environments, event cameras still stably bring suf-ficient perceptual information. Underlying these assumptions,complementary information can be extracted from the event-based data to address shortcomings of the intensity image inboth normal and abnormal scenes.Finally, as a preliminary exploration on this new task, wepropose a light-weight event-aware network branch, whichserves as the event-based fusion architecture of the multi-modal model, as well as a domain bridge connecting the sourceand target datasets. In accordance with our ISSAFE archi-tecture, the robustness of semantic segmentation algorithmstowards the accident scenarios can be significantly improved.In summary, our main contributions are: • We present a rarely solved task concerning semantic seg-mentation in accident scenarios , with the ultimate goal torobustify the perception algorithm against abnormal situa-tions during highly dynamic driving. • We provide an accompanying large-scale dataset
DADA-seg with respect to the real-world traffic accidents, of which theevaluation set has pixel-level annotations for benchmarkingthe robustness of algorithms. • We propose a multi-modal segmentation architecture
IS-SAFE to exploit complementary features from event-baseddata according to two approaches, i.e. event-aware fusionand event-aware adaptation. To analyze the benefits anddrawbacks of event data, comprehensive comparisons andablation studies are conducted between various models andbetween data modalities.II. R
ELATED W ORKS
A. Semantic Segmentation
Since FCN [10] used fully convolutional layers for pixel-wise prediction on images, a massive number of models [1],[2], [3] have achieved remarkable performance in imagesemantic segmentation. In addition to high accuracy, otherworks, such as ERFNet [4] and SwiftNet [5], proposed simpli-fied architectures to improve the efficiency. Regarding gener-alizability, domain adaptation (DA) strategies were extensivelyapplied to adapt the segmentation algorithm to new scenes. Forexample, the day-night conversions in [11] and the adaptationsbetween diverse weathers like rainy [12] and snowy [13]scenes. However, apart from these natural conditions in realdriving scenes, there are many uncontrollable factors in theinteraction with other traffic participants. The core purposeof our work is to fill the gap of semantic segmentation inabnormal situations.Any ambiguity in machine vision algorithms may causefatal consequences in autonomous driving, thus the robustnesstesting conducted in diverse driving conditions is essential. For this reason, WildDash [7] provided ten different hazards,such as blurs, underexposures or lens distortions, as well asnegative test cases against the overreaction of segmentationalgorithms. Inspired by this work, we create a new dataset toextend the robustness test from ordinary to accident scenarios.In our DADA-seg dataset, most of the critical or accidentalscenes are more difficult by having a large variety of adversehazards similar to WildDash.On the other hand of improving robustness, some solutionsconstructed a multi-modal segmentation model by fusing ad-ditional information, such as depth information in RFNet [14],thermal information in RTFNet [15] and optical flow in [16].Differing from these classic modalities, in this paper, event-based data will be explored as a novel auxiliary modality.
B. Event-based Vision
In recent years, event cameras are increasingly used invisual analysis due to their complementary functions to tradi-tional cameras, such as high dynamic range, no motion blur,and response in microseconds [9]. Instead of capturing animage in a fixed rate, event cameras asynchronously encodethe intensity change at each pixel with the position, time, andpolarity: ( x, y, t, p ) . Typically, for processing in a convolu-tional network, the original event stream is converted into animage form, such as a two-channel event frame by Maquedaet al. [17], a four-dimensional grid [18] and a discretized eventvolume (DEV) by Zhu et al. [19].Based on these image-like representations, Alonso et al. [20]constructed a semantic segmentation model Ev-SegNet andtrained it on an extended event dataset DDD17 [21], whosesemantic labels are generated by a pre-trained model onCityscapes and only contain 6 categories. In contrast, ourmodels are trained with the ground-truth labels of Cityscapesand perform semantic segmentation in all 19 classes. Addi-tionally, instead of stacking images in the input stage, eventdata will be adaptively fused with the RGB image through theattention mechanism, which is more effective for combiningtwo heterogeneous modalities.While labeled event data for semantic segmentation is scarcein the state of the art, other works leveraged the existinglabeled data of images by simulating their correspondingevent data. Rebecq et al. [22] proposed ESIM to combinea rendering engine with an event simulator. Instead, withoutrender engine, EventGAN [23] presented a self-supervisedapproach to generate events from associated images usingonly modern GPUs. In this work, we utilize the EventGANmodel to extend the source and target datasets by generatingtheir associated event data, so as to investigate the benefit ofevent sensing in dynamic accident scenes. Finally, the event-aware domain adaptation between both datasets is performedby fusing RGB images and the synthesized events.III. M ETHODOLOGY
In this section, we state the details of the new task andrelevant dataset, as well as our
ISSAFE architecture, attemptingto solve the performance drop of image semantic segmentationalgorithms in accident scenes.ABLE I:
Distribution of total 313 sequences from DADA-segdataset under conditions in terms of light, weather and occasion.
DADA-seg Light Weather Occasion day night sunny rainy highway urban rural tunnel
A. Task Definition
Aiming to provide an extensive evaluation of the robustnessof semantic segmentation models, we create a new task, i.e. semantic segmentation in accident scenarios . Besides, anassociated evaluation set following the same labelling rulesas Cityscapes [6] is provided for quantitative comparison andanalysis. All test cases are collected from real-world trafficaccidents and contain adverse situations. We explicitly studythe robustness in challenging accident scenarios based onthe assumption that the less performance degradation of thealgorithm in this unseen dataset, the better its robustness.
B. Accident Scenarios Dataset
Data Collection.
Our proposed dataset DADA-seg is se-lected from the large-scale DADA-2000 [24] dataset, whichwas collected from mainstream video sites. Only sequenceswith large watermarking or low resolution were removed,while most of the typical adverse scenes were retained, suchas those with motion blurs, over/underexposures, weak illu-minations, occlusions, etc. All other different conditions aredescribed in Table I. Concentrating on accident scenes, weremain the 10 frames before the accident and 30 frames duringthe accident. After selection, the final DADA-seg datasetcomposes of 313 sequences with a total of 12,520 frames ata resolution of 1584 × Data Annotation.
For quantitative analysis, based on thesame 19 classes as defined in Cityscapes, we perform fullpixel-wise annotation on the 11th frame of each sequence byusing the polygons to delineate individual semantic classes, asshown in the t frame in Fig. 1. Comparatively, our datasetis 2 × large as the Cityscapes and all images are taken inbroad regions by different cameras from various viewpoints.Moreover, all sequences focus on accident scenarios, compos-ing of normal, critical, and accidental situations. In such away, the evaluation performed on DADA-seg dataset reflectsmore thoroughly the robustness of semantic segmentationalgorithms. Event Data Synthesis.
Bringing event data to image se-mantic segmentation task, there is still a lack of event-basedlabeled dataset. Thus, we utilize the EventGAN [23] model tosynthesize highly reliable event data on two datasets. Differentfrom the fixed frame rate (17Hz) in the Cityscapes [6] dataset,the sequence in the DADA-seg dataset was acquired withdiverse cameras and frame rates, which indicates that itssynthesized event data vary from the intensity of motion due todifferent time intervals. After verified, the penultimate framewas selected and stacked with its anchor frame for event datasynthesis. Two cases of the generated event data are visualizedin Fig. 2. From this, we can see how the event data benefitsthe sensing in the driving scene with moving-objects or in the
HB W C it y s ca p e s DADA s e g Fig. 2:
Visualization of generated event data in B × H × W space,where B , H and W denote the time bins, image height and width.From left to right are RGB image, event volume and event frame,where blue and red colors indicate positive and negative events. low-lighting environment, meanwhile providing higher timeresolution in volumetric form. C. ISSAFE: Event-aware Fusion
Starting from various event representations, two diverseevent fusion approaches are presented in the multi-modalsegmentation model to excavate complementary informativefeatures from the event data, which are more sensitive tomotion and stable in low-lighting scenes.
Event Representation.
Event cameras asynchronously en-code an event at each individual pixel ( x , y ) at the corre-sponding triggering timestamp t , if the change of logarithmicintensity L in time variance ∆ t is greater than a presetthreshold C : L ( x, y, t ) − L ( x, y, t − ∆ t ) ≥ pC, p ∈ {− , +1 } (1)where polarity p indicates the positive or negative direction ofchange. A typical volumetric representation of a continuousevent stream with size N is a set of 4-tuples: V = { e i } Ni =1 , where e i = ( x i , y i , t i , p i ) . (2)However, it is still arduous to transmit the asynchronousevent spike to the convolutional network by retaining a suf-ficient time resolution. Hence, we perform a dimensionalityreduction operation in the time dimension, similar to [23].The original volume is discretized with a fixed length forpositive and negative events separately, and each event islocally linearly embedded to the nearest time-series panel.According to the number of positive time bin B + , a discretizedspatial-temporal volume V + is represented as: ˜ t i = ( B + −
1) ( t i − t ) / ( t N − t ) (3) V + ( x, y, ˜ t i ) = B + (cid:88) i max (cid:0) , − (cid:12)(cid:12) t − ˜ t i (cid:12)(cid:12)(cid:1) . (4)When both positive and negative volumes are concatenatedalong time dimension, the entire volume is represented as V ∈ R B × W × H , where B , W and H are the total number GB residual layer Event residual layer Event layer Upsampling Spatial parymid pooling A Attention k×k
Conv Element-wise add Element-wise product
SPP
RGB branch
RL 2 A
64, H/4, W/4
Event branch A RL 2UP 3 UP 2 UP 1RL 1 RL 2 RL 3 RL 4RL 1 RL 2 RL 3 RL 4
B HW (a) s2d: event fusion from sparse to dense
SPP
BCE loss
UP 3 UP 2 UP 1RL 1 RL 2 RL 3 RL 4L 1 L 2 L 3 L 4
64, H, W 128, H/8, W/864, H/4, W/4
RL 2
Event branch
32, H, W
RGB branch
B HW (b) d2s: event fusion from dense to sparse
Fig. 3:
Model architectures of two different event fusion strategies. In (a), event data is fused to RGB branch adaptively from sparse todense, while in (b) event data is extracted from dense image and learned from the sparse ground truth. of time bins, the width and height of spatial resolution,respectively. The detailed setting of time bins will be discussedin the experiments section.
Sparse-to-dense.
After event data is converted into animage-like representation, the most straightforward fusionstrategy is to stack the event and RGB image
I ∈ R C × W × H channel-wisely as: RGB - E : R C × W × H ⊕ R B × W × H → R ( C + B ) × W × H , (5)which can replace the 3-channel RGB image at the input stageof the network, as introduced in Ev-SegNet [20]. In this work,we mainly explore the adaptive fusion of these two modalitiesbetween layers. As shown in Fig. 3a, the s2d fusion model,short for sparse-to-dense, includes dual branches, i.e. RGBbranch and event branch, constructed with the ResNet-18 [25]backbone for maintaining a real-time speed. In the eventbranch, fine-grained motion features will be extracted fromthe event data with high time resolution. After each residuallayer of both branches, inspired by RFNet [14], a channel-wise attention module is employed for feature selection, inwhich the motion features are emphasized in the event branchand added element-wise into the RGB branch. In other words,the higher time resolution from event data complements themotion-related features in the blurred RGB image. Addition-ally, its high dynamic range enhances the over/underexposureimage. After four residual layers, the event feature serves asan additional stream in the Spatial Pyramid Pooling (SPP)module [1] and will be concatenated with other high-levelfeatures for long-range context sensing. Finally, a light-weightdecoder, composing of 3 upsampling modules with 1 × Dense-to-sparse.
On the other hand, inspired by the videorestoration from a single blurred image and the event data like[26], [27], we alternatively leverage the dense-to-sparse fusionapproach, named d2s for short, as shown in Fig. 3b. Varyingfrom the classic residual layer in the previous s2d fusion mode,a more light-weight encoder with 4 layers is selected as theevent branch, which is similar to that of Gated-SCNN [28]. Instead of multiple residual blocks, each layer only containsa 3 × × e will beused for supervised learning. Under the supervision with theBCE loss function, the training is explored with the B=1 eventrepresentation, which is divided into two cases, where P refersto the positive event data only and P + N denotes positive andnegative event data, as described in the experiment section.Furthermore, aiming to learn the whole model in an end-to-end fashion, the Cross Entropy (CE) loss from RGB branchwill be merged with the BCE loss as: L = L BCE ( e, ˆ e ) + L CE ( y, ˆ y ) , (6)where e , ˆ e , y and ˆ y are the ground-truth and the predictedevent, segmentation ground truth and prediction, respectively. D. ISSAFE: Event-aware Adaptation
When the source data has labels but the target data does not,unsupervised domain adaptation (UDA) is a vital approachto perform the transfer learning from normal to abnormalscenes, which can be investigated in different aspects, i.e. image level [29] and/or feature level [30], as well as crossvarious modalities. Compared to textured RGB images, themonochromatic event data, capturing only changes of intensity,is semantically more consistent in both domains, that denotesthe homogeneous event features and thus can serve as a bridgeto assist the RGB modal domain adaptation in the featurelevel. Based on this assumption, as shown in Fig. 4, the entire
L 1 RL 2 RL 3 RL 4L 1 L 2 L 3 L 4 RL 5RL 6
B HW
CE lossBCE lossAdv loss
SourceTarget
RGB layerSource flow Target flow Shared flow Event layer
EA-branchCLAN
Image transfer Discriminator
Fig. 4:
Architecture of the ISSAFE-CLAN model with event-awarebranch in d2s fusion mode. event-aware adaptation model ISSAFE-CLAN consists of twobranches, where the light-weight event-aware branch is thesame as the aforementioned d2s fusion and the RGB branchis constructed by the ResNet-101 [25] backbone referring tothe CLAN [30] model. Up to our knowledge, we are making anearly attempt to jointly perform the cross-modal unsuperviseddomain adaptation from normal to abnormal driving scenesbetween two heterogeneous modalities. In order to maintainthe consistency of ground-truth labels of both branches, in thecorresponding experiment, we mainly discuss the d2s eventfusion mode, from which the original event data is applied assupervision signals instead of as inputs.To distinguish and eliminate the impact of diverse domainadaptation strategies, we utilize the CycleGAN [29] model totranslate style of images from Cityscapes to DADA-seg andperform image-level adaptation between the two domains.IV. E
XPERIMENTS
This section describes the experiments of different modelsand implementation details. Initially, performance gaps ofvarious semantic segmentation models are investigated. After-wards, comprehensive experiments verified the effectiveness ofthe proposed
ISSAFE architecture in reducing the performancegap, including event fusion and event adaptation.
A. Datasets
The Cityscapes [6] dataset with 2975 training, 500 val-idation and 1525 test images from normal driving scenesis selected as the source domain. Meanwhile, as the targetdomain, our proposed DADA-seg dataset has 313 evaluationimages from abnormal driving scenes, which are labeled in 19classes as defined in Cityscapes. The unlabeled data of DADA-seg were used to perform unsupervised domain adaptation inthe CLAN [30] model. As mentioned above, both datasetswere extended with the synthesized event data correspondingto each RGB image. Note that in all experiments, the
Source results correspond to the performance calculated on the vali-dation set of Cityscapes, while the
Target results are computedon the evaluation set of DADA-seg.
B. Performance Gap
To quantitatively evaluate the robustness of semantic seg-mentation algorithms, accuracy- and efficiency-oriented mod-els are tested on the target dataset, as shown in Table II.For a fair comparison, when applicable, the results andmodel weights are provided by the respective publications.Overall, the large gap shows that semantic segmentation in TABLE II:
Performance gap of models, which are trained andvalidated on source domain (Cityscapes) and then tested on targetdomain (DADA-seg), both with 1024 ×
512 resolution.
Network Backbone Source Target mIoU Gap
ERFNet [4] ResNet-18 72.1 9.0 -63.1SwiftNet [5] ResNet-18 75.4 20.5 -54,9DeepLabV3+ [2] ResNet-50 79.0 19.0 -60.0DeepLabV3+ [2] ResNet-101 79.4 23.6 -55.8OCRNet [3] HRNetV2p-W18 77.7 23.8 -53.9
OCRNet [3] HRNetV2p-W48 -55.7
TABLE III:
Comparison of different event representations and eventfusion approaches. All models use ResNet-18 as backbone and aretested with 1024 ×
512 resolution. The s2d and d2s represent thesparse-to-dense and the dense-to-sparse fusion approach. B , P , and N are short for the time bins, positive and negative event frame,respectively. The RGB-only SwiftNet model was selected as baseline. Network Input Fusion Event data Source Target
SwiftNet [5] Event - B = 1 B = 2 B = 18 ISSAFE -RFNet RGB+Event s2d B = 1 ISSAFE -RFNet RGB+Event s2d B = 2 ISSAFE -RFNet RGB+Event s2d B = 18 ISSAFE -SwiftNet RGB+Event d2s P ISSAFE -SwiftNet RGB+Event d2s P + N accident scenarios is still a challenging task for these top-performance models. As expected, although both large [2],[3] and light-weight [4], [5] models gain high accuracy inthe source domain, they heavily depend on the consistencybetween the training and the testing data, which are allnormal scenes. It thus hinders their generalization ability andleads to a large performance degradation once taken to theabnormal scenes. The model OCRNet [3] with HRNet [31]backbone from the benchmark [32], obtains the highest mIoUof 80.6% on Cityscapes, but only reaches 24.9% on DADA-seg. Nonetheless, this comparison also indicates that higherperformance in the source domain still benefits performancein the target domain in most cases. In subsequent subsections,we perform ablation studies to verify the effectiveness of ourproposed methods for reducing the large gap and improvingthe robustness in accident scenarios. C. Ablation of Event Fusion
In the first ablation study, we explore the event represen-tations and event fusion strategies. For efficiency reasons, wechoose ResNet-18 [25] as the backbone. All models in thissubsection are constructed with the encoder-decoder structureand implemented on a single 1080Ti GPU with CUDA 10.0,CUDNN 7.6.0 and PyTorch 1.1. The detailed settings andhyperparameters are consistent with the SwiftNet [5] andRFNet [14], which are selected as the baseline models foradding the event branch.As shown in Table III, starting with event-only SwiftNet,where the event data are processed alone from sparse to densewithout RGB image, the higher time bin B brings betterperformance, and attains the mIoU of 36.6% in the sourceABLE IV: Performance comparison of domain adaptation strategies, where f and i represent the feature and image level. The resultsof Source and
Target are tested with 1024 ×
512 resolution, while
Target † is with 512 × Target result are listed:
Traffic Light, Traffic Sign, Pedestrian, Rider, Car, Truck,Bus, Train, Motorcycle and
Bicycle . Note that the target dataset does not have any
Train . Our ISSAFE-CLAN model is the event-awareadaptation on feature and image level by fusing event data in d2s mode, while the DOF-CLAN is by fusing dense optical flow data.
Network Level Foreground classes Target † Source Target
TLi TSi Ped Rid Car Tru Bus Tra Mot Bic Acc mIoU fwIoU Acc mIoU fwIoU Acc mIoU fwIoUCLAN [30] - 15.2 5.3 4.0 3.4 32.6 8.8 28.8 - 4.2 0.1 34.0 19.4 45.5 56.3 43.7 77.2 28.1 16.8 38.3CLAN [30] f 17.2
ISSAFE -CLAN f+i 17.0 19.5 - Image Event RGB-only SwiftNet Ours Ground-truthEvent-only SwiftNet
Fig. 5:
Contrastive examples between the Event-only SwiftNet, RGB-only SwiftNet and our ISSAFE-SwiftNet, which fuse the event ind2s mode with the P + N event representation. The event data are visualized as gray-scale frame here. From top to bottom are accidentscenarios in different situations: motorist collision, car-truck collision, car collision at night time, and an initial accident with overturned car. domain and 19.8% in the target domain. This indicates thatthe event data has certain interpretability for the segmentationof driving scenes, whose segmentation results are presentedin Fig. 5. As a baseline, we train the SwiftNet with RGBonly from scratch, which obtains 20.1% mIoU in the targetdomain. Compared with RGB-only SwiftNet, our s2d eventfusion ISSAFE-RFNet obtains a mIoU improvement of +2.9%in the target domain, while maintaining better performance inthe source domain. When the event data is used as auxiliaryinformation of the RGB branch, the model is improved in themoderate event representation (B=2), because others are toofew or sparse for the RGB image.Likewise, we implement the dense-to-sparse fusion model,named ISSAFE-SwiftNet, based on two different event rep-resentations as mentioned before, from which the P + N event data brings over +8.2% gain in mIoU in the targetdomain when compared with the RGB-only baseline. As it isshown in Fig. 5, our event-aware branch concentrates on themotion information, especially the foreground objects, suchas the motorcycle and truck in the accident scenes. However,segmentation of night scenes is still challenging, although ourmethod greatly benefits from event data, in contrast to thebaseline. A case of the initial accident scene is presented inthe last row of Fig. 5, where an overturned car is lying onthe road after fence collision, where our approach also clearly performs more robustly.To summarize briefly, the input from the two data domainsare obviously complementary. When event cameras will not betriggered in static scenes, conventional cameras can perfectlycapture the entire scene and provide sufficient textures. WhenRGB cameras puzzle over adverse scenes, i.e. fast-movingobjects or low lighting environments, the event data canprovide auxiliary information, which is particularly importantfor the segmentation of accident scenes. Fig. 5 demonstratesthat the model performs significantly better by fusing eventsand RGB images in those challenging situations. D. Ablation of Event Adaptation
The purpose of the second ablation is to verify the effect ofour ISSAFE domain adaptation approach for further reducingthe domain shift between normal and abnormal data. To com-pare diverse strategies, based on the recent model CLAN [30],the event-aware method is performed on two different levels, i.e. feature and/or image level. For an extensive quantitativeanalysis, we have adopted three different metrics [10], namelypixel accuracy (Acc), mean intersection over union (mIoU)and frequency weighted intersection over union (fwIoU), asshown in Table IV.Initially, the CLAN model adapted from the virtual to thereal domain was tested directly on the DADA-seg dataset t t t t t t t Fig. 6:
Semantic segmentation results of our ISSAFE-CLAN model on DADA-seg dataset. The columns correspond to the input imagesand output predictions of the sequence. without any adjustments, also named source-only CLAN,which gains the mIoU of 16.8% with 1024 ×
512 resolutionand 19.4% with 512 ×
256 resolution, respectively. Note thathere a smaller resolution input can obtain higher accuracy inthe target domain. There are two main reasons: images ofDADA-seg are originally with low resolution, and a smallerresolution can obtain a larger receptive field with wider contextunderstanding, which indicates that correct classification ismore critical in accident scenes than delineating the bound-aries. Afterwards, we train the CLAN model from scratch inCityscapes and DADA-seg datasets to verify the feature leveland feature-image level domain adaptation, whereas the latterobtained the highest mIoU of 64.8% in the source domain.Afterwards, our event-aware branch in the d2s mode isapplied into the CLAN model and jointly adapted fromsource to target, similar to the RGB branch. As a result,our ISSAFE-CLAN model obtains the highest performance onall three metrics in the DADA-seg dataset, and achieves thetop accuracy of 30.0% in mIoU, 42.1% in Acc and 64.5%in fwIoU at the higher resolution. Compared to the RGB-only SwiftNet, this model obtains +9.9% performance gainin mIoU. In order to understand the impact of event fusion,we list the per-class IoU results of all 10 foreground classesin Table IV. This demonstrates that the foreground classes canindeed benefit more from event data, which is consistent withour assumptions.Comparing various motion data based on our approach, wereplace the event-based data with dense optical flow (DOF)simulated by the Farneback [33] function in OpenCV. TheDOF-CLAN model also obtains accuracy improvements, de-spite being clearly lower than our ISSAFE-CLAN approachat the high resolution. This further illustrates the effective-ness of motion features as complementary information forsegmenting RGB images. Although both data are synthesized, motion features with higher time resolution can be extractedfrom event-based data to boost the foreground segmentation.Besides, event cameras have a high dynamic range to enhanceperception in low-light conditions, which better conforms withour ISSAFE subject for improving driving safety.
E. Qualitative Analysis
For qualitative study, Fig. 1 shows some segmentationresults of our ISSAFE-CLAN model in DADA-seg dataset,which are not only from traffic accidents, and also includedifferent adverse situations, such as motion blurs caused bypedestrian collisions, overturning of the motorcyclist, back-lighting at intersections, and obstructions caused by the wind-shield reflection. Moreover, Fig. 5 shows a qualitative com-parison of segmentation examples with or without event data.These indicate that our model can significantly robustify andstabilize and segmentation in normal and abnormal scenesby fusing event data, especially for foreground objects. Moresemantic segmentation results of sequences from DADA-segdataset are presented in Fig. 6. All these qualitative studieshelp to throw insightful hints on how to obtain reliableperception in accident scenes for autonomous vehicles.V. C
ONCLUSIONS
In the paper, we present a new task and its relevantevaluation dataset with pixel-wise annotations, which servesas a benchmark to assess the robustness and applicability ofsemantic segmentation algorithms. The main objective is toimprove the segmentation performance of complex scenes inthe application of autonomous driving, and ultimately reducetraffic accidents and ensure the safety of all traffic participants.As a baseline solution, we have constructed the multi-modalsegmentation model based on our
ISSAFE architecture byusing event-based data in different modes. Our experimentsshow that event data can provide complementary informationunder normal and extreme driving situations to enhance theRGB images, such as the fine-grained motion informationand the low-light sensitivity. Even though it is limited by thesynthetic events due to the lack of corresponding event data inthe common annotated image set, we have observed consistentand large accuracy gains.Eventually, the semantic segmentation of traffic accidentscenes is too complicated and full of challenges that the currentsegmentation performance still has large development space.Thus, the unlabeled data in the DADA-seg dataset may beexplored in future work through other learning paradigms,such as unsupervised and self-supervised learning, so that wecan gain more insights from the accident scenarios. In theory,video semantic segmentation can more largely benefit from thehigh time resolution of event data, since the driving process ishighly dynamic and temporal. An equally intriguing possibilityis the accident prediction based on the combination of videosemantic segmentation and event regression algorithm, whichis a significant and potential approach to avoid the traffichazards and further ensure the road traffic safety.R
EFERENCES[1] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsingnetwork,” in . IEEE, 2017, pp. 6230–6239.[2] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmen-tation,” in
Proceedings of the European Conference on Computer Vision(ECCV) , 2018, pp. 801–818.[3] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations forsemantic segmentation,” arXiv preprint arXiv:1909.11065 , 2019.[4] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Effi-cient residual factorized convnet for real-time semantic segmentation,”
IEEE Transactions on Intelligent Transportation Systems , vol. 19, no. 1,pp. 263–272, 2018.[5] M. Orˇsic, I. Kreˇso, P. Bevandic, and S. ˇSegvic, “In defense of pre-trainedimagenet architectures for real-time semantic segmentation of road-driving images,” in . IEEE, 2019, pp. 12 599–12 608.[6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in . IEEE, 2016, pp.3213–3223.[7] O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. Fer-nandez Dominguez, “Wilddash-creating hazard-aware benchmarks,” in
Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 402–416.[8] L. Patrick, C. Posch, and T. Delbruck, “A 128x 128 120 db 15 µ slatency asynchronous temporal contrast vision sensor,” IEEE journal ofsolid-state circuits , vol. 43, pp. 566–576, 2008.[9] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi,S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis et al. , “Event-based vision: A survey,” arXiv preprint arXiv:1904.08405 , 2019.[10] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks forsemantic segmentation,” in . IEEE, 2015, pp. 3431–3440.[11] L. Sun, K. Wang, K. Yang, and K. Xiang, “See clearer at night:towards robust nighttime semantic segmentation through day-night im-age conversion,” in
Artificial Intelligence and Machine Learning inDefense Applications , vol. 11169. International Society for Optics andPhotonics, 2019, p. 111690A.[12] F. Pizzati, R. de Charette, M. Zaccaria, and P. Cerri, “Domain bridgefor unpaired image-to-image translation and unsupervised domain adap-tation,” in . IEEE, 2020, pp. 2979–2987. [13] Z. Liu, Z. Miao, X. Pan, X. Zhan, D. Lin, S. X. Yu, and B. Gong,“Open compound domain adaptation,” in , 2020, pp. 12 406–12 415.[14] L. Sun, K. Yang, X. Hu, W. Hu, and K. Wang, “Real-time fusion networkfor rgb-d semantic segmentation incorporating unexpected obstacle de-tection for road-driving images,”
IEEE Robotics and Automation Letters ,vol. 5, no. 4, pp. 5558–5565, 2020.[15] Y. Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network forsemantic segmentation of urban scenes,”
IEEE Robotics and AutomationLetters , vol. 4, no. 3, pp. 2576–2583, July 2019.[16] H. Rashed, S. Yogamani, A. El-Sallab, P. Krizek, and M. El-Helw,“Optical flow augmented semantic segmentation networks for automateddriving,” arXiv preprint arXiv:1901.07355 , 2019.[17] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garc´ıa, and D. Scaramuzza,“Event-based vision meets deep learning on steering prediction for self-driving cars,” in . IEEE, 2018, pp. 5419–5427.[18] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Ev-flownet: Self-supervised optical flow estimation for event-based cameras,” arXivpreprint arXiv:1802.06898 , 2018.[19] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervisedevent-based learning of optical flow, depth, and egomotion,” in . IEEE, 2019, pp. 989–997.[20] I. Alonso and A. C. Murillo, “Ev-segnet: Semantic segmentation forevent-based cameras,” in . IEEE, 2019,pp. 1624–1633.[21] J. Binas, D. Neil, S.-C. Liu, and T. Delbruck, “Ddd17: End-to-end davisdriving dataset,” arXiv preprint arXiv:1711.01458 , 2017.[22] H. Rebecq, D. Gehrig, and D. Scaramuzza, “Esim: an open event camerasimulator,” in
Conference on Robot Learning , 2018, pp. 969–982.[23] A. Z. Zhu, Z. Wang, K. Khant, and K. Daniilidis, “Eventgan: Lever-aging large scale image datasets for event cameras,” arXiv preprintarXiv:1912.01584 , 2019.[24] J. Fang, D. Yan, J. Qiao, and J. Xue, “Dada: A large-scale benchmarkand model for driver attention prediction in accidental scenarios,” arXivpreprint arXiv:1912.12148 , 2019.[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in . IEEE, 2016, pp. 770–778.[26] L. Pan, R. Hartley, C. Scheerlinck, M. Liu, X. Yu, and Y. Dai, “Highframe rate video reconstruction based on an event camera,” arXivpreprint arXiv:1903.06531 , 2019.[27] M. Jin, G. Meishvili, and P. Favaro, “Learning to extract a videosequence from a single motion-blurred image,” in . IEEE,2018, pp. 6334–6342.[28] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-scnn: Gatedshape cnns for semantic segmentation,” in . IEEE, pp. 5228–5237.[29] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in . IEEE, 2017, pp.2242–2251.[30] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang, “Taking a closer look atdomain shift: Category-level adversaries for semantics consistent domainadaptation,” in . IEEE, 2019, pp. 2502–2511.[31] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representa-tion learning for human pose estimation,” in . IEEE, 2019, pp.5686–5696.[32] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,Z. Liu, J. Xu et al. , “Mmdetection: Open mmlab detection toolbox andbenchmark,” arXiv preprint arXiv:1906.07155 , 2019.[33] G. Farneb¨ack, “Two-frame motion estimation based on polynomialexpansion,” in