[PDF] Adaptable Deformable Convolutions for Semantic Segmentation of Fisheye Images in Autonomous Driving Systems

Abstract

Advanced Driver-Assistance Systems rely heavily on perception tasks such as semantic segmentation where images are captured from large field of view (FoV) cameras. State-of-the-art works have made considerable progress toward applying Convolutional Neural Network (CNN) to standard (rectilinear) images. However, the large FoV cameras used in autonomous vehicles produce fisheye images characterized by strong geometric distortion. This work demonstrates that a CNN trained on standard images can be readily adapted to fisheye images, which is crucial in real-world applications where time-consuming real-time data transformation must be avoided. Our adaptation protocol mainly relies on modifying the support of the convolutions by using their deformable equivalents on top of pre-existing layers. We prove that tuning an optimal support only requires a limited amount of labeled fisheye images, as a small number of training samples is sufficient to significantly improve an existing model's performance on wide-angle images. Furthermore, we show that finetuning the weights of the network is not necessary to achieve high performance once the deformable components are learned. Finally, we provide an in-depth analysis of the effect of the deformable convolutions, bringing elements of discussion on the behavior of CNN models.

Full PDF

AA DAPTABLE D EFORMABLE C ONVOLUTIONS FOR S EMANTIC S EGMENTATION OF F ISHEYE I MAGES IN A UTONOMOUS D RIVING S YSTEMS

A P

REPRINT

Clément Playout

Department of InformaticsÉcole Polytechnique de MontréalMontréal, Canada [email protected]

Ola Ahmad

CortAIx Thales,Montréal, Canada [email protected]

Freddy Lecue

CortAIx ThalesMontréal, CanadaInriaSophia Antipolis, France [email protected]

Farida Cheriet

Department of InformaticsÉcole Polytechnique de MontréalMontréal, Canada [email protected]

February 23, 2021 A BSTRACT

Advanced Driver-Assistance Systems rely heavily on perception tasks such as semantic segmentationwhere images are captured from large ﬁeld of view (FoV) cameras. State-of-the-art works have madeconsiderable progress toward applying Convolutional Neural Network (CNN) to standard (rectilinear)images. However, the large FoV cameras used in autonomous vehicles produce “ﬁsheye” imagescharacterized by strong geometric distortion. This work demonstrates that a CNN trained on standardimages can be readily adapted to ﬁsheye images, which is crucial in real-world applications wheretime-consuming real-time data transformation must be avoided. Our adaptation protocol mainlyrelies on modifying the support of the convolutions by using their deformable equivalents on topof pre-existing layers. We prove that tuning an optimal support only requires a limited amount oflabeled ﬁsheye images, as a small number of training samples is sufﬁcient to signiﬁcantly improvean existing model’s performance on wide-angle images. Furthermore, we show that ﬁnetuning theweights of the network is not necessary to achieve high performance once the deformable componentsare learned. Finally, we provide an in-depth analysis of the effect of the deformable convolutions,bringing elements of discussion on the behavior of CNN models.

Convolutional Neural Networks have established state of the art performances on numerous vision-based tasks (objectdetection, instance and semantic segmentation) and datasets and therefore become a standard for many application. Inparticular, these models are major assets for Advanced Driver-Assistance Systems (ADAS) and autonomous vehicles,which require models that are highly efﬁcient while allowing a good understanding of their operation. ADAS primarlyrely on a precise identiﬁcation of the environment surrounding the vehicle. In order to capture this environment, differenttypes of sensors are used, among which ﬁsheye cameras. These are built with speciﬁc lenses to achieve extremelywide ﬁeld-of-view (FoV), reaching angle far superior to regular lenses. As a comparison, the latter are consideredwide angle when covering angles from 64 ◦ to 84 ◦ , whereas the former can reach values up to 270 ◦ (but usually between100 ◦ to 180 ◦ ). However, such high FoVs come at the expense of the rectilinear property provided by regular lenses a r X i v : . [ c s . C V ] F e b PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021(straight features in the scene remain straight in the image). Fisheye lenses capture curvilinear images, often roughlydescribed as deformed by barrel distortion (the magniﬁcation decreases with the distance from the optical axis -usuallythe center of the image). Due to their nature, ﬁsheye images are usually harder to analyse with conventional recognitionmodels, requiring these to be tuned and/or retrained. In the context of deep learning, this usually comes to changing thetraining data, but this raises the question of the number of labelled ﬁsheye images needed and whether the model canstill beneﬁts from existing rectilinear datasets. In this work, we demonstrate the impact of using these datasets in orderto improve ﬁsheye segmentation by analysing different ways to adapt an existing “rectilinear” CNN model to ﬁsheyeimages. We consider different possibilities, from training a model from scratch with distorted ﬁsheye-like images tosimply slightly modifying the support of the convolution using trainable convolutions offsets (known as deformableconvolutions). As ﬁsheye data is not a largely open resources in the Machine Learning community, we investigatethe minimal number of samples needed to fulﬁll model adaptation. For the sake of simplicity, the focus of this paperis on semantic segmentation but we hypothetize that most of the technics aftermentioned would generalize to objectdetection.

Contribution 1:

We propose a novel and simple learning mechanism to adapt standard CNN models on ﬁsheye imagesusing the concept of deformable convolutions. To capture non-linear transformations, we demonstrate through thismechanism the possibility of learning the spatial support of convolution ﬁlters independently from their weights whilemaintaining almost similar performance. This insight could potentially motivate a rethinking of the data augmentationprocess in deep learning applications.

Contribution 2:

We demonstrate that adapting a CNN model to the non-linear spatial distortion induced by ultrawide-angle geometry can be achieved with only a few training images. In this way, we can alleviate the lack of availabletraining datasets of real ﬁsheye images for different perception tasks.

Contribution 3:

As advanced driver-assistance systems commonly make use of narrow as well as large FoV camerasto perform different autonomous tasks, we propose a ﬂexible semantic segmentation model that can be deployed withboth narrow and large FoVs.

From a computer vision researcher’s perspective, the ﬁeld of ﬁsheye images is relatively recent, in particular whenconsidering tasks involving object recognition (detection or semantic segmentation). The initial studies on the subjectwere mainly focused on calibration (ﬁnding the intrinsic parameters of the camera), by building a geometrical oranalytical model. [1] proposed a parametric model following a polynomial form to describe the projection function. Afourth-order polynomial was proved to be an accurate model. This model is still regularly used, including to synthesizeﬁsheye distortion in rectilinear images. Calibration is a potential ﬁrst step toward image rectiﬁcation, whose goal is toremove the distortion such that straight-line features appear as straight lines in the image. Recently, [2] proposed touse a CNN to directly rectify ﬁsheye image, by training the network to predict the distortion parameters. The authorsdemonstrated that using a semantic context as an input to the distortion parameters predictor signiﬁcantly improves thesystem’s performances. This suggests that rectiﬁcation could be used as a preprocessing step before object recognitionusing an existing conventional model. Nonetheless, there is not much work in the literature proposing to combine bothprocesses. As pointed out by [3], this can be explained by the fact that undistortion causes major issues: a typicalﬁsheye with FoV > ◦ can not be mapped onto a rectilinear image, leading to a reduction of the FoV in the rectiﬁedimage. Moreover, the re-warping operation leads to a non-uniform sampling accross the resulting image, creating blurryareas. Therefore, current research is shifting to focusing on model adaptation rather than undistorting images. To ourknowledge, [4] are the ﬁrst to use a standard CNN architecture for objects detection in ﬁsheye-like images. In theircase, the images are treated as normal ones and used to ﬁnetune a network trained on rectilinear images. The adaptationtherefore consists in changing the training data rather than on the model itself. This approach is comparable to what[5, 6] proposed, which is to rely on synthetic ﬁsheye generation to extend the training distribution. By sampling differentdistortion parameters, this types of approaches replicate the effect of data augmentation. We argue that adapting amodel pretrained on rectilinear images should not require a full retraining and propose the idea that only a limitedamount of new data if needed to implicitly learn the parameters of the image distortion and thereby adapt the model’ssemantic prediction. This intuition is motivated by [7], who showed that it is possible to predict extrinsinc and intrisinccamera distortion parameters from a single image. By extension, our work aims to explore the number of trainingsamples needed to adapt a semantic segmentation model to interpret distorted images. For this task, using a singledistorted image for ﬁnetuning would likely result in biasing the internal statistics of the model and eventually to strongmodel overﬁtting. Therefore, instead of modiﬁying the weights and biases of the model itself, we propose to change theway convolutions are done using deformable convolutions, as originally introduced by [8]. Deformable componentsshould theoretically be able to capture the distortion parameters of the image, while the regular convolutions should beable to extract meaningfull semantic features. This idea has also been explored by [9], who transformed deformable2 PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021convolutions into their restricted equivalent and tested them on ﬁsheye images. Their work also analyze the placementof the deformable layers within the network to achieve optimal performance. Our approach differs from theirs as weonly adapt the deformable part of the convolutions, rather than training a complete model. We also study in depth theeffect of the positioning of deformable convolutions within the existing network, the effect of batch-normalization andexplore few-shots training approaches.

Problem Statement:

Given a set S = { X, Y } , where X = { x ( i ) } Ni =1 is a set of rectilinear images, Y = { y ( i ) } Ni =1 the set of their associated 2D semantic segmentation groundtruth and N the number of samples in the dataset, given asemantic segmentation model φ , its prediction φ ( x ) = ˜ y and given a parametrized conversion function ρ that associatesa rectilinear image to its ﬁsheye equivalent, this works targets to ﬁnd the adaptation function Ψ that optimizes: Ψ( φ ) = ˆ φ min Ψ || ˆ φ ( ρ ( x )) − ρ (˜ y ) || (1)Note that the deﬁnition of ρ is voluntarily vague, since ρ can represent a synthetic projection function just as wellas a change in the actual camera lens. In the real world, ρ is most likely unknown (and unused). However, we willassume that we know its form in the rest of the paper as ρ is needed in absence of an existing labelled ﬁsheye dataset.To summarize, we want the adapted model to predict from a wide angle image an output as close as possible to ρ (˜ y ) ,which is the distortion of the prediction obtained from a rectilinear image with the original model.The rest of this section will describe each of the functions in equation 1, i.e. the segmentation model φ , the conversionfunction ρ and ﬁnally the components of the adaptation function Ψ . Semantic segmention of rectilinear images has been thoroughly studied and reﬁned to a point where many efﬁcientCNN architectures are now easily deployable. The choice of a particular model is mainly based on a trade-off betweenperformance requirements and computational resources available. We favour the former criterion and choose theDeepLabV3+ model as our baseline network. DeepLabV3+ is an architecture introduced by [10], extending theirprevious work on large-scale networks using atrous convolutions. The architecture is composed of three mainscomponents. It starts with an encoder network that reduces the spatial resolution of the input while increasing its depth.The encoder is used to extract features from the input image. The lowest-level features are fed to the second component,the Atrous Spatial Pyramid Pooling (ASPP). It consists of several parallel convolution layers using different dilationsrates, working overall as a multi-scale convolutional layer. The last component of the architecture is a decoder module,that expands the features from the encoder and the ASPP back up to the input dimensions. Inspired by the skippedconnections proposed by [11], the decoder concatenates low- and mid-level features from the encoder and outputs asegmentation map. We experimented with two different models as a backbone for the encoder, the resnet101 proposedby [12] and the

Aligned-Xception introduced by [13]. We did not observe any improvements with the latter and thuskept the former. The resnet101 was pretrained on a subset of the COCO train2017 dataset provided with the Torchvisionlibrary . The lack of existing ultra wide-angle datasets has motivated many researchers to generate approximations fromrectilinear images. One of the two following approaches is usually chosen:• Simulating a ﬁsheye-like distortion on rectilinear images.• Rendering images from a 3D scene using a virtual ﬁsheye cameras.The ﬁrst option, while being easy to setup, suffers from the drawbacks related to the grid sampling as mentioned in theprevious section. The second option is signiﬁcantly more time-consuming and requires knowledge about 3D graphicstools. On the other hand, it has the advantage of generating more realistic images in terms of their distortion andtheir FoV. Nonetheless, the 3D rendered images are far from being as detailed as real world images. For the sake ofcompletness and reproducibility, we have experimented with both approaches. PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021 (a) Original (b) f=125 (c) f=75

Figure 1: Example of distortion using a parametric polynomial to synthesize ﬁsheye-like images.

Simulation

To simulate ﬁsheye distortion on rectilinear images, we rely on the same model as the one used in theopen-source library OpenCV . Noting ( x, y ) a couple of normalized coordinates in the rectilinear image, the distortionfunctions maps them to normalized ﬁsheye coordinates ( x (cid:48) , y (cid:48) ) using the following equations: r = x + y θ = arctan( r ) θ d = θ · (1 + k θ + k θ + k θ + k θ ) x (cid:48) = f · ( θ d /r ) · x + x (cid:48) y (cid:48) = f · ( θ d /r ) · y + y (cid:48) (2)The parameters { f, k i } i =1 are tunable and ( x (cid:48) , y (cid:48) ) can be adjusted to change the distortion center. We limit ourselfexperimentally to variations of f , which corresponds to a scale factor (as an approximation of a varying focal length), asdepicted in Figure 1. Using this set of equations, we are able to apply the distortion on real images from the Cityscapedataset freely provided by [14]. Generation

Our generated dataset is based on an extension of the 3D urban scene provided by [15]. Instead of theorignal depth-maps provided, we conﬁgure the render engine to output semantic maps and rendered images. Moreover,we enrich the scene by adding different 3D assets obtained from a free assets-provider . We thereby added the followingobjects to the existing scene: cars (6 different models), pedestrians (3 models), bikes (2 models), cyclists, bus andbus station. The limited variability of these scene elements might limit the usefulness of this dataset for real-worldapplications, but it can still provide important insights into ﬁsheye segmentation. In total, the generated semanticmaps are composed of 11 differents classes. Images are rendered at the resolution of × . Two renders aredone, the ﬁrst one with a rectilinear camera (FoV=80 ◦ ) and the second one with a ﬁsheye camera (FoV=180 ◦ ). As thesame scene, with the same contents is represented for both, our comparative study is only focused on the effect of theﬁsheye distortion all else things being equal. Figure 2 shows the type of results we obtain. We refer to this dataset as BlenDataset . Deformable convolutions (DCN) were introduced by [8] to extend the regular grid sampling locations used in convo-lutions with 2D offsets. DCN learn the offsets at every location from the feature maps of a previous layer, leadingto dense representations that captures free deformation form of spatial kernels that apply in the current convolutionlayer. The offset simply refers to the displacement vector ( δ p i ) of a point p i , i ∈ { , ..., n × m } taken on the n × m spatial grid of the kernel. Given a location p , where a standard convolution kernel applies, the output feature value ofthe deformable convolution layer at p becomes: y ( p ) = (cid:88) p i ∈R w ( p i ) · x ( p + p i + δ p i ) (3) https://docs.opencv.org/master/db/d58/group__calib3d__fisheye.html All assets downloaded are free-to-use for non-commercial purposes. PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021 (a) Rectilinear, FoV=80 ◦ (b) Fisheye, FoV=180 ◦ Figure 2: 3D renders from the same spatial camera position but two different lenses.Figure 3: Deformable convolution (DCN) inserts themselves within a regular convolution layer with a k × k kernel.Our adaptation process only requires to train the prediction of the offsets (in light orange).where R is the grid support of the convolution. Because of the fractional nature of the offsets values, [8]. used bilinearinterpolation to apply Equation 3. Originally, the weights of the kernel were learned simultaneously with the offsets.The DCN was applied on rectilinear images to enhance convolutions in CNNs. In this work, we leverage the DCN’scapabilities to take into account non-linear transformations brought by ﬁsheye geometric distortion (as shown in Figure1) and suggest an adaptable approach for ﬁsheye image recognition tasks. To do so, we propose to transfer the weightsof a base model (trained on rectilinear images) to the task of ﬁsheye segmentation by converting regular convolutions toDCN and to only train the offsets layers that precede the main convolution layer (as shown in 3). The resulting modelis called adaptable deformable convolution as it adapts an existing convolution to the extrinsic deformations of thegrid while preserving the intrinsic properties of the objects. Following equation 1, the objective is to ﬁnd the optimaladaptation parameters of Ψ that minimizes the error between the predicted output and groundtruth. The offsets areimplicitly learned by backpropagating this error while the parameters of the base convolution layers remain ﬁxed.5 PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021 (a) rect-DL3 (b) ﬁsh-DL3 (c) adpt-DL3

Figure 4: Comparison of the prediction obtained from the same image with three different models. For the sake ofreadability, only 5 classes are shwon. The second and third image are distorted with f = 125 . rect-DL3 has beentrained on rectilinear images, ﬁsh-DL3 has been trained from scratch on distorted images and adpt-DL3 is the adaptedversion of rect-DL3 (where convolutions are replaced by DCN but weights are kept equals) Two datasets were used independantly for our experiments, meaning that we trained different models using the sameprotocol for each dataset.

BlenDataset:

This dataset is composed of 4000 pairs of rectilinear and ﬁsheye synthetic images, as well as theircorresponding groundtruth semantic segmentations. The dataset corresponds to a single video sequence taken in the 3Dvirtual scene. The ﬁrst 3000 images were used for training purposes and the remaining 1000 images for testing. Thiswas done in order to avoid having consecutive (and therefore highly correlated) frames in the two distinct sets. Thetraining set was furthermore randomly split into two subsets (training and validation), following a 0.8/0.2 ratio.

Cityscape:

The Cityscape dataset comprises 5000 images divided into training, validation and test sets (2975, 500 and1525 images respectively). Groundtruth maps are not publicly available for the test set, therefore we only used the ﬁrsttwo sets. The validation set was used as the test set and the original training set was split in two (0.9/0.1 ratio) fortraining and validation purposes.For all experiments, the images were kept at their original resolution ( × ), and random patches of size × were extracted from them. Rectilinear segmentation training:

We trained the DeepLabV3+ architecture on 4 GPUs, using synchronized batch-norm as proposed by [16], mixed-precision and a learning rate of 0.005 for the encoder and 0.05 for the decoder. Bothlearning rate were updated during training using the “poly” schedule policy introduced by [17]. Weights were updatedusing the Adam solver. As a loss function, the weighted cross-entropy was used to alleviate the issue of class imbalance.We also observed that data augmentation (random scaling, rotation and horizontal ﬂipping) helped to improve themodel’s performance. Training ran for 100 epochs, with a batch size of 8. We refer to this model as rect-DL3 (for “rectilinear DeepLabV3+” ). Adaptative training:

In order to adapt the model, we have added deformable convolutions on top of the trainedrectilinear model. We use the efﬁcient implementation from the

MMDetection toolbox provided by [18]. Following theprinciple of an ablation study, we tested different conﬁgurations in order to understand the mechanisms underlyingthe different network components and demonstrate their respective effects. For each conﬁguration, the adaptation wastrained during 25 epochs on ﬁsheye images (generated or simulated). To limit the scope of this paper, we used a ﬁxeddistortion level, parameterized by f = 125 . We found that the prediction of DCN offsets was unstable when using highlearning rates. Consequently, we reduced the rates to 0.01 for the decoder and 0.001 for the encoder. All others trainingparameters were kept identical as in the rectilinear training procedure described in the previous paragraph. Evaluation procedure

As an evaluation metric, we have adopted the mean Intersection-over-Union (mIoU) proposedby [14]. The IoU is computed per class for the whole test set and then averaged accross classes. For both datasets, a“void” class, corresponding to unsegmented objects and/or borders of the image, was included and corresponding pixelswere discarded from the metric computations.

The need for model adaptation stems from the difﬁculty regular rectilinear models have insegmenting wide-angle images. This can be illustrated quite simply by directly testing rect-DL3 (without adaptation)on different ﬁsheye distortions. The performances are reported in table 1. As expected, they quickly deteriorate in6

PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021BlenDatasetRectilinear Fisheye rect-DL3 f = 150 f = 125 f = 75 rect-DL3 rect-DL3 (one per dataset) tested under differentconﬁgurations. On Cityscape, ﬁsheye effect is simulated and a lower value of f indicates a stronger distortion.comparison to the model’s performance on rectilinear images. The performance degradation is correlated with thestrength of the distortion. An Upper Limit to the Adaptation’s Efﬁciency

As the adaptation phase is only aimed at tackling the distortionproblem, we can hypothesize an upper limit on the adapted model’s performance. Given a metric function Ω (forexample the mIoU), a rectilinear groundtruth map y , a prediction ˜ y from a rectilinear input with rect-DL3 and adistortion function ρ , the best performance we can expect with the prediction y adpt from the adapted model on thecorresponding distorted input is: Ω( y adpt , ρ ( y )) ≤ Ω( ρ (˜ y ) , ρ ( y )) = Λ Ω (4)In other words, for a distorted input, the best prediction possible corresponds to the distortion of the prediction obtainedfrom the equivalent non-distorted input. This upper limit is denoted Λ Ω . With f = 125 (the distortion level used in thefollowing experiments), we obtain Λ mIoU = 0 . with rect-DL3 . As expected, none of our adapted models exceededthis limit, but the adaptation process proved its efﬁciency by approaching the upper limit very closely (as shown inTable 4 below). Adapting with Batch Normalization

Following the observations of [19], we noticed that ﬁnetuning the Batch Nor-malization (BN) layers during the adaptation phase helps to reach better performances than only learning the offsetpredictions. Hence, we investigated the effect of each component for the adaptation process; the results of thiscomparison are reported in Table 2. Cityscape, f = 125 rect-DL3 +BN +DCN +DCN+BN Λ mIoU rect-DL3 model on rectilinearimages, whereas the offsets of the DCNs’ offsets can very easily be turned off, restoring the adapted model back to itsoriginal state. In autonomous vehicles that rely on both wide-angle and regular lenses, this ensures a very simple way touse the same model for both imaging modalities. What to Adapt?

Similarly to [8] and [20], we studied which part of the original model needed to be adapted ( i.e adding DCN and ﬁnetuning the batch-norm) to obtain optimal performances. Rather than a layer-by-layer approach,we restricted the experiments to the two main blocks of the rect-DL3 model, the encoder and the decoder. Results arereported in Table 3. Cityscape f = 125 decoder only encoder only encoder+decoder0.414 0.639 Table 3: Comparison of the effects of adapting different components of the models. For each case, the adaptationconsists in adding deformable convolutions and tuning the batch-normalization.As we can see, adapting the encoder and the decoder jointly seems to offer very little improvement over adapting onlythe encoder. This opens up interesting leads for future experiments, as a given encoder can be ﬁnetuned for different7

PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021Figure 5: Performances (mIoU) with respect to the number of training samples.tasks (for example, ours could be tuned for object detection). In light of this, we believe that our adaptation method islikely to be suited for others tasks than semantic segmentation, meaning that the weights of the offset predictions couldbe re-used with no further tuning in an encoder trained for a different task such as object detection. The veriﬁcation ofthis hypothesis is left for future work.

Few-shots adaptative training

One of our central ideas of this work is that learning to adapt a model from distortedsamples is signiﬁcantly easier than learning semantic segmentation on them from scratch. We explored this notion byturning our adaptation protocol into a few-shots learning problem. Subsets of different sizes (1, 50, 100, 1000) from theinitial training set were sampled and we evaluated the performance of the adapted model on the same test set. Trainingand testing were done on Cityscape ( f = 125 ). Results of this experiment are shown in Figure 5. This graph shows thata single image training won’t allow any improvement compared to the non-adapted model (mIoU dropping from 0.420to 0.390). Nonetheless, even a relatively small number of samples (n=50 in Figure 5) brings the model’s performancerelatively close to the level reached when using the full training set (composed of 2675 images). Adaptation versus retraining

Given the variability of existing ﬁsheye simulation procedure as well as the heavycomputational requirement to retrain different semantic models, we do not compare our adapted models directly with thestate-of-the art. Nonetheless, we evaluate how it compared with a DeepLabV3+ trained on ﬁsheye images with variable f as a form of data augmentation as suggested by [5, 6]. This model is referred to as ﬁsh-DL3 , to be distinguished fromthe adapted model, referred to as adpt-DL3 . Results are shown in Table 4. In addition, a qualitative comparison betweenthe different models is shown in Figure 4. BlenDataset, ﬁsheye rect-DL3 ﬁsh-DL3 adpt-DL3 f = 125 rect-DL3 ﬁsh-DL3 adpt-DL3 Λ mIoU PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021

Discussion

The main experiments in this study served to demonstrate the effectiveness of the proposed adaptabledeformable convolutions on ﬁsheye images for semantic segmentation tasks. By learning only the weights of offsetlayers, the DCN-based model was able to adapt faster to non-linear spatial distortion and capture more accurate featurerepresentations than ﬁnetuning or retraining a standard CNN. The proposed approach should remain valid for anychoice of f , but experiments on dynamic values of f should be conducted to explore the model’s capacity to generalizeto different types of ﬁsheye cameras used in ADAS systems and autonomous vehicles. The offsets are learned bybackpropagating the cross-entropy loss function. An unanswered question emerges: can we learn the offsets in aself-supervised way, thereby removing the need for explicit data annotation? Given the fact that real-world ﬁsheyeimages are often difﬁcult to annotate because of their distortion, self-supervised learning could save signiﬁcant timeand costs in adapting CNNs to vision-based tasks on ultra wide-angle images. This work contributes to the goal ofself-supervised learning by providing the concept of an upper performance limit on model adaptation to ﬁsheye images.We keep this research area to our future work. We see two limitations of our current work: (1) lack of direct comparisonswith related work and (2) lack of validations on real ﬁsheye images. The rationale behind these limitations is theavailability of the data. To the best of our knowledge, existing methods (even very few) have used ﬁsheye simulationswith a different experimental setup and train on larger dataset to get high performance. In the absence of a uniﬁeddataset or at least a standardized distortion approach, fair comparisons are hard to make. To prevent this in the future,we plan on releasing soon our code and trained models. Deformable convolutions have been shown to be a signiﬁcant improvement over regular convolutions in many tasks.This work focuses on proving that they can be effectively used on top of an existing CNN without modifying itspre-trained weights. This opens up interesting applications for systems relying on multiple imaging modalities, as asingle model can be reliably adapted to the different tasks by means of marginal modiﬁcation rather than full retraining.Moreover, we demonstrate that training the deformable components can be done independently from the rest of themodel (even if ﬁnetuning the batch normalization is advised) and that it does not require a large number of samplesand alleviates the need to build large datasets of labeled ﬁsheye images. These observations open different avenuesworth exploring in future work. In particular, since autonomous vehicles require real-time object detection, we plan toinvestigate whether our adaptation protocol can be applied to existing object detection models.

Acknowledgment

This work was supported by NSERC (Natural Sciences and Engineering Research Council ofCanada). The authors gratefully acknowledge Philippe Debanné for revising this manuscript.

References [1] D. Scaramuzza, A. Martinelli, and R. Siegwart. A Flexible Technique for Accurate Omnidirectional CameraCalibration and Structure from Motion. In

Fourth IEEE International Conference on Computer Vision Systems(ICVS’06) , pages 45–45, January 2006.[2] Xiaoqing Yin, Xinchao Wang, Jun Yu, Maojun Zhang, Pascal Fua, and Dacheng Tao. FishEyeRecNet: A Multi-Context Collaborative Deep Network for Fisheye Image Rectiﬁcation. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pages 469–484, 2018.[3] Senthil Yogamani, Ciaran Hughes, Jonathan Horgan, Ganesh Sistu, Sumanth Chennupati, Michal Uricar, StefanMilz, Martin Simon, Karl Amende, Christian Witt, Hazem Rashed, Sanjaya Nayak, Saquib Mansoor, PadraigVarley, Xavier Perrotton, Derek Odea, and Patrick Perez. WoodScape: A Multi-Task, Multi-Camera FisheyeDataset for Autonomous Driving. In ,pages 9307–9317, Seoul, Korea (South), October 2019. IEEE.[4] Fucheng Deng, Xiaorui Zhu, and Jiamin Ren. Object detection on panoramic images based on deep learning. In , pages 375–380, April 2017.[5] Álvaro Sáez, Luis M. Bergasa, Elena López-Guillén, Eduardo Romera, Miguel Tradacete, Carlos Gómez-Huélamo,and Javier del Egido. Real-Time Semantic Segmentation for Fisheye Urban Driving Images Based on ERFNet.

Sensors (Basel, Switzerland) , 19(3), January 2019.[6] Yaozu Ye, Kailun Yang, Juan Wang, and Kaiwei Wang. Universal Semantic Segmentation for Fisheye UrbanDriving Images. arXiv:2002.03736 [cs, stat] , January 2020.[7] Manuel López, Roger Marí, Pau Gargallo, Yubin Kuang, Javier Gonzalez-Jimenez, and Gloria Haro. Deep SingleImage Camera Calibration With Radial Distortion. In , pages 11809–11817, June 2019.9

PREPRINT ( UNDER REVIEW ) - F

EBRUARY

23, 2021[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable ConvolutionalNetworks. In

Proceedings of the IEEE International Conference on Computer Vision , pages 764–773, 2017.[9] Liuyuan Deng, Ming Yang, Hao Li, Tianyi Li, Bing Hu, and Chunxiang Wang. Restricted Deformable Convolution-Based Road Scene Semantic Segmentation Using Surround View Cameras.

IEEE Transactions on IntelligentTransportation Systems , pages 1–13, 2019.[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-Decoder withAtrous Separable Convolution for Semantic Image Segmentation. In Vittorio Ferrari, Martial Hebert, CristianSminchisescu, and Yair Weiss, editors,

Computer Vision – ECCV 2018 , Lecture Notes in Computer Science, pages833–851, Cham, 2018. Springer International Publishing.[11] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical ImageSegmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors,

MedicalImage Computing and Computer-Assisted Intervention – MICCAI 2015 , Lecture Notes in Computer Science,pages 234–241, Cham, 2015. Springer International Publishing.[12] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated Residual Transformationsfor Deep Neural Networks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 1492–1500, 2017.[13] François Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. In , pages 1800–1807, July 2017.[14] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3213–3223, 2016.[15] Zichao Zhang, Henri Rebecq, Christian Forster, and Davide Scaramuzza. Beneﬁt of large ﬁeld-of-view camerasfor visual odometry. In , pages 801–808,May 2016.[16] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. MegDet: ALarge Mini-Batch Object Detector. In ,pages 6181–6189, Salt Lake City, UT, June 2018. IEEE.[17] Wei Liu, Andrew Rabinovich, and Alexander C. Berg. ParseNet: Looking Wider to See Better. arXiv:1506.04579[cs] , November 2015.[18] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, ZiweiLiu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu,Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin.MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 , 2019.[19] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting Batch Normalization ForPractical Domain Adaptation. arXiv:1603.04779 [cs] , November 2016.[20] Liuyuan Deng, Hao Yang, Tianyi Li, Bing Hu, and Chunxiang Wang. Restricted Deformable Convolutionbased Road Scene Semantic Segmentation Using Surround View Cameras.