Sequential image processing methods for improving semantic video segmentation algorithms
Beril Sirmacek, Nicolò Botteghi, Santiago Sanchez Escalonilla Plaza
SS EQUENTIAL IMAGE PROCESSING METHODS FOR IMPROVINGSEMANTIC VIDEO SEGMENTATION ALGORITHMS
A P
REPRINT
Beril Sirmacek, Nicolò Botteghi, Santiago Sanchez Escalonilla Plaza
Robotics and MechatronicsUniversity of TwenteEnschede, The Netherlands (b.sirmacek, n.botteghi, s.sanchezescalonillaplaza)@utwente.nl
October 30, 2019 A BSTRACT
Recently, semantic video segmentation gained high attention especially for supporting autonomousdriving systems. Deep learning methods made it possible to implement real-time segmentation andobject identification algorithms on videos. However, most of the available approaches process eachvideo frame independently disregarding their sequential relation in time. Therefore their resultssuddenly miss some of the object segments in some of the frames even if they were detected properlyin the earlier frames. Herein we propose two sequential probabilistic video frame analysis approachesto improve the segmentation performance of the existing algorithms. Our experiments show thatusing the information of the past frames we increase the performance and consistency of the state ofthe art algorithms. K eywords Artificial Intelligence · Semantic Segmentation · Conditional Probability · Temporal Consistency
In an era for automation, it is not far fetched to think of a scenario where transportation does not suppose a hustle forthe driver anymore. Regarding to commuting statistics, it is interesting to take a look at the American panorama asthe living patterns are more standardized than over the different countries in Europe. A recent study by Statista [1]states that in America in 2016, an estimated 85.4 percent of 150M workers drove to their workplace in an automobile,while only 5.1 percent used public transportation for this purpose. Out of the 85.4 percent, a total of 77 percent (115Mpeople) drove alone to work everyday [2].Although some people enjoy the act of driving, it is fair to generalize that driving during rush hour is considered to beone of the most stressful scenarios for the daily commuting. While passengers can just sit and relax, the driver has to beconstantly conscious about his actions during the whole ride. Self driving cars aim to free the driver from this activity,allowing him to spend his time in more valuable tasks. AC can also transform the current traffic system scenario bymaking it safer and more efficient to navigate, extending the benefits of automation to non-AC users.Autonomous navigation however, is not a recent invention. In 1912, Lawrence Sperry successfully demonstratedthe implementation of an autopilot-system on aviation. In an aircraft exhibition celebrated in Paris in 1914, Sperryperformed numerous in-flight tricks in front of an audience to test the autonomy of the navigation system under no pilotconditions.However, solving Autonomous navigation problems for cars, drones, bus, trucks... is not a trivial problem for differentreasons: • From the structural point of view, to list some examples: non-standardize roads (undefined or different lanesizes), inconsistent driving conditions (changes in weather, driving surface might deteriorate), obstacles ordebris, ambiguous drivable space, undefined traffic signs location. a r X i v : . [ ee ss . I V ] O c t PREPRINT - O
CTOBER
30, 2019 • From the non-structural point of view other factors come into play, such as: human or animal interaction(unpredictable behavior).The eruption of deep learning on the last decade has allowed to create safer and more intelligent pilot systems thatenable autonomous vehicles to operate better than under previously unseen scenarios. Deep learning together with themotivation of some companies to take autonomous navigation systems into mass production makes the present time tobe the perfect one to solve the autonomous vehicles enigma.
Although any system that requires autonomous navigation (cars, drones or any other mobile robot) can be consideredfor this topic, this document will focus on autonomous cars. The reason for this focus is that the late outburst ofautonomous navigation in the automobile industry is promoting the scientific interest towards autonomous cars resultingin new studies and data sets that cover this application.When talking about autonomy, the National Highway Traffic Safety Administration (NHTSA) has defined the followinglevels of car automation: • Level 0: No Automation. The driver performs all driving tasks. • Level 1: Driver Assistance. The Vehicle is controlled by the driver, but some driving assists features may beincluded in the vehicle design (such as ESP, Airbags, Lane keeping,...) • Level 2: Partial Automation. Driver-assist systems that control both steering and acceleration/deceleration, butthe driver must remain engaged at all times (e.g. cruise control or parking assistance). • Level 3: Conditional Automation. The driver is a necessity but not required at all times. He must the ready totake control of the vehicle at all times with notice. • Level 4: High Automation. The vehicle is capable of performing all driving functions under certain conditions.The driver may have the option to control the vehicle. • Level 5: Full Automation. The vehicle is capable of performing all driving functions under all conditions. Thedriver may have the option to control the vehicle.There are different companies trying to adapt classic vehicles to the different levels of automation. Nowadays most ofthe cars available have at least a level 2 of automation making level 3 the next step of the challenge.Level 3 is currently dominated by Tesla, since the release of autopilot in 2016, Tesla has been manufacturing newvehicles surpassing the 1 Billion miles mark driven autonomously (followed by Waymo with 10 Million miles). Despitethis big improvement, further levels of automation require a deeper study of the current technology and gather bigamounts of data from driving patterns and uncommon situations.In order to grant cars with autonomy, former cars need to be upgraded both in the hardware as well as in the softwareside. A key piece for this upgrade is the choice of the car’s equipment. Cameras are the most common sensor presentin autonomous vehicles, which along with other type of sensors are able to recreate a virtual representation of thesurroundings.The application domain of autonomous driving extends to anyplace with a drivable area (figure 1). Apart from thevariety of roads, the difficulty of automation is enhanced by the bounds of the problem: outdoors application. Thisloose definition of the domain specifications is what makes autonomous driving so challenging.Figure 1: Different types of roads. From left to right: urban road, highway and rural road.Granting a machine with the capacity of overtaking humans for tasks such as transportation is a non-trivial problem.Driving is a life-risk activity and therefore needs of a meticulous study, testing and evaluation of these new autonomoustechnologies. 2
PREPRINT - O
CTOBER
30, 2019
Computer Vision is the field of engineering that focuses on the extraction of the information encoded on imagesfor its use on different applications. Classical computer vision extracts this information through the calculation ofdifferent image descriptors. The calculation of the image descriptors is conditioned by the system characteristics: imageresolution, object shape, light conditions, application domain. The process that defines the image descriptors derivationis called feature extraction.Image descriptors are usually designed as hand-engineered filters, providing solutions that are rather rigid (applicationspecific) and reliable only under very restricted conditions. Unfortunately autonomous navigation falls into a completelyopposite scenario, requiring of applications that can perform robustly under very dynamic circumstances.The main advantage of deep learning is its flexibility to generalize to previously unseen data. Since the application ofConvolutional Neural Networks (CNNs) [3] for image processing, Deep Learning has been the protagonist on countlessComputer Vision conferences and research papers. CNNs allow the extraction of features in a more efficient andmeaningful way than classical image descriptors, based on image gradient calculations. Standing out due to theircapacity of automation of image descriptors, CNNs allow the creation of Image Processing applications with a highlevel of abstraction and accuracy.Semantic image segmentation (figure 2) is just one of the many Deep Neural Networks (DNN) applications. Thegoal Semantic Segmentation application is to detect and classify objects in the image frame by applying pixel-levelclassification of an input image into a set of predefined categories. Semantic image segmentation provides a level ofscene understanding much richer than any other detection algorithms, it includes detailed information about the shapeof the object and its orientation. Semantic segmentation can be used in autonomous navigation to precisely define theroad (or drivable space) and its conditions (erosion, presence of obstacles or debris), it is also very useful for navigationin crowded areas being able to accurately calculate the gap between obstacles and even make predictions of the futureposition of the obstacles based on its shape and trajectory. Semantic image segmentation models can generally bedivided into two parts: the feature extraction layers (hidden layers) and the output layers. The feature extraction layersuse CNNs along with other techniques such as pooling or skip connections to obtain a low level representation of theimage. And the output layers create the necessary relations to draw-out the pixel classification.The scope of the project will be restricted to the analysis of a hypothetical video feed coming from the frontal cameraof an autonomous car. The purpose of this camera is to elaborate a frontal representation of the environment that canbe used for navigation. Flying objects, dirt or sun glare are some of the external factors that can affect the correctperformance of cameras. In order to guarantee the passengers’ safety, the detection system of autonomous cars muststand out for the robustness and consistency of its results and all these situations need to be considered. An additionalobservation is that when applied to autonomous navigation applications, the segmentation should prevail the detectionof obstacles over driving space to ensures the avoidance collisions.Figure 2: Ideal result of a semantic image segmentation. In this figure, all the objects that conform the image areperfectly classified into the different colors that define each category: road, sidewalk, pedestrian, tree, building or trafficsign. This image is part of the Cityscapes groundtruth densely annotated dataset. Cityscapes is a large-scale data setcreated as tool to evaluate the performance of vision algorithms intended for semantic urban scene understanding andhelp researchers to exploit large volumes of annotated data. Image source: [4].
Semantic segmentation allows autonomous cars to obtain an accurate representation of the outside world. Thisrepresentation is used to define the available navigation space and the presence of obstacles necessary to calculatenavigation trajectories. 3
PREPRINT - O
CTOBER
30, 2019Figure 2 shows an example of a perfect semantic image segmentation, however it is very difficult to obtain a segmentationin such a high level of detail. The deep learning model would require large amounts of high resolution finely annotatedand varied data to allow the training optimization algorithm reach the desired accuracy while not overfitting. In contrast,figure 3 shows a real example of how an image that has been processed using an out-of-the-box state-of-the-art semanticimage segmentation model (DeepLabv3 [5]) that was trained on the Cityscapes data set [6] looks like. Figure 3Figure 3: Figure illustration of the segmentation level obtained by DeepLabv3 [5], the current state-of-the-art semanticimage segmentation model trained on the Cityscapes dataset [6]. This figure illustrates two different levels of segmenta-tion imperfection. Left image: example of a totally missed classification of the cars in front of the camera, added to anoisy classification of the road and the walls. Right image: example of a partial segmentation of the pedestrians.illustrates how the current level of a semantic image segmentation state-of-the-art model performance differs from theground-truth example shown on figure 2. Although a partial classification might be good enough for obstacle avoidance,in some cases the semantic image segmentation model completely misses the classification of the obstacle and thereforecan cause an accident. For this reason, autonomous cars have numerous sensors that allow the detection of obstacles atdifferent distance ranges not relying on bare-semantic image segmentation models as the main source of information.Another of the effects that can be observed after applying semantic image segmentation models for the analysis ofvideos is temporal inconsistency. Analyzing a video frame by frame causes a segmentation that is not consistent overtime, small variations in the frame produce high variances in the segmentation .This study examines how to reduce the incorrect classifications produced by semantic image segmentation modelsby combining the information of neighbouring frames
In an attempt to improve the obstacle detection, this study canbe broken down into the following research questions: • Analysis of the state-of-the-art: what is the current state-of-the-art for semantic image segmentation? • Temporal extension: how to extend semantic-image-segmentation models for the analysis of sequences? • Reducing missed classifications: what kind of mechanisms can be applied to reduce the number of falseclassifications?
As previously stated, one of the main goals of this article is how to extend semantic image segmentation models forthe analysis of sequences (videos are a sequence of images). The main difference between images and videos is thatthe latter consists of a group of images (frames) that are adjacent in time, indirectly encoding a temporal history. Inorder to exploit the sequential information present in videos, this section will introduce the available tools capable ofmodeling sequences. In section 5, some of these techniques will be used with a semantic image segmentation model inan attempt to add the video temporal information into the segmentation.Given a causal system, sequence modeling consists on elaborating a model that is able to reproduce the dynamicbehavior present in the observed data. From probabilistic methods to neural networks, this section summarizes differentprocedures used to capture temporal dynamics.The methods reviewed in this section can be divided into two different groups depending on the tools used for sequencemodeling: Conditional Probability and Deep Learning Architectures. The first one reviews causal modeling usingprobability relations. The second one introduces deep neural networks that have been specifically designed for modelingvideos. 4
PREPRINT - O
CTOBER
30, 2019
This section analyzes how to model the association between variables using probabilistic relations. Given two observedvariables ’ x ’ and ’ x ’, the conditional probability of ’ x ’ taking a value given that ’ x ’ takes another value (twodifferent events) is is defined as [7]: P ( x | x ) = P ( x , x ) P ( x ) , if P ( x ) > (1)Where the numerator of equation 1 is the joint probability of both events happening at the same time, and the numeratoris the marginal probability of event ’ x . It is possible to extend this notation to cover a bigger set of events (or asequence of events). For a set of variables X t = x , x , ...x t (for t > ), the probability of the variable x t conditionedto the rest of the variables in ’ X t ’ is: P ( x t |{ X τ , τ (cid:54) = t } ) = P ( x , x t ) P ( x , ...x ( t − ) = P ( X t ) P ( { X τ , τ (cid:54) = t } ) , if P ( { X τ , τ (cid:54) = t } ) > (2)Besides expressing the relation between variables using conditional probability notation, it is also possible to usegraphical models or probabilistic graphical models. A Probabilistic Graphical Model (PGM) is a graph that expressesthe conditional dependence relation between random variables. Conditional probability notation in combination withprobabilistic graphical models are commonly used in fields such as probability theory, statistics and machine learning.A possible graphical representation of equation 2 for t = 4 can be found in figure 4 There are two main approaches thatFigure 4: Possible probabilistic graphical model of equation 2 for t = 4 . In this graph, ’ x ’ depends on ’ x , x and x ’;’ x ’ depends on ’ x ’; and ’ x and x ’ are independent.can be followed when defining the probability model for a dynamic system: a generative approach or a discriminativeapproach. Although the final purpose of both approaches is the same, to sample data from a probability distribution,each approach is different.Generative models focus on modeling how the data is generated, in other words, modelling the joint probability P ( X t ) ,where ’ X t ’ is the set of variables involved in the process– e.g. in retrospect to the analysis of videos, each variablein ’ X t ’ can represent the value of a pixel over consecutive time steps, and ’ t ’ the frame index. A generative model isable to calculate the joint distribution of all the present variables ’ P ( X t ) ’. For the simple case of having 4 variables,modeling ’ P ( X ) ’ allows finding all the possible combinations for these variables: having observed the pixel at times1, 2 and 3, it is possible to estimate x ; or any other combination, such as computing x given x , x and x .On the other hand discriminative models only focus on modeling the conditional relation between variables (equation2), not paying attention on how the data is generated– e.g. having observed x , x and x it is possible to estimate x but it does not allow to compute any other combination for the variables.From the definition, generative models may appear to be more general and insightful about the present dynamic systemthan the discriminative ones. However, discriminative models are often preferred for classification tasks such as logisticregression [8]. The reason for this preference is that the generalization performance of generative models is often foundto be poorer than the one proper of discriminative models due to differences between the model and the true distributionof data [9].The most common procedures to create probability models will be reviewed in the following order: Naive BayesClassifiers, Markov Chains and Hidden Markov Models (HMM). The Naive Bayes classifier is a generative approach because it models the joint probability ’ P ( X t ) ’ and afterwardscalculates the conditional probability applying the Bayes Rule. Starting from the definition of conditional probability5 PREPRINT - O
CTOBER
30, 2019(equation 2), it is possible to apply the product rule of probability to the numerator, ’ P ( X t ) ’ as: P ( X t ) = P ( { X τ , τ (cid:54) = t } , x t ) = P ( { X τ , τ (cid:54) = t }| x t ) · P ( x t ) (3)And the sum rule to the denominator to define P ( X ( t − ) as the marginal distribution of P ( X t ) : P ( { X τ , τ (cid:54) = t } ) = (cid:88) T P ( { X τ , τ (cid:54) = t } , x t ) = (cid:88) T P ( { X τ , τ (cid:54) = t }| x t = T ) · P ( x t = T ) (4)where T comprehends all the possible states of x t .The Bayes Rule is the result of applying these two properties to the definition of conditional probability (equation 2): P ( x t |{ X τ , τ (cid:54) = t } ) = P ( X t ) P ( { X τ , τ (cid:54) = t } ) = P ( x t ) · P ( { X τ , τ (cid:54) = t }| x t ) (cid:80) T P ( x t = T ) · P ( { X τ , τ (cid:54) = t }| x t = T ) (5)The general form of the Bayes Theorem says that the posterior probability of an event is proportional to the priorprobability of that event times the likelihood of the observation conditioned to that event . In other words, if theprobability of a given variable (or set of variables) P ( { X τ , τ (cid:54) = t } ) is fixed, the posterior probability (equation 5) canbe expressed as a proportional factor of the numerator.Using the previous example that tracks the value of a pixel over 3 consecutive frames, the value of the pixel at timeframe 4 will be given by: P ( x | X ) ∝ P ( x ) · P ( X | x ) (6)In a more general form, the conditional probability of a state x t given a set of previous observations from x to x ( t − : P ( x t |{ X τ , τ (cid:54) = t } ) ∝ P ( x t ) · P ( { X τ , τ (cid:54) = t }| x t ) (7)The Naive Bayes assumption states that the features (observations X ( t − ) are conditionally independent given the classlabel ( x t )[7].Applying the Naive Bayes assumption of independence allows to exploit the second term of equation 7 into ’ t − ’different terms: P ( { X τ , τ (cid:54) = t }| x t ) = P ( x | x t ) · P ( x | x t ) , ..., P ( x ( t − | x t ) P ( x t |{ X τ , τ (cid:54) = t } ) ∝ P ( x t ) ( t − (cid:89) n =1 P ( x n | x t ) (8)Equation 8 defines a model that predicts the value for the state x t for a set of observed states X ( t − =( x , x , ..., x ( t − ) . It is the final form of the Naive Bayes classifier, which as a consequence of the Naive Bayesassumption do not capture dependencies between each of the observed states in X ( t − (figure 5). Even though thisconditional independence assumption might sound unrealistic for real case scenarios, empirical results have shown agood performance in multiple domains with attribute dependencies [10]. These positive findings can be explained dueto the loose relation between classification and probability estimation: ’correct classification can be achieved even whenthe probability estimates used contain large errors’ [10].Figure 5: Graphical representation of the final form Naive Bayes classifier. It is based on the Naive Bayes assumptionthat states that: the observations ( x , ...x ( t − ) are conditionally independent given the value of x t .Markov Models and Hidden Markov Models do not make any assumptions about the in-dependency of the variablesand will be illustrated next. 6 PREPRINT - O
CTOBER
30, 2019
Markov chains or Markov Models (MM) are stochastic generative models for dynamic systems that follow the Markovproperty. Markov’s property states that the future state of a variable depends only on the current observation (there isonly dependence between adjacent periods). In the framework of video processing, Markov’s property can be interpretedas: the value of a pixel in the present is only dependent on its immediate past (figure 6).Using probabilistic notation, Markov’s property can be applied as: P ( x t |{ X τ , τ (cid:54) = t } ) = P ( x t | x ( t − ) (9)Where { X τ , τ (cid:54) = t } contains all the previous states from x to x ( t − . The resulting joint probability of a Markov chainlike the one present in figure 6, is defined as: P ( X t ) = P ( x ) P ( x | x ) P ( x | x ) ... = p ( x ) T (cid:89) t =2 P ( x t | x ( t − ) (10)In discrete-MM, the variables can only take certain values from a set of possible states that differ from each other. For aset of N possible states, there is a N by N transition matrix that contains the probabilities of transitioning between states.Figure 6 shows an example of a MM with two possible states x and x , the transition probabilities that define this MMcan be found in table 1.Figure 6: Graphical representation of a Markov Model with two possible states: x and x . The connections betweenstates represent the possible paths that the current state can follow. The values that condition each path are usuallycontained in a transition table (table 1). The dark circle is current state transitioning from x to x . Image adaptationfrom: [11]. x x x P ( x | x ) P ( x | x ) x P ( x | x ) P ( x | x ) Table 1: Transition matrix of a Markov Model with two possible states ( x and x ).In a discrete stochastic process, the rows of a transition probability matrix have to sum up to one, which means that astate has a finite amount of possible states. The values inside the transition matrix can be: given; calculated gatheringsamples from the process, doing a statistical analysis of the data and assuming that the process follows a certaindistribution; or approximated using probability distribution approximation methods such as the Metropolis-Hastingsalgorithm, that assumes an initial probability distribution and through several iterations it moves it closer to the realdistribution [7].The strong assumption made in equation 9 can be relaxed by adding dependence with more than one past states,transforming the MM into a k-order Markov chain. A second order Markov chain is illustrated in figure 7.Figure 7: Graphical representation of a second order Markov chain. Image adaptation from: [7]The corresponding joint probability of a second-order Markov chain follows the next equation: P ( X t ) = P ( x , x ) P ( x | x , x ) P ( x | x , x ) ... = p ( x , x ) T (cid:89) t =3 P ( x t | x ( t − , x ( t − ) (11)Equations 9 and 11 can be applied where the for processes where the state of the system can be directly observed.However, in many applications, the state is not directly observable, these are called Hidden Markov Models and will bedefined next. 7 PREPRINT - O
CTOBER
30, 2019
Hidden Markov models (HMM) also belong to the stochastic generative models category. They differ from Markovchains because the state variables ’ Z t = ( z , z , ...z t ) ’ are not anymore directly accessible, i.e. hidden variables, andonly the variables ’ X t = ( x , x , ...x t ) ’ are observable.Figure 8: Graphical representation of a first order hidden Markov model. Image adaptation from: [7]Figure 8 shows the representation of a first order HMM. There are two equations necessary to define this HMM. Therelation between observable variables X t and hidden processes Z t ; and the relation between hidden processes witheach other: P ( x t |{ X τ , τ (cid:54) = t } , Z t ) = P ( x t | z t ) P ( z t |{ Z τ , τ (cid:54) = t } ) = P ( z t | z ( t − ) (12)Resulting in the following joint distribution: P ( Z t , X t | Z t ) = P ( X t | Z t ) P ( Z t |{ Z τ , t − ≤ τ < t } ) = P ( z , x | z ) T (cid:89) t =2 P ( x t | z t ) P ( z t | z ( t − ) (13)The probabilities that relate hidden states ’ z t ’ (equation 12) are called transition probabilities, while the probabilitiesthat associate hidden processes with observable variables ’ x t ’ (equation 12) are the emission probabilities. Both ofthem can be calculated in an analogous way to the transition probabilities for the Markov chains (table 1).Markov Models and Hidden Markov Models, although more general than the Naive Bayes Classifier, are also limited bydefinition. Each state is defined only to be affected by a finite number of previous states ’ k ’ and the effect of any otherstates happening before ’ t − k ’ is assumed to be encoded in this period, this limitation is often described as a short-termmemory problem. Trying to find patterns in the sequence to determine the k-gram dependencies beforehand can help toalleviate this issue [12].This section has introduced different methods that are used to model sequential information from the probabilistictheory point of view. Next section introduces different approaches used to overcome temporal context using deeplearning architectures. Considering videos as sequences of static images, this section can serve as an introduction to different approaches usedto add temporal context to the analysis of videos.
Seeking to solve the semantic segmentation inconsistency characteristic of evaluating video segmentation with individualimage segmentation methods, [13] announced a method that combines nearby frames and gated operations for theestimation of a more precise present time segmentation.The Spatio-Temporal Transformer GRU (STGRU) in [13], is a network architecture that adopts multi-purpose learningmethods with the final purpose of video segmentation. Figure (9) shows a scheme of the STGRU architecture. Inside ofthe STGRU, FlowNet is in charge of calculating the optical flow for N consecutive frames. A wrapping function ( φ )uses this optical flow to create a prediction of the posterior frame. Later, a GRU compares the discrepancies betweenthe estimated frame ( w t ) and the current frame evaluated by a baseline semantic segmentation model ( x t ), keeping theareas with higher confidence while reseting the rest of the image.The STGRU presented in [13] was evaluated both quantitatively and qualitatively (fig. 10), exhibiting a high performancecompared with other segmentation methods. 8 PREPRINT - O
CTOBER
30, 2019Figure 9: Overview of the Spatio-Temporal Transformer Gated Recurrent Unit. Pairs of raw input images are used tocalculate the optical flow of the image (FlowNet). This optical flow is then combined with the semantic segmentation ofthe previous frame, obtaining a prediction of the present segmentation (blue box). A segmentation map of the presentframe is then passed together with the prediction to a GRU unit that combines them based on the sequence. Imagesource: [13]Figure 10: Image adaptation from Nelsson et al. [13]. In this image it can be seen a comparison between: GRFP method,Static Semantic Segmentation and the groundtruth segmentation. From left to right, GRFP achieves a segmentationimprovement for the left-car, the right-wall and the left-pole.
The current trend for semantic video segmentation models consists on combining multipurpose neural networks(sequence modelling with feature extraction networks) into advanced models capable of efficiently performing this task.Some other video segmentation architectures include: • Feature Space Optimization for Semantic Video Segmentation [14]. • Multiclass semantic video segmentation with object-level active inference [15]. • Efficient temporal consistency for streaming video scene analysis [16].9
PREPRINT - O
CTOBER
30, 2019Semantic image segmentation is not the only application in computer vision that can benefit from leveraging temporalcontext, tracking also use temporal analysis tools to achieve a better performance. The main reasons why temporalcontext is necessary for tracking are to guarantee the detection of the object even through occlusion and to reducethe number of identity switches (during multiple object tracking). These kind of applications are very common onsurveillance or sport events. In 2017, [17] combined image appearance information with other tracking methods (SimpleOnline Real-time Tracker [18]) based on Kalman Filter and the Hungarian algorithm to obtain state-of-the-art detectionat high rates (40Hz).
This section provides a detailed description of the problem and the materials that will be used for the study.
Semantic segmentation is considered one of the hardest computer vision applications. It differs from image classificationor object detection in how the classification is performed (figure 11). Image classification models classify the imageglobally, they assign a label to the whole image (e.g. cat or dog image classifier). Object detection models look forpatterns in the image and assign a bounding box to the region of the image that is more likely to match with the targetdescription (it provides classification and location within the image). And semantic segmentation produces pixel-levelclassification of an image; it describes each pixel of the image semantically, providing a more insightful description ofhow the image is composed than the other two methods.Figure 11: Figure comparison of the three computer vision classification applications. From left to right: imageclassification (global classification), object detection (local classification) and semantic segmentation (pixel-levelclassification). Image source: [19]Humans are very good at segmentation of images, even without knowing what the objects are. This is the main reasonwhy semantic image segmentation is necessary for autonomous navigation. Although other detection models are ableto classify obstacles and locate them in the space, they can only find the obstacles they have previously seen. E.g. anobstacle detector used to avoid pedestrians in autonomous cars, it will only be able to alert the vehicle in the presenceof pedestrians (it was just trained to learn how the pedestrian category is modeled). However, the type of obstacles thatcan be found in a undefinable driving scenario (it covers any object in any kind of shape and it is not feasible to create adata set that covers for all), this is the reason why semantic image segmentation is present in autonomous navigation.An ideal semantic image segmentation model will be able to define the boundaries of any objects, even when theseobjects have not been previously ’seen’ (figure 12). Apart from being a good obstacle detector, a perfect semanticimage segmentation model has the ability to store these previously unseen objects, tag them and use them to re-train thenetwork and improve the accuracy.There are some requirements that need to be present when applying semantic segmentation into autonomous navigation.The vehicle receives the data as a stream of images (it does not count with a video of the route beforehand) and it has toperform inference in real-time. The model should be very sensible on the detection of obstacles, e.g in an ambiguoussituation where the segmentation of the road is not perfectly clear, due to imperfections or the presence of objects, theclassification of obstacles must prevail.
The most common programming languages used for computer vision and deep learning applications are Python andC++. The former is preferred on the research scope, while the latter is mainly used in commercial applications. Apart10
PREPRINT - O
CTOBER
30, 2019Figure 12: Adverse situations that can be solved using semantic image segmentation. Object detection models can onlyto detect objects that the model is familiar with. However, it is very difficult to create a dataset that includes all thepossible types of obstacles, imperfections or debris that may appear in the road. Semantic image segmentation aims toachieve perfect definition of the image even when the objects are unknown.from the programming languages, there are different frameworks that provide the developer with the tools required tohandle big amounts of data: Theano, PyTorch, TensorFlow or Keras are some of the frameworks compatible with bothPython or C++.Although the programming language and deep learning framework affect the performance of the application, the finalperformance only depends on the implementation algorithm (semantic image segmentation model). As a matter ofpreference, this study is developed using Python and Tensorflow.
Depending on its inner structure, the different semantic image segmentation models are able to obtain different levelsof segmentation accuracy and inference speed. Figure 13, although it is not up to date, shows some of the availablepossibilities arranged by accuracy and inference speed on Cityscapes dataset [6]. Cityscapes is a large-scale urban-scenedataset that contains high resolution fully annotated segmentation images for semantic segmentation applications(section 4.1.5).Figure 13: Classification of different semantic image segmentation models according to inference speed and accuracy(mIOU) on Cityscapes test set [6]. It can be observed how faster models (bottom-right corner) usually achieve a lowerlevel of accuracy than slower ones (upper-left corner). Image source: [20].In figure 13, the inference speed was measured by counting the amount of frames that the segmentation model is able toprocess each second. The mIoU (mean Intersection over Union) measures the mean accuracy of all the frames processedat each speed. As a result, the upper-left corner contains the most accurate (but slower, ∼ − frames per second)models while the less accurate (but faster, ∼ − frames per second) are grouped on the bottom-right corner.Later in the same year of the release of the study in figure 13, Chen et al. [5] in their paper Rethinking Atrous Convolutionfor Semantic Image Segmentation , introduced DeepLabv3, a new iteration of the DeepLab model series that became thestate-of-the-art for semantic image segmentation models on Cityscapes test set (table 2).11
PREPRINT - O
CTOBER
30, 2019In order to continue with state-of-the-art efficiency, DeepLabv3 [5] with pretrained-weights on Cityscapes dataset [6] ischoosen as the baseline model for this study.
Method mIOU
DeepLabv2-CRF [21] 70.4Deep Layer Cascade [22] 71.1ML-CRNN [23] 71.2Adelaide_context [24] 71.6FRRN [25] 71.8LRR-4x [26] 71.8RefineNet [27] 73.6FoveaNet [28] 74.1Ladder DenseNet [29] 74.3PEARL [30] 75.4Global-Local-Refinement [31] 77.3SAC_multiple [32] 78.1SegModel [33] 79.2TuSimple_Coarse [34] 80.1Netwarp [35] 80.5ResNet-38 [36] 80.6PSPNet [37] 81.2
DeepLabv3 [5]
Table 2: Table comparison of the performance of different semantic image segmentation models on the Cityscapesdataset [6]. Table adaptation: [5].
DeepLabv3 [5], developed by Google, is the latest iteration of the DeepLab model series for semantic imagesegmentation– previous versions: DeepLabv1 [38] and DeepLabv2 [21].DeepLab is based on a fully convolutional layer architecture (FCN) that employs atrous convolution with upsampledfilters to extract dense feature maps and capture long range context [5]. [39] showed how powerful convolutionalnetworks are at elaborating feature models and defined a FCN architecture that achieved state-of-the-art performancefor semantic segmentation on the PASCAL VOC benchmark [40]. Another of the advantages of using FCN is that thearchitecture is independent of the input size, they can take input of arbitrary size and produce correspondingly-sizedoutput [39]. In contrast, architectures that combine Convolutional Networks (for feature extraction) with Fully-Connected Conditional Random Fields (for classification)[21, 38] are designed for a fixed input size, as a result of theparticular size necessary to pass through these classification layers.One main limitation of solving semantic segmentation using Deep Convolutional Neural Networks (DCNNs) are theconsecutive pooling operations or convolution striding that are often applied into the DCNNs, consequently reducingthe size of the feature map. These operations are necessary in order to increasingly learn new feature abstractions [41],but may impede dense prediction tasks, where detailed spatial information is desired. Chen et al. [5] suggest the use of’atrous convolution’ as a substitute of the operations that reduce the size of the input (figure 14).Atrous convolution, is also known as a dilated convolution. Apart from the kernel size, dilated convolutions are specifiedby the dilation rate, that establishes the gap between each of the kernel weights. A dilation rate equal to one correspondsto a standard convolution, while a dilation rate equal to two means that the filter takes every second element (leavinga gap of size 1), and so on (figure 15). The gaps between the values of the filter weights are filled by zeros, the term’trous’ means holes in French. [21, 38, 42] show how effective the application of dilated convolution is in maintainingthe context of the features.A second limitation faced by semantic image segmentation models is that they have to detect objects at multiple scales.This is a problem when using regular sized filters (normally 3x3) because they can only ’see’ in regions of 9 pixels at atime, which makes it very difficult to capture the overall context of big objects. DeepLabv3 [5] employs Atrous SpatialPyramid Pooling (ASPP) to overcome this issue, it consists on applying atrous convolution with different dilation ratesover the same feature map and concatenate the results before passing it into the next layer figure 16. This approachhelps capturing feature context at different ranges without the necessity of adding more parameters into the architecture(larger filters) [21, 43, 44]. Figure 17 shows the final architecture of DeepLabv3. Blocks 1 to 3 contain a copy of theoriginal last block in ResNet [5]; which consists of six layers with 256 3x3 kernel convolution filters (stride=2), batch12
PREPRINT - O
CTOBER
30, 2019Figure 14: This figure compares the effect of consecutive pooling or striding to the feature map. (a) Shows an examplewhere the feature map in the last layers is condensed to a size 256 times smaller than the input image, this is harmful forsemantic segmentation since detail information is decimated [5]. (b) Applies atrous convolution to preserve the outputstride obtaining equivalent levels of abstraction [5]. Image source: [5]Figure 15: Atrous convolution of a filter with kernel size 3x3 at different stride rates. The dilation rate determines thespace in between the different cells of the kernel. A rate=1 corresponds to the standard convolution, a rate=6 means thatweights of the kernel are applied every sixth element (gap of 5 units), and so on. Image source: [5].normalization right after each convolution and skip connections every 2 layers [46]. Block 4 is equivalent to the first 3but it applies atrous convolution with dilation rate of 2 as a substitute of downsampling the image with convolutions ofstride 2, maintaining the output stride to 16. The next block applies ASPP at different rates and global average poolingof the last feature map, all the results of this block are then concatenated and passed forward. The resulting featuresfrom all the branches are then concatenated and passed through a 1x1 convolution before the final 1x1 convolution thatgenerates the final logits [5]. The output image consists on a HxWxC matrix where H and W correspond to the heightand width of the output image and C is the number of categories in the dataset. Every pixel is assigned a real numberfor each category, that represents the likelihood (or logits) of that pixel belonging to each category, this is called thescore map. The score map is then reduced by means of an argmax operation that determines the index of the categorywith the highest likelihood, obtaining the semantic segmentation map.
Cityscapes is the state-of-the-art dataset for urban scene understanding. It was created out of the lack of availabledatasets that adequately captured the complexity of real-world urban scenes. Despite the existence of generic datasetsfor visual scene understanding such as PASCAL VOC [40], the authors of Cityscapes claim that " serious progress inurban scene understanding may not be achievable through such generic datasets " [6], referring to the difficulty ofcreating a dataset that can cover any type of applications.Nonetheless, Cityscapes is not the only dataset of its kind. Other datasets such as CamVid [47], DUS [48] or KITTI[49] also gather semantic pixel-wise annotations for the application in autonomous driving. Cityscapes is the largest andmost diverse dataset of street scenes to date [6], it counts with 25000 images (figure 18) from which 5000 are denselyannotated (pixel-level annotation) while the remaining 20000 are coarsely annotated (using bounding polygons, whichoffers a lower level of detail). Compared to the other datasets purposed for autonomous driving, Cityscapes has thelargest range of traffic participants (up to 90 different labels may appear in the same frame)[6] and has the largest rangefor object distances, covering objects up to 249 meters away from the camera [6].13
PREPRINT - O
CTOBER
30, 2019Figure 16: Graphical representation of ASPP. Atrous convolution with different dilation rates is applied on the samefeature map, the result of each convolution is then concatenated and passed to the next layer. Spatial pyramid pooling isable to capture feature context at different ranges[21, 43, 44]. Image adaptation: [45]Figure 17: DeepLabv3 architecture. The first 3 blocks are a replica of the last block of the original residual neuralnetwork [46]. The following blocks incorporate the use of atrous convolution and ASPP, which stops the output stridereduction to 16, diminishing the negative effects of consecutive pooling. Image source: [5]All these characteristics make the Cityscapes dataset the most challenging urban-scene understanding benchmark to date," algorithms need to take a larger range of scales and object sizes into account to score well in our benchmark "[6]. Yetthere are some limitations that need to be considered when evaluating the performance of a model that has been trainedusing this dataset. Cityscapes only captures urban areas (inner-city) of cities primarily from Germany or neighbouringcountries [6], which may turn in a low efficiency when applied to highways, rural areas or other countries (due todifferences in the architectures). The original images were taken during spring and summer seasons and do not coversituations with adverse weather conditions and or poor illumination. More information about the composition and astatistical analysis of this dataset can be found in [6].
After several test runs of the baseline model on image sequences, it was observed that the production of wrongclassifications (completely or partially missed object’s classification) is a transitory effect. When applied on a sequenceof images, the baseline model is usually able to detect most of the objects producing segmentation of different qualitieson each frame. Differences in the lighting conditions or noise in the image may be the cause of this variation fromframe to frame. However, this is an effect that can be exploited in order to achieve a better segmentation .The next conclusion came after analyzing a moving object over consecutive frames. The moving object was recordedusing a regular camera (at 30fps), it was noticed how the displacement of the subject from frame to frame was verysmall, depending on its relative speed with respect to the motion of the camera and its distance from the camera. Thissmall displacement produced by objects moving at relatively low-medium speeds (walking person, moving bike ormoving car) can be used as a motivation that the segmentation of neighbouring frames can be combined in order toobtain a more accurate segmentation of the present. In the next sections, this concept is regarded as the Image Bufferapproach, figure 26 shows an illustration of and Image Buffer of size 2: it holds 2 frames from the past and mergesthem to the segmentation of the present frame. 14
PREPRINT - O
CTOBER
30, 2019Figure 18: Different types of annotations present in the Cityscapes dataset [6]. Upper figure shows an example of adensely annotated image (richer in detail). Bottom figure shows an example of a coarsely annotated image (lower levelof detail). Image source: [4].Apart from a straightforward combination of the pixel-classification output of neighbouring time frames (Image Buffer,section 5.1), a second approach that computes a weighted combination of the pixel-classification logits (section 4.1.4)produced by the baseline model will be introduced next.As explained in section 4.1.4, the way in which DeepLab assigns the final classification labels to each pixel is by aprevious calculation of a C-dimensional score array for each pixel, that together form a HxWxC scoremap for the inputimage (C is the number of possible categories, in the case of Cityscapes; H and W correspond to the height andwidth of the input image respectively). The C-dimensional array contains the likelihood of each pixel to belong to eachone of the possible categories of the data set. The final pixel-labels are assigned by reducing the C-dimensional arrayof each pixel into one value that represents the index of the maximum value of the array, this is done by means of anargmax operation.The weighted combination of the classification scores as an approximated version of a conditional probability (section3.1) is the second approach that will be tested in the following sections. This method is referred as Attention Module(section 6). In order to stay truthful to the nominal conditions of baseline model, it would be necessary to test it using images withthe same characteristics as the Cityscapes data set [6], this is 2048x1024 pixel images. However, it was not possible tofind video sources with the same resolution as the Cityscapes data set and a 1920x1080 pixel resolution was adopted forthe different tests. Although the difference on the number of total pixels between both formats is less than 2 percent,using lower resolution images than the ones used for training the weights of the baseline model might have an effecton the final segmentation. The study of this issue has not been covered and is added as one of the limitations in thediscussions section (section 8).The performance of both of the suggested approaches listed before, Image Buffer (5.1) and Attention Module (6) aswell as the baseline performance (4.1.3) will be evaluated both quantitatively and qualitatively.It is necessary to count with a groundtruth label annotation data set to quantitatively measure the segmentationperformance. However, groundtruth annotations for muli-label semantic video segmentation are very costly and noavailable data sets that covers this need were found. The Densely Annotated VIdeo Segmentation (DAVIS) 2016benchmark [50] was chosen as an approximation for this requirement. It is formed by 50 densely annotated single-objectshort sequences, from which only 10 are suitable for the evaluation of this exercise (due to compatibility with theCityscapes preset categories). The DAVIS categories that will be used for this study are: • Breakdance-flare. A single person moving rapidly in the middle of the screen.15
PREPRINT - O
CTOBER
30, 2019 • Bus. A bus centered in the picture frame in a dynamic environment. • Car-shadow. A car moving out from a shadow area. • Car-turn. A car moving towards and outwards from the camera. • Hike. A person moving slowly in the center of the frame. • Lucia. A person moving slowly in the center of the frame. • Rollerblade. A person moving fast from left to right of the frame. • Swing. A person swinging back and forth and being temporarily occluded in the middle of the screen. • Tennis. A person moving fast from right to left in the frame.Since goal of this study is to improve the detection and the segmentation over time by reducing the number ofmissed classification and maintaining a consistent segmentation, the metrics used for the evaluation cover the temporalconsistency and the accuracy.A semantic image segmentation model that is consistent over time will produce a segmentation area with a smoothtransition from frame to frame (depending on whether the tracked subject is moving or not). The segmentation areais calculated by counting the number of pixels classified with the target label at each time step. Afterwards, theframe-to-frame area variation is calculated as the difference of the area between consecutive frame pairs. A finalcomputation of the standard deviation of these differences gives a global metric for the segmentation fluctuations (itis expected to obtain a lower number for more temporal consistent methods) that will be used for comparison of thedifferent approaches.The accuracy is calculated using the Intersection Over Union (described in section 4.1.3). And, in the same way as withthe area, the frame-to-frame fluctuations of the accuracy are calculated as a comparison metric for all the approaches.The qualitative evaluation consists on the observation and interpretation of the segmentation result of each one of themethods using different video sources.The videos used for the qualitative evaluation are: • Citadel - University of Twente • Carré (modified) - University of Twente. One every ten frames was removed from the original clip to simulatea temporal occlusion or the faulty behavior of the camera sensor. • Driving in a tunnel. Video source: [51]. • Driving under the rain. Video source: [52]. • Driving in the night. Video source: [53]. • Driving in low sun. Video source: [54]
Videos are a very powerful source of information. In contrast with the analysis of pictures, videos provide objects witha temporal context as a series of frames that can be exploited to benefit segmentation. In order to do so, two differentapproaches will be evaluated: Image Buffer and Attention Module (section 4.2).These two approaches will build up on top of DeepLabv3 [5] pre-trained on the Cityscapes data set [6], which is chosenas the baseline for this study due to its performance as the state-of-the-art semantic image segmentation model (section4.1.3).The results will be evaluated both qualitatively and quantitatively in a series of videos chosen to cover a wide variety ofscenarios (section 4.2.1). The metrics used for the comparison of the different approaches are chosen to cover both theaccuracy and the temporal consistency of the predictions (section 4.2.1).The machine learning framework and programming language are fixed together: Python and Tensorflow. Python allowsfor rapid script prototyping and debugging and counts with lots of libraries that make working with images and arraysvery natural. Tensorflow includes large amounts of documentation online, a very vivid community and it is beingconstantly updated with new utilities that offer new possibilities for the use of Deep Learning (section 4.1.2).16
PREPRINT - O
CTOBER
30, 2019
This section covers in detail both of the suggested approaches used to provide temporal context to the semantic imagesegmentation: Image Buffer (section 5.1) and Attention Module (section 6).
Image Buffer is the first of the approaches studied to solve for the semantic image segmentation models’ temporalinconsistency. It is the result of analyzing an object’s location from frame-to-frame and the quality of the segmentationprovided by DeepLabv3 (section 4.1.3) over consecutive frames.The first premise follows from figure 21.
The translation of an object recorded at 30 fps over small batches ofconsecutive frames is negligible .The second premise follows from figure 26.
The combination of the segmentation in neighbouring frames helpscompleting the segmentation and avoiding temporal miss classifications . In figure 26, 3 consecutive segmented frameswere overlap in order to obtain an ’augmented segmentation’. The result of this experiment showed how the figure of aperson was completed using the information of the 3 involved frames.The origins of this fault in segmentation could be of different nature, from temporal physical occlusion such as flyingobjects covering the camera lens, to sensor saturation due to sun glare, or information loss due to system performanceproblems. The approach works as follows:1. The present time frame is extracted from the video feed and passed through DeepLabv3 that computes thesegmentation map.2. The segmentation map is concatenated along with the segmentation of the 3 previous time frames contained inthe Segmentation Buffer (figure 19).3. The segmentation of the 4 evaluated frames is combined obtaining the augmented segmentation for the presenttime (figure 20).4. Lastly, the Segmentation Buffer is updated by dropping the oldest frame and storing the last frame drawn fromthe original video.To delimit this approach it is necessary to define the size of the segmentation buffer, this follows from the observationmade in figure 21 that shows how the relative displacement on three consecutive frames is very small. For simplicity,the size of the image buffer is fixed to four: covers the present frame and three frames from the past.This technique is only applied to a predefined subset of categories, the target categories. The reason for this discrimina-tion is the concern towards the segmentation of the obstacles that can be found in the road. As a proof of concept, thetarget categories for the experiments are: ’person’, ’rider’, ’car’ and ’bicycle’.It is expected that this approach alleviates the defect of partially or completely missed classifications over the targetcategories. Nevertheless, the effect of this approach will only be apparent if the defect labels are missing for less thanthe size of the Image Buffer (if the labels are missing for longer than that, this approach is not able to recover them).Overlapping past information for the inference of the present segmentation imposes a temporal lower bound to theobject detection, at the same time, it reduces the time inconsistency alleviating the ’flickery’ effect. Static elements orelements that move slowly with respect to the image frame are expected to benefit from this approach. The results ofthis application will be presented on the next section.
Apart from a bare combination of the segmentation maps obtained at different time frames (section 5.1), there is alsothe possibility of making a more educated combination by modifying the probability map of each frame prior to thecombination, resulting in an augmented segmentation. This approach is inspired by the transition probabilities inMarkov Models that model the probability of an event transitioning between different possible states (section 3.1.2).The intuition for this approach comes from the following line of thought: ’given that machine learning models are ableto classify by calculating a confidence score (the logits) over a set of predefined labels, can those values be extendedovertime (video inference) and used to influence the classification of consecutive frames?’. In other words, is it possibleto establish a causal relation between past and present frame pixel-classification?17
PREPRINT - O
CTOBER
30, 2019Figure 19: Block diagram of the Image Buffer approach (section 5.1). This block diagram depicts how the segmentationmaps of 4 consecutive frames are combined into one (augmented segmentation).Figure 20: Conceptual representation of the Image-buffer approach. In this figure, the output is calculated as acombination of the segmentation in 3 consecutive past frames plus the segmentation of the present frame.There are two problems that need to be addressed in order to create a temporal relation on the semantic imagesegmentation classification. First, the logit-map of each frame needs to be obtained and analyzed. Second, a relationbetween consecutive frames has to be defined.
In semantic segmentation, the logit-map is a HxWxC matrix that contains, for each pixel in the image (HxW) a vectorof real numbers that represents the likelihood of belonging to each one of the predefined categories (C), the logits. Thelogit-map is located one step before the final assignation of labels (figure 22) and is formed by real numbers withoutany apparent bounds.In order to establish a relation between consecutive frames, it is necessary to transform the logits of each pixel to acommon scale for comparison. This can be done by means of a softmax function (equation 14) that transfer the rawprediction values of the neural network (the logits) into probabilities (logit-map into a probability-map). S ( C i ) = e C i (cid:80) j e C j (14)where C i refers to each element in the logits vector C.Once that the probabilities have been determined, it is possible to establish a relation between consecutive frames.18 PREPRINT - O
CTOBER
30, 2019Figure 21: This figure compares the translation of a moving object during 3 consecutive frames. Each of the two seriesshown in this figure (bike and truck) correspond to 3 consecutive frames extracted from two videos recorded at 30fps. The upper series (from left to right) show a subject moving in a close plane horizontal to the camera. The lowerseries (from left to right) show a truck moving in a far plane horizontal to the camera. The further to the right imagefor both series is a mask overlap of each target over the 3 consecutive frames (zoomed in), it represents the relativedisplacement over time of these objects for a video recorded at 30 fps. It is observed how the relative displacement inminimal for both examples. The bike series, introduces how the combination of the history of an object with largerrelative displacements may introduce a ’ghosting effect’ to the resulting image. Apart from being dependent on therecording speed (fps) and distance to the camera, the relative displacement is also dependent on the relative speedbetween camera and object (although a speed figure comparison is not elaborated, this was observed along the differenttests for this thesis study).Figure 22: Graphical interpretation of the DeepLabv3 [5] semantic image segmentation tool-chain. The convolutionallayers produce a HxWxC matrix (the logit-map) that is reduced by means of an argmax function into a HxWx1 matrixwith the final labels of each pixel (the label-map). Finally an external function assigns a color value to each label.
The Attention Module models the classification probability of the present as a weighted sum of consecutive frames(equation 15). The number of frames in consideration for the sum is four (three past frames plus the present frame), thisis a conservative choice based on our observations on the test videos. aug = I · p ( t ) + I · p ( t −
1) + I · p ( t −
2) + I · p ( t − (15) aug is the resultant augmented logits (figure 23), I n are the weights of frame ( t − n ) and the probabilities p ( t − n ) are extracted from the baseline segmentation model (section 4.1.4) after applying the transformation of the softmaxfunction.Weights estimation In theory, it should be possible to find the optimal weight values that shape equation 15 resulting inthe probabilities with the smallest discrepancy with respect to the true segmentation. This is an optimization problemthat requires the definition of a loss function and data-set with ground-truth annotations. In classification tasks the mostcommon loss function is the cross-entropy function or the negative log likelihood[55] (equation 16) L = − M (cid:88) c =1 y o,c log ( p o,c ) (16)19 PREPRINT - O
CTOBER
30, 2019where M is the number of classes; y o,c is a binary indicator that the class label c is the correct classification for theobservation o ; and p o,c is the predicted probability of observation o being of class c [56].As opposed to the more common quadratic loss function, the cross-entropy loss function does not suffer slow learningdue to the computation of small gradients [55]. The gradients computed on the cross-entropy loss function increasewhen the prediction moves further from the target value [55].Besides the definition of a loss function, it is also necessary to count with a ground-truth annotation data-set. However,as mentioned in section 4.2.1, no public multi-label semantic video segmentation data-set was found and developing onewas not feasible due to the lack of resources. As an alternative, in order to find the ’best’ weight values, the equation 15was tested using different parameter combinations on some of the videos listed in section 4.2.1. The value of the finalweights can be found in table 3. I I I I T Table 3: Weight parameters for equation 15As a result, using positive weights (bigger than 1) increases the sensitivity of the per-pixel classification, which added tothe combination of successive frames result in a more consistent temporal segmentation (section 7.1). However, theincrease in the sensitivity provokes a higher number of false positive classifications in the form of noise. The amount offalse positive labels can be alleviated by setting up a threshold that pushes the augmented logits to zero if a minimumvalue is not reached (figure 23). Similar to the Image Buffer, this approach is only applied to a certain category targetsas a proof that it can be used to leverage the classification of any category labels.Figure 23: Block diagram of the Attention Module approach (section 6). This block diagram depicts how theclassification probabilities of 4 consecutive frames are combined into one.
The experiments are divided in two parts: a quantitative evaluation that measures the performance of each method usingdifferent metrics and a qualitative evaluation that is done by interpretation of the segmentation results.
This section collects and analyzes the results for all the experiments listed in section 4.2.1. It also critically analyzes thefollowed methodology highlighting the best case scenarios and revealing the possible failure modes.
This section evaluates the performance of both suggested methods (Image Buffer and Attention Module), comparingthem with the baseline segmentation (using DeepLabv3). Since the suggested methods are designed to reduce thenumber of false negative by leveraging the temporal information encoded in image sequences-videos , this section looksat the segmentation performance measuring and plotting the
Area overtime (AOT) and
IOU overtime (IOU-OT). Inaddition, the frame to frame variation of both metrics will be calculated and plot side by side, expecting to find lessvariations on the approaches that handle past time information.In order to quantify and compare the performance of each method, it is necessary to count with ground-truth annotations.However, it was not possible to find a data-set that covered this need and it was not feasible to create a custom one20
PREPRINT - O
CTOBER
30, 2019given the time-line. For this reason, the DAVIS data-set [50] is used as an approximation with the limitation it can onlyevaluate one label at a time. The list of categories covered for this objective can be found in section 4.2.1.Figures 24 and 25 give an example of the segmentation evaluation for the ’tennis’ category of the DAVIS data-set [50].Figure 24 shows an obvious result on how leveraging neighbouring frames benefits the final segmentation output. In thisfirst figure in can be observed how the baseline segmentation is surpassed by both of the suggested methods, achievinga smoother shape with the Attention module approach. The main difference between both of the suggested methods isthat the Image buffer directly uses the segmentation produced by the baseline model, while the Attention module bymodifying the probability map of the segmentation is able to produce new segmentation labels that might benefit thefinal segmentation. In figure 24 it can also be observed how both of the suggested methods increase the number of falsepositive classifications, although it was intended to limit this number the study of its reduction is left as future work.Figure 25 contains the graphs that measure the AOT, the IOU-OT and its variations for the ’tennis’ category of theDAVIS data-set. The AOT is calculated to compare the differences in the production of labels of each method. Asexpected, it can be observed how the area detection for both of the suggested methods is superior than the baselinedetection and the ground-truth annotations. The reason for this is the temporal combination and the increase in theproduction of false positive classifications. The IOU-OT indicates the accuracy of each method over time, expectingvalues closer to 1 for the best case scenarios. The results show how both the Image Buffer and the Attention Modulegenerally reproduce a lower accuracy than the baseline method, which can be explained due to the increase of falsepositive classifications. Lastly, the variation of each metric is computed to asses the reliability of each method (a reliableapplication is expected to give consistent results overtime). For a perfect segmentation, the AOT variation should besimilar to the ground-truth annotations and the IOU-OT variation equal to zero.The dispersion of the variation of each metric is calculated by computing the standard deviation of the sequence; tables4 and 5 gather these results. The variation is computed to measure the frame-to-frame consistency of the prediction,a segmentation that fully captures the dynamics of the sequence (that is consistent overtime) would result in a AOTVariation standard deviation equal to the ground-truth’s one and a IOU-OT Variation standard deviation equal to zero.AOT Variation STD
Groundtruth DeepLabv3(Cityscapes) ImageBuffer AttentionModuleBreakdance-flare
Car-turn
Lucia
Rollerblade
Tennis
Table 4: AOT Variation STD for some of the DAVIS data-set [50] categories evaluated using different semanticsegmentation approaches (section 4.2.1). Although the three approaches under study are far from achieving a Variationclose to the groundtruth, the consistency of both the Image Buffer and the Attention Module outperforms the bareDeepLabv3 implementation.Although the results shown in this section are a simplification (single-label classification) of what it might be presentin a real driving scenario, these results can serve as a guideline on how the combination of frames affects to the finalsegmentation. Next section presents the results for a series of multi-label classification experiments.
In this subsection, the testing is extended to cover the multi-label detection. Due to the lack of groundtruth annotations,these experiments will be evaluated by the visual-quality of its results.Figure 27 shows the segmentation performance of the baseline model (DeepLabv3) and its extension with the imagebuffer approach (detected area is contoured in red or blue) and the attention module approach. In particular, theextensions only leverage a set of predefined categories over the rest: person, rider, bicycle, car and bus . This is doneas a proof of concept as well as due to hardware limitations, but it can be adapted to hold any other target categories.21
PREPRINT - O
CTOBER
30, 2019
IOU-OT Variation STD (%)DeepLabv3(Cityscapes) ImageBuffer AttentionModuleBreakdance-flare
Car-shadow
Car-turn
Rollerblade
Table 5: IOU-OT Variation STD for some of the DAVIS data-set [50] categories evaluated using different semanticsegmentation approaches (section 4.2.1). A robust segmentation would be translated in having a low STD, which meansthat the accuracy does not fluctuate over the sequence. It can be observed how the lowest values are obtained by the twoapproaches that exploit temporal information, the Image Buffer and the Attention Module.
This video was taken on a normal day between lectures at the University of Twente. It was chosen to evaluate theperformance of these methods in an environment rich on features with different targets present at the same time. Figureshows how both of the extensions help completing the partial segmentation of some targets.Next section elaborates on the limitations of each method and makes a conclusion based on the results.
The results show how the Image Buffer and Attention Module approaches can support the segmentation of a set ofpredefined target categories, this is due to the combination of information from neighbouring frames into the presentsegmentation.Both of the extensions combine the segmentation of 3 past frames with the present frame. This is the result of the studyperformed and can only be applied to study-cases where the target object location does not differ considerably for 4consecutive frames. In other words, it only applies to cases with a high recording-speed/target-relative-movement (w.r.tthe camera) ratio. A low ratio will result on a ’ghosting effect’ on the segmentation and a diffuse boundary definition ofthe targets.The segmentation model also affects the final performance of both of the suggested extensions. Both of the extensionsbase their results on an initial baseline segmentation that is lastly modified. In the case of the Image Buffer thislimitation is not very extreme and the final accuracy directly depend on the efficiency of the baseline model. In the otherhand, the Attention Module modifies the probability-map of each frame by its multiplication with different weights(positive, greater than 1), this helps increasing the sensitivity of the detection of the target categories but in turn mayproduce false positive classifications.Another limitation of the Attention Module approach is that its parameters were calculated heuristically, and thesegmentation result might be sub-optimal. This was inevitable due to the lack of a sequential ground-truth data-set thatcould be used to find the optimal values of the weights and the threshold value (table 3).The chosen data-set for training the baseline model also affects the final performance of the segmentation. It determinesthe final accuracy of the segmentation and sets some limitations on the resolution of the images and the coveredscenarios. Cityscapes data-set, covers driving scenarios in urban areas which might be the reason why the baselinesegmentation offers a low performance in figures, due to the angle of the camera. Besides that, Cityscapes gatheredsamples during spring and summer seasons, so it does not include adverse weather conditions nor poor illuminationimages.Nevertheless, both of the extensions have proved how leveraging sequential information for the application of real-timepredictions is beneficial for the consistency of the results. When applied to autonomous navigation, creating an imagesegmentation training data-set that covers all the possible scenarios is very difficult and holding the information of pastframes adds a protection layer against failure modes. 22
PREPRINT - O
CTOBER
30, 2019
The evaluation of the baseline performance, showed that the output segmentation over a sequence is irregular. This is aproblem when the application relies on the frame segmentation as a main source of information. In order to attenuatethese variations, both of the extension approaches (Image Buffer and Attention Module) combine the information ofneighbouring frames to make a more educated segmentation of the present. Tables 4 and 5 show the effect of theseextensions would successfully reduce the fluctuations on the segmentation overtime.This study also showed some cases in which the baseline model was not able to detect a target in the field of view. Ournew frameworks proved to reduce the number of false negative classifications. The Image Buffer approach sets a timelower bound on the segmentation providing a safety margin over sudden faulty modes. The Attention Module, dueto the modifications on the probability map, is able to increase the sensitivity of the detection of a set of predefinedtargets while at the same time providing the advantages of the Image Buffer approach. Which can be very useful for thedetection of obstacles in the case of autonomous navigation.Both of the extensions (Image Buffer and Attention Module) require to store an array of dimensions:
M xHxW xN .Where M is the number of target categories, H and W the dimensions of the image source and N the number of framesto be combined. This can be computationally very costly to be evaluated in a real-time implementation and requires awell selected hardware framework.We can also suggest to use this methodology during the training phase of the segmentation model. The final usercan create artificial annotations for the categories that appear to be more challenging for the baseline model and usethem as ground-truth annotations for further training. This would expect to improve the segmentation over thosetarget categories and get rid of the computational constraint imposed by having to store an array for the segmentationaugmentation.As a future work, we believe that different ways using the suggested frameworks could be studied. Instead of overlappingthe segmentation results of the baseline model or making a weight sum relation between consecutive frames, a moreelaborated relation that uses probabilistic relations can be defined.The combination of sequential information and enhancement of the probability-map results in an increase of falsepositive classifications in the form of segmentation noise. There are different ways that can help alleviate this problem,while at the same time raising the overall accuracy of the segmentation. Some of them are: • Fine tuning the parameters that define the Attention Module. In such a way that, the values that do not reach aminimum after the probability augmentation are pushed to 0 and therefore will not be assigned the target label. • Using different temporal modules with different sensitivity values. After the computation of the augmentedprobabilities, the results will be compared and only the classifications with the highest agreement will beoutput. • Limiting the work-space of the temporal modules. The augmented segmentation can be applied only to acertain region of interest, such as just the driving-space. This can alleviate the production of false positiveclassifications and increase the performance of the algorithm. • Combining this method with a 3D map estimation. After the calculation of the augmented segmentation map,it could be projected onto a 3D map of the environment, assigning labels to the 3D objects and discarding thelabels that do not project into any plane (or very far away).Another extension of this study could be to instead of basing the augmentation on only past frames, an estimatedprediction of nearby future states (t+1, t+2 ...) could be calculated and added into the combination. At the same time,given a sequential fully annotated data-set could provide the optimal set of parameters that define the relations betweenadjacent frames.On top of that, these extensions could be applied to create an artificial sequential data-set. Using the results of thebaseline model with the extension on the temporal domain could be used to generate short annotated sequences. Thesesequences can then be used to train deep learning architectures that embed the temporal behavior on its structure.Deep learning models such as convLSTM [57], GRFP [13] or the Multiclass semantic video segmentation [15] can belive-trained with these temporal extensions, creating a self-supervised segmentation pipeline.It could be interesting to study the performance of semantic image segmentation models and its temporal extensionfrom the inference-speed point of view. Although this study was performed on the state-of-the-art semantic imagesegmentation model (highest in accuracy), other models designed to perform at much higher speed rates can influencethe type of solution needed for the temporal analysis. 23
PREPRINT - O
CTOBER
30, 2019Finally, finding new applications where this method can be applied. One of the highlights of this study is that slowenvironments benefit from sequential information, helping to smooth and to complete partial segmentation results(figure 27). Of course, applications with different types of sequential data can also be analyzed under the same suggestedapproaches. This indicates the potential impact of our new sequential frameworks in a large field of video and imageanalysis.
References [1] Dyfed Loesche. How americans commute to work. 2017. .[2] Niall McCarthy. Fewer americans are driving to work. 2018. .[3] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning.In
Shape, Contour and Grouping in Computer Vision , 1999.[4] Cityscapes-Team. Cityscapes dataset. 2019. .[5] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution forsemantic image segmentation.
ArXiv , abs/1706.05587, 2017.[6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In
Proc.of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.[7] Kevin P. Murphy. Machine learning - a probabilistic perspective. In
Adaptive computation and machine learningseries , 2012.[8] Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logisticregression and naive bayes. In
NIPS , 2001.[9] Julia Lasserre, Christopher M. Bishop, José M. Bernardo, M. Jesús Bayarri, James O. Berger, A. Philip Dawid,David Heckerman, Arlette Miller Smith, and M. A. West. Generative or discriminative? getting the best of bothworlds. 2007.[10] Pedro M. Domingos and Michael J. Pazzani. Beyond independence: Conditions for the optimality of the simplebayesian classifier. In
ICML , 1996.[11] Victor Powell. Markov chains, 2014. http://setosa.io/ev/markov-chains/ .[12] D. Conklin. Music generation from statistical models.
In AISB Symposium on Artificial Intelligence and Creativityin the Arts and Sciences , pages 30–35, 2003.[13] David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. , pages 6819–6828, 2016.[14] Abhijit Kundu, Vibhav Vineet, and Vladlen Koltun. Feature space optimization for semantic video segmentation. , pages 3168–3175, 2016.[15] Buyu Liu and Xuming He. Multiclass semantic video segmentation with object-level active inference. , pages 4286–4294, 2015.[16] Ondrej Miksik, Daniel Munoz, J. Andrew Bagnell, and Martial Hebert. Efficient temporal consistency forstreaming video scene analysis. , pages 133–139,2013.[17] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep associationmetric. , pages 3645–3649, 2017.[18] Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Tozeto Ramos, and Ben Upcroft. Simple online and realtimetracking. , pages 3464–3468, 2016.[19] Ronny Restrepo. What is semantic segmentation? 2017. http://ronny.rest/tutorials/module/seg_01/segmentation_01_intro/ .[20] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semanticsegmentation on high-resolution images.
ArXiv , abs/1704.08545, 2017.[21] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semanticimage segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
IEEE Transactionson Pattern Analysis and Machine Intelligence , 40:834–848, 2016.24
PREPRINT - O
CTOBER
30, 2019[22] Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and Xiaoou Tang. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. , pages 6459–6468, 2017.[23] Heng Fan, Xue Mei, Danil V. Prokhorov, and Haibin Ling. Multi-level contextual rnns with attention model forscene labeling.
IEEE Transactions on Intelligent Transportation Systems , 19:3475–3485, 2016.[24] Guosheng Lin, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. Efficient piecewise training of deepstructured models for semantic segmentation. , pages 3194–3203, 2015.[25] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-resolution residual networks forsemantic segmentation in street scenes. , pages 3309–3318, 2016.[26] Golnaz Ghiasi and Charless C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmenta-tion. In
ECCV , 2016.[27] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. Refinenet : Multipath refinement networks withidentity mappings for high-resolution semantic segmentation. 2016.[28] Xin Li, Zequn Jie, Wei Wang, Changsong Liu, Jimei Yang, Xiaohui Shen, Zhe L. Lin, Qiang Chen, ShuichengYan, and Jiashi Feng. Foveanet: Perspective-aware urban scene parsing. , pages 784–792, 2017.[29] Josip Krapac, Ivan Kreso, and Sinisa Segvic. Ladder-style densenets for semantic segmentation of large naturalimages. , pages 238–245, 2017.[30] Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe L. Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu,Zequn Jie, Jiashi Feng, and Shuicheng Yan. Video scene parsing with predictive feature learning. , pages 5581–5589, 2016.[31] Rui Zhang, Sheng Tang, Min Lin, Jintao Li, and Shuicheng Yan. Global-residual and local-boundary refinementnetworks for rectifying scene parsing predictions. In
IJCAI , 2017.[32] Rui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng Yan. Scale-adaptive convolutions for sceneparsing. , pages 2050–2058, 2017.[33] Falong Shen, Rui Gan, Shuicheng Yan, and Gang Zeng. Semantic segmentation via structured patch prediction,context crf and guidance crf. , pages5178–5186, 2017.[34] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison W. Cottrell. Under-standing convolution for semantic segmentation. , pages 1451–1460, 2017.[35] Raghudeep Gadde, Varun Jampani, and Peter V. Gehler. Semantic video cnns through representation warping. , pages 4463–4472, 2017.[36] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Wider or deeper: Revisiting the resnet model for visualrecognition.
ArXiv , abs/1611.10080, 2016.[37] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. , pages 6230–6239, 2016.[38] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic imagesegmentation with deep convolutional nets and fully connected crfs.
CoRR , abs/1412.7062, 2014.[39] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. , pages 3431–3440, 2014.[40] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascalvisual object classes (voc) challenge.
International Journal of Computer Vision , 88:303–338, 2009.[41] Dominik Scherer, Andreas C. Müller, and Seven Behnke. Evaluation of pooling operations in convolutionalarchitectures for object recognition. In
ICANN , 2010.[42] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions.
CoRR , abs/1511.07122,2015.[43] Kristen Grauman and Trevor Darrell. The pyramid match kernel: discriminative classification with sets of imagefeatures.
Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1 , 2:1458–1465 Vol. 2,2005. 25
PREPRINT - O
CTOBER
30, 2019[44] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching forrecognizing natural scene categories. , 2:2169–2178, 2006.[45] Isma Hadji and Richard Wildes. What do we understand about convolutional networks? 03 2018.[46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. , pages 770–778, 2015.[47] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition usingstructure from motion point clouds. In
ECCV (1) , pages 44–57, 2008.[48] Timo Scharwächter. Daimler urban segmentation. .[49] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.
International Journal of Robotics Research (IJRR) , 2013.[50] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark datasetand evaluation methodology for video object segmentation. In
Computer Vision and Pattern Recognition , 2016.[51] Kroonenberg. A drive downtown amsterdam. 2019. .[52] TexasHighDef. A drive downtown amsterdam. 2018. .[53] Dash Cam Tours. Night driving on california freeway. no music. 2017. http://neuralnetworksanddeeplearning.com/chap3.html .[56] ML-cheatsheet. Loss functions. 2018. https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html .[57] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang chun Woo. Convolutionallstm network: A machine learning approach for precipitation nowcasting.
ArXiv , abs/1506.04214, 2015.26
PREPRINT - O
CTOBER
30, 2019Figure 24: Figure comparison of the segmentation of 3 consecutive frames and its corresponding groundtruth annotation.It can be observed how both the Image Buffer and the Attention Module achieve a better segmentation of the ’person’category although there is also an increase in the amount of false positive classifications. Sequence source: DAVISdataset [50] - ’tennis’ category.Figure 25: Graph evaluation of the ’tennis’ category (DAVIS dataset [50]) using the metrics defined in section 4.2.1.The upper graphs shows how the AOT of both suggested approaches is higher than the baseline model, although theytend to have a lower variation. The lower graphs show how both of the suggested approaches tend to be less accuratethan the baseline segmentation, this can be reasoned by the increased production of false positive classifications. Inaddition, the IOU variation for both the IB and AT is lower than for the baseline.27
PREPRINT - O
CTOBER
30, 2019Figure 26: Conceptual figure that shows how the combination of the segmentation of a particular target (person inthis case) of neighbouring frames can be beneficial for the segmentation at the present time. From top to bottom: 3consecutive frames are extracted from the original video using a sliding window that covers the present frame and 2other consecutive frames in the past (the extraction can be done for more than 3 frames); the baseline segmentationmodel (DeepLabv3 on Cityscapes) is applied to each one of these frames, the person label is highlighted in a differentcolor (red, green, blue) for each frame in order to improve the visualization; the hypothetical augmented segmentationoutput is constructed by overlapping the segmentation of the 3 consecutive frames in the sliding window (bottom figure).It can be observed how the Augmented segmentation at frame (t) contains parts of the segmentation of each of theexamined frames. The results show how the wholes in the segmentation of the man in the left and the woman in themiddle at time (t) are ’filled’ by the segmentation of frames at times (t-2) and (t-1).28
PREPRINT - O