A Deep Convolutional Neural Network for Background Subtraction
AA Deep Convolutional Neural Network for BackgroundSubtraction
M. Babaee a, , D. Dinh a , G. Rigoll a a Institute for Human-Machine Communication, Technical Univ. of Munich, Germany
Abstract
In this work, we present a novel background subtraction system that uses a deep ConvolutionalNeural Network (CNN) to perform the segmentation. With this approach, feature engineeringand parameter tuning become unnecessary since the network parameters can be learned fromdata by training a single CNN that can handle various video scenes. Additionally, we propose anew approach to estimate background model from video. For the training of the CNN, we em-ployed randomly 5% video frames and their ground truth segmentations taken from the ChangeDetection challenge 2014(CDnet 2014). We also utilized spatial-median filtering as the post-processing of the network outputs. Our method is evaluated with di ff erent data-sets, and thenetwork outperforms the existing algorithms with respect to the average ranking over di ff erentevaluation metrics. Furthermore, due to the network architecture, our CNN is capable of realtime processing. Keywords:
Background subtraction, change detection, deep learning
1. Introduction
With the tremendous amount of available video data, it is important to maintain the e ffi -ciency of video based applications to process only relevant information. Most video files containredundant information such as background scenery, which costs a huge amount of storage andcomputing resources. Hence, it is necessary to extract the meaningful information, e.g. vehiclesor pedestrians, to deploy those resources more e ffi ciently. Background subtraction is a binaryclassification task that assigns each pixel in a video sequence with a label, for either belongingto the background or foreground scene [25, 21, 1].Background subtraction, which is also called change detection, is applied to many advancedvideo applications as a pre-processing step to remove redundant data, for instance in trackingor automated video surveillance [17]. In addition, for real-time applications, like tracking, thealgorithm should be capable of processing the video frames in real-time.One simple example of the application of a background subtraction method is the pixel-wisesubtraction of a video frame from its corresponding background image. After being comparedwith the di ff erence threshold, pixels with a larger di ff erence than a certain threshold value arelabeled as foreground pixels, otherwise as background pixels. Unfortunately, this strategy will Preprint submitted to Elsevier February 7, 2017 a r X i v : . [ c s . C V ] F e b ield poor segmentation due to the dynamic nature of the background, that is induced by noiseor illumination changes. For example, due to lighting changes, it is common that even pixelsbelonging to the background scene can have intensities very di ff erent from their other pixels inthe background image and they will be falsely classified as foreground pixels as a consequence.Thus, sophisticated background subtraction algorithms that assure robust background subtractionunder various conditions must be employed.In the following sections, the di ffi culties in this area, our proposed solution for background sub-traction and our contributions will be illustrated. The main di ffi culties that complicate the background subtraction process are: Illumination changes:
When scene lighting changes gradually (e.g. moving clouds in the sky)or instantly (e.g. when the light in a room is switched on), the background model usually has aillumination di ff erent from the current video frame and therefore yields false classification. Dynamic background:
The background scene is rarely static due to movement in the back-ground (e.g. waves, swaying tree leaves), especially in outdoor scenes. As a consequence, partsof the background in the video frame do not overlap with the corresponding parts in the back-ground image, hence, the pixel-wise correspondence between image and background is no longerexistent.
Camera jitter:
In some cases, instead of being static, it is possible that the camera itself isfrequently in movement due to physical influence. Similar to the dynamic background case, thepixel locations between the video and background frame do not overlap anymore. The di ff erencein this case is that it also applies to non-moving background regions. Camouflage:
Most background subtraction algorithms work with pixel or color intensities.When foreground objects and background scene have a similar color, the foreground objectsare more likely to be (falsely) labeled as background.
Night Videos:
As most pixels have a similar color in a night scene, recognition of foregroundobjects and their contours is di ffi cult, especially when color information is the only feature in usefor segmentation. Ghosts / intermittent object motion: Foreground objects that are embedded into the backgroundscene and start moving after background initialization are the so-called ghosts . The exposedscene, that was covered by the ghost, should be considered as background. In contrast, fore-ground objects that stop moving for a certain amount of time, fall into the category of intermit-tent object motion. Whether the object should be labeled as foreground or background is stronglyapplication dependent. As an example, in the automated video surveillance case, abandoned ob-2ects should be labeled as foreground.
Hard shadows:
Dark, moving shadows that do not fall under the illumination change cate-gory should not be labeled as foreground.In this work, we follow the trend of Deep Learning and apply its concepts to backgroundsubtraction by proposing a CNN to perform this task. We justify this approach with the factthat background subtraction can be performed without temporal information, given a su ffi cientlygood background image. With such a background image, the task itself breaks down into a com-parison of a image-background pair. Hence, the input samples can be independent among eachother, enabling a CNN to perform the task with only spatial information. The CNN is responsiblefor extracting the relevant features from a given image-background pair and performs segmenta-tion by feeding the extracted features into a classifier. In order to get more spatially consistentsegmentation, post-processing of the network output is done by spatial-median filtering and ora fully connected CRF framework. Due to the use of a CNN, no parameter tuning or descriptorengineering is needed.To train the network, a large amount of labeled data is required. Fortunately, due to the pro-cess of background subtraction, by comparing an image with its background, it is not necessaryto use images of a full scene for training. It is also possible to train the network via subsets ofa scene, i.e. with patches of image-background pairs, since the procedure also holds for imagepatches. As a consequence, we can extract enough training samples from a limited amount oflabeled data.To the best of our knowledge, background subtraction algorithms that use CNN are scenespecific to this day, i.e. a CNN can only perform satisfying background subtraction on a singlescene (that was trained with scene specific data) and also lacks the ability to perform the segmen-tation in real time. Our proposed approach yields a universal network that can handle variousscenes without having to retrain it every time the scene changes. As a consequence, one can trainthe network with data from multiple scenes and hence increase the amount of training data for anetwork. Also, by using the proposed network architecture, it is possible to process video framesin real-time with conventional computing resources. Therefore, our approach can be consideredfor real-time applications.The outline of this paper is as follows: In Section 2, early and recent algorithms for back-ground subtraction are presented. In Section 3, we explain our proposed approach for backgroundsubtraction. Here, we first describe our proposed approach to estimate background image andnext we illustrate our CNN for background subtraction. In Sections 4, we describe the exper-imental evaluation of the algorithm including the used datasets, the evaluation metrics and theobtained results followed by detailed discussion and analysis. Finally, in Section 5, we concludeour work and provide future work with some ideas.3 . Backround and Related work The majority of background subtraction algorithms are composed of several processing mod-ules which are explained in the following sections (see Figure 1).
Background Model:
The background model is essential for the background subtraction algo-rithm. In general, the background model is used as a reference to compare with the incomingvideo frames. Furthermore, the initialization of the background model plays an important rolesince video sequences are normally not completely free of foreground objects during the boot-strapping phase. As a consequence, the model gets corrupted by including foreground objectsinto the background model, which leads to false classifications.
Background Model Maintenance:
In reality, background is never completely static but changesover time. There are many strategies to adapt the background model to these changes by usingthe latest video frames and / or previous segmentation results. Trade-o ff s must be found in theadaption rate, which regulates how fast the background model is updated. High adaption rateleads to noisy segmentation due to the sensitivity to small or temporary changes. Slow adaptionrate, however, yields an outdated background model and therefore false segmentation. Selec-tive update adapts the background model with pixels that were classified as background. In thatcase, deadlock situations can occur by not incorporating misclassified pixels into the backgroundmodel, i.e. once a background pixel is falsely classified as foreground, it would never be usedto update the background and would always be considered as a foreground pixel. On the otherhand, by using all pixels as in the blind update strategy, such deadlock situations can be pre-vented but will also distort the background model since foreground pixels are incorporated intothe background model. Feature extraction:
In order to compare the background image with the video frames, adequatefeatures that represent relevant information must be selected. Most algorithms use gray scale orRGB intensities as features. In some cases, pixel intensities along with other hand engineeredfeatures (e.g. [11] or [2]) are combined. Also, the choice of the feature region is important.One can extract the features over pixels, blocks or patterns. Pixel-wise features often yield noisysegmentation results since they do not encode local correlation, while block-wise or pattern-wisefeatures tend to be insensitive to slight changes.
Segmentation:
With the help of a background model, the respective video frames can be pro-cessed. Background segmentation is performed by extracting the features from correspondingpixels or pixel regions of both frames and using a distance measure, e.g. the Euclidean distance,to measure the similarity between those pixels. After being compared with the similarity thresh-old, each pixel is either labeled as background or foreground.The combination of those building blocks forms an overall background subtraction system. Therobustness of the system is always dependent and limited by the performance of each individual4 egmentationBackgroundModelFeature Extraction BackgroundMaintenanceInput Frame BackgroundSegmentation
Figure 1: Block diagram of basic background subtraction algorithms. The dashed line is optional and not existent insome algorithms
St- Charles et al. [19] Stau ff er and Grimson [3]Elgammal et al. [5] Hofman et al. [15]St- Charles et al. [17] Oliver et al. [8]Kim et al. [7] Barnich et al. [6]Wang et al. [9]Varadarajan et al. [12]Chen et al. [13]Sajid an Cheung [14] Tiefenbacher et al. [16] Braham and Droogenbroeck [20]Codebook-based Probabilistic Sample-based Subspace-based DeepLearningBackground subtraction algorithms Figure 2: Categorization of the outlined background subtraction algorithms block, i.e. it can not be expected to perform well if one module delivers poor performance.Background subtraction is a well studied field, therefore there exists a vast number of algo-rithms for this purpose (see Figure. 2). Since most of the top performing methods at present arebased on the early proposed algorithms, some of which are outlined in the beginning. Subse-quently a few of the current methods for background subtraction will be introduced.5 .1. Early Approaches
Stau ff er and Grimson [22] proposed a method that models the background scene with Mixtureof Gaussian (MoG), also called Gaussian Mixture Model (GMM). It is assumed that each pixel inthe background is drawn from a Probability Distribution Function (PDF) which is modeled by aGMM, also pixels are assumed to be independent from their neighboring pixels. Incoming pixelsfrom video frames are labeled as background if there exists a Gaussian in the GMM, where thedistance between its mean and the pixel lies within a certain bound. For learning the parameters,that maximize the likelihood, the authors proposed an online method that approximates the Ex-pectation Maximization (EM) algorithm [6].Elgammal et al. [7] introduced a probabilistic non-parametric method to model the background.Again it is assumed that each background pixel is drawn from a PDF. The PDF for each pixel isestimated with Kernel Density Estimation (KDE).In contrast, Barnich et al. [1] propose a sample based method for background modeling. Thebackground model consists of pixel samples from past video frames. For each pixel location, thenumber of past pixel samples is stored as N (e.g. N = et al. [15], where the background model for each pixelposition is modeled by a codebook. During bootstrapping phase, incoming pixels build up code-words that consist of intensity, color and temporal features. After initialization, those codewordsform a codebook for subsequent segmentation. The intensity and color of incoming pixels arecompared with those of the codewords in the codebook. After calculating the distances betweenthe pixels and the codewords and comparing them with the threshold value , either a foregroundlabel (if no match was found) or a background label is assigned. In the latter case, the matchingcodeword is updated with the respective background pixel.Oliver et al. [16] use subspace learning to compress the background, the so called eigenback-ground. From N images, the mean and the covariance matrix are calculated. After a PCA ofthe covariance matrix, a projection matrix is set up with M eigenvectors, corresponding to the Mlargest eigenvalues. Incoming images are compared with their projection onto the eigenvectors.After calculating the distances between the image and the projection and comparing them withthe corresponding threshold value, foreground labels are assigned to pixels with large distances. Based on [22], Wang et al. [26] use GMM to model the background scene, in addition,single Gaussians are employed for foreground modeling. By computing flux tensors [4], whichdepict variations of optical flow within a local 3D spatio-temporal volume, blob motion is de-tected. With the combination of the di ff erent information from blob motion, foreground modelsand background models, moving and static foreground objects can be spotted. Also, by applyingedge matching [8], static foreground objects can be classified as ghosts or intermittent motions.Varadarajan et al. [25] introduce region-based MoG for background subtraction to tackle the6ensitivity to dynamic background . MoGs are used to model square regions within the frame,instead of pixel-wise feature modeling, the features are extracted from sub-blocks within thesquare regions.With the same purpose to tackle dynamic background movement, Chen et al. [5] propose amethod that applies MoG in a local region. Models for foreground and background are learnedwith pixel-wise MoG. To label a given pixel, a N × N region around the pixel is searched for aMoG that shows the highest probability for the center pixel. If a foreground model depicts thehighest probability, the center pixel is labeled as foreground and vice versa.To cope with sudden illumination changes, Sajid and Cheung [17] create multiple backgroundmodels with single Gaussians and employ di ff erent color representations. Input images are pixel-wise clustered into K groups with the K-means algorithm. For each group and each pixel locationa single Gaussian model is built by calculating the mean and variance for a cluster. For incomingpixels, the matching models are selected by taking the models among the K models that showthe highest normalized cross-correlation to the pixels. The segmentation is done for each colorchannel for RGB and YCbCr color representations, which yields 6 segmentation masks. Bycombining all available segmentation masks the background segmentation is then performed.Hofmann et al. [13] developed an algorithm that improved the method from Barnich et al. [1]by replacing the global threshold value R with an adaptive threshold R(x), which is dependenton the pixel location and a metric of the background model. The metric was named ”backgrounddynamics” by the authors. With the additional information from the background dynamics, thethresholds R(x) and the model update rate are adapted by a feedback loop. High variance in thebackground yields a larger threshold value and vice versa, therefore it can cope with dynamicbackground and highly structured scenes. Tiefenbacher et al. [23] improved this method bycontrolling the updates of the pixel-wise thresholds with a PID controller instead of a fixed valuethat was used before.St-Charles et al. [21] also improved the method by using Local Binary Similarity Patterns(LBSP) [2] as additional features to pixel intensities and slight changes in the update heuris-tic of the thresholds and the background model.They also proposed another method [20], which is similar [21], but works with codewords, orbackground words as they call it in their publication. Pixel intensities, LBSP features and tem-poral features are combined to a background word. To perform segmentation, incoming pixels’LBSP features together with their intensities are compared with the corresponding backgroundwords. After a two stage comparison, 1), color and structure threshold, 2), temporal featurethreshold, the pixel is either labeled as foreground or background depending on the thresholdresults. All thresholds are dynamically updated in a feedback loop using a background dynamicsmeasure.A novel approach for background subtraction with the use of CNN was proposed by Brahamand Droogenbroeck [3]. They used a fixed background model, which was generated from a tem-poral median operation over N video frames. Afterwards, a scene-specific CNN is trained withcorresponding image patches from the background image, video frames and ground truth pix-els, or alternatively, segmentation results coming from other background subtraction algorithms.After extracting a patch around a pixel, feeding the patch through the network and comparing7 NN Post-processing
InputframeBackgroundimage Output segmentation
Figure 3: Background subtraction system overview: The input frame along with the corresponding background image arepatch-wise processed. After merging the individual patches into a single output frame, the output frame is post-processedyielding the final output segmentation. it with the score threshold, the pixel is assigned with either a background or a foreground label.However, due to the over-fitting that is caused by using highly redundant data for training, thenetwork is scene-specific, i.e. can only process a certain scenery, and needs to be retrained forother video scenes (with scene-specific data).
3. Approach
In this section, we introduce our proposed method that consists of 1) a novel algorithm forbackground model (image) generation; 2) a novel CNN for background subtraction and 3) post-processing of the networks output using median filter. The complete system is illustrated inFigure 3. We use a background image to perform background subtraction from the incomingframes. Matching pairs of image patches from the background image and the video frames areextracted and fed into a CNN. After reassembling the patches into the complete output frame, itis post-processed, yielding the final segmentation of the respective video frame.In the following sections, we introduce our algorithm to get the background images from thevideo frames. Furthermore, we illustrate our CNN architecture and the training procedure. Atlast, we discuss our employed post-processing strategies to improve the network output.
We propose a new approach to generate background model which is illustrated in Figure. 5.Here we combine the segmentation mask from SuBSENSE algorithm [21] and the output ofFlux Tensor algorithm [26], which can dynamically change the parameters used in the back-ground model based on the motion changes in the video frames. The block diagram of the robust8 .. ... ... t = temporal-medianfiltering Figure 4: Temporal-median filtering over 150 frames of a video sequence. Since each background pixel is visible for atleast 75 frames, the temporal-median filtering yields the background image without foreground objects. background model algorithm is given in Figure. 5. The details of each block is explained in thefollowing sections.
Figure 5: Block diagram of Robust Background Model Algorithm
Perhaps, the simplest way to get background image from a video sequence is to performpixel-wise temporal median filtering. However, using this method, the background model willbe quickly corrupted if there are a lot of moving objects in some video frames, and hence, thepixel values of foreground object will eventually negatively a ff ect the quality of the backgroundmodel. This requires us to distinguish the foreground pixels and background pixels, and for thebackground model we only use the background pixel values to perform temporal median filter. Tothis end, we use the SuBSENSE algorithm. This method relies on spatio-temporal binary featuresas well as color information to detect changes. This allows camouflaged foreground objects tobe detected more easily while most illumination variations are ignored. Besides, instead of usingmanually-set, frame-wide constants to dictate model sensitivity and adaptation speed, it usespixel-level feedback loops to dynamically adjust the method’s internal parameters without userintervention. The adjustments are based on the continuous monitoring of model fidelity and local9egmentation noise levels. The output of SuBSENSE algorithm will be a binary image whichcontains the classification information of the current video frame. The foreground objects aremarked as white pixel, and black pixels represent the pixels belonging to background model.(a) (b) Figure 6: (a) Video input frame; (b) segmentation output of SuBSENSE
Based on the foreground mask image from SuBSENSE, we build a background pixel library, BL ,for each pixel location in the frame. The idea is that we only store the pixel values from thecurrent frame in the background pixel library, when they are classified by SuBSENSE algorithmas background pixels. An illustration of background pixel library is shown in Figure. 7. Here, Figure 7: Background pixel library for one pixel location we only keep the last 90 background pixel values from the video sequences. After the library iscomplete, the oldest background pixel value is replaced with the newest one. For this purpose,we have an indicator, bi , which is a pointer that points to the oldest background pixel value inthe library. Each time, if a pixel value is classified by SuBSENSE as background, then it will bestored in the background library at the location where bi points to, and then we move bi to thenext position in the library. To generate the background image, we calculate the average valueover a certain memory length of the pixel values in the background pixel library. The memorylength is defined using the variable bm . The background pixel at location ( x , y ) and at colorchannel k is defined by BK ( x , y , k ), whose value is calculated using following equation BK ( x , y , k ) = bi (cid:80) i = bi − bm BL ( x , y , i , k ) bm , k = { R , G , B } . (1)10ut if we use a fixed memory length bm over the entire video sequence, we will get either blurryor outdated background model if the camera is moving or if there are some intermittent objectsin the video sequence. These two cases are illustrated in Figure. 8. As we can see in the follow-ing figure, the first row shows the scenario that the camera is constantly rotating, the calculatedbackground image is shown on the right hand side. In this case, the background image becomesvery blurry if we use a fixed average length bm , or in the second row, the car on the left side stopsat the same location for 80 percent of the frames in the video sequence, so the car will naturallybe classified as background by SuBSENSE. Then if the car starts to move, the calculated back-ground image using fixed memory length will not be updated because the pixel values of the carare still stored in the background library.(a) (a’) (a”) (a”’)(b) (b’) (b”) (b”’) Figure 8: Bad Background Image Model When Using Fixed Average Length. On the first row, the first 3 images ( a − a (cid:48)(cid:48) )are the video input frames (Nr. 80,120,160) from continuousPan sequence of CDnet data-set. The first 3 images on thesecond row ( b − b (cid:48)(cid:48) ) are the video input frames(Nr. 1800,1950,2100) from winterDriveaway sequence. The left imageson the first ( a (cid:48)(cid:48)(cid:48) ) and on the second row ( b (cid:48)(cid:48)(cid:48) ) are the default background models from SuBSENSE with fixed memorylength. In order to have adaptive memory length based on the motion of the camera and objects inthe video frames, we need to have a motion detector. A commonly used motion detector is themethod using flux tensor. Compared with standard motion detection algorithms, the advantageof using flux tensor is that the motion information can be directly computed without expensiveeigenvalue decompositions. Flux tensor represents the temporal variation of the optical flow fieldwithin the local 3D spatio-temporal volume [26], where in the expanded matrix form, flux tensoris written as J F = (cid:82) Ω { ∂ I ∂ x ∂ t } dy (cid:82) Ω ∂ I ∂ x ∂ t ∂ I ∂ y ∂ t dy (cid:82) Ω ∂ I ∂ x ∂ t ∂ I ∂ t dy (cid:82) Ω ∂ I ∂ y ∂ t ∂ I ∂ x ∂ t dy (cid:82) Ω { ∂ I ∂ y ∂ t } dy (cid:82) Ω ∂ I ∂ y ∂ t ∂ I ∂ t dy (cid:82) Ω ∂ I ∂ t ∂ I ∂ x ∂ t dy (cid:82) Ω ∂ I ∂ t ∂ I ∂ y ∂ t dy (cid:82) Ω { ∂ I ∂ t } dy . (2)The elements of the flux tensor incorporate information about temporal gradient changes whichleads to e ffi cient discrimination between stationary and moving image features. The trace of the11ux tensor matrix which can be compactly written and computed as trace ( J F ) = (cid:90) Ω (cid:107) ∂∂ t (cid:53) I (cid:107) dy , (3)can be directly used to classify moving and non-moving regions without eigenvalue decomposi-tions. In our approach, the output of Flux Tensor algorithm is a binary image, which contains themotion information of the video frames. An example is shown in Figure. 9. The white pixel inthe binary image indicates that the pixel at this location is moving either temporally or spatially.(a) (b) Figure 9: (a) input video; (b) the output of Flux Tensor
Next, we define a new variable called F s as: F s = N w W × H , (4)where N w represents the number of white pixels in Flux tensor while W and H represent thewidth and height of the image respectively. The variable F s presents how many percent of thepixels in the current video frame are moving. Large F s means either the camera is moving orthere is a large object in the frame which starts to move. In this case, we need to decrease thememory length bm . If F s is relative small, then it means the background is steady and we canuse a large memory length to suppress the noise. Using F s , we dynamically increase or decreasethe value of bm . The relation between bm and F s is given as follows bm = i f F s ≥ . + F s − . . − . · i f . < F s < . i f F s ≤ . . (5)In order to avoid the noise in F s and resulting noise in bm , there is a low pass filter structureapplied to the value of F s , the low pass filter is defined as follows F s ( t ) = α · F s ( t ) + (1 − α ) · F s ( t −
1) (6)where α = . i f F s ( t ) < F s ( t − . i f F s ( t ) > F s ( t −
1) (7)12ere, F s ( t ) means the value of F s at time t . Note that the di ff erent value of α is based on thefact whether F s is increasing or decreasing. The reason for this is that after a dramatic decreaseof F s we want to increase F s slower, in order to let B be updated with new background pixelvalues. Due to the low quality of some surveillance cameras, there will be some undesired pixelvalues around moving objects in the form of semi-transparency and motion blur, which are il-lustrated in Figure. 10. In these two cases, the pixels near segmentation mask of SuBSENSE(a) (b)
Figure 10: Semi-transparency and motion blur around moving objects will probably be false negatives. This means that even the pixels near the segmentation maskare classified as background, but they are actually corrupted by foreground moving pixels withsemi-transparency and motion blur. In this case, we should not add this background pixel to thebackground library. For this purpose we need to perform a padding around the foreground maskwith the size defined by using variable P w , which means if a pixel value is classified as back-ground but the pixels around it within the radius of path size P s are classified as foreground, thenthis background pixel will be disregarded. The P s should also be dynamically adjusted using theoutput of flux tensor. For instance, if F s is large, then we need to increase P s , because with moremoving pixels the phenomenon of semi-transparency and motion blur will also increase. Fig-ure. 11 shows the comparison between the background model from SuBSENSE and the robustbackground model obtained by the proposed approach. We train the proposed CNN with background images obtained by the SuBSENSE algorithm[21]. Both networks are trained with pairs of RGB image patches from video and backgroundframes and the respective ground truth segmentation patches. Before introducing the networkarchitecture, we outline the data preparation step for the network training. Afterwards, we willillustrate our architecture for our CNN and explain the training procedure in detail.
For the training of the CNN, we use random data samples (around 5 percent) from the CDnet2014 data-set [27], which contains various challenging video scenes and their ground truth seg-mentation. We prepare one set of background images that is obtained by the proposed algorithm.13a) (a’) (a”)(b) (b’) (b”)
Figure 11: Comparison of background models obtained by SubSENSE and the propose approach. Each row representsthe results of one sample. The first column shows the input video, the second column shows the background model fromSubSENSE, and the theirs column shows the background image from proposed approach.
Since we want the network to learn representative features, we only utilize video scenes that donot challenge the background model, i.e. the background scene should not change significantlyin a video. Therefore, we exclude all videos from certain categories for training (see Table 1).We work with RGB images for background and video frames. Before we extract the patches, allemployed frames are resized to the dimension 240 ×
320 and the RGB intensities are rescaledto [0 , ff ects.The training data consist of triplets of matching patches from video, ground truth and backgroundframes of size 37 ×
37 that are extracted with a stride of 10 from the employed training frames.An example mini-batch of training patches is shown in Figure. 12.As widely recommended, we perform mean subtraction on the image patches before training,but we discard the division by the standard deviation, since we are working with RGB data andtherefore each channel has the same scale.
The architecture of the proposed CNN is illustrated in Figure 13. The network contains 3convolutional layers and a 2-layer Multi Layer Perceptron (MLP). We use the Rectified LinearUnit (ReLU) as activation function after each convolutional layer and the Sigmoid function afterthe last fully connected layer. In addition, we insert batch normalization layers [14] beforeeach activation layer. A batch normalization layer stores the running average of the mean andstandard deviation from its inputs. The stored mean is subtracted from each input of the layerand also division by the standard deviation is performed. It has been shown that by applyingbatch normalization layers, over-fitting is decreased and also higher learning rates for training14
Dnet 2014 [27] categories badWeather Xbaseline XcameraJitter XdynamicBackground XintermittentObjectMotionlowFrameratenightVideos XPTZshadow Xthermal Xturbulence X
Table 1: Categories of CDnet 2014 [27]: Crosses indicate categories including all their video sequences that wereconsidered for network training. are achieved.We train the networks with mini batches of size 150 via RMSProp [12] and a learning rate of α = . ∗ − . For the loss function, we choose the Binary Cross Entropy (BCE), which isdefined as follows: BCE ( x , y ) = − ( x ∗ log y + (1 − x ) ∗ log (1 − y )) . (8)The BCE is calculated between the network outputs and the corresponding vectorized groundtruth patches of size 37 ∗ = The spatial-median filtering, which is a commonly used post-processing method in back-ground subtraction, returns the median over a neighborhood of given size (the kernel size) foreach pixel in an image. As a consequence, the operation gets rid of outliers in the segmentationmap and also performs blob smoothing (see Figure 14). After applying the spatial-median filteron the network output, we globally threshold the values for each pixel in order to map each pixelto { , } . The threshold function is given by g ( z ; R ) = z > R , (9)where R is the threshold level.
4. Experiments
In order to evaluate our approach, we conducted several experiments on various data-sets.At first, we introduce the utilized data-sets for performance testing. Afterwards, the evaluation15a) (b) (c)
Figure 12: Visualization of a mini batch of size 48: (a) Background patches, generated with temporal-median filtering.(b) Corresponding input patches from video frames. (c) Ground truth patches, black pixels are background pixels, graypixels are foreground pixels and white pixels are not of interest, i.e. those are not incorporated into the loss. metrics are presented, which measure the quality of the segmentation outputs. The results on theevaluation data are subsequently reported. Furthermore, we analyze the network behavior duringtraining. Additionally, we visualize the convolutional filters and generated feature maps at theend.
We employ multiple data-sets to perform our tests. The CDnet 2014 [27], the Wallflower [24]and the PETS 2009 data-set [9]. The CDnet 2014 and also the Wallflower data-set were specif-ically designed for the background subtraction task. These data-sets contain video sequencesfrom di ff erent categories, which correspond to the main challenges in background subtractionand also hand segmented ground truth images are also provided. The PETS 2009 [9] data-setwas designed for other purposes, such as the evaluation of tracking algorithms, and therefore noground truth segmentations are available for its video sequences. The employed data-sets will bedescribed in the following. The CDnet 2014 [27] that was used for training of our CNNs will also be used for per-formance evaluation. With additional video sequences under new challenge categories, it is anextension of the CDnet 2012 [10] data-set, which is the predecessor of the CDnet 2014 [27].For each video sequence in the data-set, corresponding ground truth segmentation images areavailable. For the newly added categories in CDnet 2014 [27] (see Table 2), only half of theground truth segmentations were provided to avoid parameter tuning of background subtraction16
19 2419 1010 4866 96 2048 1369conv1: 5x5pad1: 3pool1: 2x2 conv2: 5x5pad2: 3pool2: 2x2 conv3: 5x5pad3: 3pool3: 2x2 fc2048 fc1369
Figure 13: Illustration of proposed CNN: The network contains 3 convolutional layers and a 2 layer MLP. (a) (b)
Figure 14: E ff ect of spatial-median filtering: (a) Raw segmentation result of the GMM algorithm [22]. (b) Spatial-medianfiltered output with a 9 × algorithms on the benchmark data-set. One has to refer to the online evaluation in order to getthe results over all ground truth segmentations. Another data-set that we employ for evaluation purpose is the Wallflower data-set [24]. Foreach category in the data-set, there exists a single video sequence (see Table 3). Also, for eachvideo sequence, a hand segmented ground truth segmentation image is provided. Hence, whenevaluating background subtraction algorithms on the Wallflower data-set [24], only a singleground truth segmentation is considered.
The PETS 2009 data-set [9] is a benchmark for tracking of individual people within a crowd.It consists of multiple video sequences, recorded from static cameras, and di ff erent crowd activi-ties. Since the data-set is not designed for background subtraction evaluation, there are no ground http://changedetection.net ategories Video Sequences CDnet 2012 [10] CDnet 2014 [27] badWeather 4 Xbaseline 4 X XcameraJitter 4 X XdynamicBackground 6 X Xintermittent-ObjectMotion 6 X XlowFramerate 4 XnightVideos 6 XPTZ 4 Xshadow 6 X Xthermal 5 X Xturbulence 4 X Table 2: Comparison of CDnet 2012 [10] and CDnet 2014 [27]: Categories that are marked with a cross are contained inthe respective data-set.
Video sequences / categories Number of frames Groundtruthframe number Bootstrap 3055 353Camouflage 294 252ForegroundAperture 2113 490LightSwitch 2715 1866MovedObject 1745 986TimeOfDay 5890 1850WavingTrees 287 248
Table 3: Overview of the Wallflower data-set [24] truth segmentation images for this purpose. Thus, only the qualitative segmentation results willbe evaluated without calculating any evaluation metric. A sample image from each category isrepresented in Fig. 15.
In order to measure the quality of a background subtraction algorithm, we evaluate the per-formance by comparing the output segmentations with the groundtruth segmentations to get thefollowing statistics:
True Positive (TP) : Foreground pixels in the output segmentation that are also foreground pixelsin the ground truth segmentation.
False Positive (FP) : Foreground pixels in the output segmentation that are not foreground pixelsin the ground truth segmentation.
True Negative (TN) : Background pixels in the output segmentation that are also background18a) (b) (c) (d) (e)(f) (g) (h) (i) (j)
Figure 15: Sample images of each video sequence. (a) baseline; (b) bad weather; (c) thermal; (d) camera jitter; (e) nightvideos; (f) shadow; (g) dynamic background; (h) Camouflage [24]; (i) WavingTrees [24]; (j) PETS2009 [9] pixels in the ground truth segmentation.
False Negative (FN) : Background pixels in the output segmentation that are not backgroundpixels in the ground truth segmentation.Using these statistics, di ff erent evaluation metrics are calculated that are outlined in the fol-lowing: • Recall (Re): Re = T PT P + FN (10) • Specificity (Sp):
S p = T NT N + FP (11) • Precision (Pr): Pr = T PT P + FP (12) • F Measure (FM):
F M = ∗ Pr ∗ RePr + Re (13) • False Positive Rate (FPR):
FPR = FPFP + T N (14) • False Negative Rate (FNR):
FNR = FNT P + FN (15)19 Percentage of Wrong Classifications (PWC):
PWC = ∗ FN + FPT P + FN + FP + T N (16)We are especially interested in the FM metric since most state-of-the-art algorithms in back-ground subtraction typically exhibit higher FM values than worse performing background sub-traction algorithms. This is due to the combination of multiple evaluation metrics for the calcu-lation of the FM. Thus, the overall performance of a background subtraction algorithm is highlycoupled with its FM performance.
In Section 3, we have introduced our background subtraction system, consisting of a CNN,a background image retrieval and a post-processing module. For the background image retrievaland the post-processing technique, respectively two methods were proposed. The novel back-ground image generation based on the SuBSENSE [21] and Flux tensor was proposed and forthe post-processing, spatial-median filtering is considered. In order to get the best performingsetup, we calculate the evaluation metrics, presented in Section 4.2, over the CDnet 2014 [27], foreach setup of our background subtraction system. Also, we compare the category-wise FM andthe overall FM for those data-sets with the FMs from other background subtraction algorithms.Afterward, we will select the best performing setup for further comparison. For the Wallflowerdata-set [24], we calculate the FM for the best setup and again, compare the values with thosefrom other algorithms. Due to the missing ground truth images for the videos in the PETS 2009data-set [9], we can not derive the numeric values for the FM and therefore only evaluate the seg-mentation outputs. In order to compare our approach with other algorithms on multiple data-sets,we need to be able to generate the corresponding segmentation, using di ff erent algorithms. Forthe CDnet 2014 [27], all algorithms that are listed in the online ranking , their evaluation metricsand segmentation results are available. For the other data-sets, namely the PETS 2009 [9] andthe Wallflower data-set [24], we need to explicitly generate the segmentation images for thosevideo sequences. For this purpose, we employed the BGSLibrary [19] as for the backgroundimage generation. As a consequence, we compare our method with certain algorithms that wereonline evaluated for the CDnet 2014 [27], as well as implemented in the BGSLibary [19]. As already mentioned in Section 3, we train the network with mini-batches of size 150, alearning rate α = . ∗ − over 10 epochs. The training data comprises of 150 images pervideo sequence in the categories listed in Table 1. The validation data contains 20 frames pervideo sequence. We train the network with RMSprop [12] using the BCE for the loss function. InFigure 16, the training plot is illustrated. Both networks yield a similar behavior and performanceduring training, mainly due to the identical network architecture and training setup. http://changedetection.net Lo ss Epochs Train lossVal loss (a)
Figure 16: Training and validation loss of the network using background images generated using the proposed back-ground model.
In order to measure the quality of our background subtraction algorithm, we compute the dif-ferent evaluation metrics for our background subtraction system. The metrics are reported for theCDnet 2014 [27]. Also, we compare our FMs with the FMs from other background subtractionalgorithms over the complete evaluation data. For this purpose, we compare our algorithm withthe GMM [22], PBAS [13] and the SuBSENSE [21] algorithm. In addition, for both CDnet data-sets, we employ further background subtraction algorithms for the FM comparison. Also, wecompare the segmenation outputs among the algorithms. For the comparison, we select furthervideo sequences from the PETS 2009 data-set [9].
In the following, for each setup of our background subtraction system, the evaluation metricsfor the CDnet 2014 are listed in Tables 4. The outputs are post-processed with a median filterwith a size of 9 ×
9. The outputs of the CNN are compared with the threshold at R = .
3. As wecan see in Table 4, the CNN yields very good results. In addition, the CNN yields the highestFM among the other algorithms in 6 out of 11 categories (see Table 5 and Table 6 ).
For the the Wallflower data-set [24], we compare the FMs among the considered algorithms.For each video in the data-set, only a single groundtruth image is given, therefore, the FM iscalculated only for this ground truth image.The di ff erent FMs are reported in Table 7. Please note that we do not consider the ”MovedOb-ject” video for comparison since the corresponding ground truth image contains no foregroundpixels and therefore, the FM can not be calculated.From the results we can see that the CNN yields the best overall FM among the considered back-ground subtraction algorithms. 21 ategory Re Sp FPR FNR PWC FM PrBaseline 0.9517 0.9991 0.0013 0.0483 0.2424 0.9580 0.9660Camera jitter 0.8788 0.9957 0.0043 0.1212 0.8994 0.8990 0.9313Dynamic Background 0.8543 0.9988 0.0012 0.1457 0.2067 0.8761 0.9083Intermittent Object Motion 0.5735 0.9949 0.0051 0.4265 4.1292 0.6098 0.8251Shadow 0.9584 0.9942 0.0058 0.0416 0.7403 0.9304 0.4844Thermal 0.6637 0.9956 0.0044 0.3363 3.5773 0.7583 0.9257Bad Weather 0.7517 0.9996 0.0004 0.2483 0.3784 0.8301 0.9494Low Framerate 0.5924 0.9975 0.0025 0.4076 1.3564 0.6002 0.9677Night Videos 0.5315 0.9959 0.0041 0.4685 2.5754 0.5835 0.8366PTZ 0.7549 0.9248 0.0752 0.2541 7.7228 0.3133 0.2855Turbulence 0.7979 0.9998 0.0002 0.2021 0.0838 0.8455 0.9082Overall 0.7545 0.9905 0.0095 0.2455 1.9920 0.7548 0.8332 Table 4: Evaluation of CNN on the CDnet 2014
Method FM
BSL FM DBG FM CJT FM IOM FM SHD FM THM
CNN
SuBSENSE [21] 0.9503 0.8177 0.8152 0.6569 0.8986 0.8171MBS [17] 0.9287 0.7915 0.8367 0.7568 0.8262 0.8194GMM [22] 0.8245 0.6330 0.5969 0.5207 0.7156 0.6621RMoG [25] 0.7848 0.7352 0.7010 0.5431 0.7212 0.4788Spectral-360 [18] 0.9330 0.7872 0.7156 0.5656 0.8843 0.7764
Table 5: FM metric comparison of di ff erent background subtraction algorithms over the 6 categories of the CDnet2014,namely 1) Baseline (BSL), Dynamic background (DBG), Camera jitter (CJT), Intermittet objectmotion (IOM), Shadow(SHD), Thermal (THM). Method FM
BDW FM LFR FM NVD FM PTZ FM TBL
CNN 0.8301 0.6002
PBAS [13] 0.7673 0.5914 0.4387 0.1063 0.6349PAWCS [20] 0.8152
Table 6: FM metric comparison of di ff erent background subtraction algorithms over the 5 categories of the CDnet2014,namely Bad weather (BDW), Low framerate (LFM), Night videos (NVD), PTZ (PTZ), Turbulence (TBL) M Video
CNN SuBSENSE [21] PBAS [13] GMM [22]FM
Bootstrap
Camouflage
ForegroundAperture
LightSwitch
TimeOfDay FM WavingTrees
Overall
Table 7: F-Measures for the Wallflower dataset [24]
The output segmentations are already precise and not very prone to outliers (e.g. throughdynamic background regions), the main problems are the camouflage regions within foregroundobjects. In the first 6 categories of CDnet 2014, where mainly the feature extraction and segmen-tation are challenging, the CNN gives the best results, compared with other algorithms, in 6 outof 11 categories. For categories such as the ”PTZ (Pan Tilt Zoom)” or ”low framerate” category,the CNN yields poor results since the background images provided by the proposed algorithmare insu ffi cient for performing good background subtraction. Therefore, our method gives poorsegmentation for these categories and as a consequence, the average FM drops significantly.However, for Wallflower data-set [24], where in most cases the background model is not chal-lenged, our algorithm outperforms all other algorithms in terms of segmentation quality andoverall FM.To sum up, our system outperforms the existing algorithms when the challenge does not lie inthe background modeling / maintenance, but on the other hand, due to the corruption of the back-ground images, our method performs poorly once there are large changes in the backgroundscenery. For further understanding and intuition, we visualize the convolutional filters and the featuremaps when an image pair is fed into the network. In order to replace feature engineering, filtersare employed which are learned during training. Since the learning is performed with groundtruth data, highly application specific features will be extracted. Our network contains 3 con-volutional layers which perform the feature extraction, especially the convolutional filters, theso-called kernels, are responsible for that task. Some of those will be illustrated in the following.All convolutional filters in our network are of size 5 ×
5. The filters that are directly applied tothe RGB image pair of input and background image are illustrated in Figure 19.Due to the large number of filters, we only show the filter-sets that output the first feature map ofthe respective convolutional layer. Those filter-sets from the remaining convolutional layers are23hown in Figure 20. By applying the filters onto the input images or feature maps, the convolu-tional layer scans the input for certain patterns, that are defined by the filters, that again generatenew feature maps which capture task-specific characteristics.
By applying the learned filters to the network input, passing the intermediate values throughthe activation and subsampling functions, feature maps are generated. Those are again the inputof the subsequent convolutional layer or classifier. In order to analyze the generated feature maps,we first need to feed an input into the network. An example input pair consisting of backgroundand input frame is shown along with the CNN output in Figure 18. The output feature mapsfrom all convolutional layers in our network are illustrated in Figures. 19 and 20. At the end,the vectorized feature maps of the last convolutional layer in our CNN are fed into a MLP thatperforms the pixel-wise classification.
5. Conclusion
We have proposed a novel approach based on deep learning for background subtraction task.The proposed approach includes three processing steps, namely background model generation,CNN for feature learning and post-processing. It turned out that the CNN yielded the best per-formance among all other algorithms for background subtraction. In addition, we analyzed ournetwork during training and visualized the convolutional filters and generated feature maps. Dueto the fact that our method works only with RGB data, the potential of Deep Learning approachesand the available data, one could think of modeling the background with Deep Learning tech-niques, for example with RNN. Furthermore, as we use a global threshold for our network outputsat the moment, one could use adaptive pixel-wise thresholds, as in the PBAS algorithm, employ-ing a kind of ”background dynamics” measure for the feedback loop. With this adaption, onecould increase the sensitivity in static background areas and decrease it for areas with dynamicbackground. At last, when combining our method with an existing background subtraction algo-rithm, in our case with the SuBSENSE algorithm, one could use the information of both outputsegmentations and combine them to get a refined output and employ this output for improvingthe updates of the background model. 24 igure 17: Comparison of the segmentation outputs: The first column is the input image, the second column is its groundtruth image, the third column represents the output of the CNN. The fourth column is the output of the SuBSENSE[21], the fifth column represents the output of the PBAS [13] and the last column shows the output of the GMM [22]segmentation. Gray pixels in the ground truth segmentation indicate pixels which are not of interest
Figure 18: Network input and output: (a) Background image of the input pair. (b) Input frame. (c) Network output. (a) (b)
Figure 19: Visualization of the kernels in the first convolutional layer: (a) Kernels for processing the input image. (b)Kernels that process the background image. (a) (b)
Figure 20: Kernels of the second and third convolutional layer: (a) 24 kernels to generate the first output feature map inthe second convolutional layer. (b) 48 kernels that output the first feature map in the third convolutional layer. eferences [1] O. Barnich and M. Van Droogenbroeck. Vibe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image processing , 20(6):1709–1724, 2011.[2] G.-A. Bilodeau, J.-P. Jodoin, and N. Saunier. Change detection in feature space using local binary similaritypatterns. In
Computer and Robot Vision (CRV), 2013 International Conference on , pages 106–112. IEEE, 2013.[3] M. Braham and M. Van Droogenbroeck. Deep background subtraction with scene-specific convolutional neuralnetworks. In
International Conference on Systems, Signals and Image Processing, Bratislava 23-25 May 2016 .IEEE, 2016.[4] F. Bunyak, K. Palaniappan, S. K. Nath, and G. Seetharaman. Flux tensor constrained geodesic active contours withsensor fusion for persistent object tracking.
Journal of Multimedia , 2(4):20, 2007.[5] Y. Chen, J. Wang, and H. Lu. Learning sharable models for robust background subtraction. In , pages 1–6. IEEE, 2015.[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.
Journal of the royal statistical society. Series B (methodological) , pages 1–38, 1977.[7] A. Elgammal, D. Harwood, and L. Davis. Non-parametric model for background subtraction. In
European confer-ence on computer vision , pages 751–767. Springer, 2000.[8] R. H. Evangelio and T. Sikora. Complementary background models for the detection of static and moving objectsin crowded environments. In
Advanced Video and Signal-Based Surveillance (AVSS), 2011 8th IEEE InternationalConference on , pages 71–76. IEEE, 2011.[9] J. Ferryman and A. Shahrokni. An overview of the pets 2009 challenge.
Eleventh IEEE International Workshop onPerformance Evaluation of Tracking and Surveillance , 2009.[10] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar. Changedetection. net: A new change detectionbenchmark dataset. In , pages 1–8. IEEE, 2012.[11] M. Heikkila and M. Pietikainen. A texture-based method for modeling the background and detecting movingobjects.
IEEE transactions on pattern analysis and machine intelligence , 28(4):657–662, 2006.[12] G. Hinton, N. Srivastava, and K. Swersky.
Lecture 6a Overview of mini–batch gradient descent , 2012.[13] M. Hofmann, P. Tiefenbacher, and G. Rigoll. Background segmentation with feedback: The pixel-based adaptivesegmenter. In ,pages 38–43. IEEE, 2012.[14] S. Io ff e and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariateshift. arXiv preprint arXiv:1502.03167 , 2015.[15] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis. Real-time foreground–background segmentation usingcodebook model. Real-time imaging , 11(3):172–185, 2005.[16] N. M. Oliver, B. Rosario, and A. P. Pentland. A bayesian computer vision system for modeling human interactions.
IEEE transactions on pattern analysis and machine intelligence , 22(8):831–843, 2000.[17] H. Sajid and S.-C. S. Cheung. Background subtraction for static & moving camera. In
Image Processing (ICIP),2015 IEEE International Conference on , pages 4530–4534. IEEE, 2015.[18] M. Sedky, C. Chibelushi, and M. MONIRI. Image processing: Object segmentation using full-spectrum matchingof albedo derived from colour images, 2010.[19] A. Sobral. BGSLibrary: An opencv c ++ background subtraction library. In IX Workshop de Viso Computacional(WVC’2013) , Rio de Janeiro, Brazil, Jun 2013.[20] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin. A self-adjusting approach to change detection based on back-ground word consensus. In , pages 990–997.IEEE, 2015.[21] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin. Subsense: A universal change detection method with localadaptive sensitivity.
IEEE Transactions on Image Processing , 24(1):359–373, 2015.[22] C. Stau ff er and W. E. L. Grimson. Adaptive background mixture models for real-time tracking. In Computer Visionand Pattern Recognition, 1999. IEEE Computer Society Conference on. , volume 2. IEEE, 1999.[23] P. Tiefenbacher, M. Hofmann, D. Merget, and G. Rigoll. Pid-based regulation of background dynamics for fore-ground segmentation. In , pages 3282–3286.IEEE, 2014.[24] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principles and practice of background maintenance.In
Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on , volume 1, pages255–261. IEEE, 1999.[25] S. Varadarajan, P. Miller, and H. Zhou. Region-based mixture of gaussians modelling for foreground detection indynamic scenes.
Pattern Recognition , 48(11):3488–3503, 2015.[26] R. Wang, F. Bunyak, G. Seetharaman, and K. Palaniappan. Static and moving object detection using flux tensor with plit gaussian models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops ,June 2014.[27] Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, and P. Ishwar. Cdnet 2014: an expanded changedetection benchmark dataset. In
Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , pages 387–394, 2014., pages 387–394, 2014.