Adaptive low rank and sparse decomposition of video using compressive sensing
Fei Yang, Hong Jiang, Zuowei Shen, Wei Deng, Dimitris Metaxas
AADAPTIVE LOW RANK AND SPARSE DECOMPOSITION OF VIDEO USINGCOMPRESSIVE SENSING
Fei Yang Hong Jiang Zuowei Shen Wei Deng Dimitris Metaxas Rutgers University Bell Labs National University of Singapore Rice University
ABSTRACT
We address the problem of reconstructing and analyzingsurveillance videos using compressive sensing. We developa new method that performs video reconstruction by lowrank and sparse decomposition adaptively. Background sub-traction becomes part of the reconstruction. In our method, abackground model is used in which the background is learnedadaptively as the compressive measurements are processed.The adaptive method has low latency, and is more robust thanprevious methods. We will present experimental results todemonstrate the advantages of the proposed method.
Index Terms — Compressive sensing, low rank andsparse decomposition, background subtraction
1. INTRODUCTION
In video surveillance, video signals are captured by camerasand transmitted to a processing center, where the videos aremonitored and analyzed. Given a large number of camerasinstalled in public places, an enormous amount of data is gen-erated and needs to be transmitted in the network, raising ahigh risk of network congestion. Therefore, it is highly desir-able to compress the video signals transmitted in the network.The recently introduced compressive sensing theory es-tablishes that if a signal has a sparse representation in somebasis, then it can be reconstructed from a small set of linearmeasurements [1][2]. The number of measurements can bemuch smaller than that required by Nyquist sampling rate.Since videos are known to have a sparse representation insome transform basis (e.g. total variation, wavelet or framelet,etc.), the compressive sensing theory can be applied to com-press video at the cameras, for example to acquire video bycompressive measurements which can then be used to recon-struct the video [3][4].In this paper, we developed a framework for processingsurveillance video using compressive measurements. Oursystem is shown in Fig. 1. At the camera, the video cap-tured by a surveillance camera is either acquired [5] as, ortransformed to, the low dimensional measurements by usingrandom projections. At the processing center, the framesof the video are reconstructed, and the moving objects aredetected at the same time.
Reconstruction& DetectionTransmissionMeasurements
Fig. 1 . The framework of the compressive sensing surveil-lance system. The video is compressed by using random pro-jections, and then transmitted to the processing center. Theframes are reconstructed and the moving objects are detectedsimultaneously.Our method is based on three observations: 1). The back-ground is nearly static over a short period. Thus the back-ground images lie in a low dimensional subspace. 2). Naturalimages are sparse in a transform, such as tight wavelet frame,domain. 3). Generally the moving objects only occupies asmall portion of the field of view of a surveillance camera.Based on these observations, we use a low rank model forbackground and a sparse model for moving objects. The re-construction of background and moving objects is performedby a low rank and sparse decomposition similar to [6][7].In the low rank model of [7], a large number of framesof video must be used in order to properly reconstruct thebackground because the low rank and sparse decompositioncomputes background frames as a low rank basis of the spacespanned by the incoming video frames. This results in a longlatency in the reconstruction.In this paper, we introduce an adaptive background modelin which the low rank and sparse decomposition is performedwith a small number of video frames. This significantly re-duces latency. In this adaptive method, the video frames arereconstructed by a few frames at a time. In each reconstruc-tion, the compressive measurements from a small number ofvideo frames are used to perform the low rank and sparse de-composition which produces a set of background frames. Thebackground frames are further processed and the results areused in the low rank and sparse decomposition for the nextset of frames. Therefore, effectively, a large number of back-ground frames are participated (although not explicitly used) a r X i v : . [ c s . I T ] M a y n the computation of the low rank and sparse decomposi-tion at each reconstruction, since the background frames fromprevious reconstructions are used. This makes it possible toaccurately reconstruct background frames even with a smallnumber of frames processed each time. The proposed methodhandles background changes very well because it is adaptive.Furthermore, the method reduces latency and computationalcomplexity significantly.In the remaining parts of the paper, we first introduce pre-vious work related to our study. Then we introduce the frame-work of our video reconstruction method, followed by thebackground model and its adaption algorithm. The experi-mental results are given at the end.
2. RELATED WORKBackground subtraction . There has been extensivestudy on background subtraction from original videos [8].The earliest background subtraction methods use frame dif-ference to detect foreground [9]. Subsequent approachesaimed to model the variations and uncertainty in back-ground appearance, such as mixture of Gaussian [10] andnon-parametric kernel density estimation [11]. Currentlystate-of-art background subtraction methods are able to getsatisfactory results for stationary cameras. However, thesemethods cannot be applied to compressive measurements.
Sparse reconstruction . Cevher et al. [12] casted thebackground subtraction as a sparse approximation problemand solved it based on convex optimization. Their methodrelies on a background model trained from pure backgroundframes, which requires the prior knowledge of the back-ground. Jiang et al. [7] developed a low rank and sparsedecomposition based approach to detect moving objects froma video. Their method solves all the frames at the same time,which results in a long latency and expensive computationalcost. In contrast, the approach in this paper does not require aclean background for training, and it reconstructs backgroundadaptively, with a small number of frames of video processedat a time. This reduces latency and complexity.
3. LOW RANK AND SPARSE DECOMPOSITION3.1. Compressive measurements
We consider a video consisting of m frames. Each frame has atotal of n pixels. Let x j ∈ (cid:60) n be a vector formed by concate-nating all pixels in frame j . Let X = [ x , ..., x m ] ∈ (cid:60) n × m be a matrix containing m columns representing the m framesof the video. Let Φ ∈ (cid:60) r × n be a sensing matrix. The com-pressive measurements of X are defined as y = Φ ◦ X (cid:44) [Φ x , ..., Φ x m ] , (1)where y ∈ (cid:60) r × n is a matrix of measurements, with a muchsmaller row dimension than X , i.e., r (cid:28) n . Each column of y contains r measurements of a frame of video. In our work, Φ is composed of a set of r randomly permutated rows ofWalsh-Hadamard matrix. Given the measurements y , we want to reconstruct the orig-inal video X . X can be decomposed into background matrix X and foreground matrix X : X = X + X . (2)In above, X is a matrix each column of which is formed fromthe pixels of a background frame of the video. Similarly, X is a matrix each column of which is formed from the pixels ofa foreground frame of the video. Thus the objective is to solve X and X , satisfying Eqs. (1) and (2). Apparently, this isan ill-posed problem which has infinite number of solutions.Therefore, we need some prior knowledge to find a propersolution. Low rank background.
We assume the background im-ages have relative small changes over a short period, then thebackground matrix X should have a low rank [6]. We usethe nuclear norm to measure the rank of this matrix, which isdefined as the sum of single values σ i : || X || ∗ = trace ( (cid:113) X X T ) = (cid:88) i σ i . (3) Sparsity in transformed domain . Previous work showsthat natural images can be sparsely represented in a trans-formed space. We assume each background frame is sparseunder transform W , and each foreground frame is sparse un-der a transform W [7]. We use the the l -norm to measurethe sparsity of the transformed background and foreground: || W ◦ X || , || W ◦ X || , where the l -norm is defined as || Z || (cid:44) (cid:88) i (cid:88) j | z ij | , Z = [ z ij ] . (4) Sparse foreground . We also assume the foreground onlyoccupies a small portion of a frame, and therefore, we can alsouse l -norm as defined in Eq. (4) to measure the its sparsity: || X || .Given these prior assumptions, X and X can be recon-structed by solving the following optimization problem: ( X , X ) = arg min X ,X µ || X || ∗ + µ || W ◦ X || (5) + µ || W ◦ X || + µ || X || such that y = Φ ◦ ( X + X ) . In above, µ , µ , µ and µ are nonnegative weights. W and W are sparsifying operators. In our system, we set W = W = W as the framelet transform [13][14][7].Eq. (5) is a convex problem, so standard convex optimiza-tion algorithms such as the interior point method [15] can beapplied to find a solution. However, these standard methodsare computationally expensive. Instead, as shown in [16], sin-gular value thresholding is more efficient for low rank decom-position. We apply the Augmented Lagrangian AlternatingDirection (ALAD) algorithm introduced in Jiang et al. [7]. . ADAPTIVE RECONSTRUCTION To reconstruct the background and foreground by solvingEq. (5), a large number of frames (i.e., the number of columnsof X ) are required. This is because the solution to Eq. (5)captures the low rank basis in the space spanned by X . If thenumber of frames is small, a moving object may not changesignificantly, thus would be captured as part of background.Only when a large number of frames is used, the solution to(5) would reconstruct a background as expected. This is thereason that a large number of frames (i.e., m > ) mustbe used in [7]. The requirement for a large number of framesleads to a high latency in the reconstruction. In addition, thecomputational complexity of singular value thresholding is O ( m ) , which makes the algorithm highly computationallyexpensive as the m becomes large.In this section, we introduce an adaptive method to re-duce both latency and complexity. In order to reduce la-tency, we want to process a small number of frames eachtime. However, in order to improve accuracy of reconstructedbackground, we still need a large number of columns to bepresent in the calculation of the nuclear norm || · || ∗ . For thispurpose, we augment X by the previously calculated back-ground frames. In other words, we replace || X || ∗ in Eq. (5)by || [ M b , X ] || ∗ where M b is a matrix which is a model ofpreviously calculated background frames, see equations (8)and (9) below.The key idea of the paper is that M b , a representation ofpreviously calculated background frames, is low dimensionaland is computed adaptively as more frames are processed. M b may initially be an inaccurate approximation of the back-ground frames, but as the adaptation proceeds, M b becomesprogressively better representation of background frames.Furthermore, as background changes, M b changes accord-ingly with the background. Therefore, this method not onlyreduces latency and complexity, but also allows the recon-structed background frames to adapt quickly to the changesin the background of the video. We assume that a set of k background frames, b j , are alreadycomputed in processing the previous frames. We put them ina background matrix defined as: X b = [ b , ..., b k ] ∈ (cid:60) n × k . The augmented background matrix ˆ X is formed by com-bining the previously computed background matrix X b withthe to-be-computed background X of m new frames: ˆ X = [ X b , X ] ∈ (cid:60) n × ( k + m ) . The use of the augmented matrix makes it possible to recon-struct X , X even if X has a very small number of columns.We now require ˆ X , instead of X , to have a small rank.Therefore, the problem to solve is same as Eq. (5) but with || X || ∗ replaced by || ˆ X || ∗ . By using ˆ X , there is no need for X to have a large number of columns. A theoretical justifi-cation is given in [17]. The computational complexity to optimize the low rank of ˆ X is O ( k + m ) , which grows quickly as frames are contin-uously being processed. Therefore, we need to find a lowerdimensional background model M b ∈ (cid:60) n × p from the com-puted background frames X b , for a new augmented matrix: [ M b , X ] ∈ (cid:60) n × ( p + m ) , where p (cid:28) k . We need to find M b such that the nuclear norm of [ M b , X ] could approximate thenuclear norm of ˆ X , which leads to the following optimiza-tion problem: M b = arg min M b (cid:12)(cid:12)(cid:12) || ˆ X || ∗ − || [ M b , X ] || ∗ (cid:12)(cid:12)(cid:12) . (6)We perform SVD decomposition of the background matrix X b , and form M b as X b = U DV T , (7) M b = U p D p . (8)In Eqs. (7) and (8), D is a diagonal matrix containing singularvalues of X b , and U , V are orthogonal matrices. D p is adiagonal matrix formed by the p largest single values, and U p is consist of the first p columns of U .Now, replacing || X || ∗ by || [ M b , X ] || ∗ in Eq. (5), wehave the low latency reconstruction given as: ( X , X ) = arg min X ,X µ || [ M b , X ] || ∗ + µ || W ◦ X || (9) + µ || W ◦ X || + µ || X || , such that y = Φ ◦ ( X + X ) . We now use the Augmented Lagrangian Alternating Direction(ALAD) algorithm to solve the problem in Eq. (9). The maindifficulty is that the nuclear norm term involves an augmentedmatrix having both known columns and unknown columns.However, this can be handled by replacing the augmentedmatrix with a new variable. In addition, we introduce split-ting variables to make the objective function separable. Weperform variable substitution as below: Z = [ M b , X ] , Z = W ◦ X , Z = W ◦ X . (10)The ALAD optimization is shown in Alg. 1. More detailsabout the optimization framework can be found in [7]. With the previously computed M b , Eq. (9) can be used tocompute current background frames X by Alg. 1. Then thequestion is, how do we update M b with current X to obtain ig. 2 . Results of video reconstruction and background subtraction. Left : original frames;
Middle : background and foregroundreconstructed using the method of this paper;
Right : Foreground masks generated from original video with GMM.
Algorithm 1
Reconstructing X and X using ALAD.Initialize Z (0) i , Λ (0) i , repeat Update X , X , while fixing Z i and Λ i ,Update Z i , while fixing X , X and Λ i ,Update Λ i , while fixing X , X and Z i , until convergea new background model M ( new ) b in order for us to solve Eq.(9) to reconstruct the next set of frames? We use an approachto update M b similar to the incremental SVD [18]. Giventhe SVD decomposition X b ≈ U p D p V Tp , the decompositionof the augmented matrix with current background frames X can be used to update M b as follows: (cid:2) U ( new ) D ( new ) (cid:3) = svd ( (cid:2) w b X b w a X (cid:3) ) , ≈ svd ( (cid:2) w b U p D p V Tp w a X (cid:3) ) , = svd ( (cid:2) w b U p D p w a X (cid:3) ) , = svd ( (cid:2) w b M b w a X (cid:3) ) .M ( new ) b = U ( new ) p D ( new ) p . (11)In (11), D ( new ) p is a diagonal matrix formed by the p largestsingle values, and U ( new ) p is consist of the first p columnsof U ( new ) , similar to those in (8). w a and w b are weightscontrolling the updating rate.It is important to point out that in the update (11), the large matrix V in SVD will never need to be computed, represent-ing a significant reduction in complexity.
5. EXPERIMENTS
We perform experiments on three video clips from PETS2001database. The results are shown in Fig. 2. The first columnshows the original frames. The second and third columnsshow backgrounds and foregrounds reconstructed by themethod of this paper. We use 5% measurements for the firsttwo examples, and 10% measurements in the last examplewhich needs more measurements because a large number ofsmall moving vehicles are difficult to detect from the back-ground. Median filters are used to post-process the results ofour method to reduce the noises. The last column shows theforegrounds generated by applying Gaussian Mixture model(GMM) [10].Fig. 2 demonstrates that the results of our method arecomparable to GMM. But our method are performed by onlyusing 5%-10% of the original data, while GMM uses 100%.
6. CONCLUSION
In this paper, we address the problem of reconstructing andanalyzing surveillance videos from compressive measure-ments. We propose a method that simultaneously performsreconstruction and background subtraction with low latency.Our method is built on a background model, which is con-tinuously updated as new frames are reconstructed. Theexperiments have proved the effectiveness and efficiency ofthe proposed method. . REFERENCES [1] E. J. Candes, J. Romberg, and T. Tao, “Robust un-certainty principles: Exact signal reconstruction fromhighly incomplete frequency information,”
IEEE Trans.Information Theory , vol. 52, no. 2, pp. 489–509, 2006.[2] D. L. Donoho, “Compressed sensing,”
IEEE Trans. In-formation Theory , vol. 52, no. 4, pp. 1289–1306, 2006.[3] H. Jiang, C. Li, R. Haimi-Cohen, P. Wilford, andY. Zhang, “Scalable video coding using compressivesensing,”
Bell Labs Technical Journal , vol. 16, no. 4,pp. 149–169, 2012.[4] C. Li, H. Jiang, P. Wilford, Y. Zhang, and M. Scheut-zow, “A new compressive video sensing framework formobile broadcast,”
IEEE Transactions on Broadcasting ,vol. 59, no. 1, pp. 197 205.[5] G. Huang, H. Jiang, K. Matthews, and P. Wilford,“Lensless imaging by compressive sensing,” , accepted for presentation, Sept, 2013.[6] E.J. Candes, X. Li, Y. Ma, and J. Wright, “Robust prin-cipal component analysis?,”
Journal of ACM , vol. 58,no. 1, pp. 1–37, 2009.[7] H. Jiang, W. Deng, and Z. Shen, “Surveillance videoprocessing using compressive sensing,”
Inverse Prob-lems and Imaging , vol. 6, no. 2, pp. 201–214, 2012.[8] S. Brutzer, B. Hoferlin, and G. Heidemann, “Evaluationof background subtraction techniques for video surveil-lance,” in
Proc. CVPR , 2011.[9] R. Jain and H.H. Nagel, “On the analysis of accumu-lative difference pictures from image sequences of realworld scenes,”
IEEE Trans. Pattern Analysis and Ma-chine Intelligence , , no. 2, pp. 206–214, 1979. [10] C. Stauffer and W.E.L. Grimson, “Learning patterns ofactivity using real-time tracking,”
IEEE Trans. PatternAnalysis and Machine Intelligence , vol. 22, no. 8, pp.747–757, 2000.[11] A. Elgammal, R. Duraiswami, D. Harwood, and L.S.Davis, “Background and foreground modeling us-ing nonparametric kernel density estimation for visualsurveillance,”
Proceedings of IEEE , vol. 90, no. 7, pp.1151–1163, 2002.[12] V. Cevher, A. Sankaranarayanan, M. Duarte, D. Reddy,R. Baraniuk, and R. Chellappa, “Compressive sensingfor background subtraction,” in
Proc. ECCV , 2008.[13] A. Ron and Z. Shen, “Affine systems in l2(rd): the anal-ysis of the analysis operator,”
Journal of FunctionalAnalysis , vol. 148, pp. 408–447, 1997.[14] Ingrid Daubechies, Bin Han, Amos Ron, and ZuoweiShen, “Framelets: Mra-based constructions of waveletframes,”
Applied and Computational Harmonic Analy-sis , , no. 14, pp. 1–46, 2003.[15] J.F. Bonnans, J.C. Gilbert, C. Lemar´echal, and C.A.Sagastiz´abal,
Numerical optimization: theoretical andpractical aspects , Springer, 2006.[16] J.F. Cai, E.J. Cand`es, and Z. Shen, “A singular valuethresholding algorithm for matrix completion,”
SIAMJournal on Optimization , vol. 20, no. 4, pp. 1956–1982,2010.[17] H. Jiang, S. Zhao, Z. Shen, D. Deng, P. Wilford, andR. Haimi-Cohen, “Surveillance video analysis usingcompressive sensing with low latency,”
Bell Labs Tech-nical Journal , to appear, 2013.[18] M. Brand, “Fast low-rank modifications of the thin sin-gular value decomposition,”