[PDF] The STONE Transform: Multi-Resolution Image Enhancement and Real-Time Compressive Video

Abstract

Compressed sensing enables the reconstruction of high-resolution signals from under-sampled data. While compressive methods simplify data acquisition, they require the solution of difficult recovery problems to make use of the resulting measurements. This article presents a new sensing framework that combines the advantages of both conventional and compressive sensing. Using the proposed \stone transform, measurements can be reconstructed instantly at Nyquist rates at any power-of-two resolution. The same data can then be "enhanced" to higher resolutions using compressive methods that leverage sparsity to "beat" the Nyquist limit. The availability of a fast direct reconstruction enables compressive measurements to be processed on small embedded devices. We demonstrate this by constructing a real-time compressive video camera.

Full PDF

11 The STONE Transform: Multi-Resolution ImageEnhancement and Real-Time Compressive Video

Tom Goldstein, Lina Xu, Kevin F. Kelly, and Richard Baraniuk

Abstract —Compressive sensing enables the reconstruction ofhigh-resolution signals from under-sampled data. While com-pressive methods simplify data acquisition, they require thesolution of difﬁcult recovery problems to make use of theresulting measurements. This article presents a new sensingframework that combines the advantages of both conventionaland compressive sensing. Using the proposed STOne transform,measurements can be reconstructed instantly at Nyquist ratesat any power-of-two resolution. The same data can then be“enhanced” to higher resolutions using compressive methods thatleverage sparsity to “beat” the Nyquist limit. The availability ofa fast direct reconstruction enables compressive measurementsto be processed on small embedded devices. We demonstrate thisby constructing a real-time compressive video camera.

I. I

NTRODUCTION

In light of the recent data deluge, compressive sensing hasemerged as a method for gathering high-resolution data usingdramatically fewer measurements. This comes at a steep price.While compressive methods simplify data acquisition, theyrequire the solution of difﬁcult recovery problems to makeuse of the resulting measurements. Compressive imaging hasreplaced the data deluge with an algorithmic avalanche —conventional sensing saturates our ability to store information,and compressive sensing saturates our ability to recovery it.Spatial Multiplexing Cameras (SMC’s) are an emergingtechnology allowing high-resolution images to be acquiredusing a single photo detector. Interest in SMC’s has beenmotivated by applications where sensor construction is ex-tremely costly, such as imaging in the Short-Wave Infrared(SWIR) spectrum. For such applications, SPC’s allow for thelow-cost development of cameras with high resolution output.However the processing of compressive data is much moredifﬁcult, making real-time reconstruction intractable usingcurrent methods.The burden of reconstruction is a major roadblock forcompressive methods in applications. For this reason it iscommon to reconstruct images ofﬂine when sufﬁcient timeand computing resources become available. As a result, theneed to extract real-time information from compressive datahas led to methods for analyzing scenes in the compressivedomain before reconstruction. Many of these methods workby applying image classiﬁers and other learning techniquesdirectly on compressive data, sacriﬁcing accuracy for compu-tation tractability.The reconstruction problem is particularly crushing in thecase of video processing. Naively extending SMC methods tovideo results in cameras with poor temporal resolution andburdensome Size, Weight and Power (SWaP) characteristics. Many proposed video reconstruction schemes rely on costly optical ﬂow maps, which track the movement of objectsthrough time. This calculation makes video reconstructionslow — a few seconds of video may take hours or days toreconstruct. When the number of frames becomes large, thereconstruction problem becomes orders of magnitude morecostly than for still images, eclipsing the possibility of real-time video using conventional methods.We present a framework that uniﬁes conventional andcompressive imaging, capturing the advantages of both. Usinga new transform, we acquire compressive measurements thatcan be either reconstructed immediately at Nyquist rates, or“enhanced” using an ofﬂine compressive scheme that exploitssparsity to escape the Nyquist bound.A cornerstone of our framework is a multi-scale sensingoperator that enables reconstructed in two different ways.First, the proposed measurements can be reconstructed usinghigh-resolution compressive schemes that beat the Nyquistlimit using sparse representations. Alternately, the data can bereconstructed using a simple, direct fast O ( N log N ) transformthat produces “preview” images at standard Nyquist samplingrates. This direct reconstruction transforms measurements intothe image domain for online scene analysis and object clas-siﬁcation in real time. This way, image processing tasks canbe performed without sacriﬁcing accuracy by working in thecompressive domain. The fast transform also produces real-time image and video previews with only trivial computationalresources. These two reconstruction methods are demonstratedon an under-sampled image in Figure 1.In the case of high-resolution compressive video reconstruc-tion, we also propose a numerical method that performs recon-struction using a sequence of efﬁcient steps. The compressivevideo reconstruction relies on a new “3DTV” model, whichrecovers video without expensive pre-processing steps suchas optical ﬂow. In addition, the proposed reconstruction usesefﬁcient primal-dual methods that do not require any expensiveimplicit sub-steps. These numerical methods are very simpleto implement and are suitable for real-time implementationusing parallel architectures such a Field Programmable GateArrays (FPGA’s).The ﬂexibility of this reconstruction framework enables areal-time compressive video camera. Using a single-pixel de-tector, this camera produces a data stream that is reconstructedin real time at Nyquist rates. After acquisition, the samevideo data can be enhanced to higher resolution using ofﬂinecompressive reconstruction. a r X i v : . [ c s . C V ] N ov OriginalPreviewCompressive

Fig. 1. The proposed sensing operators allow under-sampled data to bereconstructed using either iterative/compressive or direct reconstructions. (top)A 1024 × . sampling. (bottom) An iterative reconstruction leverages compressedsensing to achieve higher resolution using the same data. A. Structure of this Paper

In Section 2, we present background information on com-pressive imaging and the challenges of compressive video.In Section 3, we introduce the STOne transform, whichenables images to be reconstructed using both compressiveand Nyquist methods at multiple resolutions. We analyze thestatistical properties of the new sensing matrices in Section4. Then, we discuss the 3DTV model for compressive re-construction in Section 5. This model exploits compressivesensing to construct high resolution videos without commoncomputational burdens. Fast, simple numerical methods forcompressive video reconstruction are introduced in Section 6,an assortment of applications are discussed in section 7, andnumerical results are presented in Section 8.II. B

ACKGROUND

A. Single Pixel Cameras

Numerous compressive imaging platforms have been pro-posed, including the Single-Pixel Camera [1], ﬂexible voxel camera [2], P2C2 [3], and coded aperture arrays [4]. Tobe concrete, we focus here on the spatially multiplexingSingle Pixel Camera (SPC), as described in [1]. However, themeasurement operators and fast numerical schemes are easilyapplicable to a wide variety of cameras, including temporaland spectral multiplexing cameras.Rather than measuring individual pixels alone, SPC’s mea-sure coded linear combinations of pixels [1]. An SPC consistsof a lens, a Digital Micro-mirror Device (DMD), and a photodetector. Each mirror on the DMD modulates an individualpixel by diverting light either towards or away from thedetector. This results in a combination coefﬁcient for that pixelof +1 or − , respectively.When the combination coefﬁcients are chosen appropriately,the resulting measurements can be interpreted as transformcoefﬁcients (such as Hadamard coefﬁcients) of the image. Forthis reason, it is often said that SPC’s sense images in the transform domain .The i th measurement of the device is an inner product (cid:104) φ i , v (cid:105) where v is the vectorized image, and φ i is a vectorof ± ’s encoding the orientation of the mirrors. Once M measurements have been collected, the observed informationcan be written Φ v = b where the rows of Φ contain the vectors { φ i } , and b isthe vector of measurements. If v contains N pixels, then Φ is an M × N matrix. Image reconstruction requires ofsolving this system. If Φ is a fast binary transform (such as aHadamard transform) the image reconstruction is simple andfast. However, it is often the case that M < N, and imagereconstruction requires additional assumptions on v. B. Compressed Sensing

Because SPC’s acquire measurements in the transformdomain (by measuring linear combinations of pixels), theycan utilize

Compressive Sensing [5], [6], which exploits thecompressibility of images to keep resolution high while sam-pling below the Nyquist rate. Compressive imaging reduces thenumber of measurements needed for image reconstruction, andthus greatly accelerates imaging. The cost of such compressivemethods is that the image reconstruction processes is compu-tationally intense. Because we have fewer measurements thanpixels, additional assumptions need to be made on the image.Compressed sensing assumes that images have sparse rep-resentations under some transform. This assumption leads tothe following reconstruction problem: min | Sv | subject to Φ v = b (1)where S is the sparsifying transform, and the (cid:96) norm, | · | , simply counts the number of non-zero entries in Sv.

In plainwords, we want to ﬁnd the sparsest image that is still compati-ble with our measurements. It has been shown that when S and Φ are chosen appropriately, nearly exact reconstructions of v are possible using only O ( | Sv | log N ) measurements, whichis substantially fewer than the N required for conventionalreconstruction. In practice, accurate recovery is possible when Φ consists of randomly sampled rows of an orthogonal matrix[7], [8].In practice, it is difﬁcult to solve (1) exactly. Rather, wesolve the convex relation, which is denoted: min E ( v ) = | Su | + µ (cid:107) Φ u − b (cid:107) (2)for some regularization parameter µ, where | · | / (cid:107) · (cid:107) denotesthe (cid:96) / (cid:96) norm, respectively. It is known that when S and Φ are mutually incoherent , the solutions to (2) and (1) coincidewith high probability.Most often, the measurement matrix Φ is formed by sub-sampling the rows of an orthogonal matrix. In this case, wecan write Φ = RT where R is a diagonal “row selector” matrix, with R i,i = 1 if row i has been measured, and R i,i = 0 otherwise. Thematrix T is generally an orthogonal matrix that can be com-puted quickly using a fast transform, such as the Hadamardtransform. C. The Challenges of Compressive Video

Much work has been done on compressive models for still-frame imaging, while relatively little is known about videoreconstruction.Video reconstruction poses many new challenges that still-frame imaging does not. Object motion during data acquisitionproduces motion artifacts. Rather than appearing as a simplemotion blur (like in conventional pixel domain imaging), thesemotion artifacts get aliased during reconstruction and effect theentire image. In order to prevent high levels of motion aliasing,reconstruction must occur at a high frame rate, yielding a smallnumber of measurements per frame.The authors of [9] conduct a thorough investigation of mo-tion aliasing in compressive measurements. A tradeoff existsbetween spatial resolution and temporal blur when objects aremoving. When images are sampled at low resolutions (i.e.pixels are large), objects appear to move slower relative to thepixel size, and thus aliasing due to motion is less severe. Whenspatial resolution is high, motion aliasing is more severe. Forthis reason it is desirable to have the ﬂexibility to interpret dataat multiple resolutions. Higher resolution reconstructions canbe used for slower moving and objects, and low resolutionscan be used for fast moving objects. This is an issues that willbe addressed by the new Sum-To-One (STOne ) transform,introduced in Section III.

D. Previous Work on Compressive Video

A video can be viewed as a sequence of still frames, eachindividually sparse under some spatial transform. However,adjacent video frames generally have high correlations be-tween corresponding pixels. Consequently a large amount ofinformation can be obtained by exploiting these correlations.One way to exploit correlations between adjacent framesis to use “motion compensation.” Park and Wakin [10] ﬁrstproposed the use of previews to compute motion ﬁelds be-tween images, which could be used to enhance the results ofcompressive reconstruction. The ﬁrst full-scale implementation on this concept is theCompressed Sensing Multi-scale Video (CS-MUVI) frame-work [11]. This method reconstructs video in three stages.First, low-resolution previews are constructed for each frame.Next, an optical ﬂow model is used to match correspondingpixels in adjacent frames. Finally, a reconstruction of the type(7,2) is used with “optical ﬂow constraints” added to enforceequality of corresponding pixels.The CS-MUVI framework produces good video quality athigh compression rates, however it suffers from extreme com-putational costs. The CS-MUVI framework relies on specialmeasurement operators that cannot be computed using fasttransforms. A full-scale transform requires O ( N ) operators,although the authors of [11] use a “partial” transform topartially mitigate this complexity. In addition, optical ﬂowcalculation is very expensive, and is a necessary ingredientsto obtain precise knowledge of the correspondence betweenpixels in adjacent images. Finally, optimization schemes thatcan handle relatively unstructured optical ﬂow constraints areless efﬁcient than solvers for more structured problems.A similar approach was adopted by the authors of [12] inthe context of Magnetic Resonance Imaging (MRI). Ratherthan use previews to generate motion maps, the authorspropose an iterative process that alternates between computingcompressive reconstructions, and updating the estimated ﬂowmaps between adjacent frames. This method does not relyon special sensing operators to generate previews, and workswell using the Fourier transform operators needed for MRI.However, this ﬂexibility comes at a high computational cost,as the computation of ﬂow maps must now be done iteratively(as opposed to just once using low-resolution previews). Also,the resulting ﬂow constraints have no regular structure that canbe exploited to speed up the numerics.For these reasons, it is desirable to have more efﬁcientsparsity models allowing for efﬁcient computation. Such amodel should (1) not require a separate pre-processing stage tobuild the sparsity model, (2) rely on simple, efﬁcient numericalschemes, and (3) use only fast transform operators that can becomputed quickly. The desire for more efﬁcient reconstructionmotivates the 3DTV framework proposed below.III. M ULTI -R ESOLUTION M EASUREMENT M ATRICES : T HE STO NE T RANSFORM

In this section, we discuss the construction of measurementoperators that allow for fast low-resolution “previews.” Pre-views are conventional (e.g. non-compressive) reconstructionsthat require one measurement per pixel. Because they rely ondirect reconstructions, previews have the advantage that theydo not require the solution of expensive numerical problemsand thus require less power and time to reconstruct. At thesame time, the data used to generate previews is appropri-ate for high-resolution compressive reconstruction when therequired resources are available.Such measurement matrices have the special property thateach row of the matrix has unit sum, and so we call this theSum To One (STOne) transform.

A. The Image Preview Equation

Previews are direct reconstructions of an image withoutthe need for full-scale iterative methods. Such an efﬁcientreconstruction is only possible if the sensing matrix Φ isdesigned appropriately. We wish to identify the conditions on Φ that make previews possible.The low resolution preview must be constructed in sucha way that it is compatible with measured data. In order tomeasure the quality of our preview using high resolution data,we must convert the low-resolution n × n image into a highresolution N × N image using the prolongation operator P Nn .To be compatible with our measurements, the low-resolutionpreview must satisfy Φ P u n = b (3)where we have left the sub/super-scripts off P for notationalsimplicity. In words, when the preview is up-sampled to fullresolution, the result must be compatible with the measure-ments we have taken.The solution of the preview equation (3) is highly non-trivialif the sensing matrix is not carefully constructed. Even when Φ is well-behaved, it may be the case that Φ P is not. Forexample, if Φ is constructed from randomly selected rows ofa Hadamard matrix, the matrix Φ P is poorly-conditioned. As aresult, the low-resolution preview is extremely noise-sensitiveand inaccurate. Also, for many conventional sensing operators,there is no efﬁcient algorithm for solving the preview equation.Even when Φ can be evaluated efﬁciently (using, e.g., the fastHadamard or Fourier transform), solving the preview equationsmay require slow iterative methods.Ideally, we would like the matrix Φ P to be unitary. Inthis case, the system is well-conditioned making the previewhighly robust to noise. Also, the equations can be explicitlysolved in the form u n = Φ T P T b. These observations motivate the following list of propertiesthat a good sensing operator should have:1)

The matrix Φ has good compressive sensing properties. It is known that compressive reconstruction is possibleif the rows of Φ are sub-sampled from an orthogonalmatrix [7], [8]. That is, we want to deﬁne Φ = RT forsome row selector R and fast orthogonal transform T .2) The preview matrix Φ P must be well conditioned andeasily invertible. These conditions are ensured if Φ P isunitary.3) The entries in Φ must be ± . Our sensing matrix mustbe realizable using a SMC.It is not clear that a sensing matrix possessing all theseproperties exists, and for this reason several authors haveproposed sensing methods that sacriﬁce at least one of theabove properties. CS-MUVI, for example, relies on Dual-Scale-Space (DSS) matrices that satisfy properties 2 and 3,but not 1. For this reason, the DSS matrices do not have a fasttransform, making reconstruction slow.Below, we describe the construction of sensing matrices thatsatisfy all of the above desired properties. B. Embeddings of Images into Vectors

Suppose we have compressive measurements taken from an N × N image. We wish to acquire a low-resolution n × n preview for n < N. If n evenly divides N, then we can deﬁnethe downsampling ratio δ = N/n.

Depending on the situation, we will need to representimages as either a 2-dimensional array of pixels, or as a 1-dimensional column vector. The most obvious embedding ofimages into vectors is using the row/column major ordering,or equivalently to perform the transform on the image in therow and column directions separately. This embedding doesnot allow low-resolution previews to be constructed using asimple transform.Rather, we embed the image into a vector by evenly dividingthe image into blocks of size δ × δ. There will be n suchblocks. The image is then vectorized block-by-block. Theresulting vector has the form v =  v v ... v n  (4)where v i contains the pixel data from the i th block.It is possible to embed the image so that the vector is inblock form (4) for every choice of n = 2 k < N . When suchan embedding is used, previews can be obtained at arbitraryresolutions.The new embedding is closely related to the so-called nested-dissection ordering of the image, which is well knownin the numerical linear algebra literature [13], [14]. Theproposed ordering is deﬁned by a recursive function whichbreaks the square image into four symmetrical panels. Eachof the four panels are addressed one at a time. Every pixelin the ﬁrst panel is numbered, and then the second panel, andthen the third and fourth. The numbering assigned to the pixelsin each panel is deﬁned by applying the recursive algorithm.Pseudocode for this method is given in Algorithm 1. Algorithm 1

Ordering of Pixels in an n × n Image Inputs: Image : a 2d array of pixels L : The lowest index to be assigned to a pixel function ASSIGN ( Image , L ) Let N be the side length of Image if N = 1 then Image = L return end if Break

Image into four panels, { I , I , I , I }

11: ASSIGN ( I , L )

12: ASSIGN ( I , L + N × N/ )

13: ASSIGN ( I , L + 2 N × N/ )

14: ASSIGN ( I , L + 3 N × N/ ) end function This recursive process is depicted graphically in ﬁgure(III-B). Note that at every level of the recursion, we have

Fig. 2. The recursive algorithm for embedding a × image into a 1-dimensional signal. The ordering of the pixels in generated by recursively breakingthe image into blocks and numbering each block individually. (Top Left) The ﬁrst stage of the algorithm breaks the image into 4 blocks, and assigns a blockof indices to each in clockwise order. (Bottom Left) The second level of recursion breaks each block into 4 sub-panels, which are each numbered in clockwiseorder. (Right) The third level of recursion breaks each sub-panel into 4 pixels, which are numbered in clockwise order. Note that the ordering of the panelsis arbitrary, but we choose the clockwise ordering here for clarity. assigned a contiguous block of indices to each sub-block ofthe image, and thus the corresponding embedding admits adecomposition of the form (4).Below, we present Theorem (2), which can be used to obtainpreviews for any n = 2 k with ≤ k < K. C. Interpolation Operators

In order to study the relationship between the low and highresolution images, we will need prolongation/interpolationoperators to convert between these two resolutions.The prolongation operator maps a small n × n image intoa large N × N image. It replaces each pixel in the low-resolution image with a δ × δ block of pixels. If the image hasbeen vectorized in the block-wise fashion described above, theprolongation operator can be written in block form as P Nn =  δ · · ·

00 1 δ · · · ... . . . ... · · · δ  = I n ⊗ δ where δ denotes a δ × column vector of 1’s, and ⊗ denotesthe Kronecker product. The row-selector matrix and measurement vector can alsobe broken down into blocks. We can write R =  R · · · R · · · ... . . . ... · · · R n × n  , b =  b b ... b n  where R i is a δ × δ row-selector sub-matrix, and b i containsa block of δ measurements. D. A Sum-To-One Transform

We now introduce a fast orthogonal transform that will bethe building block of our sensing matrices. This transformhas the property that each row of the transform matrix SumsTo One, and thus we call it the STOne transform. It will beshown later that this property is essential for the existence offast preview reconstruction methods.Consider the following matrix stencil: S = 12  − − − −  It is clear by simple observation that this matrix is unitary (i.e. S T S = I ), and its Eigenvalues are ± . Furthermore, unlikethe stencil for the standard Hadamard matrix, this stencil hasthe property that the rows of the matrix sum to 1.We will use the matrix S as a stencil to construct a newset of transform matrices as follows S k +1 = S ⊗ S k = 12  − S k S k S k S k S k − S k S k S k S k S k − S k S k S k S k S k − S k  where ⊗ denotes the Kronecker product. We have the follow-ing result; Theorem 1:

For all k ≥ , each row of the matrix S k sumsto 1. Also, every matrix S k is unitary. Proof:

The summation result follows immediately byinspection. To prove the unitary claim, we simply use theKronecker product identity ( A ⊗ B )( C ⊗ D ) = AC ⊗ BD . Itfollows that S T k +1 S k +1 = ( S ⊗ S k ) T ( S ⊗ S k )= ( S T ⊗ S T k )( S ⊗ S k ) = S T S ⊗ S T k S k = I ⊗ I k = I k +1 . The result follows by induction.Note that because the Kronecker product is associative, wecan form arbitrary decompositions of S k of the form S k = S k − (cid:96) ⊗ S (cid:96) for any (cid:96) between and k . E. Reconstructing Low Resolution Previews

In this section, we show how the sum-to-one transform canbe used to obtain low-resolution previews. The constructionworks at any power-of-two resolution.The low resolution preview is possible if our data satisﬁesthe following simple properties:1) The sensing matrix is of the form

Φ = RS N , where R is a row-selector matrix, and S N is the fastsum-to-one transform.2) Every block, R i , of δ diagonal entries of R contains atleast one non-zero entry.3) Every δ × δ patch of the image is mapped contiguouslyinto the vector v . The will be true for any power-of-two downsampling when the pixels are ordered usingAlgorithm 1.We show below that when the measurement operator hasthese two properties it is possible to efﬁciently recover low-resolution previews using a simple fast transform.As discussed above, the low-resolution preview requires usto solve the equation Φ P u n = RS N P u n = b (5) When the measurements are taken using the sum-to-one trans-form, the solution to this equation is given explicitly by thefollowing theorem. Theorem 2:

Consider a row selector, R, a measurementoperator of the form Φ = RS N , and a prolongation operatorP as described above. Suppose that images are represented asvectors in the block form (4).If each sub-matrix R i contains exactly one non-zero entry,then the preview equation has a unique solution, which is givenby u n = S n ¯ b, where ¯ b is an n × vector containing the known entries of b. If each sub-matrix R i contains one or more non-zero entries,then the preview equation may be overdetermined, and theleast squares solution is given by u n = S n ˆ b, where ˆ b i = (cid:16)(cid:80) δ γ =1 b i,γ (cid:17) / (cid:16)(cid:80) δ γ =1 R i (cid:17) . In other words, ˆ b i isthe mean value of the known entries in b i . Proof:

The preview equation contains the product S N P, which we can decompose using the Kronecker product deﬁ-nition of the sum-to-one transform: S N P = ( S n ⊗ S N /n ) P = ( S n ⊗ S δ ) P. The up-sampling operator works by replacing every pixelin u n with a constant panel of size δ × δ, and has the form P = I n ⊗ δ . We can thus write S N P = ( S n ⊗ S δ )( I n ⊗ δ )= ( S n I n ) ⊗ ( S δ δ )= S n ⊗ ( S δ δ ) . Now is when the sum-to-one property comes into play. Theproduct S δ δ simply computes the row sums of the matrix S δ , and so S δ δ = 1 δ . This gives us S N P = S n ⊗ δ . Using this reduction of the transform and prolongationoperators, the low-resolution preview equation (5) reduces to Φ P u n = RS N P u n = R ( S n ⊗ δ ) u n . Note that the matrix ( S n ⊗ δ ) is formed by taking the matrix S n and copying each of its rows δ times. If each of the R i contain only a single non-zero entry, then the operator R simply selects out a single copy of each row. In this case, itfollows that the preview equation reduces to Φ P u n = S n u n = ¯ b. The solution to this equation is u n = S − n ¯ b = S n ¯ b. If we allow the R i to contain more than one non-zeroelement, then we must seek the least squares solution. In thiscase, consider the sum-of-squares error function: E ( u ) = (cid:107) R ( S n ⊗ δ ) u − b (cid:107) = n (cid:88) i =1 δ (cid:88) γ =1 R i,γ ( S n ,i · u − b i,γ ) (6) where S n ,i denotes the i th row of S n . Observe now that δ (cid:88) γ =1 R i,γ ( S n ,i · u − b i,γ ) = δ (cid:88) γ =1 R i,γ (cid:107) S n ,i · u (cid:107) − R i,γ (cid:104) S n ,i , b i,γ (cid:105) + R i,γ (cid:107) b i,γ (cid:107) = r i (cid:107) S n ,i · u (cid:107) − (cid:104) S n ,i , δ (cid:88) γ =1 b i,γ (cid:105) + δ (cid:88) γ =1 (cid:107) b i,γ (cid:107) =  √ r i S n ,i · u − √ r i δ (cid:88) γ =1 b i,γ  + δ (cid:88) γ =1 (cid:107) b i,γ (cid:107) − r i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) δ (cid:88) γ =1 b i,γ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) =  √ r i S n ,i · u − √ r i δ (cid:88) γ =1 b i,γ  + C i where C i is a constant that depends only on b . It follows thatthe least-squares energy (6) can be written E ( u ) = n (cid:88) i =1 ( √ r i S n ,i · u − √ r i δ (cid:88) γ =1 b i,γ ) + n (cid:88) i =1 C i = (cid:107) ˆ R / S n u − ˆ R / ˆ b (cid:107) + C. where C is a constant, ˆ R i = r i , and ˆ b i = r i (cid:80) δ γ =1 b i,γ . This energy is minimized when we choose u n to satisfy theequation S n u = ˆ b. Theorem 2 shows that under-sampled high resolutionSTOne coefﬁcients can be re-binned into a complete set of lowresolution STOne transform coefﬁcients. These low-resolutioncoefﬁcients are then converted into an image using a singlefast transform. This process is depicted in Figure 3.

F. Design of the Measurement Sequence

In this section, we propose a “structured random” mea-surement sequence that allows for previews to be obtainedat any resolution or time. It is clear from the discussionabove that it is desirable to form the measurement operator bysubsampling rows of S N . However, the order in which theserows are sampled is of practical signiﬁcance, and must becarefully chosen if we are to reconstruct previews at arbitraryresolutions. We demonstrate this by considering the followingscenarios.Suppose that we have a sequence of measurements { b k i } taken using row indices K = { k i } of the matrix S N . Aftermeasurement b i is acquired, we wish to obtain an n × n previewusing only the most recent n measurements. If we breakthe set of rows of S N into n groups, then we know fromTheorem 2 that the preview exists only if we have sampled a measurement from each group. It follows that each of latest n measurements must lie in a unique group.After measurement b i +1 is acquired, the measurement win-dow is shifted forward. Since we want to have previewsavailable at any time, it must still be true that the most recent n measurements lie in unique groups even after this shift.Now, suppose that because of fast moving objects in theimage, we decide we want a “faster” preview with a shorteracquisition window. We now need an n (cid:48) × n (cid:48) preview with n (cid:48) < n . When we sample the most recent ( n (cid:48) ) data andredistribute the row space into ( n (cid:48) ) groups, each of the datain this new window must lie in a unique group.Clearly, the measurement sequence must be carefully con-structed to allow for this high level of ﬂexibility. The measure-ment sequence must have the property that, for any index i and resolution n , if we break the row space into n groups, themeasurements b i − n +1 through b i all lie in separate groups.Such an ordering is produced by the recursive process listedin Algorithm 2. Algorithm 2

Sampling Order of Rows From STOne Matrix Inputs: K = { K i } : a linearly ordered list of row indices function ORDER (K) Let | K | be the length of the list K . if | K | = 1 then return K end if Form 4 sub-lists: { k i } | K | / i =1 , { k i } | K | / i = | K | / , { k i } | K | / i = | K | / , { k i } | K | i =3 | K | / Randomly assign the names K , K , K , and K tosub-lists

10: ORDER ( K )

11: ORDER ( K )

12: ORDER ( K )

13: ORDER ( K ) K ← { K K K K K K K K · · · K | K | } end function The top level input to the algorithm is a linearly orderedsequence of row indices, K = { k i } = { , , · · · N } . Ateach level of the recursion, the list of indices is broken evenlyinto four parts, { k i } | K | / i =1 , { k i } | K | / i = | K | / , { k i } | K | / i = | K | / , and { k i } | K | i =3 | K | / , and each group is randomly assigned a uniquename from { K , K , K , K } . Each of the 4 groups is thenreordered by recursively calling the reordering algorithm.Once each group has been reordered, they are recombinedinto a single sequence by interleaving the indices - i.e. thenew sequence contains the ﬁrst entry of each list, followed bythe second entry of each list, etc.Note that Algorithm 2 is non-deterministic because theindex groups formed at each level are randomly permuted.Because of these random permutations, the resulting ordering K exhibits a high degree of randomness suitable for compres-sive sensing.Compressive data acquisition proceeds by obtaining a se-quence of measurements b i = (cid:104) S N ,k i v (cid:105) , where S N ,k i STO Transform DomainResample64X Less DataSTOOriginalPreview

Fig. 3. The reconstruction of the low resolution preview. The original image is measured in the transform domain where the STOne coefﬁcients are sub-sampled. The sub-sampled coefﬁcients are then re-binned into a low resolution STOne domain, ¯ b. Finally, the low-resolution transform is inverted to obtainthe preview. denotes row k i of S N , and the sequence K = { k i } isgenerated from Algorithm 2. If the number of data acquiredexceeds N , then we proceed by starting over with row k , and proceeding as before. Any window of length n will stillcontain exactly one element from each row group, even if thewindow contains the measurement where we looped back to k . IV. S

TATISTICAL A NALYSIS OF THE

STO NE P REVIEW

Suppose we break an N × N image into an n × n arrayof patches. In the corresponding low resolution preview, eachpixel “represents” its corresponding patch in the full-resolutionimage. The question now arises: How accurately does the lowresolution preview represent the high resolution image?We can answer this question by looking at the statistics ofthe low resolution previews. When the row selector matrixis generated at random, the pixels in the preview can beinterpreted as random variables. In this case, we can showthat the expected value of each pixel equals the mean of thepatch that it represents. Furthermore, the variation of eachpixel about the mean is no worse than if we had chosen a pixel at random as a representative from each patch. To provethis, we need the following simple lemma: Lemma 1:

The i th entry of S N v can be written S N ( i ) v = µ + e ( i ) where µ = 1 N M ean ( v ) , M ean ( e ) = 0 , and V ar ( e ) = V ar ( v ) . Proof:

We have S N ( i ) v = (cid:80) j S N ( i, j ) v ( j ) , and so M ean ( S N v ) = 1 N (cid:88) i S N ( i ) v = 1 N (cid:88) i (cid:88) j S N ( i, j ) v ( j )= 1 N (cid:88) j v ( j ) (cid:88) i S N ( i, j ) = 1 N (cid:88) j v ( j ) = µ. It follows that

M ean ( e ) = M ean ( S N v − µ ) = 0 . The variation about the mean is given by

V ar ( e ) = V ar ( S N v − µ ) = V ar ( v − µ ) = V ar ( v ) where we have used the unitary nature of S N and sum to oneproperty, which gives us the identity S N µ = µ .We now prove the statistical accuracy of the preview. Theorem 3:

Suppose we create an n × n preview from an N × N image. Divide the rows of S N into n groups. Gen-erate the row selector matrix by choosing an entry uniformlyat random from each group. Then the expected value of eachpixel in the low resolution preview equals the mean of thecorresponding image patch. The variance of each pixel aboutthis mean equals the mean of the patch variances . Proof:

The STOne transform has the following decom-position: S K = S K − k ⊗ S k . Let δ = N/n.

Break the rows of S K into n groups, eachof length δ . The r th row block of S K can then be written(using “Matlab notation”) S K ( rδ : rδ + δ , :) v = S K − k ( r, :) · (cid:0) S k S k · · · S k (cid:1) v where r = (cid:98) i/δ (cid:99) is the index of the row block of length δ from which row i is drawn, S K − k ( r, :) denotes the r th rowof S K − k , and the transform S k operates on a single δ × δ block of the image.The measurement taken from the i th block can be written: b ( i ) = S K − k ( i, :) · (cid:0) S k ( r (cid:48) , :) S k ( r (cid:48) , :) · · · S k ( r (cid:48) , :) (cid:1) v where r (cid:48) = i − δ r is the index of the obtained measurementrelative to the start of the block.Now, note that S k ( r (cid:48) i , :) v k = µ k + e r (cid:48) i k . We then have b ( i ) = S K − k ( i, :) µ + S K − k ( i, :) (cid:16) e r (cid:48) i e r (cid:48) i · · · e r (cid:48) i n (cid:17) = S K − k ( i, :) µ + η i where η i = S K − k ( i, :) (cid:16) e r (cid:48) i e r (cid:48) i · · · e r (cid:48) i n (cid:17) . We have

V ar ( η i ) = 14 K − k (cid:88) j V ar ( e r (cid:48) i j ) = M ean j ( V ar ( e r (cid:48) i j )) . The reconstructed preview is then u n ( i ) = S K − k b = µ + S K − k η. Because S K − k is unitary and the entries in η i are identicallydistributed, each entry in S K − k η has the same variance as theentries of η, which is M ean ( V ar ( e r (cid:48) i j )) . V. T HE ODEL FOR R ECONSTRUCTION

A. Motivation

First-generation compressive imaging exploits the com-pressibility of natural scenes in the spatial domain. However,it is well-known that video is far more compressible than 2-dimensional images. By exploiting this high level of compress-ibility, we can sample moving scenes at low rates withoutcompromising reconstruction accuracy.Rather than attempt to exploit the precise pixel-to-pixelmatching between images, we propose a model that allowsadjacent images to share information without the need for anexact mapping (such as that obtained by optical ﬂow). The model, which we call 3DTV, assumes not only that imageshave small total variation in the spatial domain but also thateach pixel generates an intensity curve that is sparse in time.The 3DTV model is motivated by the following observation:Videos with sparse gradient in the spatial domain also havesparse gradient in time. TV-based image processing representsimages with piecewise constant approximations. Assumingpiecewise constant image features, the intensity of a pixel onlychanges when it is crossed by a boundary. As a result, applyingTV in the spatial domain naturally leads to videos that havesmall TV in time. This is demonstrated in Figure 4.The 3DTV model has several other advantages. First, sta-tionary objects under stable illumination conditions produceconstant pixel values, and hence are extremely sparse in thetime domain. More importantly, the 3DTV model enforces“temporal consistency” of video frames; stationary objects ap-pear the same in adjacent frames, and the ﬂickering/distortionassociated with frame-by-frame reconstruction is eliminated.

B. Video Reconstruction From Measurement Streams

In practice, multiplexing cameras acquire one compressivemeasurement at a time. Because data is being generatedcontinuously and the scene is moving continuously in time,there is no “natural” way to break the acquired data intoseparate images. For this reason, it makes sense to interpretthe data as a “stream” – an inﬁnite sequence of compressivemeasurements denoted by { d i } ∞ i =1 . To reconstruct this stream,it must be artiﬁcially broken into sets, each of which formsthe measurement data for a frame. Such a decomposition canbe represented graphically as follows: d , d , d , d (cid:124) (cid:123)(cid:122) (cid:125) b , d , d , d , d (cid:124) (cid:123)(cid:122) (cid:125) b , d , d , d , d (cid:124) (cid:123)(cid:122) (cid:125) b . . . where b f denotes the measurement data used to reconstructthe f th frame. In practice the data windows for each framemay overlap, or be different widths.Suppose we have collected enough data to reconstruct F frames, denoted { u f } Ff =1 . The frames can then be simultane-ously reconstructed using a variational problem of the form(2) with u =  u u ... u F  , Φ =  Φ · · ·

00 Φ · · · ... . . . ... · · · Φ F  (7)where Φ f represents the measurement operator for each indi-vidual frame. C. Mathematical Formulation

The 3DTV operator measures the Total-Variation of thevideo in both the space and time dimension. The conventionalTotal-Variation semi-norm of a single frame u f is the absolutesum of the gradients: |∇ u f | = (cid:88) i,j (cid:113) ( u fi +1 ,j − u fi,j ) + ( u fi,j +1 − u fi,j ) . Fig. 4. Piecewise constant frames form videos that are piecewise constantin time. (top) A frame from our test video. (bottom) We choose a column ofpixels from the center of our test video, and plot this column over time. Thevertical axis is the “y” dimension, and the horizontal axis is time. Note thatthe stationary background pixels are constant in time, while moving objects(such as the trucks) look similar in both the space and time domain resultingin a video that has small derivatives in the time direction.

The 3DTV operator generalizes this to the space and timedimensions |∇ u | = (cid:88) i,j,f (cid:113) ( u fi +1 ,j − u fi,j ) + ( u fi,j +1 − u fi,j ) + | u f +1 i,j − u fi,j | . (8)We can reconstruct an individual frame using the variationalmodel u f = arg min u min |∇ u | + µ (cid:107) R f Su − b f (cid:107) (9)where R f and b f denote the rows selector and data for frame u f .The 3DTV model extends conventional TV-based com-pressed sensing using the operator (8). This model model canbe expressed in a form similar to (9) by stacking the frames into a single vector as in (7). We deﬁne combined row-selectorand STOne transforms for all frames using the notation R =  R · · · R · · · ... . . . ... · · · R F  S =  S N · · · S N · · · ... . . . ... · · · S N  . Using this notation, the 3DTV model can be expressed con-cisely as min |∇ u | + µ (cid:107) RSu − b (cid:107) . (10)Note that just as in the single-frame case, S is an orthogonalmatrix and R is diagonal.VI. N UMERICAL M ETHODS F OR C OMPRESSIVE R ECOVERY

In this section, we discuss efﬁcient numerical methods forimage recovery. There are many splitting methods that arecapable of solving (10), however not all splitting methodsare capable of exploiting the unique mathematical structure ofthe problem. In particular, some methods require the solutionof large systems using conjugate gradient sub-steps, whichcan be inefﬁcient. We focus on the Primal Dual HybridGradient (PDHG) methods. Because the STOne transform isself-adjoint, every step of the PDHG scheme can be writtenexplicitly, making this type of solver efﬁcient.

A. PDHG

Primal-Dual Hybrid Gradients (PDHG) is a scheme forsolving minimax problems of the form max y min x f ( x ) + (cid:104) Ax, y (cid:105) − g ( y ) where f, g are convex functions and A ∈ R m,n is a matrix.The algorithm was ﬁrst introduced in [15], and later in [16].Rigorous convergence results are presented in [17]. Practicalimplementations of the method are discussed in [18].The scheme treats the terms f and g separately, whichallows the individual structure of each term to be exploited.The PDHG method in its simplest form is listed in Algorithm3. Algorithm 3

PDHG

Require: x ∈ R N , y ∈ R m , τ > , σ > for k = 0 , , . . . do ˆ x k +1 = x k − τ A T y k x k +1 = arg min x f ( x ) + τ (cid:107) x − ˆ x k +1 (cid:107) ¯ x k +1 = x k +1 + ( x k +1 − x k ) ˆ y k +1 = y k + σA ¯ x k +1 y k +1 = arg max y − g ( y ) − σ (cid:107) y − ˆ y k +1 (cid:107) end for Algorithm 3 can be interpreted as alternately minimizing for x and then maximizing for y using a forward-backward tech-nique. These minimization/maximization steps are controlledby two stepsize parameters, τ and σ. The method convergesas long as the stepsizes satisfy τ σ < (cid:107) A T A (cid:107) . However thechoice of τ and σ greatly effects the convergence rate. Forthis reason, we use the adaptive variant of PDHG presented in[18] which automatically tunes these parameters to optimizeconvergence for each problem instance. B. PDHG for Compressive Video

In this section, we will customize PDHG to solve (10). Webegin by noting that |∇ u | = max p ∈ C p · ∇ u, where C = { p | − ≤ p i ≤ } denotes the (cid:96) ∞ unit ball. Usingthis principle, we can write (10) as the saddle-point problem max p min u p · ∇ u + µ (cid:107) RHDu − s (cid:107) − C ( p ) (11)where C ( p ) denotes the characteristic function of the set C , which is inﬁnite for values of p outside of C and zerootherwise. Algorithm 4

PDHG Compressive Reconstruction

Require: u ∈ R N × F , y ∈ R N × F , τ > , σ > for k = 0 , , . . . do ˆ u k +1 = u k − τ ∇ T y k u k +1 = arg min u µ (cid:107) RSu − b (cid:107) + τ (cid:107) u − ˆ u k +1 (cid:107) ¯ u k +1 = u k +1 + ( u k +1 − u k ) ˆ p k +1 = p k + σ ∇ ¯ u k +1 p k +1 = arg max p − C ( p ) − σ (cid:107) p − ˆ p k +1 (cid:107) end for Note that only steps 3 and 6 of Algorithm 4 are implicit.The advantage of the PDHG approach (as opposed to e.g. theAlternating Direction Method of Multipliers) is that all implicitsteps have simple analytical solutions.Step 3 of Algorithm 4 in simply the projection of ˆ p k +1 ontothe (cid:96) ∞ unit ball. This projection is given by p k +1 = max { min { p, } , − } where max {·} and min {·} denote the element-wise minimumand maximum operators.Step 3 of Algorithm 4 is the quadratic minimization problem u k +1 = arg min u µ (cid:107) RSu − b (cid:107) + 12 τ (cid:107) u − ˆ u k +1 (cid:107) . The optimality condition for this problem is µS T R T ( RSu k +1 − b ) + 1 τ ( u k +1 − ˆ u ) = 0 which simpliﬁes to ( µS T R T RS + 1 τ I ) u k +1 = µS T R T b + 1 τ ˆ u k +1 . If we note that S and R are symmetric, R = R, and we write I = S (because S is symmetric and orthogonal) we get S ( µR + 1 τ I ) Su k +1 = µSRb + 1 τ ˆ u k +1 . Since ( µR + τ I ) is an easily invertible diagonal matrix, wecan now write the solution to the quadratic program explicitlyas u k +1 = S ( µR + 1 τ I ) − S ( µSRb + 1 τ ˆ u k +1 ) . (12)Note that (12) can be evaluated using only 2 fast STOnetransforms. VII. A PPLICATIONS

A. Fast Object Recognition and the Smashed Filter

Numerous methods have been proposed to performing se-mantic image analysis in the compressive domain. Varioussemantic tasks have been proposed including object recogni-tion [19], human pose detection [20], background subtraction[21] and activity recognition [22]. These methods ascertain thecontent of images without a full compressive reconstruction.Because these methods do not have access to an actual imagerepresentation, they frequently suffer from lower accuracy thanclassiﬁers applied directly to image data, and can themselvesbe computationally burdensome.When sensing is done in an MSS basis, compressive datacan be quickly transformed into the image domain for semanticanalysis rather than working in the compressive domain. Thisnot only allows for high accuracy classiﬁers, but also cansigniﬁcantly reduce the computational burden of analysis.To demonstrate this, we will consider the case of objectrecognition using the “Smashed Filter” [19]. This simple ﬁltercompares compressive measurements to a catalog of hypoth-esis images and attempts to ﬁnd matches. Let { H i } denotethe set of hypothesis images, and H i ( x ) denote an imagetranslated (shifted in the horizontal and vertical directions)by x ∈ R . Suppose we observe a scene s by obtainingcompressive data of the form b = Φ s. The smashed ﬁlteranalyzes a scene by evaluating min x,H i (cid:107) Φ s − Φ H i ( x ) (cid:107) = (cid:107) b − Φ H i ( x ) (cid:107) . (13)In plain words, the smashed ﬁlter compares the measurementvector against the transforms of every hypothesis image inevery possible pose until a good (cid:96) match is found.The primary drawback of this simple ﬁlter is computationalcost. The objective function (13) must be explicitly evaluatedfor every pose of each test image. The authors of [19] suggestlimiting this search to horizontal and vertical shifts of no morethan 16 pixels in order to keep runtime within reason.The complexity and power of such a search can be dramati-cally improved using MSS measurements. Suppose we obtainmeasurements of the form b = RS N s. We can then re-bin themeasurements using Theorem 2, and write them as ¯ b = S n ¯ s, where ¯ b is a dense vector of measurements and ¯ s is a lowerresolution preview of the scene. Applying Theorem 2 to the Original (1024 × ×

256 128 ×

128 64 × Noisy Recovery ×

256 128 ×

128 64 × Fig. 5. Direct multiscale reconstruction from under-sampled transform coefﬁcients. (top left) The original image. (top right) The image contaminatedwith Gaussian noise with standard deviation equal to of the peak image intensity. (center) Direct reconstruction of the noiseless measurements usingunder-sampled data. (bottom) Direct reconstruction of the noisy data. matching function (13) yields min x,H i (cid:107) RS N s − S N H i ( x ) (cid:107) = (cid:107) S n ¯ s − S n ¯ H i ( x ) (cid:107) = (cid:107) ¯ s − ¯ H i ( x ) (cid:107) (14)where ¯ H i is representation of image H i at resolution n .The expression on the right of (14) shows that applyingthe smashed ﬁlter in the compressive domain is equivalentto performing classiﬁcation in the image domain using thepreview ¯ s ! Furthermore, the values of (cid:107) S n ¯ s − S n ¯ H i ( x ) (cid:107) for all possible translations can be computed in O ( N log N ) using a Fourier transform convolution algorithm, which is asubstantial complexity reduction from the original method. B. Enhanced Video Reconstruction Schemes

Several authors propose compressive video reconstructionschemes that beneﬁt from preview reconstruction. The basicconcept of these methods is to use an initial low-quality recon-struction to obtain motion estimates. These motion estimates Original Preview 5% Sampling 1% Sampling

Fig. 6. Reconstruction of high speed video from under-sampled data. A frame from the original full resolution video is displayed on the left. A × preview generated from a stream with 6.25% sampling is also shown. Compressive reconstructions using 1% and 6.25% sampling are shown on the right. are then used to build sparsity models that act in the time do-main and are used for ﬁnal video reconstruction. Such methodsinclude the results of Wakin [10] and Sankaranarayanan [11]which rely on optical ﬂow mapping as well as the iterativemethod of Asif [12].The methods proposed in [10] and [11] use a Dual-Scalesensing basis that allows for the reconstruction of previews.However, unlike the MSS framework proposed here, thesematrices do not admit a fast transform. The matrices proposedin [11] for example require O ( N ) operations to performa complete transform. The proposed MSS matrices openthe door for a variety of preview-based video reconstructionmethods using fast O ( N log N ) transforms.VIII. N UMERICAL E XPERIMENTS

A. Preview Generation

We demonstrate preview generation using a simple testimage. The original image dimensions are 1024 × of the peak signal (SNR = 20 Db).Reconstructions are performed at × , × and × resolutions. Results are shown in Figure 5.This example demonstrates the ﬂexibility of MSS sensing— the preview resolution can be adapted to the numberof available measurements. The × reconstructions areobtained using only out of measurements (less than0.4% sampling). B. Simulated Video Reconstruction

To test our new image reconstruction framework, we usea test video provided by Mitsubishi Electric Research labo-ratories (MERL). The video was acquired using a high-speedcamera operating at 200 frames per second at a resolution of 256 ×

256 pixels. A measurement stream was generatedby applying the STOne transform to each frame, and thenselecting coefﬁcients from this transform. Coefﬁcients weresampled from the video at a rate of 30 kilohertz, whichis comparable to the operation rate of currently availableDMD’s. The coefﬁcients were selected in the order generatedby Algorithm 2, and pixels were mapped into a vector withnested dissection ordering. Thus, previews are available at anypower-of-two resolution, and can be computed using a simpleinverse STOne transform. At the same time, the data acquiredare appropriate for compressive reconstruction.The goal is to reconstruct 20 frames of video from under-sampled measurements. We measure the sampling rate as thepercentage of measured coefﬁcients per frame. For example,at the sampling rate, the total number of samples is × × . . Results of compressive reconstructions at twodifferent sampling rates are shown in Figure 6.Note that non-compressive reconstructions require a lot ofdata (one measurement per pixel), and are therefore subject tothe motion aliasing problems described in [9]. By using thecompressive reconstruction, we obtain high-resolution imagesfrom a small amount of data, yielding much better temporalresolution and thus less aliasing. By avoiding the motionaliasing predicted by the classical analysis, the compressivereconstruction “beats” the Nyquist sampling limit.Figure 7 considers 3 different reconstructions at the samesampling rate. First, a full-resolution reconstruction is createdusing a complete set of STOne samples. Because a longmeasurement interval is needed to acquire this much data,motion aliasing is severe. When a preview is constructedusing a smaller number of samples and large pixels, motionaliasing is eliminated at the cost of resolution loss. When thecompressive/iterative reconstruction is used, the same under-sampled measurements reveal high-resolution details.

C. Single-Pixel Video

To demonstrate the STOne transform using real data, weobtained measurements using a laboratory setup. The Ricesingle-pixel camera [1] is depicted in Figure 8. The imageof the target scene is focused by a lens onto a DMD. STOnetransform patterns were loaded one-at-a-time onto the DMDin an order determined by Algorithm 2. The DMD removedpixels with STOne coefﬁcient − and reﬂects pixels with co-efﬁcient +1 towards a mirror. A focusing lens then converges Original Complete Transform Preview Compressive

Fig. 7. Detailed views of reconstructions. On the left is a frame from the original video showing the part of the image we will be using for comparisons.A full-resolution reconstruction using a complete set of transform coefﬁcients is show beside it. Motion aliasing is present because of object motion over along data acquisition time. A preview from 5% sampling is also shown. Motion aliasing has been dramatically reduced by shortening the data acquisitiontime at the cost of lower resolution. On the right is displayed a compressive reconstruction using the same 5% sampling a the preview. This reconstructionsimultaneously achieves high resolution and fast acquisition.Fig. 8. Schematic drawing and table-top photo of a single-pixel camera. the selected pixels onto a single detector which generates ameasurement.Data was generated from two scenes at two different reso-lutions. The “car” scene was acquired using 256 ×

256 STOnepatterns, and the “hand” scene was acquired using × patterns. For both scenes, previews were reconstructed at × resolution. Full-resolution compressive reconstruc-tions were also performed. Frames from the resulting recon-structions are displayed in Figure 9. Both the previews and compressive reconstructions were generated using the samemeasurements. Note the higher degree of detail visible in thecompressive reconstructions.IX. C ONCLUSION

Compressed sensing creates dramatic tradeoffs betweenreconstruction accuracy and reconstruction time. While com-pressive schemes allow high-resolution reconstructions fromunder-sampled data, the computational burden of these meth-ods prevents their use on portable embedded devices. TheSTOne transform enables immediate reconstruction of com-pressive data at Nyquist rates. The same data can then be“enhanced” using compressive schemes that leverage sparsityto “beat” the Nyquist limit.The multi-resolution capabilities of the STOne transformare paramount for video applications, where data resourcesare bound by time constraints. The limited sampling rate ofcompressive devices leads to “smearing” and motion aliasingwhen sampling high-resolution images at Nyquist rates. Weare left with two options: either slash resolution to decreasedata requirements, or use compressive methods that prohibitreal-time reconstruction. The STOne transform offers the bestof both worlds: immediate online reconstructions with high-resolution compressive enhancement.A

CKNOWLEDGMENTS

The authors would like to thank Yun Li for his help inthe lab, and Christoph Studer for many useful discussions.This work was supported by the Intelligence Community (IC)Postdoctoral Research Fellowship Program.R

EFERENCES[1] M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, andR. Baraniuk, “Single-pixel imaging via compressive sampling: Buildingsimpler, smaller, and less-expensive digital cameras,”

Signal ProcessingMagazine, IEEE , vol. 25, no. 2, pp. 83–91, 2008.[2] M. Gupta, A. Agrawal, A. Veeraraghavan, and S. G. Narasimhan,“Flexible voxels for motion-aware videography,” in

Proc. of the 11thEuropean conference on Computer vision: Part I , ECCV’10, (Berlin,Heidelberg), pp. 100–114, Springer-Verlag, 2010. ABCD

Fig. 9. Video reconstructions from the single-pixel camera. (A/C) High-resolution compressive reconstructions of the “car” and “hand” scenes. (B/D) 64 × IEEE Conferenceon Computer Vision and Pattern Recognition , CVPR ’11, pp. 329–336,2011.[4] R. F. Marcia, Z. T. Harmany, and R. M. Willett, “Compressive codedaperture imaging,” in

Proc. SPIE , p. 72460, 2009.[5] E. J. Candes, J. Romberg, and T.Tao, “Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequency informa-tion,”

IEEE Trans. Inform. Theory , vol. 52, pp. 489 – 509, 2006.[6] D. Donoho, “Compressed sensing,”

Information Theory, IEEE Transac-tions on , vol. 52, pp. 1289–1306, 2006.[7] E. Cand`es and J. Romberg, “Sparsity and incoherence in compressivesampling,”

Inverse Problems , vol. 23, no. 3, p. 969, 2007.[8] W. Bajwa, A. Sayeed, and R. Nowak, “A restricted isometry propertyfor structurally-subsampled unitary matrices,” in

Allerton Conference onCommunication, Control, and Computing , pp. 1005 –1012, 30 2009-oct.2 2009.[9] M. Wakin, “A study of the temporal bandwidth of video and itsimplications in compressive sensing,” Tech. Rep. 2012-08-15, ColoradoSchool of Mines, 2012.[10] J. Y. Park and M. B. Wakin, “Multiscale algorithm for reconstructingvideos from streaming compressive measurements,”

Journal of Elec-tronic Imaging , vol. 22, no. 2, 2013.[11] A. Sankaranarayanan, C. Studer, and R. Baraniuk, “CS-MUVI: videocompressive sensing for spatial-multiplexing cameras,” in

Computa-tional Photography (ICCP), 2012 IEEE International Conference on ,pp. 1–10, 2012. [12] M. S. Asif, L. Hamilton, M. Brummer, and J. Romberg, “Motion-adaptive spatio-temporal regularization (MASTeR) for accelerated dy-namic MRI,”

Magnetic Resonance in Medicine , 2013.[13] Y. Saad,

Iterative Methods for Sparse Linear Systems . Society forIndustrial and Applied Mathematics, 2003.[14] J. A. George, “Nested dissection of a regular ﬁnite element mesh,”

SIAMJournal on Numerical Analysis , vol. 10, no. 2, pp. 345–363, 1973.[15] E. Esser, X. Zhang, and T. Chan, “A general framework for a class ofﬁrst order primal-dual algorithms for TV minimization,”

UCLA CAMReport 09-67 , 2009.[16] A. Chambolle and T. Pock, “A ﬁrst-order primal-dual algorithm forconvex problems with applications to imaging,”

Convergence , vol. 40,no. 1, pp. 1–49, 2010.[17] B. He and X. Yuan, “Convergence analysis of primal-dual algorithmsfor a saddle-point problem: From contraction perspective,”

SIAM J. Img.Sci. , vol. 5, pp. 119–149, Jan. 2012.[18] T. Goldstein, E. Esser, and R. Baraniuk, “Adaptive Primal-Dual HybridGradient Methods for Saddle-Point Problems,”

Available at Arxiv.org(arXiv:1305.0546) , 2013.[19] M. Davenport, M. Duarte, M. Wakin, J. Laska, D. Takhar, K. Kelly,and R. Baraniuk, “The smashed ﬁlter for compressive classiﬁcation andtarget recognition,” in

Proc. SPIE Symposium on Electronic Imaging:Computational Imaging , p. 6498, 2007.[20] K. Kulkarni and P. Turaga, “Recurrence textures for human activityrecognition from compressive cameras,” in

Image Processing (ICIP),2012 19th IEEE International Conference on , pp. 1417–1420, 2012.[21] V. Cevher, A. Sankaranarayanan, M. F. Duarte, D. Reddy, and R. G. Baraniuk, “Compressive sensing for background subtraction,” in

Euro-pean Conf. Comp. Vision (ECCV , pp. 155–168, 2008.[22] O. Concha, R. Xu, and M. Piccardi, “Compressive sensing of time seriesfor human action recognition,” in