An Algorithmic Theory of Dependent Regularizers, Part 1: Submodular Structure
AAn Algorithmic Theory of Dependent RegularizersPart 1: Submodular Structure
Hoyt KoepkeDepartment of StatisticsUniversity of WashingtonSeattle, WA 98105 [email protected]
Marina MeilaDepartment of StatisticsUniversity of WashingtonSeattle, WA 98105 [email protected]
October 2, 2018
Abstract
We present an exploration of the rich theoretical connections between several classes ofregularized models, network flows, and recent results in submodular function theory. This workunifies key aspects of these problems under a common theory, leading to novel methods forworking with several important models of interest in statistics, machine learning and computervision.In Part 1, we review the concepts of network flows and submodular function optimizationtheory foundational to our results. We then examine the connections between network flows andthe minimum-norm algorithm from submodular optimization, extending and improving severalcurrent results. This leads to a concise representation of the structure of a large class of pairwiseregularized models important in machine learning, statistics and computer vision.In Part 2, we describe the full regularization path of a class of penalized regression problemswith dependent variables that includes the graph-guided LASSO and total variation constrainedmodels. This description also motivates a practical algorithm. This allows us to efficiently findthe regularization path of the discretized version of TV penalized models. Ultimately, our newalgorithms scale up to high-dimensional problems with millions of variables. a r X i v : . [ s t a t . M L ] D ec Introduction
High-dimensional data is a central focus of modern research in statistics and machine learning. Re-cent technological advances in a variety of scientific fields for gathering and generating data, matchedby rapidly increasing computing power for analysis, has attracted significant research into the statis-tical questions surrounding structured data. Numerous models and computational techniques haveemerged recently to work with these data types.Our primary contribution is a theoretical framework enabling the efficient optimization of mod-els that work with high-dimensional models in which the predictors are believed to be dependent,correlated, or sparse. In particular, we propose a new approach for estimation of certain types ofdependent predictors, in which the model incorporates a prior belief that many of the predictorstake similar or identical values. This has been a hot topic of research in recent years, with appli-cations in genomics, image analysis, graphical models, and several other areas. However, efficientestimation involving structured and dependent predictors has proven to be quite challenging. Ourmain contribution, which includes a number of related theoretical results, is a theoretical frameworkand algorithmic approach that unlocks a large and particularly thorny class of these models.After laying out the context for our research in the next section, we outline our main results aswell as the needed background in section 3. Our contributions build on recent results in optimiza-tion theory and combinatorial optimization, which we describe in detail. In 4, we present a cleartheoretical connection between submodular optimization theory, network flows, and the proximaloperator in regularized models such as the graph-guided LASSO. Then, in 5, we extend the un-derlying submodular optimization theory to allow a weighted version of the underlying submodularproblem. In the context of regularized regression, this allows the use of additional unary terms inthe regularizer.
To begin, consider a simple regression model, in which we have N observations y = ( y , y , ..., y N )and n predictors u = ( u , u , ..., u n ), with y i = Au + ε i , i = 1 , , ..., N (1)where ε , ε , ..., ε N are independent random noise vectors with E ε i = ; typically, these are assumedto be i.i.d. Gaussian. In many high dimensional contexts, the estimation of ˆ u may be problematicusing classical methods. For example, n may be much larger than N , making the problem ill-posedas many possible values of u map to the same response. ( n and N are used here instead of themore common p and n to be consistent with the optimization literature we connect this problemto.) Additionally, many predictors of interest may have a negligible or nearly identical affect on y ;eliminating or grouping these predictors is then desirable. To handle this, a number of sophisticatedapproaches have been proposed to incorporate variable selection or aggregation into the statisticalestimation problem.One common and well-studied approach is to add a penalty term, or regularizer, to the log-likelihood that enforces some prior belief about the structure of u [Bickel et al., 2006, Hastie et al.,2009]. In this setup, estimating the predictor u involves finding the minimizer of a log-likelihoodterm or loss function L plus a regularization term Φ:ˆ u = argmin u ∈ R n L ( u , y ) + λ Φ( u ) , (2)where λ controls the strength of the regularization λ Φ( u ).2any well studied and frequently used models fall into this context. For example, if L is thelog-likelihood from a multivariate Gaussian distribution, one of the oldest regularization techniquesis the L -norm, which gives us the classic technique of ridge regression [Hoerl and Kennard, 1970],where ˆ u ridge = argmin u ∈ R n (cid:107) y − Au (cid:107) + λ (cid:107) u (cid:107) . (3)In this example, the regularization term is typically used as an effective way of dealing with anill-conditioned inverse problem in the least squares context or as a simple way of preventing u fromover-fitting the model [Bishop, 2006].In the Bayesian context, (2) often corresponds directly to finding the maximum a posterioriestimate in the classic likelihood-prior formulation, i.e. p ( u | y ) ∝ p ( y | u ) p ( u ; λ, θ ) (4) ∝ e −L ( u , y ) × e − λ Φ θ ( u ) . (5)where λ and θ are hyperparameters controlling the behavior of the prior distribution. In thisformulation, the prior captures the belief encoded by the regularization term. For example, theridge regression problem of (3) corresponds to using a standard multivariate Gaussian likelihoodbut assumes a 0-mean spherical Gaussian prior distribution over u .More recently, the Least Absolute Shrinkage and Selection Operator (LASSO), is used frequentlyto promote sparsity in the resulting estimator [Tibshirani, 1996, Hastie et al., 2005, 2009]. Thisapproach uses the L -norm as the regularization term, i.e.ˆ u lasso = argmin u ∈ R n (cid:107) y − Au (cid:107) + λ (cid:107) u (cid:107) (6)The LASSO problem has been analyzed in detail as a method for simultaneous estimation andvariable selection, as the L penalty tends to set many of the predictors to 0. The consistencyand theoretical properties of this model, for both estimation and variable selection, are well studied[Meinshausen and Yu, 2009, Bunea et al., 2007, 2006, Van De Geer, 2008, van de Geer, 2007,Bickel et al., 2006]. Many variants of this problem have also been proposed and analyzed. Theseinclude the elastic net, in which Φ( u ) = α (cid:107) u (cid:107) + (1 − α ) (cid:107) u (cid:107) , with α ∈ [0 , While there are other possible regularization strategies on individual parameters, of most interest tous is the incorporation of correlation structures and dependencies among the predictors into the priordistribution or regularization term. In many cases, this translates into penalizing the differences ofpredictors. This has recently been attracting significant algorithmic and theoretical interest as away to effectively handle dependency structures in the data.In our particular context, we are interested in the MAP estimate of models with Markov RandomField prior. The resulting log-linear models consist of a collection of unary and pairwise terms thatcapture the dependency structure of the problem. In particular, the priors we are interested inenforce similarity between neighboring variables, where “neighbor” is defined according to the graphstructure of the MRF prior. The type of dependency we are looking at is determined by the3he simplest form of model dependency in the parameters is captured by the
Fused LASSO problem [Tibshirani et al., 2005], which penalizes pairwise differences between terms in an orderedproblem to be u ∗ fused = argmin u ∈ R n = (cid:107) Au − y (cid:107) + λ (cid:34) w (cid:107) u (cid:107) + w n − (cid:88) i =1 | u i +1 − u i | (cid:35) , (7)where w , w ≥ L -norm controlling variable sparsity and thesum of ordered differences that effectively penalizes changepoints. This problem has gained somerecent attention in the statistics community in the context of non-parametric regression [D¨umbgenand Kovac, 2009, Cho and Fryzlewicz, 2011, Davies and Meise, 2008], group sparsity [Tibshiraniet al., 2005, Bleakley and Vert, 2011], and change-point detection [Bleakley and Vert, 2011]. Fromthe optimization perspective, several algorithms to find the solution to this problem have beenproposed; we refer the reader to Liu et al. [2010], Ye and Xie [2011], Bach et al. [2012] or Friedmanet al. [2007] for discussions of this particular problem.The generalized problem, in which the pairwise terms are not required to be ordered, is ofsignificant practical interest in both the statistics and machine learning communities. Here, thepairwise interactions are controlled with an arbitrary graph of weighted difference penalties: u ∗ graph = argmin u ∈ R n = (cid:107) y − Au (cid:107) + λ w (cid:107) u (cid:107) + (cid:88) ≤ i 100 200 300 400 500 60002004006008001000 Original Image. (a) Original Image Total Variation Solution, λ = 0 . . (b) TV minimized image with λ (cid:48) = 0 . Total Variation Solution, λ = 0 . . (c) TV minimized image with λ (cid:48) = 0 . Figure 1: An example of Total Variation Minimization for noise removal on Leonardo da Vinci’s MonaLisa. Different values of the regularization parameter produce different results, with the higher value of λ (cid:48) smoothing the image less but removing less of the noise. function f : R d ⊃ Ω (cid:55)→ R under a constraint or penalty on the total norm of the gradient of u . u ∗ = argmin u ∈F (Ω) (cid:107) u − f (cid:107) + λ TV( u ) (12)= argmin u ∈F (Ω) ˆ Ω ( u ( x ) − f ( x )) d x + λ TV( u ) (13)This model – often called the Rudin-Osher-Fatimi (ROF) model after the authors – was originallyproposed in Rudin et al. [1992] as a way of removing noise from images corrupted by Gaussian whitenoise; however, it has gained significant attention in a variety of other areas.In the image analysis community, it has been used often as a model for noise removal, with thesolution of (10) being the estimation of the noise-free image. In addition, the value of the totalvariation term in the optimal solution is often used as a method of edge detection, as the locationsof the non-zero gradients tend to track regions of significant change in the image.Beyond image analysis, a number of recent models have used total variation regularization asa general way to enforce similar regions of the solution to have a common value. In this way, itis a generalization of the behavior of the Fused LASSO in one dimension; it tends to set regionsof the image to constant values. As such, it is used in MRI reconstruction, electron tomography[Goris et al., 2012, Gao et al., 2010], MRI modeling [Michel et al., 2011, Keeling et al., 2012], CTreconstruction [Tian et al., 2011], and general ill-posed problems [Bardsley and Luttman, 2009].The total variation regularizer tends to have a smoothing effect around regions of transition,while also setting similar regions to a constant value. As such, it has proven to be quite effective forremoving noise from images and finding the boundaries of sufficiently distinct regions. The locationswhere the norm of the gradient TV( u ) is nonzero correspond to boundaries of the observed processin which the change is the greatest. A rigorous treatment of the theoretical aspects surrounding6sing this as a regularization term can be found in Bellettini, Caselles, and Novaga [2002], Ring[2000] and Chan and Esedoglu [2005]. A complete and treatment of this topic in practice can befound in a number of surveys, including Darbon and Sigelle [2006], Allard [2007, 2008, 2009]; inparticular, see Caselles, Chambolle, and Novaga [2011] and Chambolle, Caselles, Cremers, Novaga,and Pock [2010]. As our purpose in this work is to present a treatment of the underlying theoreticalstructures, we refer the reader to one of the above references for use practical image analysis. Still,several points are of statistical interest, which we discuss now.Several generalizations of the traditional Rudin-Osher-Fatimi model have been proposed for dif-ferent models of the noise and assumptions about the boundaries of the images. Most of these involvechanges to the loss or log-likelihood term but preserve the total variation term as the regularizer.In general, we can treat u ∗ as the estimation of u ∗ = argmin u ∈F (Ω) L ( u, f ) + λ TV( u ) (14)where L is a smooth convex loss function.One option is to use L loss for L , namely L = (cid:107) f − u (cid:107) , (15)as a way of making the noise more robust to outliers. This model is explored in Bect et al. [2004],Chan and Esedoglu [2005] and the survey papers above. These perform better under some types ofnoise and have also become popular [Goldfarb and Yin, 2009].Similarly, many models for image denoising – often when the image comes from noisy sensingprocesses in physics, medical imaging, or astronomy – involve using a Poisson likelihood for theobservations. Here, we wish to recover a density function where we observe Poisson counts in anumber of cells with rate proportional to the true density of the underlying process. Total variationregularization can be employed to deal with low count rates. Here, total variation regularization isused to recover the major structures in the data. In this case, the optimal loss function is given by L ( u, f ) = ˆ Ω ( u ( x ) − f ( x ) log u ( x ))d x , u ≥ . (16)Research on this, including optimization techniques, can be found in Bardsley and Luttman [2009],Bardsley [2008], Sawatzky et al. [2009].Also of interest in the statistics community, Wunderli [2013] replaces the squared-norm fit penaltyin (13) with the quantile regression penalty from Koenker and Bassett Jr [1978], where loss functionis replaced with L ( u, f ) = | f ( x ) − u ( x ) | × (cid:26) − β f ( x ) ≥ u ( x ) β f ( x ) < u ( x ) (cid:27) . (17)for which the author proves the existence, uniqueness, and stability of the solution.One of the variations of total variation minimization is the Mumford-Shah model. In this model,used primarily for segmenting images, regions of discontinuity are handled explicitly Mumford andShah [1989]. In this model, the energy which is minimized is E ( u, Γ) = (cid:107) f − u (cid:107) + ˆ Ω \ Γ (cid:107)∇ u (cid:107) d µ + v (cid:107) Γ (cid:107) (18)where Γ is a collection of curves and (cid:107) Γ (cid:107) is the total length of these curves. Thus the function isallowed to be discontinuous on Γ; these are then taken to be the segmentations of the image. This7odel has also attracted a lot of attention in the vision community as a way of finding boundariesin the image [Chan and Vese, 2000, Pock et al., 2009, El Zehiry et al., 2007]. The Bayesian modelof this is as a type of mixture model in which the mean function and noise variance differs betweenregions [Brox and Cremers, 2007]. Unlike general TV regularization, however, it is typically focusedexclusively on finding curves for image segmentation, rather than on estimating the original image.While it shares a number of close parallels to our problem, particularly in the graph cut optimizationsused [El Zehiry et al., 2007], our method does not appear to generalize to this problem.The optimization of the total variation problem has garnered a lot of attention as well. Anenormous number of approaches have been proposed for this problem. These largely fall into thecategory of graph based algorithms, active contour methods, or techniques based around partialdifferential equations. These are discussed in more depth in part 2 of this paper, where we mentionthem in introducing our approach to the problem. The primary purpose of our work is to present a theory of the underlying structure connecting theoptimization of the above models over dependent variables and recent results in combinatorial opti-mization, particularly submodular function minimization. Our theory connects and makes explicit anumber of connections between known results in these fields, many of which we extend in practicallyrelevant ways. The insights gained from our theory motivate a family of novel algorithms for workingwith these models. Furthermore, we are able to give a complete description of the structure of theregularization path; this result is also completely novel.One primary practical contribution is the development of methods to efficiently optimize functionsof the following form, which we denote as ( (cid:60) B ) u ∗ ( λ ) = argmin u ∈ R n (cid:107) u − a (cid:107) + λ (cid:88) i ξ i ( u i ) + (cid:88) i,j w ij | u i − u j | , ( (cid:60) B )with a ∈ R n , λ is a non-negative regularization parameter, w ij ≥ ξ i ( u i ) is a convex piecewise linear function, possibly the common L penalty ξ i ( u i ) = | u i | .We develop a novel algorithm for ( (cid:60) B ) based on network flows and submodular optimizationtheory, and completely solve the regularization path as well over λ > 0. The crux of the ideais to construct a collection of graph-based binary partitioning problems indexed by a continuousparameter. With our construction, we show that the points at which each node exactly gives thesolution to ( (cid:60) B ). We develop algorithms for this that scale easily to millions of variables, and showthat the above construction also unlocks the nature of the regularization path.These functions arise in two active areas of interest to the statistics and machine learning com-munities. The first is in the optimization of penalized regression functions where we wish to find u ∗ ( λ ) = argmin u ∈ R n L ( u , y ) + λ (cid:88) i ξ i ( u i ) + (cid:88) i,j w ij | u i − u j | , ( (cid:60) L )where y is a collection of observed response and L ( u , y ) is a smooth, convex log-likelihood term.Throughout our work, we denote this problem as ( (cid:60) L ). In the general case, we handle this problemthrough the use of the proximal gradient methods, a variant of sub-gradient methods, which wedescribed below. The inner routine involves solving ( (cid:60) B ).8inally, ( (cid:60) B ) also extends to the continuous variational estimation problem of Total Variationminimization from section 2.2. where we wish to minimize u ∗ = argmin F (Ω) (cid:107) u − f (cid:107) + λ TV( u ) ( (cid:60) TV )where F (Ω) and TV( u ) are defined in (10) – (13). We show in the second part of our work that thisproblem can be solved using the methods developed for ( (cid:60) B ). Thus we not only present an efficientalgorithm for ( (cid:60) TV ), we also present the first algorithm for finding the full regularization path ofthis problem. This paper is laid out as followed. The rest of this section presents a survey of the background tothe current problem, describing the current state of the research in the relevant areas and laying outthe reasons it is of interest to the statistics and machine learning communities.Section 4 lays out the basic framework connecting network flows, submodular function mini-mization, and the optimization of ( (cid:60) B ). While some of our results were discovered independently,our theory pulls them together explicitly in a common framework, and provides more direct proofsof their correctness. Additionally, our approach motivates a new algorithm to exactly solve theresulting optimization problem.Section 5 presents entirely novel results. We extend the basic theory of section 4 using a novelextension lemma to allow for a general size-biasing measure on the different components of the input.The theoretical results are proved for general submodular functions, extending the state-of-the-artin this area. In the context of our problem, we use these results to extend the algorithmic resultsof section 4. In particular, this opens a way to handle the ξ i ( u i ) term in ( (cid:60) B ) above, significantlyextending the results of Mairal et al. [2011] related to section 4. Regularized models have gained significant attention in both the statistics and machine learningcommunities. Intuitively, they provide a way of imposing structure on the problem solution. Thisfacilitates the accurate and tractable estimation of high-dimensional variables or other features,such as change points, that are difficult to do in a standard regression context. Furthermore,these methods have been well studied both from the theoretical perspective – correct estimation isguaranteed under a number of reasonable assumptions – and the optimization perspective – efficientalgorithms exist to solve them. Not surprisingly, these approaches have gained significant popularityin both the statistics and machine learning communities. We will describe several relevant examplesbelow.In the general context, we consider finding the minimum of γ ( u ) = L ( u , y ) + λ Φ( u ) , (19)where L ( u , y ) is a smooth convex Lipshitz loss function, Φ( u ) is a regularization term, and λ ≥ L is typically given by the log-likelihood of the model. The regularization term Φ( u ) is required to be convex but is not requiredto be smooth; in practice it is is often a regularization penalty that enforces some sort of sparsenessor structure. A common regularization term is simply the L norm on u , i.e.Φ ( u ) = (cid:107) u (cid:107) (20)9hich gives the standard LASSO problem proposed in [Tibshirani, 1996] and described in section 2.A number of algorithms have been proposed for the optimization of this particular problemFriedman et al. [2010], Efron et al. [2004]. Most recently, a method using proximal operators hasbeen proposed called the Fast Iterative Soft-Threshold Algorithm [Beck and Teboulle, 2009]; thisachieves theoretically optimal performance – identical to the O (cid:0) (cid:30) k (cid:1) lower bound of the convergencerate by iteration of optimization of general smooth, convex functions [Nesterov, 2005]. The methodis based on the proximal operators we discuss next, and is described in detail, along with theoret-ical guarantees, in Beck and Teboulle [2009]. This is the method that immediately fits with ourtheoretical results. Proximal gradient methods Nesterov [2007], Beck and Teboulle [2009] are a very general approachto optimizing (19). At a high level, they can be understood as a natural extension of gradientand sub-gradient based methods designed to deal efficiently with the non-smooth component of theoptimization. The motivation behind proximal methods is the observation that the objective γ ( u )is the composition of two convex functions, with the non-smooth part isolated in one of the terms.In this case, the proximal problem takes the form γ P ( u , ˆ u ) = f (ˆ u ) + ( u − ˆ u ) T ∇ f (ˆ u ) + λ Φ( u ) + L (cid:107) u − ˆ u (cid:107) (21)where we abbreviate f ( u ) = L ( u , y ). L is the Lipshitz constant of f , which we assume is differen-tiable. We rewrite this as γ (cid:48) P ( u , ˆ u ) = 12 (cid:13)(cid:13)(cid:13)(cid:13) u − (cid:20) ˆ u − L ∇ f (ˆ u ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) + λL Ω( u ) . (22)The minimizing solution of (22) forms the update rule. Intuitively, at each iteration, proximalmethods linearize the objective function around the current estimate of the solution, ˆ u , then updatethis with the solution of the proximal problem. As long as (21) can be optimized efficiently, theproximal operator methods give excellent performance.While this technique has led to a number of algorithmic improvements for simpler regularizers, ithas also opened the door to the practical use of much more complicated regularization structures. Inparticular, there has been significant interest in more advanced regularization structures, particularlyones involving dependencies between the variables. These types of regularization structures are theprimary focus of our work. Finally, of most relevance to our work here, is a series of recent papers by Francis Bach and others,namely Bach, Jenatton, Mairal, and Obozinski [2012], Mairal, Jenatton, Obozinski, and Bach [2010],and Mairal, Jenatton, Obozinski, and Bach [2011], that independently discover some of our results.In these papers, combined with the more general work of Bach, Jenatton, Mairal, and Obozinski[2011], Bach [2010b] and Bach [2010a], ? , 2011], several of the results we present here were inde-pendently proposed, although from a much different starting point. In particular, he proposes usingparametric network flows to solve the proximal operator for Φ( u ) = (cid:80) i,j w ij | u i − u j | and that thiscorresponds to calculating the minimum norm vector of the associated submodular function. Theseresults are discussed in more detail in section 4. 10hile our results were discovered independently from these, we extend them in a number ofimportant ways. First, we propose a new algorithm to exactly solve the linear parametric flowproblem; this is a core algorithm in these problems. Second, through the use of the weightingscheme proposed in section 5, we develop a way to include general convex linear functions directlyin the optimization procedure. This expands the types of regularization terms that can be handledas part of the proximal operator scheme. The final piece of background work we wish to present forms the last pillar on which our theory isbuilt. Core to the theory and the algorithms is the simple problem of finding the minimum costpartitioning of a set of nodes V in which relationships between these nodes are defined by a graphconnecting them. Our results are built on a fundamental connection between a simple combinatorialproblem – finding a minimum cost cut on a graph – and the convex optimization problems describedin section 3.2.A fundamental problem in combinatorial optimization is finding the minimum cost cut on adirected graph. This problem is defined by ( V , E , c , s, t ), where V is a set of nodes and E is a setof directed edges connecting two nodes in V . We consistently use n to denote the number of nodesand, without loss of generality, assume V = { , , ..., n } , i.e. the nodes are refered to using the first n counting numbers. Similarly, the edges are denoted using pairs of vertices; thus E ⊆ { ( i, j ) : i, j ∈ V} .Here, ( i, j ) denotes an edge in the graph going from i to j . For an undirected graph , we simply assumethat ( i, j ) ∈ E if and only if ( j, i ) ∈ E , and that c ij = c ji for all i, j ∈ V . c associates a cost with eachof the edges in E . I.e. maps from a pair of edges to a non-negative cost, i.e. c ij ∈ R + for ( i, j ) ∈ E .The letters s and t here denote specific nodes not in the node set V . For reasons that willbecome clear in a moment, s is called the source node and t is called the sink node. We treat s and t as valid nodes in the graph and define the cost associated with an edge from s to i as c si ;analogously, c it denotes the cost of an edge from i to the sink t . These nodes are treated speciallyin the optimization, however; the minimum cut problem is the problem of finding a cut in the graphseparating s and t such that the total cost of the edges cut is minimal. Formally, we wish to find aset S ∗ ⊆ V satisfying S ∗ ∈ Argmin S ⊆V (cid:88) i ∈ Sj ∈ ( V\ S ) c ij + (cid:34)(cid:88) i ∈ S c it (cid:35) + (cid:88) i ∈V\ S c si (23)where Argmin with a capital A returns the set of minimizers, as there may be multiple partitions S ∗ achieving the minimum cost.The above lays out the basic definitions needed for our work with graph structures. This problemis noteworthy, however, as it can be solved easily by finding the maximal flow on the graph – one ofthe fundamental dualities in combinatorial optimization is the fact that the cut edges defining theminimum cost partition are the saturated edges in the maximum flow from s to t in the equivalentnetwork flow problem. It is also one of the simplest practical examples of a submodular function,another critical component of our theory. We now describe these concepts. The network flow problem is the problem of pushing as much “flow” as possible through the graphfrom s to t , where the capacity of each edge is given by the cost mapping c above. The significance11f this problem is the fact that it map directly to finding the minimum cost partition in a graph[Dantzig and Fulkerson, 1955, Cormen et al., 2001, Kolmogorov and Zabih, 2004]. The saturatingedges of a maximizing network flow – those edges limiting any more flow from being pushed throughthe graph – defines the optimal cut in the minimum cut partitioning. This result is one of the mostpractical results of combinatorial optimization, as many problems map to the partitioning problemabove, and one and simple algorithms exist for solving network flow problems.In the network flow problem, we wish to construct a mapping z that represents “flow” from s to t . A mapping z is a valid flow if, for all nodes in V , the flow going into the each node is the sameas the flow leaving that node. Specifically, Definition 3.1 ((Valid) Flow) . Consider a graph G = ( V , E , c , s, t ) as described above. Then z ij , i, j ∈ { s, t } ∪ V , is a flow on G if ≤ z ij ≤ c ij ∀ i, j ∈ V (24) (cid:88) i ∈V∪{ s } z ij = (cid:88) k ∈V∪{ t } z jk , (25) where for convenience, we assume that z ii = 0 and z ij = c ij = 0 if ( i, j ) / ∈ E .Furthermore, we say that an edge ( i, j ) is saturated if z ij = c ij , i.e. the flow on that edge cannotbe increased. Since we frequently talk about flows in a more informal sense, we use the term “valid flow” toreference this formal definition.It is easy to show that the amount of flow leaving s is the same as the amount of flow leaving t ,i.e. Total Flow = (cid:88) i ∈V z si = (cid:88) i ∈V z it . (26)this total flow is what we wish to maximize in the maximum flow problem, i.e. we wish to find a maximal flow z ∗ such that z ∗ ∈ Argmin z : z is a valid flow on G (cid:88) i ∈V z si . (27)The canonical min-cut-max-flow theorem [Dantzig and Fulkerson, 1955, Cormen et al., 2001] statesthat the set of edges saturated by all possible maximal flows defines a minimum cut in the senseof (23) above. The immediate practical consequence of this duality is that we are able to find theminimum cut partitioning quickly as fast algorithms exist for finding a maximal flow z ∗ . We outlineone of these below, but first we generalize the idea of a valid flow in two practically relevant ways. Two extensions of the idea of a flow z on a graph relaxes the equality in the definition of a flow.Relaxing the inequality is key to several algorithms, and forms some interesting connections to thetheory we outline. The equality constraint is effectively replaced with an excess( · ) function thatgives the excess flow at each node; if the total amount of flow into and out of a node i is equal, thenexcess( i ) = 0. Formally, 12 efinition 3.2 (Preflow) . Consider a graph G = ( V , E , c , s, t ) as in definition 3.1. Then z ij , i, j ∈ { s, t } ∪ V , is a flow on G if ≤ z ij ≤ c ij ∀ i, j ∈ V (28) (cid:88) i ∈V∪{ s } z ij ≥ (cid:88) k ∈V∪{ t } z jk . (29)With this definition, we define the excess( · ) function asexcess( i ) = (cid:88) i ∈V∪{ s } z ij − (cid:88) k ∈V∪{ t } z jk . (30)It is easy to see that if z is a preflow, excess( i ) ≥ i . Maintaining a valid preflow isone of the key invariants in the common push-relabel algorithm discussed below.A Pseudoflow is the weakest definition of a flow; it relaxes (29) completely, allowing there tobe both excesses and deficits on the nodes of the graph. However, it does add in an additionalconstraint, namely that all edges connected to the source and sink are saturated. Formally, Definition 3.3 (Pseudoflow) . Consider a graph G = ( V , E , c , s, t ) as in definition 3.1. Then z ij , i, j ∈ { s, t } ∪ V , is a flow on G if ≤ z ij ≤ c ij ∀ i, j ∈ V (31) z si = c si ∀ i ∈ V (32) z it = c it ∀ i ∈ V (33)Note that now the excess( i ) function can be either positive or negative. A pseudoflow hassome interesting geometric and algorithmic properties, It was described in Hochbaum [1998] andan efficient algorithm for solving network flows based on pseudoflows was presented in Hochbaum[2008]. We mention it here to preview our results in section 4: the pseudoflow matches up directlywith the base polytope, one of the fundamental structures in submodular function optimization. Finding the minimum cut or the maximum flow on a network is one of the oldest combinatorialoptimization problems, and, as such, it is also one of the most well studied. It would be impossibleto detail the numerous algorithms to solve the network flow problem here – Schrijver [2003] listsover 25 survey papers on different network flow algorithms. Many of these are tailored for differenttypes of graphs (sparse vs. dense) and other variations of the problem.Perhaps the most well-known algorithm for Network flow analysis is the push-relabel method.Variants of it achieve the best known performance guarantees for a number of problems of interest. Itis simple to implement; essentially, each node has a specific height associated with it that represents(informally) the number of edges in the shortest unsaturated path to the sink. Each node can havean excess amount of flow from the source. At each iteration, a node with excess either pushes flowalong an unsaturated edge to a neighbor with a lesser height, or increments its height. Terminationoccurs when the source node is incremented to |V| + 1, where |V| is the number of nodes – at thispoint, there is no possible path to the sink and the maximum cut can be found. For more informationon this, along with other information on this algorithm, see Cormen et al. [2001] or Schrijver [2003].This algorithm runs at the core of our network flow routines. This algorithm is guaranteed to run13n polynomial time; in the general case, it runs in O (cid:16) |V| |E| (cid:17) time, where V is the set of verticesand E is the set of edges. However, it can be improved to O (cid:0) |V| |E| log( (cid:12)(cid:12) V (cid:12)(cid:12) / |E| ) (cid:1) using dynamictree structures [Schrijver, 2003]. One variation of the regular network flow problem is the parametric flow problem [Gallo et al., 1989].The general parametric flow problem is a simple modification of the maximum flow problem givenabove, except that now c si is replaced with a monotonic non-decreasing function c si ( β ), z ∗ ( β ) ∈ Argmin z : z is a valid flow on G ( β );Capacity of edge ( s, i ) given by c si ( β ) (cid:88) i ∈V z si . (34)Gallo et al. [1989] showed that this problem could be solved for a fixed sequence β < β < · · · < β m in time proportional to the time of a single run of the network flow problem by exploiting a nestednessproperty of the solutions. Namely, as β increase, the minimum cut of the optimal solution movescloser on the graph to the source. More formally, if β < β , then S ∗ ( β ) ⊆ S ∗ ( β ) , (35)where S ∗ ( β ) is an optimal cut at β in the sense of (23).If c si ( β ) is a linear function with positive slope, we call this problem the the linear parametricflow problem . In sections 4 and 5, we develop an exact algorithm for this problem that does notrequire a predetermined sequence of β to solve. This method is one of the key routines in thealgorithms for general optimization. One of the recent applications of graph partitioning is in finding the minimum energy state of abinary Markov random field. In particular, if we can model the pairwise interactions over binaryvariables x , ..., x n on the graph using unary and pairwise potential functions E i ( x i ) and E ij ( x i , x j ),then the minimum energy solution can be found using the maximum cut on a specially formulatedgraph, provided that E ij (0 , 0) + E ij (1 , ≤ E ij (0 , 1) + E ij (1 , 0) for all i, j . This application ofnetwork flow solvers has had substantial impact in the computer vision community, where pairwisepotential functions satisfying this condition are quite useful.We discuss the details of this application in section 4, where we start with the theory behindthese connections as first building block of our other results. Ultimately, we show that network flowalgorithms can be used to solve not only this problem, but the much more general optimizationproblems described in section 3.2 as well. Through these connections, we hope to bring the powerof network flow solvers into more common use in the statistical community. The problem of finding a minimal partition in a graph is a special case of a much larger class ofcombinatorial optimization problems called submodular function minimization. Like the problemof finding a minimal-cost partitioning of a graph given in (23), this optimization problem involvesfinding the minimizing set over a submodular function f : 2 V (cid:55)→ R , where 2 V denotes the collectionof all subsets of a ground set V . Given this, f is submodular if, for all sets S, T ⊆ V , f ( S ) + f ( T ) ≥ f ( S ∩ T ) + f ( S ∪ T ) . (36)14his condition is seen as the discrete analogue of convexity. It is sufficient to guarantee that aminimizing set S ∗ can be found in polynomial time complexity.An alternative definition of submodularity involves the idea of diminishing returns [Fujishige,2005]. Specifically f is submodular if and only if for all S ⊆ T ⊂ V , and for all i ∈ V\ T , f ( S ∪ { i } ) − f ( S ) ≤ f ( T ∪ { i } ) − f ( T ) . (37)This property is be best illustrated with a simple example. Let A , A , ..., A n ⊆ A be n subsets ofa larger set A , and define f coverage ( S ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:91) i ∈ S A i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (38)so f coverage ( S ) measures the coverage of the set ∪ i ∈ S A i . In this context, it is easy to see that f coverage ( S ) satisfies (37). f coverage ( S ) is an example of a monotone submodular function as adding new elements to S isonly going to increase the value of f coverage ( S ). However, many practical examples do not fall intothis category. In particular, the graph cut problem given in equation (23) above is non-monotonesubmodular: f gc ( S ) = (cid:88) i ∈ Sj ∈ ( V\ S ) c ij + (cid:34)(cid:88) i ∈ S c it (cid:35) + (cid:88) i ∈V\ S c si . (39)We examine this particular example in more detail in section 4, where we prove f gc is indeedsubmodular.Submodular function optimization has gained significant attention lately in the machine learningcommunity, as many other practical problems involving the sets can be phrased as submodularoptimization problems. It has been used for numerous applications in computer vision, languagemodeling [Lin and Bilmes, 2010, 2012], clustering [Narasimhan, Jojic, and Bilmes, 2005, Narasimhanand Bilmes, 2007], computer vision [Jegelka and Bilmes, 2011, 2010], and many other domains. Thisfield is quite active, both in terms of theory and algorithms, and we contribute some novel resultsto both areas in section 5.The primary focus of our work has been developing a further connection between these graphproblems and submodular optimization theory. Our approach, however, is the reverse of much of theprevious work. Ultimately, we attempted to map the network flow problems back to the submodularoptimization problems. Surprisingly, this actually opened the door to several new theoretical resultsfor continuous optimization, and, in particular, to efficient solutions of the optimization problemsgiven in section 2. The theory surrounding submodular optimization, and combinatorial optimization in general, isquite deep. Many of the results underlying submodular optimization require a fairly substantialtour of the theory of the underlying structures; good coverage of these results is found in [Fujishige,2005] and [Schrijver, 2003]. Most of these results are not immediately relevant to our work, so weleave them to the interested reader. However, several additional results are needed for some of theproofs we use later.As with the discussion of problem, we assume that V = { , , , ..., n } ; our notation intentionallymatches that of the minimum cut problem defined above and is consistent throughout our work. Werefer to V as the ground set . Formally, f can be restricted to map from a collection of subsets of 2 V ,15hich we refer to consistently as D , with D ⊆ V . In the context of submodular functions, D mustbe closed under union and intersection and include V as an element[Fujishige, 2005]. In general, andexcept for some of our proofs in section 5, D can be thought of as 2 V . The first polynomial time algorithm for submodular function optimization was described in[Gr¨otschel et al., 1993]; it used the ellipsoid method from linear programming [Chvtal, 1983]. Whilesufficient to prove that the algorithm can be solved in polynomial time, it was impractical touse on any real problems. The first strongly polynomial time algorithms – polynomial in a sensethat does not depend on the values in the function – were proposed independently in Schrijver[2000] and Iwata, Fleischer, and Fujishige [2001]. The algorithm proposed by Schrijver runs in O (cid:0) n + γn (cid:1) , where γ refers to the complexity of evaluating the function. The latter algorithm is O (cid:0) γn log n (cid:1) , which may be better or worse depending on γ . The weakly polynomial version ofthis algorithm runs in O (cid:0) γn log M (cid:1) , where M is the difference between maximum and minimumfunction values. Research in this area, however, is ongoing – a strongly polynomial algorithm thatruns in O (cid:0) n + n γ (cid:1) has been proposed by Orlin [2009].In practice, the minimum norm algorithm – also called the Fujishige-Wolfe Algorithm – is gener-ally much faster, although it does not have a theoretical upper bound on the running time [Fujishige,2005]. We discuss this algorithm in detail, as it forms the basis of our work. However, there areclear cases that occur in practice where this algorithm does not seem to improve upon the morecomplicated deterministic ones – in Jegelka, Lin, and Bilmes [2011], a running time of O (cid:0) n (cid:1) wasreported. Thus the quest for practical algorithms for this problem continues; it is an active area ofresearch.Additionally, several other methods for practical optimization have been proposed for generalsubmodular optimization or for special cases that are common in practice. Stobbe and Krause [2010]proposed an efficient method for submodular functions that can be represented as a decomposablesum of smaller submodular functions given by g i ( | U i ∩ S | ), where g i is convex. In this particular case,the function can be mapped to a form that permits the use of nice numerical optimization techniques;however, many submodular functions cannot be minimized using this technique, and it can also beslow [Jegelka et al., 2011]. Along with analyzing the deficiencies of existing methods, Jegelka,Lin, and Bilmes [2011] propose a powerful approach that relies on approximating the submodularfunctions with a sequence of graphs that permit efficient optimization. This method is quite efficienton a number of practical problems, likely because the graph naturally approximates the underlyingstructure present in many real-world problems.Most recently, in Iyer, Jegelka, and Bilmes [2013], another practical optimization method isproposed; at its core is a framework for both submodular minimization and maximization based ona notion of discrete sub-gradients and super-gradients of the function. While not having a theoreticalupper bound itself, it performs efficiently in practice. This method is also noteworthy in that it canconstrain the solution space in which other exact solvers operate, providing substantial speedups. A number of geometrical structures underpin the theory of submodular optimization. In particular,an associated polymatroid is defined as the set of points in |V| -dimensional Euclidean space withsums of sets of the dimensions constrained by the function value of the associated set.16or notational convenience, for a vector x ∈ R n and set S ⊆ V , define x ( S ) = (cid:88) i ∈ S x i . (40)In this way, x ( S ) forms a type of unnormalized set measure.Now the polymatroid associated with f is defined as P ( f ) = (cid:110) x ∈ R |V| : x ( S ) ≤ f ( S ) ∀ S ∈ D (cid:111) , (41) P ( f ) is fundamental to the theory behind submodular function optimization.In our theory, we work primarily with the base of the polymatroid P ( f ), denoted by B ( f ). Thisis the extreme ( |V| − P ( f ); it is defined as B ( f ) = (cid:110) x ∈ R |V| : x ∈ P ( f ) , x ( V ) = f ( V ) (cid:111) . (42)In the case where the domain D = 2 V , B ( f ) is a linear, convex compact set. Thus it is often referedto as the Base Polytope of f as it is compact.The base B ( f ) is particularly important for our work, as one of the key results from submodularfunction optimization is the minimum norm algorithm , which states effectively that sets S ∗ mini-mizing f are given by the sign of the point in B ( f ) closest to the origin. This surprising result,while simple to state, takes a fair amount of deep theory to prove for general f ; we state it here: Theorem 3.4 ([Fujishige, 2005], Lemma 7.4.) . For submodular function f defined on V , let y ∗ = argmin y ∈ B ( f ) (cid:107) y (cid:107) (43) and let S ∗ = { i : y ∗ i < } and S ∗ = { i : y ∗ i ≤ } . Then both S ∗ and S ∗ minimize f ( S ) over allsubsets S ⊆ V . Furthermore, for all S † ⊆ V such that f ( S † ) = min S ∈V f ( S ) , S ∗ ⊆ S † ⊆ S ∗ . A central aspect of our theory relies on this result. In general, it as one way of working thegeometric structure of the problem. It turns out that in the case of graph partitioning, B ( f ) takeson a particularly nice form. This allows us to very quickly solve the minimum norm problem here.We show the resulting minimum norm vector also gives us the full solution path over weightingby the cardinality of the minimizing set. While this basic result was independently discovered inMairal, Jenatton, Obozinski, and Bach [2011], our approach opens several doors for theoretical andalgorithmic improvements, which we outline in the subsequent sections. The foundational aspect of our work is the result established in this section, namely an exploration ofnetwork flow minimization problems in terms of their geometric representation on the base polytopeof a corresponding submodular problem. This representation is not new; it is explored in somedepth as an illustrative example in [Fujishige, 2005] and connected to the minimum norm problemsin Mairal et al. [2011]. Our contribution, however minor, is based on a simple transformation of theoriginal problem that yields a particularly intuitive geometric form. The fruit of this transformation,however, is a collection of novel results of theoretic and algorithmic interest; in particular, we areable to exactly find the optimal u ∗ over the problem u ∗ ( λ ) = argmin u ∈ R n (cid:107) u − a (cid:107) + λ (cid:88) i,j w ij | u i − u j | . (44)17ecall that this problem was discussed in depth in section 3.2; here, a ∈ R n is given, and λ ∈ R + and the weights w ij ∈ R + control the regularization.In this section, we lay the theoretical foundation of this work, denoting connections to otherrelated or previously known results. The contribution at the end is a strongly polynomial algorithmfor the solution of a particular parametric flow problem; this algorithm follows naturally from thisrepresentation. In the next section, we extend the theory to naturally allow a convex piecewise-linearfunction ξ i ( u i ) to be included in (44) to match ( (cid:60) B ).Before presenting the results for the real-valued optimization of (44), we must first present anumber of theoretical results using the optimization over sets V = { , , ..., n } . We begin with thesimplest version of this optimization – finding the minimizing partition of a graph – and extend thisresult to general submodular functions later. We proceed by defining and establishing equivalences between three basic forms of the minimumcut problem on a graph. Recall from section 3.3 that the minimum cut problem is the problem offinding a set S ∗ ⊆ V satisfying S ∗ ∈ Argmin S ⊆V (cid:88) i ∈ Sj ∈ ( V\ S ) c ij + (cid:34)(cid:88) i ∈ S c it (cid:35) + (cid:88) i ∈V\ S c si . (45)As we mentioned earlier, several other important problems can be reduced to this form; in partic-ular, finding the lowest energy state of a binary Markov random field, when the pairwise potentialfunctions satisfy the submodularity condition, is equivalent to this problem. Our task now is tomake this explicit.We here show equivalences between three versions of the problem. The first is P E , which givesthe standard energy minimization formulation, i.e. finding the MAP estimate of p ( x ) ∝ exp (cid:88) ( i,j ) ∈E E ij ( x i , x j ) + (cid:88) i ∈V E i ( x i ) , x ∈ { , } n (46)which is equivalent to finding x ∗ ∈ Argmin x ∈{ , } n (cid:88) ( i,j ) ∈E E ij ( x i , x j ) + (cid:88) i ∈V E i ( x i ) . (47) P Q gives the formulation as a quadratic binary minimization problem; here, the problem is tofind x ∈ { , } n that minimizes x T Qx for an n × n matrix Q . This version forms a convenientform that simplifies much of the notation in the later proofs. Finally, we show it is equivalent tothe classic network flow formulation, denoted P N . This states the original energy minimizationproblem as the minimum st -cut on a specially formulated graph structure. The equivalence ofthese representations is well known [Kolmogorov and Zabih, 2004] and widely used, particularly incomputer vision applications. We here present a different parametrization of the problem whichanticipates the rest of our results.The other unique aspect of our problem formulation is the use of a size biasing term ; specifically,we add a term β | S | to the optimization problem, where β ∈ R can be positive or negative. Thisterm acts similarly to a regularization term in how it influences the optimization, but to think of18t this way would lead to confusion as the true purpose of this formulation is revealed later in thissection – ultimately, we show equivalence between the values of β at which the set membership of anode flips and the optimal values of the continuous optimization problem of ( (cid:60) B ). Theorem 4.1. Let G = V , E ) be an undirected graph, where V is the set of vertices (assume V = { , ..., n } ) and E ⊆ { ( i, j ) : i, j ∈ V} is the set of edges. Without loss of generality, assumethat i < j ∀ ( i, j ) ∈ E . Define S ∗ E ( β ) , S ∗ Q ( β ) , and S ∗ N ( β ) as the sets of optimizing solutions to thefollowing three problems, respectively: Energy Minimization: P E ( β ) . Given an energy function E = ( E i ( x i ) : i ∈ V ) defined for eachvertex i ∈ V and a pairwise energy function E = ( E i,j ( x i , x j ) : ( i, j ) ∈ E ) defined for eachedge ( i, j ) ∈ E , with E ij (0 , 0) + E ij (1 , ≤ E ij (0 , 1) + E ij (1 , , let X ∗ ( β ) = Argmin x ∈{ , } n (cid:88) i ∈V ( E i ( x i ) − βx i ) + (cid:88) ( i,j ) ∈E i 1) + E ij (0 , − E ij (0 , − E ij (1 , i < j otherwise (50) q ii = ( E i (1) − E i (0)) + (cid:88) i (cid:48) i : ( i,j ) ∈E ( E ij (1 , − E ij (0 , . (51) Suppose q ij ≤ for i (cid:54) = j , and let X ∗ = Argmin x ∈{ , } n x T ( Q − β I ) x , (52) and let S ∗ Q ( β ) = (cid:8) { i : x ∗ i = 1 } : x ∗ ∈ X ∗ ( β ) (cid:9) . (53) Minimum Cut Formulation: P N ( β ) . Let G (cid:48) = ( V (cid:48) , E (cid:48) ) be an augmented undirected graph with V (cid:48) = V ∪ { s, t } , where s and t represent source and sink vertices, respectively, and E (cid:48) = E ∪ { ( s, i ) : i ∈ V} ∪ { ( i, t ) : i ∈ V} ∪ { ( j, i ) : ( i, j ) ∈ E} . Define capacities c ij , ( i, j ) ∈ E on theedges as: c si = c si ( β ) = [ a i ( β )] + c ji = c ij = − q ij , i < j c jt = c jt ( β ) = [ a j ( β )] − (54) where a i ( β ) = 12 (cid:88) i (cid:48) : i (cid:48)
A reformulation of known results [Kolmogorov and Zabih, 2004], but given in section A forconvenience.The primary consequence of this theorem is that solving the energy minimization problem canbe done efficiently and exactly due to several types of excellent network flow solvers that makesolving problems with millions of nodes routine [Boykov and Kolmogorov, 2004, Cormen et al.,2001, Schrijver, 2003]. Because of this, numerous applications for graphcuts have emerged in recentyears for computer vision and machine learning. Our purpose, in part, is to expand the types ofproblems that can be handled with network flow solvers, and problems of interest in statistics inparticular.In our work, we alternate frequently between the above representations. For construction, theenergy minimization problem is nicely behaved. In the theory, we typically find it easiest to workwith the quadratic binary problem formulation due to the algebraic simplicity of working in thatform. Again, however, each of these is equivalent; when it does not matter which form we use, werefer to the problem an solution set as S ∗ ( β ) and P ( β ) respectively. While theorem 4.1 lays out the equivalence between P E ( β ) and P Q ( β ) and a specific form of networkflow problem P N ( β ), for completeness we show that any network flow problem can be translatedinto the form P N ( β ) and thus P E ( β ) and P Q ( β ). The two distinguishing aspects of P N ( β ) arethe facts that each node is connected to either the source or the sink, and that all the edges aresymmetric, i.e. c ij = c ji for all i (cid:54) = j . It is thus sufficient to show that an arbitrary flow problemcan be translated to this form. For conciseness, assume that β = 0; the results can be adapted forother β easily. Theorem 4.2. Any minimum cut problem on an arbitrary, possibly directed graph can be formulatedas a quadratic binary problem of the form P Q ( β = 0) as follows: . For all edges ( i, j ) such that c ij > c ji , add a path s → j → i → t with capacity c ij − c ji . Thatedge is now undirected in the sense that both directions have the same capacity, and the edgesin the minimum cut are unchanged, as these paths will simply be eliminated by flow along thatpath. . Set q ij = c ij for i < j and q ij = 0 for i > j . . Given q ij , set q ii = ( c si − c it ) − (cid:104)(cid:80) i (cid:48) : i (cid:48)
The minimization problem P Q ( β ) can be expressed as minimization of a function f β : 2 V (cid:55)→ R , with f β ( S ) = (cid:88) i One immediate proof of the if part follows from the fact that P Q ( β ) can be expressed as thesum of submodular pairwise potential terms, and the sum of pairwise submodular functions is alsosubmodular [Fujishige, 2005]. The direct proof, including both directions, is a simple reformulationof known results see [Kolmogorov and Zabih, 2004], but given in section A on page 38 for convenience. We are now ready to present the theory that explicitly describes the geometry of P Q ( β ), whichextends to both P E ( β ) and P N ( β ), in the context of submodular function optimization. This theoryis not new; several abstract aspects of it have been thoroughly explored by Schrijver [2003] andFujishige [2005]. In our case, however, the exact form of the problem presented in theorem 4.1 wascarefully chosen to yield nice properties when this connection is made explicit. From these, a numberof desirable properties follow immediately.The rest of this section is arranged as follows. First, we show that the base polytope B ( f β )is a reduction of all pseudoflows (see definition 3.3) on the form of the cut problem P N ( β ) fromtheorem 4.1. B ( f β ), described in section 3.4, has special characteristics for our purposes, as theminimum norm algorithm described in section 3.4 provides a convenient theoretical tool to examinethe structure of P Q ( β ). Our key result is to show that the minimum norm vector – the L -projectionof the origin onto B ( f β ) – depends on β only through a simple, constant shift. Thus this vectorallows us to immediately compute the minimum cut directly for any β , and we are thus able tocompute the minimum cut solution as well for any β as well. Theorem 4.4 (Structure of B ( f β )) . Let f β be defined in theorem 4.3 (58) , and let A = (cid:26) α ∈ M n × n : (cid:26) | α ij | ≤ | q ij | i < jα ij = 0 otherwise (cid:27) . (59) Let r i ( α ) , i = 1 , ..., n , be defined as follows: r i ( α ) = q ii + 12 (cid:88) i (cid:48)
Proceeds with straightforward albeit tedious algebra. Proved in section A on page 37.In light of this, the minimum norm problem on the graph structure is as follows. The immediatecorollary to the min-norm theorem is that α ∗ ( β ) yields the optimum cut S ∗ ( β ) of the correspondingnetwork flow problem, P N ( β ): Theorem 4.5 (Minimum Norm Formulation of P N ( β )) . Let A and r ( α ) be given by equations (59) and (60) , respectively. Then the min-norm problem P N ( β ) associated with P Q ( β ) is defined by α ∗ ( β ) = argmin α ∈A (cid:107) r ( α ) − β (cid:107) (63) Then any optimal cut S ∗ ( β ) solving P G ( β ) , as given in theorem 4.1, satisfies: { i ∈ V : r i ( α ∗ ( β )) < β } ⊆ S ∗ ( β ) ⊆ { i ∈ V : r i ( α ∗ ( β )) ≤ β } . (64) Proof. Follows immediately from theorems 4.1 and 4.3 characterizing the cut problem as a sub-modular function optimization problem, theorem 4.4 describing the structure of this problem, andtheorem 3.4 to characterize the solution.The optimal α ∗ in the above formulation has some surprising consequences that motivate therest of our results. In particular, we show that the optimal α ∗ in equation (63) is independent of β ; this is the key observation that allows us to find the entire regularization path over β . Formally,this result is given in theorem 4.9. However, we need other results first. The representation in terms of α is significant partly as the values of α effectively form a pseudoflowin the sense of Hochbaum [2008] (see section 3.3.2). Recall that a pseudoflow extends the conceptof a flow by allowing all the nodes to have both excesses and deficits. In addition, a pseudoflowassumes that all edges from the source to nodes in the graph are saturated, possibly creating excessesat these nodes, and all edges connected to the sink are similarly saturated, possibly creating deficitsat these nodes. Theorem 4.6. Consider the problem P N ( β ) . For any α ij ∈ R , with ( i, j ) ∈ E , let α ij representthe flow on each edge c ij , with α ij > indicating flow from i to j , and α ij < indicating flowfrom j to i . Then α ∈ A defines a pseudoflow on the graph structure indexed by A . Furthermore, r i ( α ) − β = excess( i ) is the (possibly negative) excess at node i in the sense of (30) .Proof. The pseudoflow condition that all edges from the source node and to the sink node aresaturated is immediately implied by the fact that r i ( ) − βw i = ( c si − c it ) − β . The flow conditions,then follow from the edge capacity being c ij = c ji = − q ij and −| q ij | ≤ α ij ≤ | q ij | . Corollary 4.7. Every pseudoflow on the graph defined by theorem 4.1 maps to a point in the basepolytope B ( f β ) , and every point in B ( f β ) is given by at least one pseudoflow.Proof. Follows immediately from theorem 4.6. 22 .3 Structure of the Complete Solution The theorem above has a number of important consequences detailed in the next few sections. Themost immediate consequence comes when we consider the structure of the optimal solution of theminimum norm algorithm specialized to the network flow problem P N ( β ); this effectively allows usto derive a way of solving P N ( β ) for all β . Lemma 4.8 (Optimal solutions to P N ( β )) . Then α ∗ is an optimal solution to P N ( β ) if and onlyif for all i, j , i < j , the following condition holds: α ∗ ij = | q ij | ⇐⇒ r i ( α ∗ ) ≥ r j ( α ∗ ) −| q ij | ≤ α ∗ ij ≤ | q ij | ⇐⇒ r i ( α ∗ ) = r j ( α ∗ ) α ∗ ij = −| q ij | ⇐⇒ r i ( α ∗ ) ≤ r j ( α ∗ ) . (65) In particular, the optimum value of α ∗ in this case is independent of β .Proof. First, A is convex as each dimension α ij is bounded independently. Thus the objective of P N ( β ) is minimizing a convex function over a convex domain. Therefore, it suffices to prove thatequation (65) can be satisfied if and only if α ∗ is a local minimum of (cid:107) r ( α ) − β (cid:107) . As (cid:107) r ( α ) − β (cid:107) is differentiable w.r.t. α , this is equivalent to showing that either the gradient is 0 or α ∗ is on theboundary of A and all coordinate-wise derivatives point outside the domain A .First, define g ij ( α ) as the gradient of (cid:107) r ( α ) − β (cid:107) w.r.t. α ij : g ij ( α ) = ∂∂α ij (cid:107) r ( α ) − β (cid:107) = 2 (cid:88) k ( r k ( α ) − β ) ∂∂α ij r k ( α ) (66)Now ∂∂α ij r k ( α ) = k = i − k = j . (67)Thus, g ij ( α ) = 2 [( r i ( α ) − β ) − ( r j ( α ) − β )] (68)= 2 [ r i ( α ) − r j ( α )] , (69)and the following condition holds for all α : g ij ( α ) < ⇐⇒ r i ( α ) > r j ( α ) g ij ( α ) = 0 ⇐⇒ r i ( α ) = r j ( α ) g ij ( α ) > ⇐⇒ r i ( α ) < r j ( α ) . (70)Matching these conditions to those in (65) shows that α ∗ as given defines a local, and thus global,optimum of P N ( β ). In particular, note that this criterion is independent of β , completing theproof. β The invariance of the optimal α ∗ to β allows us to characterize the solution space of optimal cutsas the level sets of r ( α ∗ ). The core result, as well as our algorithm, is based on this intuition.23 heorem 4.9 () . I. α ∗ is the optimal solution to (63) if and only if for all β ∈ R , all optimalcuts S ∗ β ∈ S ∗ ( β ) for P ( β ) satisfy U ( β ) = { i ∈ V : r i ( α ∗ ) < β } ⊆ S ∗ β ⊆ { i ∈ V : r i ( α ∗ ) ≤ β } = U ( β ) (71) II. Furthermore, for all β , U ( β ) is the unique smallest minimizer of P ( β ) and U ( β ) is the uniquelargest minimizer.Proof. Part (I) follows as a direct consequence of theorem 4.5 and the invariance of the minimumnorm problem to changes in β as given in lemma 4.8. Part (II) is then an immediate consequenceof the minimum norm algorithm.This theorem is intuitively important as the key values that permit a connection to the continuousproblems are values of β at which the membership of the different nodes change. This theorem tellsus that these points are given by the values of the minimum norm vector, here given as r ( α ∗ ). One immediate corollary of theorem 4.9 is a monotonicity property on the optimal sets, used in thecontinuous optimization theory we present later: Corollary 4.10. Let β < β . Then for all S ∗ ∈ S ∗ ( β ) and S ∗ ∈ S ∗ ( β ) , S ∗ ⊂ S ∗ . (72) Proof. Follows immediately from the equivalence of the optimizing sets to the level sets of theminimum norm vector given in theorem 4.9. Theorem 4.9 above was discovered independently for the full case of general submodular functionsby Nagano et al. [2011]. There, the authors similarly showed that the level sets of the minimumnorm algorithm give the solutions for the f ( S ) − β | S | problem. While the approach those authorstake is different and more involved, we give a shorter, alternative proof. We use a simple argumentfollowing from the fact that B ( f ) constrains the minimum norm vector y to a constant total sum.The result is that the constant offset in the minimum norm objective drops out of the optimization.More formally: Theorem 4.11 (Invariance of General Submodular Functions to β ) . I. Let f be a general sub-modular function. Then y ∗ is the optimal solution to the minimum norm problem y ∗ = argmin y ∈ B ( f ) (cid:107) y (cid:107) (73) if and only if ∀ β ∈ R , the set of optimizing solutions S ∗ ( β ) = argmin S ∈D f ( S ) − β | S | (74) satisfies U ( β ) = { i ∈ V : y ∗ < β } ⊆ S ∗ ( β ) ⊆ { i ∈ V : y ∗ ≤ β } = U ( β ) . (75)24 I. Furthermore, for all β , U ( β ) is the unique minimal solution to (63) and U ( β ) is the uniquemaximal solution in S ∗ ( β ) .Proof. Consider the submodular function f β ( S ) = f ( S ) − β | S | . It is easy to show that B ( f β ) = { x − β : x ∈ B ( f ) } . (76)Denote by y ∗ ( β ) the minimum norm solution for f β . Then the submodular problem for y ∗ ( β ) isgiven by y ∗ ( β ) = argmin y ∈ B ( f β ) (cid:107) y (cid:107) (77)= (cid:34) argmin v ∈ B ( f ) (cid:107) v − β (cid:107) (cid:35) + β (78)= (cid:34) argmin v ∈ B ( f ) (cid:40)(cid:32)(cid:88) i ∈V v i (cid:33) − β (cid:32)(cid:88) i ∈V v i (cid:33)(cid:41) + |V| β (cid:35) + β (79)= (cid:34) argmin v ∈ B ( f ) (cid:40)(cid:32)(cid:88) i ∈V v i (cid:33)(cid:41) − βf ( V ) + |V| β (cid:35) + β (80)= (cid:34) argmin v ∈ B ( f ) (cid:107) v (cid:107) (cid:35) + β (81)where steps (78)–(79) follow by definition of B ( f ), causing the terms dependent on β to drop outby way as constants under the optimization.From this, we have that the optimal minimum norm vector for f β is just the minimum normvector for f shifted by β , immediately implying part (I). Similarly, part (II) follows immediatelyfrom the minimum norm theorem.This result is used in several other sections as well, and has a number of practical implicationsfor size-constrained optimizations and related problems. For a full treatment of related implications,see [Nagano et al., 2011]. Using the above theory, we now wish to present a viable algorithm to calculate the reduction, andhence all cuts, for each node. The idea is simple and follows immediately from the similarity ofthe minimum cut problem P N ( β ) to the structure of B ( f β ) as detailed in theorem 4.6. As eachlevel set of r ( α ∗ ) defines an optimum cut in the graph, we can adjust all of the unary potentials by r µ = mean i ∈ S , chosen to bisect the reduction values, and solve the resulting cut problem. By themax-flow min-cut theorem, all edges crossing the cut are saturated. Specifically, ∀ i, j, i < j, such that r i ( α ) ≤ r µ < r j ( α ) , α ∗ ij = − q ij , (82) ∀ i, j, i < j, such that r i ( α ) > r µ ≥ r j ( α ) , α ∗ ij = q ij (83)As these α ij are optimal in the final solution α ∗ , they can be fixed by permanently adding theirvalues to the corresponding r i ( α ) and removing them from consideration in the optimization. Thisthen bisects the nodes, allowing us to treat these two subsets separately when solving for the rest25 lgorithm 1 : AlphaReduction Input : Submodular Q . Output : α ∗ , the minimizer in A of (cid:107) r ( α ) (cid:107) . // Begin by calling BisectReductions below on the full set V to get α . return BisectReductions ( T = V , α = , Q ) // α [ T ] denotes α restricted to edges with both nodes in T . BisectReductions ( T , α , Q ) r µ ← mean i ∈ T r i ( α ), r min ← min i ∈ T r i ( α ) if r µ = r min then return α [ T ] // Done; We are on a single level set. E T ← { ( i, j ) : i, j ∈ T } S ∗ T ← Minimum cut on ( T, E T ), with capacities formed from ( Q [ T ] − diag( r µ )) by theorem4.1. // Fix the flow on edges in the cut by adjusting the source/sink capacities of each node,then removing those edges. for i ∈ S ∗ T , j ∈ T \ S ∗ T , i < j do α ij ← − q ij , q ii ← q ii − q ij , q ij ← for i ∈ T \ S ∗ T , j ∈ S ∗ T , i < j do α ij ← q ij , q jj ← q jj + q ij , q ij ← // Recursively solve on the two partitions to fix the other α ’s. α [ S ∗ T ] ← BisectReductions ( S ∗ T , α , Q ) α [ T \ S ∗ T ] ← BisectReductions ( T \ S ∗ T , α , Q ) return α [ T ] of the bisections. The validity of this bisection can also be seen by the optimality of the minimumnorm solution as described by theorem 4.9.Algorithm 1 can be summarized as follows. We first start by considering the entire set of nodes,setting the working set S = V . At each step, we recursively partition the working set S using aminimum cut as follows:1. If r ( α ) is constant in S , then return. We’re done.2. Otherwise, set up a network flow problem to bisect the nodes and find a minimum cut. Oncea minimum cut is found, set all the edges in the cut to their saturated values.3. Repeat on the two resulting subsets of nodes.Pseudocode for this algorithm is presented in Algorithm 1. Theorem 4.12 (Correctness of Algorithm 1.) . After the termination of Algorithm 1, all values of α are set such that (cid:107) r ( α ) (cid:107) is minimized over α ∈ A .Proof. Let i, j , i (cid:54) = j , be any pair of nodes such that q ij (cid:54) = 0, and let α † be the solution returned byAlgorithm 1.First, suppose that r i ( α ∗ ) = r j ( α ∗ ). Then trivially, the optimality criteria of Lemma 4.8 issatisfied.Next, suppose that r i ( α ) (cid:54) = r j ( α ), and first suppose that r i ( α ) > r j ( α ). Then, by the termina-tion condition of the recursion in BisectReductions , nodes i and j must have been separated by26 valid minimum cut for some r µ . However, as all edges crossing a minimum cut are saturated bythe max-flow-min-cut theorem, α ij = | q ij | . Thus condition (65) in Lemma 4.8 is satisfied. Similarly,if r i ( α ) < r j ( α ), then α ij = −| q ij | , indicating a flow of | q ij | from j to i ; again, this satisfies (65).As the above holds for any pairs of nodes i, j , the optimality criteria of Lemma 4.8 is satisfiedglobally, proving the correctness of the algorithm.This algorithm is quite efficient in practice, and it forms an core routine of the total variationminimization algorithm given in the second part of our work, where we present full experiments andsome comparisons with existing approaches. One of the intriguing consequences of the above theory, and one that opens new doors to efficientlyoptimizing several other classes of functions, comes as the result of being able to map other correlateddata to this framework. In general, interactions between terms can be very difficult to work within practice. However, the above theory allows us to exactly find the optimizer of a large class ofgeneral functions. These functions may not necessarily be smooth.Our approach connects closely to several recent results discovered independently by Mairal[Mairal et al., 2011] and Bach [Bach, 2010a], which connect some of these problems to an olderresult by Hochbaum [Hochbaum and Hong, 1995]. The last of these papers effectively establishes anequivalence between a class of quadratic objective functions and some types of network flow algo-rithms, although the equivalence to parametric flows isn’t really explored. However, this result wasused by Bach in Bach [2010a] to note that the minimum norm problem of the network flow problemcan be solved using classical methods for solving parametric flow problems [Gallo et al., 1989], andthe implications of this for structured sparse recovery are explored in Mairal et al. [2011]. Theend result, discovered independently, parallels the theorem we present below, albeit with a differentalgorithm.In contrast, while less general, the algorithm we presented in 1 gives the exact change pointsimmediately, and the theoretical framework presented surrounding this problem is more thoroughlyexplored here. However, the primary practical improvement we provide comes in the next chapterwhen we incorporate the use of piecewise-linear convex penalty terms as well. Theorem 4.13. Suppose γ : R n → R + can be expressed as γ ( u ) = (cid:107) u − a (cid:107) + λ (cid:88) i,j w ij | u i − u j | (84) where u ∈ R n is the variable we wish to optimize over and a ∈ R n , λ > , and w ij ≥ are given.Without loss of generality, assume that i < j . Then the minimizer u ∗ = argmin u ∈ R n γ ( u ) (85) can be found exactly using Algorithm 1 with q ii = a i (86) q ij = 12 w ij λ (87) Then u ∗ = r ( α ∗ ) .Proof. See appendix A, page 41. 27 Unary Regularizers and Non-Uniform Size Measures In many contexts, it is helpful to use regularization terms or weighting terms on the solution tocontrol the final behavior of the result. In the previous section, we discussed the simplest case in thesubmodular function context, namely weighting the problem against the cardinality of the solutionset. We proved this for the case of graph-based submodular problems and extended that argumentto the full submodular function context. In this section, we extend this result to the case where thesize biasing term β | S | term is replaced with a weighted size biasing term β w ( S ). Here, w ( S ) = (cid:88) i ∈ S w i ≥ S ∗ ( β ) = Argmin S ⊆V f ( S ) − β w ( S ) (89)for all β ∈ R and all positive weight measures w .The results of the rest of these papers are entirely novel. We prove that the optimal solution toan alternate construction of the minimum norm theorem yields the entire solution path for all β .Like the last section, we find a vector z ∗ on the base B ( f ) such that every solution is given by azero-crossing of [ z ∗ − β w ]. We then show that this allows us to include more detailed structures inthe continuous optimization problem; in particular, we are able to incorporate the piecewise-linearconvex penalty term ξ i ( u i ) in ( (cid:60) B ) directly into our optimization.Our result is based around a very simple technique to augment the original graph such thatauxiliary variables “attract” parts of the regularization influence of β and transfer it to associatedvariables in the original problem. We then show that it is possible to translate this result intoarbitrary weights by taking several well-controlled limits. The end result is an algorithm for solving(89) for arbitrary positive weights.As this result is novel and holds for general submodular functions, we prove it for the generalcase first, then extend it to the special case of the linear parametric flow problem. Analogously toalgorithm 1, the algorithm we develop here solves this problem exactly. The fact that the solutionis exact also allows us to solve ( (cid:60) B ). The primary tool used for introducing weights into the optimization of the level sets of the functionis to augment the original problem with additional variables. When β = 0, these variables do notcontribute to the solution values of the base set of nodes. In the network flow interpretation, theyhave no connection to the source or sink – but they are subject to the influence by β in the resultingsolutions.With the proper construction, it is possible to guarantee that these augmented nodes alwayshave the same reduction value as the nodes they are augmenting; this allows us to construct a graphsuch that these values then translate back into weights on the β terms. The primary tool used isthe following lemma, which forms the basis of the rest of our results. Lemma 5.1. Let V = { , , ..., n } , and let f ( S ) be a bounded submodular function defined on D ⊆ V . (Recall that D is closed under intersection and union.)Let w ∈ { , , ... } |V| be a vector of positive integer weights, and set W = (cid:80) i ( w i − . Denote V w = V ∪ { n + 1 , ..., n + W } . Fix M β ∈ R + and set M >> M β sufficiently large. Then, . For all β ∈ [ − M β , M β ] , Argmin S ∈D f ( S ) − β w ( S ) = (cid:26) T ∗ ∩ V : T ∗ ∈ Argmin T ⊆V w : T ∩V∈D f w ( T ) − β | T | (cid:27) (90) where Argmin returns the set of minimizing sets, f w ( T ) is submodular and given by f w ( T ) = f ( T ∩ V ) + M (cid:88) i ∈V (cid:88) j ∈ K i (cid:2) { i ∈ T } + { j ∈ T } − { i,j }⊆ T (cid:3) , (91) and K i is a block of indices of length w i − , given by K i = (cid:40)(cid:32) n + (cid:88) k
See appendix A, page 42.The above lemma is noteworthy as it provides a way to theoretically augment the original problemin a way that alters the original problem such that the relative influence of the β scaling can bealtered. In particular, in the augmented problem, the size of the evaluation set | T | includes theseaugmented nodes – since they are included deterministically based on the values in the unaugmentedset V , the unaugmented node is effectively counted multiple times. This allows us to weight thenodes separately.In our context, when dealing with graph structures, this corresponds to adding a collection ofsingle nodes with no connections other than an effectively infinite capacity edge connecting each toone of the base nodes. As this edge ties the nodes together in any cut solution, the influence of theglobal weighting parameter β on this auxiliary node is simply transferred to the attached node. Thenext two theorems extend this result to the minimum norm vector y ∗ , and an approximation lemmaextends this to general weight vectors. The central result of this section is a weighted version of the minimum norm problem. Under thisconstruction, the solution to the original minimum norm problem is the same as problem a β | S | weighting term, but with additional nodes. However, the level sets of the resulting vector yield theoptimal minimizing sets f ( S ) − β w ( S ) for all values of the parameter β .29 efinition 6.1 (Weighted Minimum Norm Problem) . For a submodular function f defined on D ⊆ V , and positive weights w ∈ R n , w > , the weighted minimum norm problem is given by z ∗ = argmin z ∈ B ( f ) (cid:88) i ∈V z i w i . (96) and we call the solution vector z ∗ the Weighted Minimum Norm Vector. Furthermore, if the weights w are restricted to be positive integers, then this is called the Integer Weighted Minimum NormProblem. We use the solution to this problem in an analogous way to the use of the minimum norm vectorof section 4. The theorems in this section show that z ∗ gives the entire solution path to f ( S ) − β w ( S )over β . Theorem 6.2 (Structure) . Let f be a submodular function on D , and let w , V w , f w , K i , and D w be as defined in lemma 5.1 (in particular, the elements of w are integers). Let κ i = K i ∪ { i } , so x ( κ i ) = x i + (cid:80) j ∈ K i x j . Then I. The polymatroid associated with f w is given by P ( f w ) = (cid:40) x ∈ R |V w | : ∀ S ∈ D , (cid:88) i ∈ S x ( κ i ) ≤ f ( S ) (cid:41) (97) and the associated base polymatroid is given by B ( f w ) = (cid:40) x ∈ R |V w | : x ∈ P ( f w ) , x ( V w ) = (cid:88) i ∈V x ( κ i ) = f ( V ) (cid:41) , (98) II. Let y ∗ = argmin y ∈ B ( f w ) (cid:107) y (cid:107) . (99) Then y ∗ j = y ∗ i for all j ∈ K i (100) III. Furthermore, let z ∗ be the solution to the integer weighted minimum norm problem, i.e. z ∗ = argmin z ∈ B ( f ) (cid:88) i z i w i . (101) Then y ∗ i = z ∗ i w i for all i ∈ V . (102) Furthermore a vector z ∗ is the optimal solution to (101) if and only y ∗ , as given by (102) , isthe optimal solution to (99) .Proof. See appendix A, page 44.The important concept behind this theorem and its corollaries is that it demonstrates a directconnection between the minimum norm problem on the augmented problem f w and the originalproblem f . This connection allows us to build the theory of the weighted problem directly upon theoriginal theory, essentially using those results. 30 .1 On the Use of General Positive Weights The goal of this section is to extend the above results on integer w to all positive real numbers. Thisallows us to do a number of interesting things, particularly in the case of network flow algorithms.It also extends the state of the known theory on general submodular function minimization outsideof the cases we are interested in. We here state the form of theorem 6.2 for general w , then discusssome of the implications for the case of network flows and the continuous optimization problemsintroduced earlier. In particular, this theorem allows us a way to include the piecewise linear ξ i ( u i )term in ( (cid:60) B ). Theorem 6.3. Let f be a submodular function defined on D ⊂ V , and let w ∈ R |V| be strictlypositive, finite weights. Then I. Let z ∗ be the optimal solution to the weighted minimum norm problem, i.e. z ∗ = argmin z ∈ B ( f ) (cid:88) i ∈V z i w i , (103) and, for all β ∈ R , let U ( β ) = { i ∈ V : z ∗ i − βw i < } (104) U ( β ) = { i ∈ V : z ∗ i − βw i ≤ } (105) and let S ∗ ( β, w ) = Argmin S ∈D f ( S ) − β w ( S ) . (106) Then z ∗ is the optimal solution to (103) if and only if, for all β ∈ R and all S ∗ ∈ S ∗ ( β, w ) , U ( β ) ⊆ S ∗ ⊆ U ( β ) (107) II. Furthermore, for all β , U ( β ) is the unique minimal solution to (106) and U ( β ) is the uniquemaximal solution in S ∗ ( β, w ) .Proof. This result is somewhat involved and quite technical. We present a proof of it, along withsupporting lemmas, in section A.1 on page 45.The above problem allows us to generalize the previous results of size-penalized submodularoptimization to general weighted penalties. This result may have several significant practical impli-cations; several of these we explore later in the context of the network flow analysis results.An interesting corollary to the above theorem is that the original formulation of the minimumnorm problem is still valid when the norm being optimized over is reweighted. It may be thatthis would open up an way to remove some of the numerical difficulties often encountered with theminimum norm problem [Jegelka et al., 2011]. More formally, Corollary 6.4 (Validity of Weighted Minimum Norm.) . Let z ∗ be the optimal value of the weightedminimum norm problem, with w ∈ R |V| , w > . Let U = { i : z ∗ i < } and U = { i : z ∗ i ≤ } , Thenboth U and U minimize f ( U ) over all subsets U ∈ D . Furthermore, for all U † ∈ D such that f ( U † ) = min U ∈V , U ⊆ U † ⊆ U . In other words, the weighted minimum norm vector z ∗ may besubstituted for the original minimum norm vector.Proof. Set β = 0 in theorem 6.3. 31 .1.1 Handling the Case of w i = 0One of the challenging aspects here is that we might be interested in the case of w i = 0. In theory,this can be easily handled by simply allowing w i to be so small that its effect on the problem isnegligibly different from w i = 0; in other words, we can see it as the limit w i (cid:38) 0. In practice, thisleads to numerical issues. Thus we propose here a numerically stable method to work with w i = 0by investigating the limiting behavior. Theorem 6.5. Let w ≥ and define Q = { i ∈ V : w i = 0 } . Then let z ∗ Q = min z ∈ B ( f ) (cid:88) i ∈ Q z i . (108) and let z ∗ = argmin z ∈ B ( f ) (cid:80) i ∈ Q (cid:107) z [ Q ] (cid:107) = (cid:107) z ∗ Q [ Q ] (cid:107) (cid:88) i/ ∈ Q z i w i . (109) where z [ Q ] denotes the vector of elements of z in Q . Then, for z ∗ ε = argmin z ∈ B ( f ) (cid:88) i ∈V z i max ( ε, w i ) , (110) we have that z ∗ ε → z ∗ as ε (cid:38) Proof. Consider the form of the optimization problem in (110). For ε sufficiently small, we havethat z ∗ ε = argmin z ∈ B ( f ) (cid:88) i/ ∈ Q z i w i + ε − (cid:88) i ∈ Q z i (112)As ε − (cid:37) ∞ , the optimum value of z is constrained to be on the simplex in which (cid:80) i ∈ Q z i isminimal. This value is given by C Q above, and this is the constraint that is enforced explicitly in(109). Since everything is continuous, it is valid to take the limit as ε (cid:38) 0. Thus the theorem isproved.It is outside the current realm of our investigation how to implement this in the inner workingsof the general minimum norm algorithm; however, we will revisit this issue later when proving thecorrectness of the network flow version of the weighted reduction algorithm. We now turn our attention to the specific case of network flows. The linear parametric flow problemis similar to the flow problem described earlier, except that now we allow the capacity functions –analogous to the unary energy terms – to be a non-decreasing linear function of the weighting term β . Previously, we treated this global weighting term as having equal influence on all nodes. Thissection considers the case where the influence of β has a different weight on each node. Specifically,we replace β with βw i in theorem 4.1. In this case, we are still able to compute the entire pathdirectly. This result, while interesting in its own right, also sets the stage for our later results fortotal variation minimization.To be specific, we extend the problems in theorem 4.1 as follows:32 nergy Minimization: P E ( β, w ) . Let E and E be defined as in theorem 4.1, and let X ∗ ( β, w ) = Argmin x ∈{ , } n (cid:88) i ∈V ( E i ( x i ) − βw i x i ) + (cid:88) ( i,j ) ∈E i Replace β with βw i or β w as appropriate in the proof of theorem 4.1.However, it is a much more complicated endeavor to show that this formulation can be solvedexactly in a similar manner to the previous result. In the end, we prove the following result:analogously to before, we find an optimal pseudoflow α ∗ such that the zero crossings of r ( α ∗ ) − β w ,with r ( α ) defined as in 4.4, give the level sets of the augmented problem. The purpose of the currentsection is to define these relationships explicitly.The above result works as well for general network flow solutions as well. In this case, we havethe following immediate corollary to theorem 6.3: Corollary 7.2. Let w be a collection of positive weights. Then α ∗ is the optimal pseudoflow solutionto the weighted minimum norm problem α ∗ = argmin α ∈A (cid:88) i ∈V r ( α ) w i (121)33 f and only if ∀ β ∈ R , the optimal cut S ∗ ( β, w ) for P G satisfies { i ∈ V : r i ( α ∗ ) < βw i } ⊆ S ∗ ( β, w ) ⊆ { i ∈ V : r i ( α ∗ ) ≤ βw i } . (122) Proof. Replace B ( f ) with its representation for network flow problems as given in theorem 4.4.An alternate version of this, which is handy for the proofs, is the following: Corollary 7.3. α † is optimal for a problem if and only if, for all β ∈ R , the following conditionholds for each ordered pair i, j ∈ V , i < j :if r i ( α † ) − βw i ≤ < r i ( α † ) − βw j , then α ij = −| q ij | (123) if r i ( α † ) − βw i > ≥ r i ( α † ) − βw j , then α ij = | q ij | (124) Proof. Follows immediately from noting that criteria (122) in corollary 7.2 specifies that if thereexists a β that separates two scaled reductions, then there is valid cut in the associated flow problemthat separates these nodes. This, however, is equivalent to the condition on the α terms given by(123) or (124). While the original interpretation of the weights is to vary the influence of each node in the regular-ization path, a natural extension in the network flow context is to use these weights to fix nodes atparticular reduction values – informally, to make them less responsive to the influence of the externalflow during the optimization. This is done by scaling both the base reduction value q ii and the β term by w i instead of just β . In this context, then, replace r ( α ) with r w ( α ), defined as r w i ( α ) = w i q ii + 12 (cid:88) i (cid:48)
Input : Submodular Q , weights w , subset of nodes T . Output : Bisecting cut S ⊆ T or ∅ if all nodes in same partition. ρ ← (cid:80) i ∈ T w i = 0 then // In this case, the optimal cut that divides the reductions into the positive or negativecomponents is all that matters; w i = 0 means this result is true for all β . for i ∈ T do ρ i ← q ii , else µ ← (cid:80) i ∈ T q ii w i (cid:80) i ∈ T w i . // M here denotes the largest numerically stable number. for i ∈ T do ρ i ← (cid:26) q ii − w − i β µ ( T ) w i > M sign ( β µ ( T )) w i = 0 , if ρ [ T ] = 0 then return ∅ // These nodes are all at a common reduction level already. Q (cid:48) ← Q [ T ] with diagonal replaced by ρ [ T ] . S ∗ T ← Minimum cut on ( T, E [ T ] ), with capacities formed from Q (cid:48) by theorem 4.1. return S ∗ T Next, suppose that there exists a β such that r i ( α † ) − βw i ≤ < r j ( α † ) − βw j (131)Then, according to the termination criterion of WeightedBisectionCut – in which it returns S ∗ T = T or S ∗ T = ∅ – the edge between nodes i and j will be saturated in such a way that the reductionvalue of r i ( α † ) is maximized and the reduction value of r j ( α † ) is minimized, i.e. α ij = −| q ij | . Thuscondition (123) in corollary 7.3 is satisfied. Similarly, if the order in 131 is reversed, then condition(124) is satisfied.As the above holds for any pairs of nodes i, j , the optimality criteria of corollary 7.3 are satisfiedglobally, proving the correctness of the algorithm. This section improves upon the result of theorem 4.13 given at the end of section 4: using theweighting method above, it is possible to anchor weights at specific values in the optimization. Thisindicates that the reduction levels of these nodes do not vary in the optimization even though theyare treated the same way as the other nodes. This provides a simple way of incorporating an L fitpenalty into our optimization. Formally, the following corollary to theorem 4.13 lays this out: Theorem 8.1. Consider ( (cid:60) B ) : suppose γ : R n → R + can be expressed as γ ( u ) = const + (cid:107) u − a (cid:107) + λ (cid:88) i ξ i ( u i ) + (cid:88) i,j w ij | u i − u j | (132)36 lgorithm 3 : FindWeightedReductions Input : Submodular Q , vector of non-negative weights w . Output : α ∗ satisfying condition 121. // Begin by calling WeightedBisectReductions below on the full set V to get α . return WeightedBisectReductions ( T = V , α = , Q ) // Note: α [ T ] denotes α restricted to edges with both nodes in T . WeightedBisectReductions ( T , α , Q ) S ∗ T ← WeightedBisectionCut ( Q , w , T ) if S ∗ T = ∅ or S ∗ T = T then // We are done on this set. return α [ T ] // Fix the flow on edges in the cut by adjusting the source/sink capacities of each node,then removing those edges. for i ∈ S ∗ T , j ∈ T \ S ∗ T , i < j do α ij ← − q ij , q ii ← q ii − q ij , q ij ← for i ∈ T \ S ∗ T , j ∈ S ∗ T , i < j do α ij ← q ij , q jj ← q jj + q ij , q ij ← // Recursively solve on the two partitions to fix the other α ’s. α [ S ∗ T ] ← WeightedBisectReductions ( S ∗ T , α , Q ) α [ T \ S ∗ T ] ← WeightedBisectReductions ( T \ S ∗ T , α , Q ) return α [ T ] with a , b ∈ R n , λ ≥ , and w ij ≥ , and where ξ i is a convex piecewise-linear function. Then theminimizer u ∗ = argmin u ∈ R n γ ( u ) (133) can be found exactly using Algorithm 3.Proof. First, to accommodate the ξ i ( u i ) functions, suppose, without loss of generality, that eachfunction is defined by a sequence of m i inflection points − ∞ = b i < b i < b i < · · · < b im i = ∞ (134)and an associated line slope θ ij between each, so θ ij = ξ i ( b ij ) − ξ i ( b i,j − ) b ij − b i,j − . (135)Then, since ξ i ( x ) is linear, it is easy to show that at any point x , ξ i ( x ) can be expressed as theintegral over sums of step functions; if c − i and c + i are the values of these step functions, ξ i ( u i ) = const + ˆ u i − M (cid:88) j : b ij ≤ x c + ij + (cid:88) j : b ij >x (cid:48) c − ij d x. (136)37he exact values of c + ij and c − ij can be found easily by solving a simple linear system, and theycan satisfy c + ij + c − ij = 0 by adjusting the constant term in front of the integral; this ensures thesepairwise terms satisfy the submodularity condition. Thus ξ i ( · ), can be encoded on the graph bycreating m i auxiliary nodes u (cid:48) i,j fixed at reduction values b i , ..., b i,m i using the method described insection 7.0.2. Setting the overall weight of the edge using E ( x i , x (cid:48) ij ) = (cid:20) c + ij c − ij (cid:21) (137)ensures that the cost of a cut at any region between b ij and b i,j +1 incurs cost (cid:88) j (cid:48) ≤ j c + ij (cid:48) + (cid:88) j (cid:48) >j c − ij (cid:48) . (138)Thus, in the final integral, the reduction level u i incurs the penalty const + ξ ( u i ). The proof thenfollows identically to theorem 4.13 with some minor modifications – namely, for the auxiliary nodes,we are actually working with: { M · u n + i ≤ M · β } (139)but this easily converts to { u n + i ≤ β } . (140)As the L fit terms on the auxiliary nodes become constant in the limit, the rest of the proof oftheorem 4.13 goes through for this form as well, giving us the desired result.In this chapter, In this work, we outlined the theoretical connections between network flows, results in submodularoptimization, and regularized regression problems such as the graph-guided LASSO. We rigorouslyestablished the submodular structure underlying the optimization of this problem. This motivatedseveral novel algorithms, and extended some of the theory surrounding the minimum norm algorithm.We extended the existing theory of size-constrained submodular optimization first proposed by[Nagano et al., 2011] to the weighted case. This theoretical tool has several important consequences.In the case of network flows, it gives us the ability to make nodes more or less affected by theoptimization process. This opens the door to the general optimization problem given in theorem8.1.In part 2, we extend these results to develop a full treatment of the entire regularization pathover all λ . In the second part of this work, we will explore further implications of this theory, namelya technique to algorithmically recycle solutions for network flows through the use of the structurepresented here. A Proofs In this section, we give a number of the proofs needed for the previous theorems; they are given herefor readability. 38 roof of Theorem 4.1. To show the equivalence between P E ( β ) and P Q ( β ), it is sufficient to verify (cid:88) i ∈V ( E i ( x i ) − βx i ) + (cid:88) ( i,j ) ∈E i 0) (141)for arbitrary x ∈ { , } n . To simplify notation, assume that E ij ( x i , x j ) = 0 for all ( i, j ) / ∈ E . Then (cid:88) i,j x T ( Q − diag( β )) x + (cid:88) i E i (0) + (cid:88) i 0) (142)= (cid:88) i x i ( q ii − β ) + (cid:88) i 0) (143)= (cid:88) i (cid:2) x i ( E i (1) − β − E i (0)) + E i (0) (cid:3) + (cid:88) i 1) + E ij (0 , − E ij (0 , − E ij (1 , x i ( E ij (1 , − E ij (0 , x j ( E ij (0 , − E ij (0 , (cid:3) + (cid:88) i 0) (144)= (cid:88) i (cid:26) E i (1) − β x i = 1 E i (0) x i = 0 (cid:27) + (cid:88) i From [Fujishige, 2005], we know that f β is submodular if and only if, for all T ⊂ U and v ∈ S \ U , f β ( T ∪ { v } ) − f β ( T ) ≥ f β ( U ∪ { v } ) − f β ( U ) . (154)For convenience, assume that the β is absorbed into the diagonal elements of q ii . Now[ f β ( T ∪ { v } ) − f β ( T )] − [ f β ( U ∪ { v } ) − f β ( U )] (155)= (cid:88) i,j ∈ T q ij + (cid:88) i ∈ T ( q iv + q vi ) + q vv − (cid:88) i,j ∈ T q ij − (cid:88) i,j ∈ U q ij + (cid:88) i ∈ U ( q iv + q vi ) + q vv − (cid:88) i,j ∈ U q ij (156)= (cid:34)(cid:88) i ∈ T ( q iv + q vi ) (cid:35) − (cid:34)(cid:88) i ∈ U ( q iv + q vi ) (cid:35) (157)= − (cid:88) i ∈ U \ T ( q iv + q vi ) . (158)The theorem immediately follows; this is non-negative if q ij ≤ ∀ i < j , and T, U, v can be easilychosen to make it negative if ∃ i, j such that q ij > Proof of theorem 4.4. For ease of notation, assume that βw i is absorbed into q ii . For ease of algebra,we first prove a simpler result, then show it immediately implies the desired one.40efine W = (cid:26) w ∈ M n × n : (cid:26) q ij ≤ w ij ≤ i < jw ij = 0 otherwise (cid:27) . (159)(Recall that q ij ≤ P ( f β ) can be represented by P ( f β ) = y : ∃ w ∈ W s.t. ∀ i, y i ≤ q i + (cid:88) i (cid:48) : i (cid:48)
1, and let T ∗ ∈ Argmin T ⊆V m : T ∩V∈D f (cid:48) m,β ( T ) + h ( T ) (196)be any minimizer of f (cid:48) m,β .To show that the elements k m and n + m are tied in T ∗ , assume the opposite – suppose that either[ k m ∈ T ∗ and n + m / ∈ T ∗ ] or [ k / ∈ T ∗ and n + m ∈ T ∗ ]. This, however, contradicts the optimalityof T ∗ for sufficiently large M , as the value of the minimum can always be improved by M if element n + m is included or excluded so that its membership in T ∗ matches that of k m . Thus, for optimal T ∗ , k m ∈ T ∗ if and only if n + m ∈ T ∗ .We then have that for any minimizer T ∗ of f (cid:48) m,β + h , f (cid:48) m,β ( T ∗ ) = f (cid:48) m − ,β ( T ∗ ∩ V m − ) − β { k m ∈ T ∗ } , (197)as g k m ,n + m ( { k m , n + m } ) = g k m ,n + m ( ∅ ) = 0.Now let h (cid:48) ( S ) = h ( S ) − β { k m ∈ T ∗ } . ThenArgmin S ⊆D f m,β ( S ) + h ( S ) (198)= Argmin S ⊆D f m − ,β ( S ) + h (cid:48) ( S ) (199)= (cid:40) V ∩ T ∗ : T ∗ ∈ (cid:34) Argmin T ⊆V m − : T ∩V∈D f (cid:48) m − ,β ( T ) + h (cid:48) ( T ∩ V ) (cid:35)(cid:41) (200)= (cid:26) V ∩ T ∗ : T ∗ ∈ (cid:20) Argmin T ⊆V m : T ∩V∈D f (cid:48) m,β ( T ) + h ( T ∩ V ) (cid:21)(cid:27) (201)44here (199) - (200) follows by the inductive assumption. As (199) - (200) holds for any modularfunction h (cid:48) ( S ), we have that (198) - (201) is true as well for any h ( S ), proving (195) for m =1 , , ..., W .Now, it is easy to show that f (cid:48) W,β ( T ) = f ( T ∩ V ) + W (cid:88) i =1 g k i ,n + i ( T ) − β | T | (202)= f w ( T ) − β | T | (203)For all T ∈ D (cid:48) , and f W,β ( S ) = f ( S ) − β (cid:88) i ∈V w i { i ∈ S } (204)= f ( S ) − β w ( S ) (205)for all S ∈ D , proving the first part of the theorem.( (II) ) To show (II), note that the membership of each item in V w \V exactly matches the membershipof an item in V ; as D is a distributed lattice, D w is also a distributed lattice that is closed underintersection and union.It remains to show that (94) is equivalent to (90). This follows directly from noting that for M sufficiently large, D w can be written as D w = { T ⊆ V w : T ∩ V ∈ D , f w ( T ) ≤ M/ } . (206)As we have already argued that in any optimal solution T ∗ of f w ( T ), the M terms cancel out, soany minimizer of f w ( T ) must be in D w , proving (II).( (III) ) Finally, in this case, for all { j, k } such that ∃ S ∈ D w with { j, k } ⊆ S , g jk ( S ) = 0, proving(95) and completing the proof. Proof of theorem 6.2. ( (I) ) Recall that the definition of the polymatroid associated with the sub-modular function f w is defined as P ( f w ) = (cid:110) x ∈ R |V w | : ∀ T ∈ D w , x ( T ) ≤ f w ( T ) (cid:111) (207)Now, by equation (95) in lemma 5.1, f w ( T ) = f ( T ∩ V ) (208)for all T ∈ D w . Furthermore, for all j ∈ K i , i ∈ T if and only if j ∈ T . Thus the condition x ( T ) ≤ f w ( T ) is equivalent to (cid:88) i ∈ T ∩V x ( κ i ) ≤ f ( T ∩ V ) (209)and the condition x ( V w ) = f w ( V w ) is equivalent to x ( V w ) = (cid:88) i ∈V x ( κ i ) = f ( V ) . (210)This proves part (I) 45 (II)-(III) ) Now, to prove part (II), consider minimizing (cid:107) y (cid:107) over B ( f w ). Under the constraints,we can express this problem as( (cid:63) ) minimize (cid:88) i ∈V (cid:88) j ∈ κ i y j such that (cid:88) i ∈ S y ( κ i ) ≤ f ( S ) ∀ S ∈ D (211) (cid:88) i ∈V y ( κ i ) = f ( V )Now, let z i = y ( κ i ), and define h i ( y , t ) = min (cid:88) j ∈ κ i y j : (cid:88) j ∈ κ i y j = t (212)The objective ( (cid:63) ) in (211) above can then be expressed as( (cid:63) ) = minimize (cid:88) i ∈V h i ( y , y ( κ i )) . (213)This, however, has an analytic solution, as the L -norm above is minimized when y is equal on theset κ i , i.e. y j = t | κ i | ∀ j ∈ κ i . (214)Thus h i ( y , t ) = h i ( t ) = | κ i | (cid:18) t | κ i | (cid:19) = t | κ i | = t w i . (215)This proves equation (100), and allows us to rewrite (211) asminimize (cid:88) i ∈V z i /w i such that z ( S ) ≤ f ( S ) ∀ S ∈ D (216) z ( V ) = f ( V )which is exactly identical to the optimization in equation (101). Given that y is constant on each κ i , we have already proved equation (102) by setting t = z i in equation (214). A.1 Proof of Correctness for the Weighted Minimum Norm Problem Before proving theorem 6.3, we must establish a lemma that shows convergence of the mappingin (103) under various perturbations. The reason that this lemma is needed is that the crux ofthe proof of theorem 6.3 involves showing that all positive w ∈ R n can be seen as the limit of asequence of problems with integer w ∈ Z n . The crux of this lemma is found in Wets [2003], wherea convenient theorem shows that mappings of the type (103) are locally Lipshitz-continuous withrespect to perturbations of the objective function. This continuity is sufficient to ensure that thelimiting argument employed in the proof of theorem 6.3 is valid.46 emma A.1. Let D be a convex set, and let g ∗ ( δ , w ) = argmin z ∈ D n (cid:88) i =1 ( z i + δ i ) w i (217) for δ , w ∈ R n , with w ≥ η > bounded away from . Then g ∗ ( δ , w ) is a locally Lipshitz-continuousfunction of δ and w . In other words, for every point ( δ , w ) ∈ R n × [ η, ∞ ) n , there exists a neighbor-hood U ⊂ ( δ , w ) , ( δ , w ) ∈ U , such that g ∗ is Lipshitz-continuous on U .Proof. Define h : R n × R n × R n (cid:55)→ R ∩ {∞} as an extended version of the function g : h ( δ , w , z ) = (cid:40) (cid:80) ni =1 ( z i + δ ) w i z ∈ B ( f ) ∞ otherwise . (218)The lemma follows from showing that h satisfies the properties of Theorem 3.4 of Wets [2003],which states that the inf -mapping of (217) is locally Lipshitz continuous if mild conditions on h aresatisfied. These conditions follow immediately as h is Lipshitz continuous in w , δ , and z , and thedomain of z is not affected by the w and δ . This gives us the desired result. Proof of theorem 6.3. With this lemma in place, we are now ready to provide the proof of theorem6.3. The proof proceeds in two stages. First, we show that it is valid for positive rational w ∈ Q n ++ ,where Q ++ = { x ∈ Q : x > } . Then, we use a carefully constructed limiting argument extends thisto all real weights. Step 1: First, assume that the weights are positive rationals. Thus we may write w i = N i M i , (219)where N i ∈ Z ++ and M i ∈ Z ++ . Then, let M = (cid:89) i ∈V M i , (220)so M w i is an integer for all i ∈ V . Now note that we can immediately map the original problem tothis form by simply setting w (cid:48) i = M w i (221) β (cid:48) = β/M (222)Then βw i = β (cid:48) w (cid:48) i , with w (cid:48) i being an integer.Let V w (cid:48) , D w (cid:48) , and f w (cid:48) be as defined in theorem 6.2, and let z ∗ = argmin z ∈ B ( f ) (cid:88) i ∈V z i w i = argmin z ∈ B ( f ) M (cid:88) i ∈V z i w i = argmin z ∈ B ( f ) (cid:88) i ∈V z i w (cid:48) i (223)and let y ∗ = z ∗ w (cid:48) . (224)47y theorem 6.2, we know that z ∗ is the optimal solution to (223) if and only if y ∗ is the optimalsolution to y ∗ = argmin y ∈ B ( f w (cid:48) ) (cid:107) y (cid:107) . (225)Now, by theorem 6.2-(III) and theorem 4.11-(I), we have that y ∗ is optimal for (225) if and only if ∀ β (cid:48) ∈ R , all optimal solutions T ∗ ( β (cid:48) ) = argmin T ∈D w (cid:48) ( f w (cid:48) ( T ) − β (cid:48) ) (226)satisfy U (cid:48) ( β (cid:48) ) = { i ∈ V : y ∗ i − β (cid:48) < } ⊆ T ∗ ( β (cid:48) ) ⊆ { i ∈ V : y ∗ i − β (cid:48) ≤ } = U (cid:48) ( β (cid:48) ) . (227)However, substituting y ∗ i = z ∗ i /w (cid:48) i into equation (227) immediately gives us the equivalence { i ∈ V : z ∗ i − β (cid:48) w (cid:48) i < } ⊆ T ∗ ( β (cid:48) ) ⊆ { i ∈ V : z ∗ i − β (cid:48) w (cid:48) i ≤ } . (228)Letting T † ( β ) = T ∗ ( β/M ), and recalling that β (cid:48) w (cid:48) i = βw i , we have that U ( β ) = { i ∈ V : z ∗ i − βw i < } ⊆ T † ( β ) ⊆ { i ∈ V : z ∗ i − βw i ≤ } = U ( β ) . (229)Thus U (cid:48) ( β (cid:48) ) = U ( β ) and U (cid:48) ( β (cid:48) ) = U ( β ), and we have proved (I) for w ∈ Q n ++ .Part (II) follows similarly. y ∗ in (225) is the minimum norm vector for the extended problem f w (cid:48) , soby theorem 4.11-(II) U (cid:48) ( β (cid:48) ) and U (cid:48) ( β (cid:48) ) are the smallest and largest minimizers of f ( S ) − β (cid:48) w (cid:48) ( S ) = f ( S ) − β w ( S ). However, U (cid:48) ( β (cid:48) ) = U ( β ) and U (cid:48) ( β (cid:48) ) = U ( β ), so U ( β ) and U ( β ) satisfy part (II).Thus we have proved the theorem for rational w . Step 2: Now, it remains to show that this result extends to all real weights as well. This is moredifficult than it immediately seems, as z ∗ depends on w through an optimization over B ( f ), and,unless D = 2 V , B ( f ) is not necessarily bounded. Thus it takes some care to show that the setsgenerated by the inequalities in (104) and (105) are the limit points corresponding to a sequence ofrational w , and that they are the smallest and largest minimizers of f ( S ) − β w ( S ) for all β .First, for convenience, define f ˜ w ,β : D (cid:55)→ R as f ˜ w ,β = f ( S ) − β ˜ w ( S ) (230)and recall that this is a submodular function for any ˜ w ∈ R n ++ . Let y ∗ ( ˜ w , β ) be the correspondingminimum norm vector: y ∗ ( ˜ w , β ) = argmin y ∈ B ( f ˜ w ,β ) (cid:107) y (cid:107) . (231)Similarly, denote z ∗ as a function of ˜ w as well, i.e. z ∗ ( ˜ w ) = argmin z ∈ B ( f ) (cid:88) i ∈V z i ˜ w i . (232)Now suppose that w / ∈ Q n , so the above proof for rational w does not apply. As Q is dense on R and w > 0, there exists a sequence of positive rationals ω , ω , ... , ω i ∈ Q n such thatlim i → ∞ ω i = w . (233)48y the minimum norm theorem and the fact that we have proved the theorem in question for allrational ω m , we know that, for all i and m , U ( β, ω m ) = { i : z ∗ i ( ω m ) − βω m,i < } = { i : y ∗ i ( ω m , β ) < } (234) U ( β, ω m ) = { i : z ∗ i ( ω m ) − βω m,i ≤ } = { i : y ∗ i ( ω m , β ) ≤ } . (235)where we have made the dependence of the minimal and maximal sets U ( β ) and U ( β ) in (104)and (105) on ω m explicit since we are working with a sequence of problems given by ω m . Thus, bytheorem 4.11, for all ˜ w ∈ { w , ω , ω , ... } , U ( β, ˜ w ) = { i : y ∗ i ( ˜ w , β ) < } (236)and U ( β, ˜ w ) = { i : y ∗ i ( ˜ w , β ) ≤ } (237)are the unique smallest and largest minimizing sets of f ˜ w ,β , respectively. Thus we are done if wecan show that, for all β , lim m → ∞ z ∗ ( ω m ) − β ω m = z ∗ ( w ) − β w (238)and lim m → ∞ y ∗ ( ω m , β ) = y ∗ ( w , β ) , (239)as this immediately implies that (234) and (235) hold in the limit as well.First, let D be a convex domain, and consider the function g ∗ D : R n × (0 , ∞ ) (cid:55)→ D , where g ∗ D ( δ , w ) = argmin z ∈ D n (cid:88) i =1 ( z i + δ i ) w i . (240)As w i > g ∗ D ( δ , w ) is locally Lipshitz-continuous in w by lemma A.1. Thus,for all fixed D and sequences (˜ δ m , ˜ w m ) such that(˜ δ m , ˜ w m ) → ( δ , w ) as m −→ ∞ , (241)we have that g ∗ D (˜ δ m , ˜ w m ) → g ∗ D ( δ m , w m ) as m −→ ∞ . (242)Now, we may assume w.l.o.g. that ω m > 0. Then we immediately have that z ∗ ( ω m ) = g ∗ B ( f ) ( , ω m ) −→ g ∗ B ( f ) ( , w ) = z ∗ ( w ) as m −→ ∞ , (243)proving (238). 49o show (239), let δ m = ω m − w , and note that B ( f ω m ,β ) = (cid:110) x ∈ R n : x ( S ) ≤ f ( S ) − β ω m ( S ) ∀ S ∈ D , x ( V ) = f ( V ) − β ω m ( V ) (cid:111) (244)= (cid:110) x + β [ w − ω m ] : x ( S ) ≤ f ( S ) − β ω m ( S ) − β [ w ( S ) − ω m ( S )] ∀ S ∈ D , x ( V ) = f ( V ) − β ω m ( V ) − β [ w ( V ) − ω m ( V )] (cid:111) (245)= (cid:110) x + β [ w − ω m ] : x ( S ) ≤ f ( S ) − β w ( S ) ∀ S ∈ D , x ( V ) = f ( V ) − β w ( V ) (cid:111) (246)= { x − β δ m : x ∈ B ( f w ,β ) } . (247)Thus y ∗ ( ω m , β ) = argmin y ∈ B ( f ω m,β ) (cid:107) y (cid:107) (248)= argmin y (cid:48) ∈ B ( f w ,β ) (cid:107) y (cid:48) − β δ m (cid:107) (249)= g ∗ B ( f w ,β ) ( β δ m , ) . (250)Thus, since δ m → , y ∗ ( ω m , β ) = g ∗ B ( f w ,β ) ( β δ m , ) −→ g ∗ B ( f w ,β ) ( , ) = y ∗ ( w , β ) as m −→ ∞ , (251)proving (239). The theorem is proved. 50 eferences William K Allard. Total variation regularization for image denoising, i. geometric theory. SIAMJournal on Mathematical Analysis , 39(4):1150–1190, 2007.William K Allard. Total variation regularization for image denoising, ii. examples. SIAM Journalon Imaging Sciences , 1(4):400–417, 2008.William K Allard. Total variation regularization for image denoising, iii. examples. SIAM Journalon Imaging Sciences , 2(2):532–568, 2009.Luigi Ambrosio and Simone Di Marino. Equivalent definitions of bv space and of total variation onmetric measure spaces. preprint , 2012.Francis Bach. Convex analysis and optimization with submodular functions: a tutorial. arXivpreprint arXiv:1010.4207 , 2010a.Francis Bach. Shaping level sets with submodular functions. arXiv preprint arXiv:1012.1501 , 2010b.Francis Bach. Learning with submodular functions: A convex optimization perspective. arXivpreprint arXiv:1111.6453 , 2011.Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization withsparsity-inducing penalties. arXiv preprint arXiv:1108.0775 , 2011.Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsitythrough convex optimization. Statistical Science , 27(4):450–468, 2012.Johnathan M Bardsley. An efficient computational method for total variation-penalized poissonlikelihood estimation. Inverse Problems and Imaging , 2(2):167–185, 2008.Johnathan M Bardsley and Aaron Luttman. Total variation-penalized poisson likelihood estimationfor ill-posed problems. Advances in Computational Mathematics , 31(1-3):35–59, 2009.Amir Beck and Marc Teboulle. Fast gradient-based algorithms for constrained total variation imagedenoising and deblurring problems. Image Processing, IEEE Transactions on , 18(11):2419–2434,2009.Julien Bect, Laure Blanc-F´eraud, Gilles Aubert, and Antonin Chambolle. An (cid:96) -unified variationalframework for image restoration. In Computer Vision-ECCV 2004 , pages 1–13. Springer, 2004.Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and semi-supervised learning onlarge graphs. In Learning theory , pages 624–638. Springer, 2004.Giovanni Bellettini, Vicent Caselles, and Matteo Novaga. The total variation flow in R n . Journalof Differential Equations , 184(2):475–525, 2002.Julian Besag, Peter Green, David Higdon, and Kerrie Mengersen. Bayesian computation and stochas-tic systems. Statistical Science , pages 3–41, 1995.P.J. Bickel, B. Li, A.B. Tsybakov, S.A. van de Geer, B. Yu, T. Vald´es, C. Rivero, J. Fan, andA. van der Vaart. Regularization in statistics. Test , 15(2):271–344, 2006.C.M. Bishop. Pattern recognition and machine learning . Springer, 2006.51evin Bleakley and Jean-Philippe Vert. The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199 , 2011.Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms forenergy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence ,pages 1124–1137, 2004.Thomas Brox and Daniel Cremers. On the statistical interpretation of the piecewise smoothmumford-shah functional. In Scale Space and Variational Methods in Computer Vision , pages203–213. Springer, 2007.F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation and Sparsity Via 1 Penalized Least Squares. Learning theory , pages 379–391, 2006.F. Bunea, A. Tsybakov, M.H. Wegkamp, et al. Sparsity oracle inequalities for the Lasso. ElectronicJournal of Statistics , 1:169–194, 2007.Vicent Caselles, Antonin Chambolle, and Matteo Novaga. Total variation in imaging. Handbook ofMathematical Methods in Imaging , pages 1016–1057, 2011.Antonin Chambolle, Vicent Caselles, Daniel Cremers, Matteo Novaga, and Thomas Pock. An intro-duction to total variation for image analysis. Theoretical foundations and numerical methods forsparse recovery , 9:263–340, 2010.Tony F Chan and Selim Esedoglu. Aspects of total variation regularized l function approximation. SIAM Journal on Applied Mathematics , 65(5):1817–1837, 2005.Tony F Chan and Luminita A Vese. Image segmentation using level sets and the piecewise-constantmumford-shah model. In Tech. Rep. 0014, Computational Applied Math Group . Citeseer, 2000.Xi Chen, Seyoung Kim, Qihang Lin, Jaime G Carbonell, and Eric P Xing. Graph-structuredmulti-task regression and an efficient optimization method for general fused lasso. arXiv preprintarXiv:1005.3579 , 2010.Xi Chen, Qihang Lin, Seyoung Kim, Jaime G Carbonell, and Eric P Xing. Smoothing proximalgradient method for general structured sparse regression. The Annals of Applied Statistics , 6(2):719–752, 2012.Haeran Cho and Piotr Fryzlewicz. Multiscale interpretation of taut string estimation and its con-nection to unbalanced haar wavelets. Statistics and computing , 21(4):671–681, 2011.Vaek Chvtal. Linear Programming . W. H. Freeman and Company, New York, 1983.T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to algorithms . The MIT press,2001.George Bernard Dantzig and Delbert R Fulkerson. On the max flow min cut theorem of networks.1955.J´erˆome Darbon and Marc Sigelle. Image restoration with discrete constrained total variation part i:Fast and exact optimization. Journal of Mathematical Imaging and Vision , 26(3):261–276, 2006.PL Davies and Monika Meise. Approximating data with weighted smoothing splines. Journal ofNonparametric Statistics , 20(3):207–228, 2008.52utz D¨umbgen and Arne Kovac. Extensions of smoothing via taut strings. Electronic Journal ofStatistics , 3:41–75, 2009.B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of statistics ,32(2):407–451, 2004.N. El Zehiry, S. Xu, P. Sahoo, and A. Elmaghraby. Graph cut optimization for the mumford-shahmodel. In The Seventh IASTED International Conference on Visualization, Imaging and ImageProcessing , pages 182–187. ACTA Press, 2007.J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models viacoordinate descent. Department of Statistics, Stanford University, Tech. Rep , 2008a.Jerome Friedman, Trevor Hastie, Holger H¨ofling, and Robert Tibshirani. Pathwise coordinate opti-mization. The Annals of Applied Statistics , 1(2):302–332, 2007.Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation withthe graphical lasso. Biostatistics , 9(3):432–441, 2008b.Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linearmodels via coordinate descent. Journal of statistical software , 33(1):1, 2010.S. Fujishige. Submodular functions and optimization , volume 58. Elsevier Science, 2005.Giorgio Gallo, Michael D Grigoriadis, and Robert E Tarjan. A fast parametric maximum flowalgorithm and applications. SIAM Journal on Computing , 18(1):30–55, 1989.Hao Gao, Hongkai Zhao, et al. Multilevel bioluminescence tomography based on radiative transferequation part 2: total variation and l1 data fidelity. Opt. Express , 18(3):2894–2912, 2010.E Giusti. Minimal surfaces and functions of bounded variation , volume 80. Birkhauser, 1984.Donald Goldfarb and Wotao Yin. Parametric maximum flow algorithms for fast total variationminimization. SIAM Journal on Scientific Computing , 31(5):3712–3743, 2009.Bart Goris, Wouter Van den Broek, KJ Batenburg, Hamed Heidari Mezerji, and Sara Bals. Electrontomography based on a total variation minimization reconstruction technique. Ultramicroscopy ,113:120–130, 2012.Martin Gr¨otschel, L´aszl´o Lov´asz, and Lex Schrijver. Geometric algorithms and combinatorial opti-mization. Algorithms and Combinatorics , 2:1–362, 1993.Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statisticallearning: data mining, inference and prediction. The Mathematical Intelligencer , 27(2):83–85,2005.Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Linear Methods for Regression . Springer,2009.Dorit S Hochbaum. The pseudoflow algorithm and the pseudoflow-based simplex for the maximumflow problem. In Integer Programming and Combinatorial Optimization , pages 325–337. Springer,1998. 53orit S Hochbaum. The pseudoflow algorithm: A new algorithm for the maximum-flow problem. Operations research , 56(4):992–1009, 2008.Dorit S Hochbaum and Sung-Pil Hong. About strongly polynomial time algorithms for quadraticoptimization over submodular constraints. Mathematical programming , 69(1-3):269–309, 1995.Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonalproblems. Technometrics , 12(1):55–67, 1970.Satoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial strongly polynomial algorithmfor minimizing submodular functions. Journal of the ACM (JACM) , 48(4):761–777, 2001.Rishabh Iyer, Stefanie Jegelka, and Jeff A. Bilmes. Fast semidifferential-based submodular functionoptimization. In International Conference on Machine Learning (ICML) , Atlanta, Georgia, 2013.Stefanie Jegelka and Jeff Bilmes. Cooperative cuts for image segmentation. Technical ReportUWEETR-2010-0003, University of Washington, Seattle, 2010.Stefanie Jegelka and Jeff A. Bilmes. Multi-label cooperative cuts. In CVPR 2011 Workshop onInference in Graphical Models with Structured Potentials , Colorado Springs, CO, June 2011. URL http://users.cecs.anu.edu.au/~julianm/cvpr2011.html .Stefanie Jegelka, Hui Lin, and Jeff A. Bilmes. Fast approximate submodular minimization. In NeuralInformation Processing Society (NIPS) , Granada, Spain, December 2011.Stephen L Keeling, Christian Clason, Michael Hinterm¨uller, Florian Knoll, Antoine Laurain, andGregory Von Winckel. An image space approach to cartesian based parallel mr imaging with totalvariation regularization. Medical Image Analysis , 16(1):189–200, 2012.Seyoung Kim, Kyung-Ah Sohn, and Eric P Xing. A multivariate regression approach to associationanalysis of a quantitative trait network. Bioinformatics , 25(12):i204–i212, 2009.Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the Econo-metric Society , pages 33–50, 1978.V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEETransactions on Pattern Analysis and Machine Intelligence , 26(2):147–159, 2004.Arne Kovac and Andrew DAC Smith. Nonparametric regression on a graph. Journal of Computa-tional and Graphical Statistics , 20(2):432–447, 2011.Hui Lin and Jeff Bilmes. Multi-document summarization via budgeted maximization of submodularfunctions. In North American chapter of the Association for Computational Linguistics/HumanLanguage Technology Conference (NAACL/HLT-2010) , Los Angeles, CA, June 2010.Hui Lin and Jeff Bilmes. Learning mixtures of submodular shells with application to documentsummarization. In Uncertainty in Artificial Intelligence (UAI) , Catalina Island, USA, July 2012.AUAI.Jun Liu, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and datamining , pages 323–332. ACM, 2010. 54ulien Mairal, Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Network flow algorithmsfor structured sparsity. arXiv preprint arXiv:1008.5209 , 2010.Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Convex and network flowoptimization for structured sparsity. The Journal of Machine Learning Research , 12:2681–2720,2011.N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics , 37(1):246–270, 2009.Vincent Michel, Alexandre Gramfort, Ga¨el Varoquaux, Evelyn Eger, and Bertrand Thirion. Totalvariation regularization for fmri-based prediction of behavior. Medical Imaging, IEEE Transac-tions on , 30(7):1328–1340, 2011.David Mumford and Jayant Shah. Optimal approximations by piecewise smooth functions andassociated variational problems. Communications on pure and applied mathematics , 42(5):577–685, 1989.Kiyohito Nagano, Yoshinobu Kawahara, and Kazuyuki Aihara. Size-constrained submodular mini-mization through minimum norm base. In Proc. ICML , volume 23, 2011.Mukund Narasimhan and Jeff Bilmes. Local search for balanced submodular clusterings. In Twen-tieth International Joint Conference on Artificial Intelligence (IJCAI07) , Hyderabad, India, Jan-uary 2007.Mukund Narasimhan, Nebojsa Jojic, and Jeff Bilmes. Q-clustering. In Neural Information ProcessingSociety (NIPS) , Vancouver, Canada, December 2005.Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming , 103(1):127–152, 2005.Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical report,2007.James B Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Mathematical Programming , 118(2):237–251, 2009.Thomas Pock, Daniel Cremers, Horst Bischof, and Antonin Chambolle. An algorithm for minimizingthe mumford-shah functional. In Computer Vision, 2009 IEEE 12th International Conference on ,pages 1133–1140. IEEE, 2009.Wolfgang Ring. Structural properties of solutions to total variation regularization problems. ESAIM:Mathematical Modelling and Numerical Analysis , 34(04):799–810, 2000.Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removalalgorithms. Physica D: Nonlinear Phenomena , 60(1):259–268, 1992.Alex Sawatzky, Christoph Brune, Jahn M¨uller, and Martin Burger. Total variation processing ofimages with poisson statistics. In Computer Analysis of Images and Patterns , pages 533–540.Springer, 2009.A. Schrijver. Combinatorial optimization , volume 24. Springer, 2003.55lexander Schrijver. A combinatorial algorithm minimizing submodular functions in strongly poly-nomial time. Journal of Combinatorial Theory, Series B , 80(2):346–355, 2000.Dhruv B Sharma, Howard D Bondell, and Hao Helen Zhang. Consistent group identification andvariable selection in regression with correlated predictors. Journal of Computational and GraphicalStatistics , 22(2):319–340, 2013.P. Stobbe and A. Krause. Efficient minimization of decomposable submodular functions. Arxivpreprint arXiv:1010.5511 , 2010.Zhen Tian, Xun Jia, Kehong Yuan, Tinsu Pan, and Steve B Jiang. Low-dose ct reconstruction viaedge-preserving total variation regularization. Physics in medicine and biology , 56(18):5949, 2011.R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety. Series B (Methodological) , 58(1):267–288, 1996.Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity andsmoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67(1):91–108, 2005.S.A. van de Geer. On non-asymptotic bounds for estimation in generalized linear models with highlycorrelated design. Lecture Notes-Monograph Series , pages 121–134, 2007.S.A. Van De Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics ,36(2):614, 2008.Roger J-B Wets. Lipschitz continuity of inf-projections. Computational Optimization and Applica-tions , 25(1-3):269–282, 2003.Thomas Wunderli. Total variation time flow with quantile regression for image restoration. Journalof Mathematical Analysis and Applications , 2013.Gui-Bo Ye and Xiaohui Xie. Split bregman method for large scale fused lasso. ComputationalStatistics & Data Analysis , 55(4):1552–1569, 2011.Yao-Liang Yu. Better approximation and faster algorithm using the proximal average. In Advancesin Neural Information Processing Systems 26 , pages 458–466. 2013. URL http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/295.pdf .H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association ,101(476):1418–1429, 2006.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 67(2):301–320, 2005.Hui Zou and Hao Helen Zhang. On the adaptive elastic-net with a diverging number of parameters.