aa r X i v : . [ c s . I T ] M a r Regularly random duality M IHAILO S TOJNIC
School of Industrial EngineeringPurdue University, West Lafayette, IN 47907e-mail: [email protected]
Abstract
In this paper we look at a class of random optimization problems. We discuss ways that can helpdetermine typical behavior of their solutions. When the dimensions of the optimization problems are largesuch an information often can be obtained without actually solving the original problems. Moreover, wealso discover that fairly often one can actually determine many quantities of interest (such as, for example,the typical optimal values of the objective functions) completely analytically. We present a few general ideasand emphasize that the range of applications is enormous.
Index Terms: Linear constraints; duality . We start by looking at a class of very simple optimization problems. Namely, we will look at a linearlyconstrained optimization problems. Such problems can be formulated in the following fairly general way: min x f ( x ) subject to A x = 0 B x ≤ , (1)for concreteness we will assume that A is an m × n matrix from R m × n and B is an m × n matrix from R m × n . Also, it is rather clear, but we still mention that f ( x ) : R n → R , is what we will call the objectivefunction. Also if one looks at the problem given in (1) the first thing that comes to mind is that it is alinearly constrained optimization problem (see, e.g. [1]). So, there is really nothing specific about it beyondthat depending on the type of function f ( x ) its objective value could be either bounded or unbounded andthe problem can be either feasible or not. To make the exposition easier we will assume that wheneversomething in our exposition can be such that the objective could be unbounded or even nonexistent thensuch a scenario is not the subject of our discussion in this paper. Or in other words, we will assume that welook only at the scenarios where the objective values can be computed and are properly bounded. Anotheralternative would be, if say the problem above is unbounded, to simply add constraints that would insureboundedness of the objective value; or on the other hand, if the problem above is say infeasible, to simplyremove some of the constraints until it becomes feasible. We will occasionally throughout the paper look ata few scenarios where we would need to force the boundedness. However, since there will be those wherewe will ignore it we simply preface it right here before we proceed with further presentation.1ow, going back to the optimization problem given in (1). Determining the solution of this problemand the optimal value of its objective function is of course the ultimate goal. The type of function f ( x ) is typically what determines if this problem can be solved in polynomial time or not. For a moment letus assume that f ( x ) is such that (1) can be solved in polynomial time (in this paper whenever we saypolynomial time, we mean it roughly speaking, i.e. without all the details related to what is typically in thecomplexity theory called strongly polynomial and all other subtleties that come with considerations similarto that). From an algorithmic point of view the above problem is then typically considered as solvable. Ourinterest in this paper will be slightly different from this classical approach. We will look at a class of theseproblems and discuss whether or not is it possible to analytically determine the optimal value of the objectivefunction. Of course, if the dimension of the problem, n , is small (say n = 2 or n = 3 ) it is highly likely (nomatter how complicated f ( x ) can be) that (1) can be solved analytically. As one may guess our interest willnot be in such small dimensional scenarios either. Instead we will typically look at large values of n and allother dimensions. Moreover, to facilitate writing, we will typically assume the so-called linear regime, i.e.we will assume that all dimensions in this paper are large but linearly proportional to n . For example, in (1)we will assume that m = α n and m = α n where both, α and α are constants independent of n .Now, if the dimensions in (1) are large and our goal is to solve it analytically how exactly do we planto go about it. Well, there is really not much we would be able to say right away for two reasons: 1) wehave not specified f ( x ) and dealing with an unspecified f ( x ) could be unpredictable and in fact quite oftenimpossible; 2) the dimension of the problem is large which means that the number of constraints is largeas well and moreover in a general setup that we assume they all act on all components of x , i.e. on all { x , x , . . . , x n } . While we will not change much in our specifications of f ( x ) we will look for a glimmerof hope in a particular type of constraints. In other words we will leave the first of the above reasons asideand try to deal with the second one hoping that that alone will introduce enough simplifications so thateventually even the first reason is not that much of a problem. There are many ways how one deal withsets A x = 0 , B x ≤ . Our approach will be a random one. More specifically, we will assume that the setof constraints is drawn from a probability distribution. Since matrices A and B essentially determine theconstraints we will assume that they are the objects that are random. Moreover, to make the presentationeasier and to introduce a bit more of concreteness we will also assume that all components of both, A and B , are i.i.d standard normal random variables. This effectively establishes problem in (1) as a randomoptimization problem and that is the class of the optimization problems that will be the subject of our studyin this paper. Fairly often, in the theory of random optimization problems one looks at the objective valuesthat are also random functions of unknown x . Our entire exposition can easily be adapted to encompasssuch a scenario as well. However, we find it easier from the presentation point of view to assume that f ( x ) is actually a deterministic function.While there is a quite large literature on studying algorithmic aspects of random optimization problemswe stop short of reviewing it here. The main reason is that here we are not interested in a specific instance ofa certain optimization problem but rather a large class of optimization problems and it would be fairly hardto cover all the relevant work without missing some specific portions of it. We do however mention thatthe problems we study here are very generic and the literature on any of its particular instances would be asolid subreview. We also mention that our exposition does not rely on using any of the results known for anyspecific instance. In that sense the reader is not really even required to have pretty much any backgroundin optimization theory beyond a few classical concept that will be rather obvious from our presentation.Moreover, any such concepts will be fairly general and not tailored in any way for the classes of problemswe study here.Now that we have a setup of the introductory problem that we will look at we briefly describe whatwe will present in the rest of the paper and how the paper will be organized. In Section 2 we look atproblem (1) in the above mentioned random context and present several observations that can be useful inanalytically studying typical probabilistic behavior of the solutions of such problems. In Section 3 we then2tudy two more general versions of the original problem (1), namely nonhomogeneous linear constraintsand additional functional constraints. In Section 4 we then look at several particular objective functions andpresent in details how the mechanisms of Section 3 work. In Section 5 we give a brief discussion and presentseveral conclusion related to presented results. In this section we look at problem (1) in a statistical scenario. As mentioned above, for concreteness weassume that in ξ ( f, A, B ) = min x f ( x ) subject to A x = 0 B x ≤ , (2)all components of matrices A and B are i.i.d. standard normals. Since the assumed scenario is randomwe also need to revisit one of the assumptions we have made right after (1). Namely, we stated that wewill ignore all situations where the objective function is unbounded. Given the statistical scenario we willslightly modified such a statement by saying that we will assume that the objective in (2) is bounded withoverwhelming probability (under overwhelming probability we in this paper assume a probability that is nomore than a number exponentially decaying in n away from ). Under such an assumption we then proceedwith the following transformation of (2) ξ ( f, A, B ) = min x max ν,λ f ( x ) + ν T A x + λ T B x subject to λ i ≥ , i = 1 , , . . . , m . (3)In the rest of this section we will present a strategy that can be helpful in obtaining a probabilistic view ofquantity ξ ( f, A, B ) . We will split the presentation in two parts. In the first part we will present a lowerbound type of strategy whereas in the second part we will present an upper bound type of strategy. We will invoke the results of the following lemma which is a slightly modified version of Lemma 3.1from [3] (which is a direct consequence of Theorem B from [3]).
Lemma 1.
Let A be an m × n matrix with i.i.d. standard normal components and let B be an m × n matrixwith i.i.d. standard normal components. Let g and h be n × and ( m + m ) × vectors, respectively,with i.i.d. standard normal components. Also, let g be a standard normal random variable. Then P (min x max λ ≥ ,ν ( ν T A x + λ T B x + k (cid:2) ν T λ T (cid:3) k k x k g − ζ ( l ) x ,ν,λ ) ≥ ≥ P (min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( l ) x ,ν,λ ) ≥ . (4) Proof.
The proof follows from Theorem B from [3] after a fairly obvious modification of the proof ofLemma 3.1 given in [3].Let ζ ( l ) x ,ν,λ = ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k − f ( x ) + ξ ( l ) D ( f ) with ǫ ( g )5 > being an arbitrarily small constantindependent of n and ξ ( l ) D ( f ) being a fixed number that we will discuss later in great detail. Also, let3 = [ h TA h TB ] T , where h A is the first m components of h and h B is the last m components of h . We willfirst look at the right-hand side of the inequality in (4). The following is then the probability of interest P (min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( l ) x ,ν,λ ) ≥ . (5)Before further looking at this probability we will look in a bit more detail at the optimization problem insidethe probability. We first denote L = min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( l ) x ,ν,λ ) . (6)Replacing the value of ζ ( l ) x ,ν,λ we further have L = min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( l ) x ,ν,λ )= min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k + f ( x ) − ξ ( l ) D ( f ) . (7)One can now do the inner maximization for a fixed x and fixed k (cid:2) ν T λ T (cid:3) k . We then get L = min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k k (cid:2) ν T λ T (cid:3) T k q k h A k + k h B + k − ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k + f ( x ) − ξ ( l ) D ( f ) , (8)where h B + is the vector comprised only of non-negative components of h B . To make sure that L remainsbounded we further have L = min x f ( x ) − ξ ( l ) D ( f ) subject to g T x + k x k k ( q k h A k + k h B + k − ǫ ( g )5 √ n ) ≤ . (9)Combining (5), (6), and (9) one then has for the left hand side of (4) P (min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( l ) x ,ν,λ ) ≥
0) = P ( L ≥ , (10)with L given in (9). Since h A is a vector of m i.i.d. standard normal variables and h B is a vector of m i.i.d. standard normal variables it is rather trivial that P ( q k h A k + k h B + k > (1 − ǫ ( m )1 ) p m + m / ≥ − e − ǫ ( m )2 ( m + m / , where ǫ ( m )1 > is an arbitrarily small constant and ǫ ( m )2 is a constant dependent on ǫ ( m )1 but independent of n . Then one can modify (9) and (10) in the following way P (min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( l ) x ,ν,λ ) ≥
0) = P ( L ≥ ≥ (1 − e − ǫ ( m )2 ( m + m / ) P ( L (1) ≥ , (11)4here L (1) is L (1) = min x f ( x ) − ξ ( l ) D ( f ) subject to g T x + k x k ((1 − ǫ ( m )1 ) p m + m / − ǫ ( g )5 √ n ) ≤ . (12)We now look at the left-hand side of the inequality in (4). P (min x max λ ≥ ,ν ( ν T A x + λ T B x + k (cid:2) ν T λ T (cid:3) k k x k g − ζ ( l ) x ,ν,λ ) ≥ P (min x max λ ≥ ,ν ( ν T A x + λ T B x + f ( w ) − ξ ( l ) D ( f ) + k (cid:2) ν T λ T (cid:3) k k x k ( g − ǫ ( g )5 √ n )) ≥ . (13)Since P ( g ≤ ǫ ( g )5 √ n ) > − e − ǫ ( g )6 n (where ǫ ( g )6 is, as all other ǫ ’s in this paper are, independent of n ) from(13) we have P (min x max λ ≥ ,ν ( ν T A x + λ T B x + k (cid:2) ν T λ T (cid:3) k k x k g − ζ ( l ) x ,ν,λ ) ≥ ≤ (1 − e − ǫ ( g )6 n ) P (min x max λ ≥ ,ν ( ν T A x + λ T B x + f ( w ) − ξ ( l ) D ( f )) ≥
0) + e − ǫ ( g )6 n . (14)Connecting (4), (10), (11), and (14) we obtain P (min x max λ ≥ ,ν ( ν T A x + λ T B x + f ( w ) − ξ ( l ) D ( f )) ≥ ≥ (1 − e − ǫ ( m )2 ( m + m / )(1 − e − ǫ ( g )6 n ) P ( L (1) ≥ − e − ǫ ( g )6 n (1 − e − ǫ ( g )6 n ) , (15)where L (1) is as given in (12). A further combination of (3) and (15) gives P ( ξ ( f, A, B ) − ξ ( l ) D ( f ) ≥ ≥ (1 − e − ǫ ( m )2 m )(1 − e − ǫ ( g )6 n ) P ( L (1) ≥ − e − ǫ ( g )6 n (1 − e − ǫ ( g )6 n ) . (16)We are now in position to state the following lemma which is the first of results that we will present thatrelates to the optimal value of the objective of (2). Lemma 2. (Lower bound) Let A be an m × n matrix with i.i.d. standard normal components. Let B bean m × n matrix with i.i.d. standard normal components. Assume that n is large and that m = α n and m = α n where α and α are constants independent of n . Let f ( x ) : R n → R be a given functionand let ξ ( f, A, B ) be the objective value of the optimization problem in (2). Assume that f ( x ) is such that | ξ ( f, A, B ) | < ∞ with overwhelming probability. Further let g be an n × vector with i.i.d. standardnormal components. Let ǫ ’s in (12) be arbitrarily small constants and let ξ ( l ) D ( f ) be the largest scalar sothat L (1) defined in (12) is non-negative with overwhelming probability. Then, lim n →∞ P ( ξ ( f, A, B ) > ξ ( l ) D ( f )) = 1 . Proof.
Follows from the previous discussion.While the above lemma may sound a bit dry it is often a fairly powerful tool to deal with random linearly5onstrained programs. Its power essentially lies in potential simplicity of the auxiliary optimization program(12). It is relatively easy to see that the optimization problem in (12) is substantially simpler than the originalone given in (2). Still, there is no guarantee that (12) is always solvable. That would certainly depend on thestructure of function f ( x ) . Also, not only that (12) needs to be solvable, one should also be able to showthat its solution behaves “nicely”, i.e. one should be able to find a quantity ξ ( l ) D that is almost certain to besmaller than the optimal value of the objective of (12). We will towards the end of the paper demonstratehow the results of this lemma can be used in practice on a small example. The key in such an example (aswell as in any example where the above lemma is to be of any use) will be ability to probabilistically handlemuch simpler program (12). Before proceeding with further generic considerations of (2) we will in thenext subsection we present a corresponding upper-bounding strategy for a probabilistic characterization ofthe objective in (2). Before we proceed with the detail presentation we should make more explicit the following point. Namely,what we presented in the previous subsection is a concept that is mathematically speaking always correct,i.e. as long as the problem in (2) is deterministically solvable and its solution probabilistically speakingbounded. Now, while the concept is correct it is just a lower bound type of approach. Moreover, the conceptis correct and it relies on a potential simplicity of (12). So it will be useful if (12) can be handled. However,no matter if (12) can be handled or not, the entire concept remains valid with very minimal assumptions on f ( x ) (in fact, assumption that f ( x ) is such that (2) is bounded seems as pretty much unavoidable as longas solving (2) is to have any reasonable practical sense). On the other hand the strategy that we will presentbelow will not work generically, i.e. it will require additional assumptions on f ( x ) beyond those mentionedin the previous subsection. Since these assumptions may or may not hold we will preface our presentationby saying that the upper-bounding strategy that we show below is in a way a virtual strategy. Of course,we should add that the strategy is not purely virtual. Quite contrary, it fairly often works; in fact, roughlyspeaking, it works almost exactly as the complexity theory works, i.e. it is fairly similar to the followingparadigm “as long as (2) is computationally doable in a reasonable amount of time the strategy will beworking well”. Of course, this is a fairly informal statement without any mathematically rigorous type oflanguage. To establish the above statement on a more mathematically rigorous level requires a presentationthat goes way beyond the scope of this paper and will pursue it elsewhere. We do mention that such apresentation does not contain almost any further conceptual insight, i.e. the core of the ideas is already here.However, it does require an enormous amount of mathematical detailing which we choose to skip to avoidruining the elegance of the presentation that we attempt to achieve here.Going back to (2), in this section we will essentially attempt to mimic the presentation of the previoussubsection. To that end we start by recalling that our object of interest is the following linearly constrainedoptimization problem: ξ ( f, A, B ) = min x f ( x ) subject to A x = 0 B x ≤ , (17)which after a bit of juggling becomes ξ ( f, A, B ) = min x max λ ≥ ,ν f ( x ) + ν T A x + λ T B x . (18)Now, we recall that A and B are random matrices and the above optimization problems are random. Given6heir randomness sometimes they can be solvable sometimes they may not be solvable. They may be unsolv-able due to the fact that they are not feasible or that they are feasible but the value of the objective function isunbounded. However, as we did in the previous subsections, we leave all these unfavorable scenarios asideand preface our presentation assuming that | ξ ( f, A, B ) | ≤ ∞ with overwhelming probability.A this point we will attempt to transform (18) assuming that f ( x ) is such that the transformation ismathematically possible. Namely, let f ( x ) be such that ξ ( f, A, B ) = min x max λ ≥ ,ν f ( x ) + ν T A x + λ T B x = max λ ≥ ,ν min x f ( x ) + ν T A x + λ T B x . (19)Assuming that (19) holds one can then further write ξ ( f, A, B ) = max λ ≥ ,ν min x f ( x ) + ν T A x + λ T B x = − min λ ≥ ,ν max x − f ( x ) − ν T A x − λ T B x (20)and − ξ ( f, A, B ) = min λ ≥ ,ν max x − f ( x ) − ν T A x − λ T B x . (21)Similarly to what was done in the previous subsection we will utilize the results of the following lemmawhich is a slightly modified version of Lemma 3.1 from [3] and an upper-bounding analogue to lower-bounding Lemma 1. Lemma 3.
Let A be an m × n matrix with i.i.d. standard normal components and let B be an m × n matrixwith i.i.d. standard normal components. Let g and h be n × and ( m + m ) × vectors, respectively,with i.i.d. standard normal components. Also, let g be a standard normal random variable. Then P ( min λ ≥ ,ν max x ( − ν T A x − λ T B x + k (cid:2) ν T λ T (cid:3) k k x k g − ζ ( u ) x ,ν,λ ) ≥ ≥ P ( min λ ≥ ,ν max x ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( u ) x ,ν,λ ) ≥ . (22) Proof.
The proof follows from Theorem B from [3] after a fairly obvious modification of the proof ofLemma 3.1 given in [3].Let ζ ( u ) x ,ν,λ = ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k + f ( x ) − ξ ( u ) D ( f ) with ǫ ( g )5 > being an arbitrarily small constantindependent of n and ξ ( u ) D ( f ) being a fixed number that we will discuss later in great detail. As in theprevious subsection, let h = [ h TA h TB ] T , where h A is the first m components of h and h B is the last m components of h . As in the previous subsection we will first look at the right-hand side of the inequality in(22). The following is then the probability of interest P ( min λ ≥ ,ν max x ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( u ) x ,ν,λ ) ≥ . (23)Before looking further at this probability we will look in a bit more detail at the optimization problem insidethe probability. We first denote U = min λ ≥ ,ν max x ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( u ) x ,ν,λ ) . (24)7eplacing the value of ζ ( u ) x ,ν,λ we further have U = min λ ≥ ,ν max x ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( u ) x ,ν,λ )= min λ ≥ ,ν max x ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k − f ( x ) + ξ ( u ) D ( f )) . (25)From (25) one then has U = min λ ≥ ,ν max x ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k − f ( x ) + ξ ( u ) D ( f )) ≥ max x min λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k − f ( x ) + ξ ( u ) D ( f ))= − min x max λ ≥ ,ν ( −k (cid:2) ν T λ T (cid:3) k g T x − k x k h T (cid:2) ν T λ T (cid:3) T + ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k + f ( x ) − ξ ( u ) D ( f )) . (26)One can now do the inner maximization for a fixed x and fixed k (cid:2) ν T λ T (cid:3) k to get U ≥ − min x max λ ≥ ,ν ( −k (cid:2) ν T λ T (cid:3) k g T x + k x k k (cid:2) ν T λ T (cid:3) T k q k h A k + k h B + k + ǫ ( g )5 √ n k (cid:2) ν T λ T (cid:3) k k x k + f ( x ) − ξ ( u ) D ( f )) , (27)where as in the previous subsection h B + is vector comprise of only non-negative components of h B . Tomake sure that the quantity on the right-hand side remains bounded we further have U ≥ − min x f ( x ) − ξ ( l ) D ( f ) subject to − g T x + k x k k ( q k h k + k h B + k + ǫ ( g )5 √ n ) ≤ . (28)Let U (0) = min x f ( x ) − ξ ( l ) D ( f ) subject to − g T x + k x k k ( q k h k + k h B + k + ǫ ( g )5 √ n ) ≤ . (29)Combining (23), (24), and (28) one then has for the left hand side of (22) P (min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( u ) x ,ν,λ ) ≥
0) = P ( U ≥ ≥ P ( − U (0) ≥
0) = P ( U (0) ≤ , (30)with U (0) as given in (29). Since h A is a vector of m i.i.d. standard normal variables and h B is a vector of m i.i.d. standard normal variables it is rather trivial that P ( q k h k + k h B + k < (1 + ǫ ( m )1 ) p m + m / ≥ − e − ǫ ( m )2 ( m + m / , where we recall that as in the previous subsection ǫ ( m )1 > is an arbitrarily small constant and ǫ ( m )2 is a8onstant dependent on ǫ ( m )1 but independent of n . Then one can modify (29) and (30) in the following way P (min x max λ ≥ ,ν ( k (cid:2) ν T λ T (cid:3) k g T x + k x k h T (cid:2) ν T λ T (cid:3) T − ζ ( u ) x ,ν,λ ) ≥ ≥ P ( U (0) ≤ ≥ (1 − e − ǫ ( m )2 ( m + m / ) P ( U (1) ≤ , (31)where U (1) is U (1) = min x f ( x ) − ξ ( u ) D ( f ) subject to − g T x + k x k ((1 + ǫ ( m )1 ) p m + m / ǫ ( g )5 √ n ) ≤ . (32)We now look at the left-hand side of the inequality in (22). Essentially we will just need to repeat thecorresponding arguments from the previous subsection. A few notational modifications will be in placethough. We start with P ( min λ ≥ ,ν max x ( − ν T A x − λ T B x + k (cid:2) ν T λ T (cid:3) k k x k g − ζ ( u ) x ,ν,λ ) ≥ P ( min λ ≥ ,ν max x ( − ν T A x − λ T B x − f ( w ) + ξ ( u ) D ( f ) + k (cid:2) ν T λ T (cid:3) k k x k ( g − ǫ ( g )5 √ n )) ≥ . (33)Since P ( g ≤ ǫ ( g )5 √ n ) > − e − ǫ ( g )6 n (where ǫ ( g )6 is, as all other ǫ ’s in this paper are, independent of n ) from(33) we have P ( min λ ≥ ,ν max x ( − ν T A x − λ T B x + k (cid:2) ν T λ T (cid:3) k k x k g − ζ ( u ) x ,ν,λ ) ≥ ≤ (1 − e − ǫ ( g )6 n ) P ( min λ ≥ ,ν max x ( − ν T A x − λ T B x − f ( w ) + ξ ( u ) D ( f )) ≥
0) + e − ǫ ( g )6 n . (34)Connecting (22), (30), (31), and (34) we obtain P ( min λ ≥ ,ν max x ( − ν T A x − λ T B x − f ( w ) + ξ ( u ) D ( f )) ≥ ≥ (1 − e − ǫ ( m )2 ( m + m / )(1 − e − ǫ ( g )6 n ) P ( U (1) ≤ − e − ǫ ( g )6 n (1 − e − ǫ ( g )6 n ) , (35)where U (1) is as given in (32). A further combination of (18) and (35) gives P ( − ξ ( f, A, B ) + ξ ( u ) D ( f ) > ≥ (1 − e − ǫ ( m )2 m )(1 − e − ǫ ( g )6 n ) P ( U (1) ≤ − e − ǫ ( g )6 n (1 − e − ǫ ( g )6 n ) . (36)We are now in position to state the following lemma which is a result that helps create an upper-boundon the optimal value of the objective of (2). Lemma 4. (Virtual upper bound) Let A be an m × n matrix with i.i.d. standard normal components. Let B be an m × n matrix with i.i.d. standard normal components. Assume that n is large and that m = α n and m = α n where α and α are constants independent of n . Let f ( x ) : R n → R be a given functionand let ξ ( f, A, B ) be the objective value of the optimization problem in (2). Assume that f ( x ) is such that | ξ ( f, A, B ) | < ∞ with overwhelming probability and that (19) holds. Further let g be an n × vector ith i.i.d. standard normal components. Let ǫ ’s in (32) be arbitrarily small constants and let ξ ( u ) D ( f ) be thesmallest scalar so that U (1) defined in (32) is non-positive with overwhelming probability. Then, lim n →∞ P ( ξ ( f, A, B ) < ξ ( u ) D ( f )) = 1 . Proof.
Follows from the previous discussion.As was the case with Lemma 2, Lemma 4 may also sound a bit dry. However, as we mentioned rightafter Lemma 2, Lemma 4 often turns out to be a fairly powerful tool to deal with random linearly constrainedprograms. Its major power essentially lies in potential simplicity of the auxiliary optimization program (32)(of course excluding a couple of technical details this program is for all practical purposes the same as theone given in (12)). On the other hand one should keep in mind that Lemma 4 is a bit more restrictive inthat it also requires that f ( x ) is such that (19) holds. If one for a moment leaves aside this restriction thenthe power of the above lemma pretty much relies on one’s ability to determine a quantity ξ ( u ) D that is almostcertain to be larger than the optimal value of f ( x ) in (32). Of course the smaller ξ ( u ) D the better the bound. Ina more informal language though, if a duality in (19) holds and if everything else (probabilistically speaking)behaves “nicely” the success of the above introduced mechanism relies on one’s ability to provide a preciseprobabilistic analysis of (12) or (32). That is typically highly likely to be possible given that the optimizationprogram (12) (or (32)) has only one random linear constraint. What we presented in the previous section is an often very powerful mechanism to handle linearly con-strained optimization programs. One then naturally may wonder is there a way to extend the above resultsto more general classes of optimization problems. The answer is yes, but in our experience such extensionsare typically problem specific. That is of course one of the reasons why we presented the main concepts ona very simple optimization problem. Instead of listing various other types of problems where the mecha-nism presented here can be used equally successfully we below choose to discuss a few small modificationswhich will hopefully provide a hint as to how relatively easily the whole framework can be massaged to fitinto various other scenarios. All these modifications could have been already included in our original setup.However, we thought that they would make the original problem unnecessary cumbersome and in order topreserve the lightness of the exposition we chose to start with the simplest possible example and then buildfrom there.
Looking back at problem (1) one can notice that we started with a set of constraints that is basically homo-geneous, i.e. pretty much scaling invariant. In other words for any x that is feasible in (1) c x is feasible aswell as long as c ≥ . Typically linear constraints are not necessarily homogeneous and if they are not onehas the following more general version of (1) min x f ( x ) subject to A x = a B x ≤ b , (37)10here a is an m × vector from R m and analogously b is an m × vector from R m . Now, giventhat in this paper we are dealing with random programs, it is natural to wonder if a and/or b are random ordeterministic (fixed). We will below just sketch how our results easily adapt if a and b are deterministic.Essentially, one can pretty much repeat the entire derivation from the previous section. Namely, one canstart by defining the optimal value of the objective in (37) as ξ nh ( f, A, B ) = min x f ( x ) subject to A x = a B x ≤ b , (38)and write an analogue to (3) ξ nh ( f, A, B ) = min x max λ ≥ ,ν f ( x ) + ν T A x + λ T B x + ν T a + λ T b . (39)One can then repeat the entire definition from the previous section with very minimal and fairly obviousmodifications. We skip such an exercise but mention only the critical differences and final results. Theonly difference in the entire derivation will be the form of the auxiliary programs (12) (or (32)). Since (12)(and (32)) are a more refined version of (9) (and (29)) what will actually change is the structure of theseprograms. So instead of them one would have L (0) nh = min x f ( x ) − ξ ( l ) D ( f ) subject to g T x + q kk x k h A + a k + k ( k x k h B + b ) + k − k x k ǫ ( g )5 √ n ≤ , (40)where ( k x k h B + b ) + is a vector comprised of non-negative components of vector k x k h B + b . On theother hand one would have for a corresponding replacement of (29) U (0) nh = min x f ( x ) − ξ ( u ) D ( f ) subject to − g T x + q kk x k h A + a k + k ( k x k h B + b ) + k + k x k ǫ ( g )5 √ n ≤ . (41)Of course, for all practical purposes programs (40) and (41) are basically equivalent. Statement of Lemma2 would then remain in place with the only difference being that L (1) should be replaced by L (0) nh . Similarly,Lemma 4 would remain correct with U (1) being replaced by U (0) nh and with an f ( x ) being such that thefollowing modified version of (19) holds ξ nh ( f, A, B ) = min x max λ ≥ ,ν f ( x ) + ν T A x + λ T B x + ν T a + λ T b = max λ ≥ ,ν min x f ( x ) + ν T A x + λ T B x + ν T a + λ T b . (42)What we presented above is a generic scenario that would work for any given a and b . Even when a and b are generic, one can of course massage it further and remove the randomness of h as in the definitions of L (1) and U (1) (when a and b are random this is even easier). We skip these easy exercises. What we discussed above is an upgrade in the existing set of constraints. Instead one may wonder howmechanism would fare if the linear structure of constraints would be changed to include more general con-11traints. For example instead of (1) one may look at its a more general version min x f ( x ) subject to A x = 0 B x ≤ f i ( x ) ≤ , i = 1 , , . . . , l (43)where each f i ( x ) : R n → R is a non-necessarily linear function of x (of course, there is really no need torestrict on scalar functions; i.e. all the major steps that we present below can be repeated/extended to prettymuch any kind of function). Similarly to what we discussed in the previous subsection, these functions canbe random of deterministic. To make writing easier we will assume that they are generic, i.e. deterministic.One can then again proceed as above by introducing ξ afc ( f, f , f , . . . , f l , A, B ) = min x f ( x ) subject to A x = 0 B x ≤ f i ( x ) ≤ , i = 1 , , . . . , l (44)and writing an analogue to (3) ξ afc ( f, f , f , . . . , f l , A, B ) = min x max λ ≥ ,γ i ≥ ,ν f ( x ) + ν T A x + λ T B x + l X i =1 γ i f i ( x ) . (45)One can again then repeat the entire derivation from the previous section with very minimal modifications.As in the previous subsection, we skip such an exercise and only mention the critical differences and finalresults. As was the case above when we discussed the non-homogeneous linear constraints, the only differ-ence in the repeated derivation will be the form of the auxiliary programs (12) (or (32)). So instead of themone would have L (1) afc = min x f ( x ) − ξ ( l ) D ( f ) subject to g T x + k x k ((1 − ǫ ( m )1 ) p m + m / − ǫ ( g )5 √ n ) ≤ f i ( x ) ≤ , i = 1 , , . . . , l. (46)On the other hand one would have for a corresponding replacement of (29) U (1) afc = min x f ( x ) − ξ ( u ) D ( f ) subject to − g T x + k x k ((1 + ǫ ( m )1 ) p m + m / ǫ ( g )5 √ n ) ≤ f i ( x ) ≤ , i = 1 , , . . . , l. (47)Of course, for all practical purposes programs (46) and (47) are basically equivalent. Statement of Lemma 2would then remain in place with the only difference being that L (1) should be replaced by L (1) afc . Similarly,Lemma 4 would remain correct with U (1) being replaced by U (1) afc and with an f ( x ) being such that the12ollowing modified version of (19) holds ξ afc ( f, A, B ) = min x max λ ≥ ,γ i ≥ ,ν f ( x ) + ν T A x + λ T B x + l X i =1 γ i f i ( x )= max λ ≥ ,γ i ≥ ,ν min x f ( x ) + ν T A x + λ T B x + l X i =1 γ i f i ( x ) . (48)What we presented above is a generic scenario where all functions f i ( x ) are assumed to be deterministic.Of course some of the additional constraints (sometimes even all of them) can be random functions as well.Then they typically can be massaged further, either when handling (46) (or (47)) or in the derivation processfrom Section 2. However, the way to handle them is typically problem specific and we typically treat themon the individual case basis and choose to present such discussions elsewhere.It is of course relatively easy to see that the non-homogenous case from the previous subsection andthe case of additional functional constraints considered in this subsection can easily be merged. We ofcourse skip rewriting this easy exercise. Instead in the following section we provide a specific exampleto demonstrate how the entire mechanism can be applied. Moreover, the example will be selected so thatthe mechanism works in its full capacity, i.e. with all assumptions being satisfied and both Lemma 2 andLemma 4 being useful and essentially providing matching lower and upper bounds on the optimal value ofthe objective function. In this section we demonstrate how the mechanism from previous sections can be applied on a particularoptimization problem. We start by assuming a specific type of the objective function. We will assume that f ( x ) is a homogeneous function. Namely, let f h ( x ) be such that f h ( a x ) = a d f h ( x ) , (49)for any a > and a fixed d > . Then we say that function f h ( x ) is positive homogeneous of degree d .Then for all practical purposes the optimization problem (1) is useless. Basically, if there is a feasible x such that f h ( x ) < one can then keep multiplying such an x by a sequence of arbitrarily large increasingconstants a and no matter how small d is the value of f h ( x ) will eventually keep converging to −∞ . To helpmaking problem (1) bounded we will add an origin encapsulating closed set to act as an additional boundingconstraint. There is really no restriction as what this constraint needs to be. However, to facilitate concretecomputations we will assume the most typical spherical constraint. One then has a reformulated version of(1) ξ h ( f h , A, B ) = ξ afc ( f h , f , A, B ) = min x f h ( x ) subject to A x = 0 B x ≤ f ( x ) = k x k − ≤ . (50)13ow, the mechanism of Section 2.1 can be used. A way to provide a lower bound based on such a mechanismis to determine a quantity ξ ( l ) D ( f ) such that L (1) h below is non-negative with overwhelming probability. L (1) h = min x f ( x ) − ξ ( l ) D ( f ) subject to g T x + k x k ((1 − ǫ ( m )1 ) p m + m / − ǫ ( g )5 √ n ) ≤ k x k ≤ . (51)Of course, the larger ξ ( l ) D ( f ) is the harder for L (1) h to stay non-negative. So, roughly speaking, the best ξ ( l ) D ( f ) would be the one that makes L (1) h equal to zero (or to be more precise, the one that makes L (1) h stayjust above zero). When ξ h ( f h , A, B ) < The optimization problem in (51) can be simplified a bit L (1) h = min x f ( x ) − ξ ( l ) D ( f ) subject to g T x + ((1 − ǫ ( m )1 ) p m + m / − ǫ ( g )5 √ n ) ≤ k x k ≤ . (52)(Throughout the presentation in the rest of this section we pretty much ignore scenario when there is no x such that ξ h ( f h , A, B ) < , since in that case one trivially has ξ h ( f h , A, B ) = 0 .) On the other hand, ifone sets f ( x ) = k x k − and f h ( x ) is such that (48) holds then one can also utilize the mechanism ofSection 2.2. A way to provide an upper bound on ξ h ( f h , A, B ) based on such a mechanism is to determinea quantity ξ ( u ) D ( f ) such that U (1) h below is non-positive with overwhelming probability. U (1) h = min x f h ( x ) − ξ ( u ) D ( f ) subject to g T x + k x k ((1 + ǫ ( m )1 ) p m + m / ǫ ( g )5 √ n ) ≤ k x k ≤ . (53)Of course, the smaller ξ ( u ) D ( f ) is the harder for U (1) h to stay non-positive. Again, roughly speaking, the best ξ ( u ) D ( f ) would be the one that makes U (1) h equal to zero (or to be more precise, the one that makes U (1) h stayjust below zero). The optimization problem in (53) can be simplified a bit U (1) h = min x f h ( x ) − ξ ( u ) D ( f ) subject to g T x + ((1 + ǫ ( m )1 ) p m + m / ǫ ( g )5 √ n ) ≤ k x k ≤ . (54)Of course, roughly speaking (basically ignoring all ǫ ’s), for all practical purposes programs (52) and (54) areequivalent, which essentially means that if f h ( x ) is such that (48) holds then not only will ξ ( l ) D ( f ) be a lowerbound on ξ h ( f h , A, B ) with probability as n → ∞ , but also its a small variation ξ ( u ) D ( f ) will be an upperbound on ξ h ( f h , A, B ) with probability as n → ∞ . Or in other words, the probability that ξ h ( f h , A, B ) will substantially deviate away from ξ ( l ) D ( f ) will go to zero as n → ∞ .Now, to demonstrate how one would proceed further we will look at a couple of particular examples ofhomogeneous functions. 14 .1 Purely linear f(x) We will first look at quite likely the simplest possible example for f ( x ) , namely a purely linear function.So, we will set f lp ( x ) = n X i =1 x i . (55)Then (52) becomes L (1) lp = min x n X i =1 x i − ξ ( l ) D ( f lp ) subject to g T x + ((1 − ǫ ( m )1 ) p m + m / − ǫ ( g )5 √ n ) ≤ k x k ≤ . (56)Also, to make writing easier we will set p D ( l ) = ((1 − ǫ ( m )1 ) p α + α / − ǫ ( g )5 ) . (57)Now we rewrite (56) in the following more convenient way L (1) lp = min x max λ ≥ n X i =1 x i + λ g T x + λ p D ( l ) √ n − ξ ( l ) D ( f lp ) subject to k x k ≤ . (58)Since the duality easily holds one then further has L (1) lp = max λ ≥ min x n X i =1 x i + λ g T x + λ p D ( l ) √ n − ξ ( l ) D ( f lp ) subject to k x k ≤ . (59)After solving the inner minimization we finally have L (1) lp = max λ ≥ ( −k + λ g T k + λ p D ( l ) √ n ) − ξ ( l ) D ( f lp ) , (60)where is the n -dimensional column vector of all ones. Now, clearly, L (1) lp is a random quantity. Tocompletely understand its random behavior one would need to study it in full detail. However, since thispaper is mostly concerned with a conceptual approach rather than with the details of particular calculationswe will skip all unnecessary portions and focus only on the main results. To that end we will just mentionwithout proving that L (1) lp concentrates around its mean with overwhelming probability (the proof of thisfact is not hard; however we do feel that going into such details would sidetrack our exposition; instead wedo mention that a great deal of details needed for proofs of this type can be found in e.g. [5, 6] as well as inmany general probability type of references). Given all of this it is clear that to apply results of Lemma 2 itis then enough to compute EL (1) lp and then choose ξ ( l ) D ( f lp ) such that EL (1) lp ≥ . When n is large one thenhas lim n →∞ EL (1) lp √ n = max λ ≥ ( − p λ + λ p D ( l ) ) − lim n →∞ ξ ( l ) D ( f lp ) √ n , (61)15hich after solving over λ gives lim n →∞ EL (1) lp √ n = − p − D ( l ) − lim n →∞ ξ ( l ) D ( f lp ) √ n , if D ( l ) ≤ − lim n →∞ ξ ( l ) D ( f lp ) √ n , otherwise . (62)Now if we recall on the definition of D ( l ) from (57) and set ξ ( l ) D ( f lp ) = ( − q − ((1 − ǫ ( m )1 ) p α + α / − ǫ ( g )5 ) √ n, if ((1 − ǫ ( m )1 ) p α + α / − ǫ ( g )5 ) ≤ , otherwise , (63)we then based on Lemma 2 and previous discussion have lim n →∞ P ( ξ h ( f lp , A, B ) > ξ ( l ) D ( f lp )) = 1 , (64)where ξ ( l ) D ( f lp ) is as in (63).Since for f lp ( x ) from (55) and f ( x ) = k x k − (48) holds, one can now, analogously to (57), set p D ( u ) = ((1 + ǫ ( m )1 ) p α + α / ǫ ( g )5 ) , (65)and write the following analogue to (58) U (1) lp = min x max λ ≥ n X i =1 x i − λ g T x + λ p D ( u ) √ n − ξ ( u ) D ( f lp ) subject to k x k ≤ . (66)After repeating previous arguments and relying on Lemma 4 one then arrives at lim n →∞ P ( ξ h ( f lp , A, B ) < ξ ( u ) D ( f lp )) = 1 , (67)where ξ ( u ) D ( f lp ) would analogously to (63) be ξ ( u ) D ( f lp ) = ( − q − ((1 + ǫ ( m )1 ) p α + α / ǫ ( g )5 ) √ n, if ((1 + ǫ ( m )1 ) p α + α / ǫ ( g )5 ) ≤ , otherwise . (68)We summarize the above presentation in the following convenient lemma. Lemma 5.
Consider optimization problem in (50). Let f h ( x ) = f lp ( x ) = P ni =1 x i . Let A be an m × n matrix with i.i.d. standard normal components. Let B be an m × n matrix with i.i.d. standard normalcomponents. Assume that n is large and that m = α n and m = α n where α and α are constantsindependent of n . Let ξ ( l ) D ( f lp ) and ξ u ) D ( f lp ) be as in (63) and (68), respectively. Let ǫ ’s in (63) and (68) bearbitrarily small constants independent of n . Then, lim n →∞ P ( ξ ( l ) D ( f lp ) < ξ h ( f lp , A, B ) < ξ ( u ) D ( f lp )) = 1 , Proof.
Follows from previous discussion. 16able 1: Experimental results for (50); α = 0 . ; (50) was run times with n = 200 α . . . . . E ( L (1) lp + ξ ( l ) D ) √ n – (sim.) − . − . − . − . − . − . n →∞ E ( L (1) lp + ξ ( l ) D ) √ n – (th.) − . − . − . − . − . − . More informally, assume the setup of Lemma 5. If − α − α / , one then has that with verylow probability the optimal value of the objective function in (50), ξ h ( f lp , A, B ) , would deviate from − p − α − α / √ n . On the other hand if − α − α / < then with very high probability theoptimal value of the objective function in (50), ξ h ( f lp , A, B ) , is zero. To give a bit more flavor as to how useful practically would be the results from the previous subsection,we conducted a limited set of numerical experiments. Namely, we solved problem (50) with A and B asrandomly generated i.i.d. Gaussian matrices and f h ( x ) = f lp ( x ) = P ni =1 x i . We repeated our experimenta number of times with different (but of course random) A and B . The results we obtained are summarizedin Table 1. The second row contains the numerical values obtained through the simulations and the thirdrow contains the numerical values that the above theory predicts. As can be seen from Table 1, even for afairly small value of n one has a solid agreement between what the above theory predicts and the resultsobtained through numerical experiments. The results we presented in Table 1 are given for the expectedvalues whereas Lemma 5 gives a probabilistic type of behavior. However, as we mentioned earlier, allimportant quantities do concentrate and they do concentrate around their mean values. We will now extend a bit the results from the previous subsection. Namely, instead of looking at a purelylinear function f ( x ) we will look at general linear functions. So, we will set f gl ( x ) = n X i =1 c i x i = c T x , (69)where c is a deterministic (fixed) n × vector from R n . For concreteness we will also set C gl = k c k √ n andassume C gl < ∞ as n → ∞ . As in previous subsection one can then consider L (1) gl = min x n X i =1 c i x i − ξ ( l ) D ( f gl ) subject to g T x + ((1 − ǫ ( m )1 ) p m + m / − ǫ ( g )5 √ n ) ≤ k x k ≤ . (70)After repeating all the steps from the previous subsection one then arrives at lim n →∞ EL (1) gl √ n = max λ ≥ ( − q C gl + λ + λ p D ( l ) ) − lim n →∞ ξ ( l ) D ( f gl ) √ n , (71)17hich after solving over λ gives lim n →∞ EL (1) lp √ n = − C gl √ − D − lim n →∞ ξ ( l ) D ( f lp ) √ n , if D ≤ − lim n →∞ ξ ( l ) D ( f lp ) √ n , otherwise . (72)One can then repeat all remaing arguments from the previous subsection to arrive at the following (moregeneral) analogue of Lema 5. Lemma 6.
Consider optimization problem in (50). Let f h ( x ) = f gl ( x ) = P ni =1 c i x i , where c is a deter-ministic (fixed) n × vector from R n . Set C gl = k c k √ n and assume C gl < ∞ as n → ∞ . Let A be an m × n matrix with i.i.d. standard normal components. Let B be an m × n matrix with i.i.d. standard normalcomponents. Assume that n is large and that m = α n and m = α n where α and α are constantsindependent of n . Let ξ ( l ) D ( f gl ) = C gl ξ ( l ) D ( f lp ) and ξ u ) D ( f gl ) = C gl ξ u ) D ( f lp ) where ξ ( l ) D ( f lp ) and ξ u ) D ( f lp ) areas in (63) and (68), respectively. Let ǫ ’s in (63) and (68) be arbitrarily small constants independent of n .Then, lim n →∞ P ( ξ ( l ) D ( f gl ) < ξ h ( f gl , A, B ) < ξ ( u ) D ( f gl )) = 1 , Proof.
Follows from previous discussion.
Remark:
Knowing results of Lemma 5 one can deduce Lemma 6 even faster. For example, one can observethat f gl ( x ) = c T x = C gl T Q c x where Q c is an n × n matrix such that Q T c Q c = I . Then (50) with f h ( x ) = f gl ( x ) becomes ξ h ( f h , A, B ) = ξ afc ( f h , f , A, B ) = min x C gl T Q c x subject to AQ T c Q c x = 0 BQ T c Q c x ≤ f ( x ) = k Q c x k − ≤ . (73)After a change of variables A rot = AQ T c , B rot = BQ T c , and x rot = Q c x one further has ξ h ( f h , A, B ) = ξ afc ( f h , f , A, B ) = min x rot C gl T x rot subject to A rot x rot = 0 B rot x rot ≤ f ( x ) = k x rot k − ≤ . (74)Now, observing that due to rotational invariance of Gaussian distribution matrices A rot and B rot are againcomprised of i.i.d. standard normals one effectively has the same optimization problem as in the previoussubsection. The only difference is that the objective function is multiplied by C gl which is exactly whatLemma 6 states should be the case. In this subsection we will look at a more general homogeneous function f h ( x ) . Namely, we will set f h ( x ) = f bp ( x ) = n − k X i =1 | x i | + n X i = n − k +1 x i , (75)18here k = βn and β ≤ α is a constant independent of n . This function is an interesting choice for atleast three reasons. First, it appears as a very important object in studying sparse solutions of random under-determined linear systems of equations. Second, it is a function for which (48) holds. And third, it has anice structure that allows one to actually analytically compute ξ ( l ) D (and since (48) holds then ξ ( u ) D as well).We will below closely follow the presentation of Section 4.1. To that end we start with (52) which in thecase of interest here simplifies to (we are again mostly concern with the scenario where ξ h ( f h , A, B ) = ξ h ( f bp , A, B ) < ) L (1) bp = min x n − k +1 X i =1 | x i | + n X i = n − k +1 x i − ξ ( l ) D ( f bp ) subject to g T x + ((1 − ǫ ( m )1 ) p m + m / − ǫ ( g )5 √ n ) ≤ k x k ≤ . (76)Also, to make writing easier, as in Section 4.1, we will use √ D ( l ) from (57). Now we rewrite (76) in thefollowing more convenient way L (1) bp = min x max λ ≥ n − k +1 X i =1 | x i | + n X i = n − k +1 x i + λ g T x + λ p D ( l ) √ n − ξ ( l ) D ( f bp ) subject to k x k ≤ . (77)Since the duality easily holds one then further has L (1) bp = max λ ≥ min x n − k +1 X i =1 | x i | + n X i = n − k +1 x i + λ g T x + λ p D ( l ) √ n − ξ ( l ) D ( f bp ) subject to k x k ≤ . (78)After solving the inner minimization we finally have L (1) bp = max λ ≥ ( − q k ( n − k − λ | g n − k | ) − k + k ( k + λ g n − k +1: n k + λ p D ( l ) √ n ) − ξ ( l ) D ( f bp )= max θ> ( θ − ( − q k ( θ n − k − | g n − k | ) − k + k ( θ k + g n − k +1: n k + p D ( l ) √ n )) − ξ ( l ) D ( f bp ) , (79)where n − k and k are the n − k - and k -dimensional column vectors of all ones respectively. Also, g n − k and g n − k +1: n − k are vectors comprised of first n − k and last k components of g , respectively. Vector ( n − k − λ | g n − k | ) − is a vector comprised only of negative components of vector ( n − k − λ | g n − k | ) and analogously vector ( θ n − k − | g n − k | ) − is a vector comprised only of negative components of vector ( θ n − k − | g n − k | ) ..Now, clearly, L (1) bp is a random quantity. To completely understand its random behavior one wouldneed to study it in full detail. However, since this paper is mostly concerned with a conceptual approachrather than with the details of particular calculations we will, as in Section 4.1, skip all unnecessary portionsand focus only on the main results. To that end we will just mention without proving that L (1) bp concentratesaround its mean with overwhelming probability (the proof of this fact needs some work but it is conceptuallyeasy; a majority of the details needed for the proof can be found in e.g. [5, 6]). Given all of this it is clearthat to apply results of Lemma 2 it is then enough to compute EL (1) bp and then choose ξ ( l ) D ( f bp ) such that19 L (1) bp ≥ . When n is large one then has lim n →∞ EL (1) bp √ n = max θ> ( θ − ( − s − β ) √ π Z − θ −∞ ( θ + z ) e − z / dz + β (1 + θ ) + p D ( l ) )) − lim n →∞ ξ ( l ) D ( f bp ) √ n . (80)After solving the integral one further has lim n →∞ EL (1) bp √ n = max θ> ( θ − ( − s − β )( − θe − θ / √ π + ( θ + 1)2 erfc ( θ √ β (1 + θ )+ p D ( l ) )) − lim n →∞ ξ ( l ) D ( f bp ) √ n . (81)Let φ ( l ) ( θ ) = ( θ − ( − s − β )( − θe − θ / √ π + ( θ + 1)2 erfc ( θ √ β (1 + θ ) + p D ( l ) )) , (82)and ˆ θ ( l ) = max θ> φ ( l ) ( θ ) . (83)Then one has lim n →∞ EL (1) bp √ n = φ ( l ) (ˆ θ ( l ) ) − lim n →∞ ξ ( l ) D ( f bp ) √ n , if max θ> φ ( l ) ( θ ) < − lim n →∞ ξ ( l ) D ( f bp ) √ n , otherwise . (84)Now we set ξ ( l ) D ( f bp ) = ( φ ( l ) (ˆ θ ( l ) ) √ n, if max θ> φ ( l ) ( θ ) < , otherwise . (85)We then based on Lemma 2 and previous discussion have lim n →∞ P ( ξ h ( f bp , A, B ) > ξ ( l ) D ( f bp )) = 1 , (86)where obviously ξ ( l ) D ( f bp ) is as in (85).Since for f bp ( x ) from (55) and f ( x ) = k x k − (48) holds, one can now make use of Lemma 4 to ina way upper-bound ξ h ( f bp , A, B ) . One starts with writing the following analogue to (58) U (1) bp = min x max λ ≥ n − k X i =1 | x i | + n X i = n − k +1 x i + λ g T x + λ p D ( u ) √ n − ξ ( u ) D ( f bp ) subject to k x k ≤ . (87)After repeating previous arguments and relying on Lemma 4 one then arrives at lim n →∞ P ( ξ h ( f bp , A, B ) < ξ ( u ) D ( f bp )) = 1 , (88)where ξ ( u ) D ( f bp ) would analogously to (85) be ξ ( u ) D ( f bp ) = ( φ ( u ) (ˆ θ ) √ n, if max θ> φ ( u ) ( θ ) < , otherwise , (89)20 ( u ) would be as in (65), and φ ( u ) ( θ ) and ˆ θ ( u ) would analogously to (82) and (83) be φ ( u ) ( θ ) = ( θ − ( − s − β )( − θe − θ / √ π + ( θ + 1)2 erfc ( θ √ β (1 + θ ) + p D ( u ) )) , (90)and ˆ θ ( u ) = max θ> φ ( u ) ( θ ) . (91)We summarize the above presentation in the following convenient lemma. Lemma 7.
Consider optimization problem in (50). Let f h ( x ) = f bp ( x ) = P n − ki =1 | x i | + P ni = n − k +1 x i . Let A be an m × n matrix with i.i.d. standard normal components. Let B be an m × n matrix with i.i.d.standard normal components. Assume that n is large and that m = α n and m = α n where α and α are constants independent of n . Let D ( l ) , D ( u ) , φ ( l ) ( θ ) , ˆ θ ( l ) , φ ( u ) ( θ ) , and ˆ θ ( u ) be as in (57), (65), (82), (83),(90), and (91), respectively. Further, let ξ ( l ) D ( f bp ) and ξ ( u ) D ( f bp ) be as in (85) and (89), respectively. Let ǫ ’sin (57) and (65) be arbitrarily small constants independent of n . Then, lim n →∞ P ( ξ ( l ) D ( f bp ) < ξ h ( f bp , A, B ) < ξ ( u ) D ( f bp )) = 1 , Proof.
Follows from previous discussion.
Remark:
Taking functional equation φ ( l ) (ˆ θ ( l ) ) (or φ ( u ) (ˆ θ ( u ) ) ) and equalling it with zero would give thecritical dependence for β , α , and α so that (50) has negative optimal value of the objective function withprobability that goes to as n → ∞ . In fact, this (with α → ) is precisely what was done in [5, 6] toobtain the critical threshold for success of ℓ optimization in recovering sparse solutions of random under-determined linear systems of equations (of course in [5, 6] we were strictly interested in characterizing thecritical threshold and properties of (76) in an as explicit way as possible and conducted a substantial furthermassage of (90) and (91) which we clearly skip here). As in Subsection 4.1, to give a bit more flavor as to how useful practically would be the results from theprevious subsection, we conducted a limited set of numerical experiments. Namely, we solved problem(50) with A and B as randomly generated i.i.d. Gaussian matrices and f h ( x ) = f bp ( x ) = P n − ki =1 | x i | + P ni = n − k +1 x i . We again repeated our experiment a number of times with different (but of course random) A and B . The results we obtained are summarized in Table 2. The second row contains the numericalvalues obtained through the simulations and the third row contains the numerical values that the abovetheory predicts. As can be seen from Table 2, even for a fairly small value of n , one as in Subsection 4.1,has a solid agreement between what the above theory predicts and the results obtained through numericalexperiments. The results we presented in Table 2 are given for the expected values whereas Lemma 7 givesa probabilistic type of behavior. However, as we mentioned earlier, all important quantities do concentrateand they do concentrate around their mean values. In this paper we looked at classic linearly constrained optimization problems. We viewed them in a statisticalcontext. We provided a general way of characterizing their optimal values. More specifically, we provided21able 2: Experimental results for (50); α = 0 . , α = 0 . ; (50) was run times with n = 200 β .
42 0 . . . . . E ( L (1) bp + ξ ( l ) D ) √ n – (sim.) − . − . − . − . − . − . − . n →∞ E ( L (1) bp + ξ ( l ) D ) √ n – (th.) − . − . − . − . − . − . − . a generic strategy that can help create a lower-bound on the optimal value of the objective function. Thestrategy is based on transforming the original problem to its a simpler probabilistic alternate. On the otherhand for a specific type of objective function we were then able to create an analogous strategy that canhelp create an upper-bound on the optimal value of the objective function. Moreover, probabilisticallyspeaking the two bounds match which essentially means that the lower-bounding strategy (which worksfor any objective function) in certain scenarios is actually good enough to optimally characterize the entireproblem.We then mentioned that the presented framework is fairly powerful and presented ways how one canmodify it to cover various other optimization problems. Still, the modifications that we presented are fairlysimple and we chose to present them just to give an idea how relatively easy is to use the presented strate-gies. Of course a whole lot more can be done, i.e. the class of optimization problems where the strategiespresented here will work is much wider then a few examples that we presented. However, since this is anintroductory paper where we intended just to present the core concepts of a much bigger theory we skipped adetail discussion as to what the limits of our propositions are. Also, many of further modifications/extensionsare typically problem specific and we thought that it is better to cover them separately and present such acoverage elsewhere.What is also important to stress is that we viewed optimization problems in a statistical context. To bemore precise, we assumed a typical Gaussian scenario where all random quantities in any of our problemsare assumed to be i.i.d. standard normals. These assumptions substantially simplified the exposition butare not really necessary. In fact, all results presented here would actually hold for a fairly large class ofrandom distributions. Proving that is not that hard. In fact there are many ways how it can be done, buttypically would boil down to repetitive use of the central limit theorem. For example, a particularly simpleand elegant approach would be the one of Lindeberg [4]. Adapting our exposition to fit into the frameworkof the Lindeberg principle is relatively easy and in fact if one uses the elegant approach of [2] pretty much aroutine. Since we did not create these techniques we chose not to do these routine generalizations. However,to make sure that the interested reader has a full grasp of generality of the results presented here, we doemphasize again that pretty much any distribution that can be pushed through the Lindeberg principle wouldwork in place of the Gaussian one that we used.Since the theory that we presented above in a way establishes a random duality we decided to call itthat way. Along the lines of the above mentioned probabilistic generality of our theory, we then coinedthe term regularly random duality where under regularly random we essentially view any randomness thateventually in large dimensional settings boils down to Gaussian. It is quite possible that there are otherclasses of randomness for which similar theories can be built. While they may not be as powerful as theGaussian one it would certainly (at least from a mathematical point of view) be interesting to see what theirshapes and forms are. 22 eferences [1] S. Boyd and L. Vandenberghe. Convex Optimization . Cambridge University Press, 2003.[2] S. Chatterjee. A generalization of the Lindenberg principle.
The Annals of Probability , 34(6):2061–2076.[3] Y. Gordon. On Milman’s inequality and random subspaces which escape through a mesh in R n . Geo-metric Aspect of of functional analysis, Isr. Semin. 1986-87, Lect. Notes Math , 1317, 1988.[4] J. W. Lindeberg. Eine neue herleitung des exponentialgesetzes in der wahrscheinlichkeitsrechnung.
Math. Z. , 15:211–225, 1922.[5] M. Stojnic. Upper-bounding ℓ -optimization weak thresholds. available at arXiv.[6] M. Stojnic. Various thresholds for ℓ -optimization in compressed sensing. submitted to IEEE Trans. onInformation Theorysubmitted to IEEE Trans. onInformation Theory