Shape-Constrained Density Estimation via Optimal Transport
SShape-Constrained Density Estimation Via
Optimal Transport
Ryan Cumings-Menon ∗ November 26, 2018
Abstract
Constraining the maximum likelihood density estimator to satisfy a suf-ficiently strong constraint, log − concavity being a common example, has theeffect of restoring consistency without requiring additional parameters. Sincemany results in economics require densities to satisfy a regularity condition,these estimators are also attractive for the structural estimation of economicmodels. In all of the examples of regularity conditions provided by Bagnoliand Bergstrom (2005) and Ewerhart (2013), log − concavity is sufficient to en-sure that the density satisfies the required conditions. However, in many cases log − concavity is far from necessary, and it has the unfortunate side effect ofruling out sub-exponential tail behavior.In this paper, we use optimal transport to formulate a shape constraineddensity estimator. We initially describe the estimator using a ρ − concavityconstraint. In this setting we provide results on consistency, asymptotic dis-tribution, convexity of the optimization problem defining the estimator, andformulate a test for the null hypothesis that the population density satisfies ashape constraint. Afterward, we provide sufficient conditions for these results tohold using an arbitrary shape constraint. This generalization is used to explorewhether the California Department of Transportation’s decision to award con-struction contracts with the use of a first price auction is cost minimizing. Weestimate the marginal costs of construction firms subject to Myerson’s (1981)regularity condition, which is a requirement for the first price reverse auctionto be cost minimizing. The proposed test fails to reject that the regularitycondition is satisfied. JEL Classification:
C14
Keywords:
Nonparametric density estimation, Kernel density estimation,Optimal transport, Log-concavity, ρ − concavity/ s − concavity ∗ Department of Economics, University of Illinois at Urbana-Champaign,
Address:
214 David Kin-ley Hall, 1407 West Gregory Drive, Urbana, IL 61801, USA.
E-mail address: [email protected]. a r X i v : . [ ec on . E M ] N ov Introduction
Nonparametric density estimation has the advantage over its parametric counterpartsof not requiring the underlying population density to belong to a specific family. In thecase of distribution functions, Kiefer and Wolfowitz (1956) showed that the empiricaldistribution function is a maximum likelihood estimator; however, attempting to usethis distribution function to directly define a nonparametric density estimate results ina series of point masses located at each of the datapoints. Grenander (1956) providedthe first example of a shape constrained density estimator as a way to extricate themaximum likelihood estimator from this “Dirac catastrophe.” Specifically, he showedthat maximizing the likelihood function subject to a monotonicity constraint on theestimator results in density estimates without point masses.A great deal of progress was made in subsequent decades by adding penalty termsto the maximum likelihood objective function to restore the parsimony of the densityestimator; for example, see (Silverman, 1986). Parzen (1962) also showed that kerneldensity estimators resulted in consistent density estimators and derived the ratesof convergence. Unlike Grenander’s (1956) approach, the performance of maximumpenalized likelihood estimators and kernel density estimators is highly dependent onthe specification of penalty terms and bandwidths, respectively, which can be difficultto choose.Partly for this reason, recently there has been a renewed interest in ensuringparsimony of the maximum likelihood density estimator through conditioning on theinformation provided by the shape of the underlying density. In particular, significantprogress has been made on the maximum likelihood density estimator subject to theconstraint that the logarithm of the density is a concave function, which defines a log − concave density (Dümbgen and Rufibach, 2009; Cule, Samworth, and Stewart,2010; Kim and Samworth, 2016).One early pioneer on the advantages of log − concavity for both statistical test-ing as well as estimation was Karlin (1968). Suppose the distribution function F : R → [0 , has a density function denoted by f : R → R + . Some examplesof log − concavity’s many implications include that the density f ( x − θ ) has a mono-tonic likelihood ratio if and only if f ( · ) is log − concave, products and convolutionsbetween log − concave densities are log − concave, and that the hazard function of the log − concave density f ( x ) , defined by f ( x ) / (1 − F ( x )) , is increasing. Bagnoli andBergstrom (2005) also provide a survey of economic models in which log − concavityof a density is a sufficient condition for the existence or uniqueness of an equilibrium.Chen and Samworth (2013) as well as Dümbgen, Samworth, and Schuhmacher (2011)provide tests for a population density satisfying log − concavity, and Carroll, Delaigle,and Hall (2011) provide a test for a population density satisfying a more general setof shape constraints.A wide variety of the random variables in the economics literature, such as annual2ncome or changes in stock prices, are thought to exhibit sub-exponential tail behavior,so a log − concavity constraint would not result in a consistent estimator in thesecases. Koenker and Mizera (2010) generalized the log − concave maximum likelihoodestimator by maximizing Rényi entropy of order ρ ∈ R subject to the ρ − concavityconstraint, f ( αx + (1 − α ) x ) ≥ ( αf ( x ) ρ + (1 − α ) f ( x ) ρ ) /ρ , for all α ∈ [0 , . This estimator converges to the maximum likelihood estimatorsubject to a log − concavity constraint in the limit as ρ → . Decreasing ρ correspondsto a relaxation of this shape constraint, so if f ( x ) satisfies the constraint for some ρ, then it also satisfies the constraint for all ρ (cid:48) < ρ. Also, this constraint is equivalent toconcavity when ρ is equal to one, and the cases of log − concavity and quasi-concavitycan be derived in the limit as ρ → and ρ → −∞ respectively. Koenker and Mizera(2010) place particular emphasis on the case in which ρ = − / , partly because moststandard densities are − / − concave. For example, all Student t v densities with v ≥ satisfy this constraint. ρ − concavity constraints provide a considerable relaxation over log − concavity con-straints, while restricting the set of feasible densities sufficiently to ensure parsimonyof the density estimator. These constraints are also sufficient conditions for manyresults in economics, including the uniqueness or existence of equilibria in a variety ofmodels; see for examples, (Ewerhart, 2013; Bagnoli and Bergstrom, 2005). However,in many cases the necessary and sufficient conditions for these results are considerablyweaker, so inference and estimation based on these stronger conditions can providemisleading results. For example, inferring whether a population density satisfies theseweaker conditions based on tests for their more restrictive counterparts is generallynot possible. Things are less straightforward in the case of estimation because shapeconstraints are generally the source of the density estimator’s parsimony. However,since using a shape constraint that is not satisfied by the population density wouldnot result in a consistent estimator, it is prudent to err toward the weakest constraintthat theory predicts a population density would satisfy, or when using the estimate ina structural model, the weakest constraint that a model requires a population densityto satisfy.For a concrete example in the economics literature, given a density of private val-uations of risk neutral agents, Myerson (1981) defined the virtual valuations functionas x − (1 − F ( x )) /f ( x ) , and showed that the first price auction is revenue maximiz-ing if this function is strictly increasing. Sufficient conditions for Myerson’s (1981)regularity condition, ordered from strongest to weakest, are log − concavity, a mono-tonic hazard rate, and ρ − concavity for ρ > − / (Ewerhart, 2013). While the firsttwo sufficient conditions are commonly cited in mechanism design, they both imply Maximum likelihood is equivalent to maximizing Shannon entropy, and Rényi entropy of order ρ converges to Shannon entropy as ρ → . log − normal density hassub-exponential tails, it also satisfies this condition when σ < , which holds in thestructural model provided by Laffont, Ossard, and Vuong (1995).This paper provides a framework for estimating and performing inference withshape constrained densities using regularized optimal transport (Cuturi, 2013). Thisobjective function has the advantage of having an unconstrained global optimumthat is a consistent density estimator, which ameliorates the requirement that theshape constraint is the only source of parsimony. At first we motivate the methodusing a ρ − concavity constraint, but one of the advantages of the method is that theestimator is consistent when this constraint is replaced by a wide variety of alternativeshape constraints. We also provide a consistent test for whether or not a populationdensity satisfies a shape constraint based on comparing the objective function at theunconstrained optimum to the constrained optimum.After introducing density estimation with this more general class of shape con-straints, we use the proposed estimator to explore whether or not the CaliforniaDepartment of Transportation’s decision to use a reverse first price auction to awardconstruction contracts is cost minimizing. To do this, we use the method providedby Guerre, Perrigne, and Vuong (2000) to calculate the firms’ marginal costs usingdata on their bids. We find that a kernel density estimator of these costs does notsatisfy Myerson’s (1981) regularity condition everywhere; however, the proposed den-sity estimate, subject to the constraint that Myerson’s (1981) regularity conditionis satisfied, appears to follow the data closely. Our test also fails to reject that thepopulation density satisfies Myerson’s (1981) regularity condition.In addition to the flexibility offered by the proposed framework, there are threeother advantages of the proposed method. First, the notion of fidelity to the datathat we optimize is independent of the constraint, including the choice of ρ in thecase of ρ − concavity constraints. Note that the objective function used in Koenkerand Mizera’s (2010) approach, Rényi entropy of order ρ, is dependent on this con-straint parameter. Also, adding a ρ − concavity constraint to the maximum likelihoodestimator would not provide a convincing way to achieve this goal since this wouldnot provide a convex optimization problem for values of ρ < and the estimator doesnot exist when ρ < − (Doss and Wellner, 2016).Second, the shape constraints of the estimator only binds in regions in which itwould not otherwise be satisfied. This is advantageous when using shape constraintsthat are not sufficiently restrictive to ensure parsimony by themselves. Moreover, theexistence of the unconstrained minimizer ensures that the density estimator exists,regardless of the strength of the shape constraint. As discussed in the preceding4aragraph, this is not the case for the maximum likelihood estimator. Although it isnot our primary focus, we also provide an option for relying on the shape constraintfor parsimony.Third, the proposed algorithm solves an optimization problem over a set of vari-ables that grows sub-linearly in the sample size, so the time complexity of the proposedalgorithm compares favorably to other shape constrained density estimators.The next section outlines the aspects of optimal transport that are required to for-mulate our estimator. Galichon (2016) provides a more comprehensive overview of theoptimal transport literature, including its many applications in economics. The thirdsection defines the estimator, provides the rate of convergence, and the asymptoticdistribution of the estimator. This section also provides a test for the null hypothesisthat the population density satisfies the shape constraint. The fourth section provesthat the optimization problem defining the estimator is convex and provides an algo-rithm to calculate the estimator. Note that initializing this algorithm at a reasonableapproximation of the density estimator provides a gain in computational efficiency,and an algorithm for finding an approximation is provided in the appendix. The sixthsection generalizes the framework presented here to allow for density estimation andinference subject to a much larger class of shape constraints, with a focus on shapeconstraints that arise in the economics literature. Specifically, this section providessufficient conditions for each of the results in the paper to hold under an arbitraryset of shape constraints. This generalization is used in the seventh section to provideevidence that the firms bidding on the California Department of Transportation’sconstruction contracts have marginal cost distributions that satisfy Myerson’s (1981)regularity condition.A few notational conventions will be useful in the subsequent sections. For x ∈ R m , we will denote the vector with an i th element defined by exp( x i ) as exp( x ) , and asimilar convention will be used for log( x ) and x ρ . Also, a diagonal matrix with adiagonal equal to the vector x will be denoted by D x , an m × vector of ones by m , the identity matrix by I, element-wise division of the two vectors x and y by x (cid:11) y, element-wise multiplication by x ⊗ y, the Moore-Penrose pseudoinverse of the matrix A as A + , the convolution between f : R d → R and g : R d → R by g ( x ) ∗ f ( x ) , thederivative of f ( x ) with respect to x by (cid:79) x f ( x ) , and a Dirac delta function centeredat z by δ z ( x ) . Also, sgn ( x ) will be used to denote a function that is when x ≥ and − when x < . Since the proposed method requires discretizing densities, say µ : A → R for A ⊂ R d , we will denote the points in the mesh as { a i } mi =1 , where a i ∈ R d . We will also continue to include parenthesis after functions, as in µ ( x ) or µ ( · ) , and exclude parenthesis when denoting µ ∈ R m with elements µ i = µ ( a i ) . Optimal Transport
Gaspard Monge formulated the theory of optimal transport in the th century inorder to derive the optimal method of moving a pile of sand to a nearby hole of thesame volume. Specifically, suppose that both the pile of sand and the hole are definedon A ⊂ R d , and we use the measures M : A → R + and M : A → R + to define thevolume of the pile and the hole respectively. Monge sought to find a transportationplan, T : A → A , that minimizes transportation costs while ensuring that the hole iscompletely filled.Kantorovitch (1958) generalized this problem by describing the transportationplan by the absolutely continuous measure ψ : A × A → R + . For example, given a ∈ A and a ∈ A , we can view the Radon-Nikodym derivative of ψ ( · ) , dψ ( a , a ) , asthe amount of mass moved from a to a under the transportation plan, or coupling , ψ ( · ) . Feasibility of ψ ( · ) simply requires ψ ( a, A ) = M ( a ) and ψ ( A , a ) = M ( a ) forall a ∈ A . When M i ( · ) is absolutely continuous, there exists µ i ( a ) such that M i ( A ) = (cid:82) A µ i ( a ) da for each A ⊂ A by the Radon-Nikodym theorem. When M i ( · ) also satisfies M i ( A ) = 1 , then µ i ( a ) is also a probability density function. Optimal transport canbe described without assuming M ( · ) and M ( · ) satisfy these conditions; however,since our goal is density estimation, we will generally restrict our attention to thesecases for the rest of the paper. In addition we will define (or constrain) all densityfunctions to be continuous, with the exception of δ z ( · ) . We will use this notation todefine Ψ( µ ( · ) , µ ( · )) as the set of feasible couplings.The most common cost function in optimal transport is simply squared Euclideandistance. In this case the cost of moving one unit of earth from a ∈ A to a ∈ A isproportional to (cid:107) a − a (cid:107) . The resulting minimization problem is then given by W ( µ ( · ) , µ ( · )) := min ψ ∈ Ψ( µ ( · ) ,µ ( · )) (cid:90) A×A (cid:107) a − a (cid:107) dψ ( a , a ) , (1)which we will refer to as the squared Wasserstein distance (Mallows, 1972). W ( µ ( · ) ,µ ( · )) has many desirable properties, one being that (cid:112) W ( µ ( · ) , µ ( · )) satisfies all ofthe usual properties of a distance metric. W ( µ ( · ) , µ ( · )) also metrizes weak conver-gence and convergence in the first two moments. In other words, given a sequenceof densities, { µ ( · ) , µ ( · ) , ..., µ n ( · ) } , we have lim i →∞ W ( µ ( · ) , µ i ( · )) = 0 if and onlyif M i converges weakly to M and the first two moments of µ i converge to the firsttwo moments of µ . The Wasserstein distance between two distributions can be viewed as a measure ofdistance over the domain of the densities rather than in the direction of their range.For example, the Fréchet mean of two Dirac delta functions, centered at a and b, in thespaces of densities equipped with an L norm is given by arg min ν ( x ) (cid:107) δ a ( x ) − ν ( x ) (cid:107) + (cid:107) δ b ( x ) − ν ( x ) (cid:107) = δ a ( x ) / δ b ( x ) / , while a similar notion of average in the spaces6f densities equipped with the Wasserstein distance is arg min ν ( x ) W ( δ a ( x ) , ν ( x )) + W ( δ b ( x ) , ν ( x )) = δ a/ b/ ( x ) . To make this intuition more explicit, when
A ⊂ R , onecan show W ( µ ( · ) , µ ( · )) can also be expressed as (cid:82) ( Q ( τ ) − Q ( τ )) dτ, where Q ( τ ) and Q ( τ ) are the quantile functions corresponding to µ ( · ) and µ ( · ) respectively.Thus, ( Q ( τ ) − Q ( τ )) represents a squared distance between two points in A (Villani,2003).In practice augmenting the Wasserstein distance with a regularization term ame-liorates some numerical difficulties, which will be described below in more detail. Theregularized squared Wasserstein distance is a generalization of W ( µ ( · ) , µ ( · )) , andis defined by W γ ( µ ( · ) , µ ( · )) := min ψ ∈ Ψ( µ ( · ) ,µ i ( · )) (cid:90) A×A (cid:107) a − a (cid:107) dψ ( a , a ) − γH ( ψ ( · )) , (2)where γ ≥ and H ( ψ ( · )) := − (cid:82) A×A log ψ ( a , a ) dψ ( a , a ) is the Shannon entropyof ψ ( · ) (Cuturi, 2013; Cuturi and Doucet, 2014). In the shape constrained densityestimation setting, the addition of this entropy term is advantageous for several rea-sons. First, the objective function is strictly convex when γ > , so the optimalcoupling will always be unique. Second, in practice A must be discretized beforefinding the unregularized Wasserstein distance, and the computational cost of solvingfor the optimal coupling scales at least cubically in the number of points in the mesh.Third, after discretizing, the minimizer of W γ ( µ , µ ) with respect to µ is often amore accurate representation of the minimizer of W ( µ ( · ) , µ ( · )) , when γ is set to areasonably small value. Four, using W γ ( µ ( · ) , µ ( · )) allows us to avoid assumptionsin the next section regarding the existence of the second moments of µ and µ . Lastly,we can find the minimizer of (2) with a very computationally efficient algorithm afterdiscretizing, which we will describe next.To introduce the discretized counterparts of dψ ( · ) , µ ( · ) , and µ ( · ) , recall our uni-form mesh over A contains the vertices { a i } mi =1 , and let µ , µ define µ ( a i ) , µ ( a i ) respectively. Also, let M m × m so that M ij := (cid:107) a i − a j (cid:107) , and ψ m × m so that ψ i,j := dψ ( a i , a j ) . After discretizing, (2) can be written as W γ ( µ , µ ) := min ψ (cid:88) i,j ψ ij M ij + γψ ij log( ψ ij ) subject to: (3) m (cid:88) j ψ ij = µ i ∀ i ∈ { , , ..., m } (4) Minimizing W ( µ , µ ) with respect to µ generally results in a minimizing density with manylarge discrete changes. For more detail, see Figures 3.1, 3.2, and the accompanying explanation in(Cuturi and Peyré, 2016). (cid:88) i ψ ij = µ j ∀ i ∈ { , , ..., m } . (5)The corresponding Lagrangian is given by L = (cid:32)(cid:88) i,j γψ ij log ( ψ ij ) + ψ ij M ij (cid:33) − λ T (cid:32) m (cid:88) j ψ · j − µ (cid:33) − λ T (cid:32) m (cid:88) i ψ i · − µ (cid:33) , (6)and the first order conditions imply ψ ij = exp ( λ i /γ − /
2) exp ( − M ij /γ ) exp ( λ j /γ − / . In other words, there exists v, w ∈ R m + such that the optimal coupling has elements ψ ij = K ij w i v j , where K ij := exp (cid:0) − (cid:107) a i − a j (cid:107) /γ (cid:1) , a symmetric m × m matrix. Thiscan also be written as, ψ = D w KD v , (7)so adding the entropy term to the objective function reduces the dimensionality of theoptimization problem from m to m. Sinkhorn (1967) shows that ψ is unique. Theiterative proportional fitting procedure (IPFP) is an efficient method of computingthese vectors; see Krupp (1979). This method iteratively redefines w so that D w Kv = µ , and subsequently redefines v so that D v Kw = µ , as summarized in Algorithm1. Note that after combining these equalities we have D w K ( µ (cid:11) ( Kw )) = µ , whichwill be used in subsequent sections. Algorithm 1
The iterative proportional fitting procedure. function
IPFP ( K, µ , µ ) w ← m until convergence: v ← µ (cid:11) ( Kw ) w ← µ (cid:11) ( Kv ) return w, v In the rest of the paper we will make substantial use of the dual of (3)-(5). Cuturiand Doucet (2014) show that the dual is given by the unconstrained optimizationproblem, W γ ( µ , µ ) = max ( x,y ) ∈ R m x T µ + y T µ − γ (cid:88) i,j exp (( x i + y j − M ij ) /γ ) . (8)Note that K = exp ( − M/γ ) , so the first order conditions of (8) can be written as,8 i / (cid:32)(cid:88) j exp ( y j /γ ) K ij (cid:33) = exp ( x i /γ ) (9) µ j / (cid:32)(cid:88) i exp ( x i /γ ) K ij (cid:33) = exp ( y j /γ ) . (10)Note that after replacing exp ( x/γ ) with w and exp ( y/γ ) with v, these formulas areequivalent to the updates of w and v given in Algorithm 1. Also, given x and y that satisfy (9)-(10), consider the vectors ˜ x := x − c and ˜ y := y + c, where c ∈ R . Since µ and µ have the same sum, the objective function of (8), evaluated at ˜ x and ˜ y must equal the objective function evaluated at x and y. Since exp (˜ y j /γ ) =exp ( y j /γ ) exp ( c/γ ) and exp (˜ x i /γ ) = exp ( x i /γ ) exp ( − c/γ ) , ˜ x and ˜ y must also satisfy(9)-(10). In other words, while v and w are unique up to a multiplicative constant on w and one over this constant on v, y and x are unique up to the additive constant c. A few comments regarding the effect of γ on the optimal coupling will also beuseful in subsequent sections. Higher values of γ correspond to placing a higherpenalty on the negative entropy of the coupling, so the optimal coupling becomesmore dispersed as this parameter is increased. Also, in the limit γ → , W γ ( µ , µ ) converges to W ( µ , µ ) at the rate O (exp( − /γ )) and ψ converges to the optimalunregularized coupling at this same rate; see Benamou et al. (2015) and Cuturi(2013). In the next section will use W γ ( · ) as an objective function to define theproposed estimator and show how γ impacts this estimator in more detail. The input of the density estimator proposed in this paper is a kernel density estimator, µ, based on N i.i.d. datapoints, { z i } Ni =1 , drawn from a uniformly continuous popula-tion density, µ (cid:63) : R d → R , with a bandwidth of σ ≥ . Our results also hold when γ and σ are redefined to be functions of the form c ( { z i } i ) γ and c ( { z i } i ) σ, where c j ( { z i } i ) p → c j for j ∈ { , } at any polynomial rate. In the interest of the brevity ofnotation, we will only write the parameters in this way when their randomness wouldhave a non-trivial impact on the result. γ, σ and m will also be dependent on N, butwe will suppress this input throughout the paper and discuss our recommendationsfor defining them after Theorem 2.With this notation in mind we can define the shape-constrained density estimatoras, min f W γ ( f, µ ) subject to: sgn ( ρ ) f ρ ∈ K , (11)9here K is the cone of concave functions. Although K is a convex set, the set of ρ − concave densities is generally not convex. To ameliorate this problem, we will usea similar formulation as Koenker and Mizera (2010) and solve for g := f ρ . Note thatno generality is lost in doing so, as there is a one to one correspondence between g and f. For the clarity of the derivations in the next section, we will also define g to bea vector of length m − and refer to the element of the vector f, of length m, thatcorresponds to this omitted value as f k . We will set f k so that the density sumsto m. In other words, the elements of the density f that do not correspond to this k th element, denoted by f − k , will be set equal to g /ρ and f k will be set equal to m − (cid:80) i g /ρi . Unlike optimizing over g rather than f, this can be seen as a slight modificationof our initial optimization problem, since we will also exclude the shape constraintsthat depend on the k th element of f. However, as discussed in the next section inmore detail, one can choose k to correspond to an element on the boundary of themesh so that the estimator satisfies the shape constraint everywhere on the interiorof its domain.In a slight abuse of notation, we will also denote the objective function as W γ ( g /ρ , µ ) . In summary, we will define W γ ( g /ρ , µ ) by, max ( x,y ) ∈ R m x T − k g /ρ + x k (cid:32) m − (cid:88) i g /ρi (cid:33) + y T µ − γ (cid:88) i,j exp (( x i + y j − M ij ) /γ ) , (12)and the final form of our optimization problem is, min g W γ ( g /ρ , µ ) subject to: (13) g i = α i + β i a i , sgn ( ρ ) (cid:0) α i − α j + ( β i − β j ) T a i (cid:1) ≤ ∀ i, j ∈ { , ..., m − } , (14)where α i ∈ R , β i ∈ R d , and d is the dimension of the support of µ and f. Theseinequality constraints are used by Afriat (1972) to estimate production functions withconcavity constraints. They tend to provide a gain in numerical accuracy relative tolocal concavity constraints.The following Lemma provides the limiting distribution of the estimator afterremoving the shape constraints. This is achieved by showing that the resulting densitycan be viewed as a kernel density estimator with a bandwidth of (cid:112) σ + γ/ away fromthe edges of the mesh over A . To avoid these boundary value effects, we recommendenlarging the domain of µ and ˆ f to include regions within approximately (cid:112) σ + γ/ of the datapoints. Alternatively, one could replace the matrix K with the direct10pplication of a Gaussian filter. This approach has the added benefit of reducing thecomputational complexity of Algorithm 1 to O ( m log m ) , so this is our recommendedapproach when d > see Solomon et al. (2015) for more details on this method. Lemma 1:
Suppose µ is a kernel density estimate, generated with a Gaussiankernel and a bandwidth σ and µ (cid:63) ( · ) is uniformly continuous. Also, suppose σ, γ and m are chosen so that (cid:112) σ + γ/ p → , N (cid:112) σ + γ/ p → ∞ , and min i (cid:54) = j (cid:107) a i − a j (cid:107) / √ γ → as N → ∞ . Then there exists c ∈ R so that the limiting density of f unc := arg min f W γ ( f, µ ) is given by, (cid:112) N ( σ + γ/ d/ (cid:0) f unc,i − µ (cid:63)i + c ( σ + γ/ d (cid:1) d → N (cid:16) , µ (cid:63)i / (2 √ π ) d (cid:17) Proof:
To find f unc , consider the optimization problem, min ψ (cid:88) i,j ψ ij M ij + γψ ij log( ψ ij ) subject to: (15) m (cid:88) i ψ ij = µ j ∀ j ∈ { , , ..., n } (16)The corresponding Lagrangian is L = (cid:16)(cid:80) i,j γψ ij log( ψ ij ) + ψ ij M ij (cid:17) + λ T ( (cid:80) i ψ i · − µ ) , and the first order conditions imply that there exists v ∈ R m such that ψ = KD v . (17)Note that convexity of negative entropy implies that the optimal coupling will corre-spond to a minimizer. After combining this equality with the constraints, we have m (cid:88) i ψ ij = v j m (cid:88) i K ij = µ j , (18)which implies v j = µ j / ( (cid:80) mi K ij ) . Let κ := K . Now we can find f unc by finding the other marginal of ψ,f unc := K ( µ (cid:11) κ ) . When µ (cid:63) ( · ) is differentiable at a i , c = (cid:52) µ (cid:63)i ( x ) / | x = a i , where (cid:52) µ (cid:63)i ( x ) denotes the Laplacian, (cid:80) dj =1 (cid:79) x j ,x j µ (cid:63) ( x ) . However, c can be defined without assuming µ (cid:63) ( · ) is differentiable; Karunamuniand Mehra (1991) provide more details on this approach. φ η : R d → R be a Gaussian density with variance ηI d and mean d . Suppose ν : R d → R is a continuous function. Then, Kν ( a ) = (cid:80) mi =1 exp( − (cid:107) a i − a j (cid:107) /γ ) ν ( a i ) ≈ m √ πγ (cid:80) mi =1 φ γ/ ( a i − a j ) ν ( a i ) by the definition of K, so K can be viewed as a linear operator that discretizes theconvolution ν ( · ) ∗ φ ( · ) √ π/γ. Since the distance between adjacent points decreases ata rate that is faster than (cid:112) γ/ , and the convex hull of the data is a strict subsetof the convex hull of { a i } i , the error of this approximation converges to zero on theconvex hull of the data. Thus, as N diverges we have κ i → √ π/ ( mγ ) for all i in theconvex hull of the data.Since µ is a kernel density estimator, it can be expressed as µ = ( (cid:80) i δ z i ( · ) ∗ φ σ /N ) , ( a ) which implies that f unc converges to (cid:0)(cid:80) i δ z i ( · ) ∗ φ σ ∗ φ γ/ /N (cid:1) ( a ) . The convo-lution of two Gaussian densities is φ σ ∗ φ γ/ = φ σ + γ/ , so f unc = ( (cid:80) i δ ( z i )) ∗ φ σ + γ/ ( y ) /N, which defines a kernel density estimator with a bandwidth of (cid:112) σ + γ/ , and Parzen (1962) provides the limiting density of the kernel density estimator when (cid:112) σ + γ/ → and N (cid:112) σ + γ/ → ∞ . (cid:3) The following theorem provides the limiting density of the estimator with theprimary additional assumption that ( µ (cid:63) ( x )) ρ is strictly concave. Afterward, we willmove onto results that relax this assumption. Theorem 2:
Suppose the assumptions from Lemma 1 hold. If µ (cid:63) is in the interiorof the feasible set, N (cid:112) c ( { z i } i ) σ + c ( { z i } i ) γ/ / log( N ) → ∞ , and for j ∈ { , } ,c j ( { z i } i ) converges in probability to a constant, then there exists c ∈ R so that, (cid:112) N ( σ + γ/ d/ (cid:16) ˆ f i − µ (cid:63)i + c ( σ + γ/ d (cid:17) d → N (cid:16) , µ (cid:63)i / (2 √ π ) d (cid:17) . Proof:
Since W γ ( · ) is differentiable, we can apply the envelope theorem to thedual problem (8) to show that (cid:79) f W γ ( f, µ ) = x. This implies that the ˜ f ∈ R m is acritical point of f (cid:55)→ W γ ( f, µ ) if and only if it has a corresponding dual variable in (8)of x = , so f unc is the unique minimizer of W γ ( · ) by the proof of Lemma 1 and thedefinition w := exp( x/γ ) . This implies ˆ f = f unc when f unc is feasible. Einmahl andMason (2005) show that f unc a.s. → µ (cid:63) under the additional assumptions of the theoremwhen using data dependent bandwidths, as in the statement of the theorem. Theresult follows from the fact that µ (cid:63) is in the interior of the feasible set. Uniqueness of f unc can also be shown using strict convexity of W γ ( f, µ ) in f, which will beshown in Theorem 6. Remark 1:
Algorithm 1 can be slow to converge when γ is chosen to be too small,but our assumption that γ → as N → ∞ is not problematic when the assumptionsof the theorem hold. To see this note that f unc can be written as K ( µ (cid:11) ( K )) . Evaluating Algorithm 1 at input densities µ and f unc would result in the algorithmfirst initializing w as m and then define v to be µ (cid:11) ( K ) . The w − update wouldredefine w to be f unc (cid:11) K ( µ (cid:11) ( K )) , but since this is also equal to , the algorithmhas already converged. For input densities f and µ, it is also generally the case thatAlgorithm 1 tends to converge at a faster rate for densities that are closer to f unc . (cid:3) Remark 2:
Our more general convergence result (Theorem 4) provides the samerate of convergence as Theorem 2 without requiring that µ (cid:63) is an interior point of thefeasible set, so we can compare this rate with the corresponding rate of convergence ofshape constrained maximum likelihood estimators. Seregin and Wellner (2010) showthat the pointwise minimax absolute error loss of the log − concave constrained max-imum likelihood estimators at x ∈ A is N − / ( d +4) when µ (cid:63) ( x ) is twice differentiable, x is in the interior of the domain of µ (cid:63) ( · ) , and the Hessian has full rank. Theorem2 implies that our estimator also obtains these bounds when (cid:112) σ + γ/ is chosen sothat it converges to zero at the optimal rate, in the sense of minimizing mean squarederror, of O p ( N − / ( d +4) ) . (cid:3) In practice, (cid:112) σ + γ/ has a more noticeable impact on the resulting densityestimator than the individual values of σ and γ, so we recommend fixing γ/σ to bea constant and focusing on the choice of (cid:112) σ + γ/ . Since increasing γ/σ tends toresult in an increase in the rate of convergence and numerical stability of Algorithm1, our recommendation is γ/σ = 8 , which was used in all of the applications andexamples given below.From a numerical standpoint, it is possible to set (cid:112) σ + γ/ so that the smoothingprovided by γ and σ is negligible and parsimony is almost entirely ensured by theshape constraints. Since the stability of Algorithm 1 is the only limiting factor on howsmall we can make (cid:112) σ + γ/ , we can use this fact to estimate a lower bound on thevalue (cid:112) σ + γ/ . Specifically, we can estimate the lowest possible value of (cid:112) σ + γ/ that still results in convergence of Algorithm 1 to within a given tolerance in a fixednumber of iterations, say 100, which can be achieved using a root finding algorithm.This requires an approximation of the density estimate, and Appendix B provides amethod for finding an approximation in a computationally efficient manner. Whenrunning the main optimization algorithm described in the next section, we recommendincreasing γ after this root is found by approximately one fourth and increasing the13umber of iterations used in Algorithm 1 by a factor of approximately ten to ensurethe accuracy of the Hessian.In our tests this approach resulted in density estimates that were surprisingly sim-ilar to estimates using the method described by Koenker and Mizera (2010). However,it is likely that the condition N (cid:112) σ + γ/ → ∞ would not hold in this case, whichis a requirement of Theorem 2. If N (cid:112) σ + γ/ → and we redefine the location ofthe vertices in the mesh, a , to be equal to the datapoints, then asymptotically µ i and f unc,i would only be influenced by a single datapoint. The following theorem showsthat we can at least ensure ˆ f p → µ (cid:63) in the case γ = σ = 0 . Theorem 3:
Suppose the locations of the vertices in the mesh a are defined to bethe datapoints, µ (cid:63) ( z ) > on its bounded domain A , and µ i = 1 /N for i ∈ { , , ..., N } . If there exists (cid:15) > such that (cid:82) A | x | (cid:15) dµ (cid:63) < ∞ , µ (cid:63) is ρ − concave, and γ = σ = 0 , then ˆ f p → µ (cid:63) . Proof:
The Wasserstein distance between density functions, denoted by W ( ν ( z ) ,ν ( z )) , is commonly approximated by discretizing, using ν j,i = ν j ( z i ) for j ∈ { , } and z i ∈ { a i } i , and solving a linear programming problem, as discussed in the previoussection. As long as min { r | A ⊂ ∪ mi =1 B r ( a i ) } goes to zero asymptotically, W ( ν , ν ) converges to its non-discretized counterpart, W ( ν ( z ) , ν ( z )); see for example, (Cuturiand Peyré, 2016). Since µ (cid:63) ( z ) > for all z ∈ A and { a i } i = { z i } i , this condition holdsasymptotically, so showing W ( ˆ f ( z ) , µ (cid:63) ( z )) p → , where ˆ f ( z ) is the linear interpolationof { ( a i , ˆ f i ) } Ni =1 , will imply the result.Let µ ( z ) = (cid:80) i δ z i ( z ) /N and f ( z ) be an arbitrary proper density with (cid:82) A | x | (cid:15) df. After using the triangle inequality twice, we can bound W ( f ( · ) , µ ( · )) by W ( f ( · ) , µ (cid:63) ( · )) ± W ( µ ( · ) , µ (cid:63) ( · )) for all N > . Note that Wasserstein distances metrize weak con-vergence, in the sense that W ( ν ( z ) , ν N ( z )) p → if and only if the distribution cor-responding to ν N ( z ) converges weakly to that of ν ( z ) and the first two momentsof ν N ( z ) converge to the first two moments of ν ( z ); see for example, Theorem 7.12of (Villani, 2003). Thus, W ( µ ( · ) , µ (cid:63) ( · )) p → by the result of Kiefer and Wolfowitz(1956), so our bounds on W ( f ( · ) , µ ( · )) imply that W ( f ( · ) , µ ( · )) p → W ( f ( · ) , µ (cid:63) ( · )) for all such f ( · ) . Note that W ( f ( · ) , µ (cid:63) ( · )) obtains its minimum value with respect to f ( · ) at densi-ties with distribution functions that are arbitrarily close to the distribution functionof µ (cid:63) ( · ) . Since µ (cid:63) is ρ − concave by definition, and ˆ f ( · ) is also by our constraint, thisdistribution has well defined and continuous density function by Aleksandrov’s theo-rem. Thus, ˆ f ( · ) p → µ (cid:63) ( · ) by the standard argument for consistency of an M-estimator;see for example, Theorem 5.7 of (van der Vaart, 2000). (cid:3) For the five reasons discussed in the previous section, our primary focus is oncases in which µ, and f unc , are also consistent estimators, so we will move onto our14ecommended approach for setting (cid:112) σ + γ/ and m. Like kernel density estimators,the mean squared error of f unc is minimized when (cid:112) σ + γ/ is O ( N − / ( d +4) ) . Also,any of the standard techniques for setting the bandwidth of kernel density estimatorsare reasonable methods of setting (cid:112) σ + γ/ , including cross validation or a rule-of-thumb bandwidth estimator; for examples of rule-of-thumb bandwidth estimatorssee (Silverman, 1986; Scott, 1992). Since the shape constraint also helps to ensurethe parsimony of the density estimate, these rules should generally be regarded asupper bounds on (cid:112) σ + γ/ . In practice, using Scott’s (1992) rule-of-thumb mul-tiplied by / works well. After dividing each dimension of the dataset, { z i,j } i for j ∈ { , , ..., d } , by min( IQR ( { z i,j } i ) / . , ˆ σ ( { z i,j } i )) , combining this with γ/σ = 8 , results in σ ≈ N − / ( d +4) / and γ ≈ N − / ( d +4) / . The choice of m is likely less critical than that of (cid:112) σ + γ/ . The effect of m on the estimator can be viewed as analogous to the effect of the size of a Gaussianfilter on its output. In practice, these filters are often constructed by discretizing aGaussian density over a mesh with a resolution equal to the standard deviation. Usingthis as a lower bound in our setting yields, m (cid:112) γ/ /d ≥ . The gain in accuracy fromincreasing m beyond the point m (cid:112) γ/ /d = 1 generally diminishes fairly quickly when m is set to larger values, so we recommend choosing m so that m = (cid:108) N β d/ (cid:112) γ/ (cid:109) , where (cid:100)·(cid:101) is the ceiling function, for any β > . In the interest of specificity, werecommend using m = (cid:108) N / d/ (cid:112) γ/ (cid:109) . If the set of constraints that bind asymptotically is known, it is straightforward tofind the asymptotic distribution of ˆ f when µ (cid:63) is not ρ − concave; however, this is rarelythe case in practice. For this reason, finding confidence intervals for estimators subjectto shape constraints is an active area of research in nonparametric econometrics. Thedifficulties in these cases are due to the estimators, when viewed as functions of thedata, not being sufficiently smooth (either differentiable or Hadamard differentiable)at points in which an inequality constraint goes from slack to active, and these notionsof smoothness are requirements of many of the commonly used methods for derivinglimiting distributions (Andrews, 2000).The following theorem establishes the limiting distribution of the estimator con-ditional on the set of constraints that bind asymptotically. Note that the set of activeconstraints converges when (cid:112) σ + γ/ converges to zero at its optimal rate. Thus,the limiting distribution part of the proof might be useful with a rather large samplesize. However, since the set of constraints that binds in a finite sample may not beequal to the set of constraints that bind asymptotically, in many cases the primaryvalue of this result is that it provides the limiting value of the estimator withoutassuming that ( µ (cid:63) ) ρ is strictly concave.There are several possible methods to avoid conditioning on the set of active con-straints, which is an area of ongoing research. One interesting possibility would be toadopt a similar method as Horowitz and Lee (2017) to the present setting. Their tech-15ique involves explicitly defining the set of active constraints as those constraints thatwould bind otherwise or that would “nearly bind.” Asymptotically correct coveragefollows from consistency of this conservative estimate of the active set. Theorem 4:
Suppose the assumptions from Lemma 1 hold. In addition, suppose σN → ∞ . Then, conditional on the set of constraints that are active asymptotically,say Ω , there exists G, as defined in (23-25), and c ∈ R m such that (cid:112) N ( σ + γ/ d/ (cid:16) ˆ f − ˜ µ + D c ( σ + γ/ d (cid:17) d → N (cid:0) , G T D µ (cid:63) / (2 √ π ) d G (cid:1) , where ˜ µ is the Wasserstein projection of µ (cid:63) onto the set of feasible densities. Specifi-cally, if ˜ g is the minimizer of min g W γ ( g /ρ , µ (cid:63) ) subject to: sgn ( ρ ) g ∈ K , then ˜ µ − k := ˜ g /ρ and ˜ µ k = m − (cid:80) i ˜ g /ρi in the case of a ρ − concavity constraint.Proof: We only consider the case in which d = 1 and k = m for the sake of clarity,but generalizing the proof to higher dimensions is straightforward. Since d = 1 , wewill assume all sets containing the numerical labels of vertices in the mesh are orderedsequentially. For example, Ω i will be viewed as the i th largest vertex label in which theconstraint binds. Let ˙Ω := { i | i / ∈ Ω ∧ i ± ∈ Ω } , the boundary of the complementof Ω . Since (cid:112) σ + γ/ → , we have σ → . Since this is the case, and since σN → ∞ and µ (cid:63) is continuous, µ converges uniformly to µ (cid:63) (Parzen, 1962). We can view ˆ f asa continuous function of µ, so the continuous mapping theorem implies ˆ f convergesto ˜ µ. Let the matrix A | Ω |× ( | Ω | + | ˙Ω | ) be defined so that Ag Ω ∪ ˙Ω is the second difference of g Ω ∪ ˙Ω , so the binding constraints can be denoted by Ag Ω ∪ ˙Ω ≤ . Since we assumedthat Ω is known, ˆ g, the vector of Lagrange multipliers, λ, and x are defined by thefollowing system of equations, ( x Ω ∪ ˙Ω − x k ) ⊗ ˆ g /ρ − ∪ ˙Ω /ρ = − A T λ (19) ( x − Ω ∩− ˙Ω − x k ) ⊗ ˆ g /ρ − − Ω ∩− ˙Ω /ρ = (20) In the interest of the simplicity of the exposition, we are defining A using local concavity con-straints rather than Afriat’s (1972) formulation of concavity constraints. The nonzero elements ofeach row of A are given by ( A i,j − , A i,j , A i,j +1 ) = sgn ( ρ )(1 , − , . Note that j > , and, since thesets Ω and Ω ∪ ˙Ω are ordered by the vertex labels, this matrix is in row echelon form by construction.Thus, A has full rank, so AA T is nonsingular. ˆ g Ω = (21) [ (cid:0) ˆ g /ρ (cid:1) T m − (cid:80) i ˆ g /ρi ] T = exp( x/γ ) ⊗ ( K ( µ (cid:11) ( K exp( x/γ )))) , (22)where the final equality results from combining the first order conditions of the opti-mization problem defining W γ ( g /ρ , µ ) , which are given by (9) and (10).Theorem 5 will establish that W γ ( g /ρ , µ ) is strictly convex in g. Since this is thecase, the solution of this system is unique. The implicit function theorem, appliedto (19)-(22), implies that we can view ˆ g, and thus also ˆ f , as a differentiable functionof µ. After simplifying this system, we will find this limiting density using the deltamethod.We will begin by deriving (cid:79) µ x using (22). Recall w := exp( x/γ ) , and let h ( x, µ ) be defined as, h ( x, µ ) := ˆ f − D w K ( µ (cid:11) ( Kw )) . (22) can be written as h ( x, µ ) = . Recall the following equalities from the previoussection: ψ = D w KD v , f = D w Kv, and µ = D v Kw.
These imply (cid:79) µ h ( x, µ ) = − D w KD (cid:11) ( Kw ) = − D w KD v (cid:11) µ = − ψD (cid:11) µ , and (cid:79) w h ( x, µ ) = D w KD µ (cid:11) ( Kw ) K − D K ( µ (cid:11) ( Kw )) = D w KD v D (cid:11) µ D v K − D ˆ f (cid:11) w = (cid:16) ψD (cid:11) µ ψ T − D ˆ f (cid:17) D (cid:11) w . The implicit function theorem implies, (cid:79) µ w = D w (cid:16) ψD (cid:11) µ ψ T − D ˆ f (cid:17) + ψD (cid:11) µ . Since x = γ log( w ) , we have, (cid:79) µ x = γ (cid:16) ψD (cid:11) µ ψ T − D ˆ f (cid:17) + ψD (cid:11) µ . (21) implies that each element in ˆ g Ω can be expressed as a mean of its neighbors,so we can express all of the elements of ˆ g Ω as a weighted mean of ˆ g ˙Ω . In other words,there exists C such that ˆ g Ω ∪ ˙Ω = C ˆ g ˙Ω . (19) implies that, given x and g ˙Ω , λ is given by,17 (cid:0) AA T (cid:1) − AD x Ω ∪ ˙Ω − x k ( C ˆ g ˙Ω ) /ρ − /ρ = λ. We will use the additional | Θ | equations in (19) to define ˆ f ˙Ω . After replacing eachinstance of ˆ g ˙Ω with ˆ f ρ ˙Ω and using this definition of λ, we can write (19) as h ( x, ˆ f ˙Ω ) = , where h ( x, ˆ f ˙Ω ) is defined as ( x ˙Ω − x k ) ⊗ ˆ f − ρ ˙Ω + ˜ A (cid:16) ( x Ω ∪ ˙Ω − x k ) ⊗ ( C ˆ f ρ ˙Ω ) /ρ − (cid:17) , and ˜ A := A T · , ˙Ω (cid:0) AA T (cid:1) − A. This implies (cid:79) ˆ f ˙Ω h ( · ) = (1 − ρ ) (cid:16) D ( x ˙Ω − x k ) ˆ f − ρ ˙Ω + ˜ AD x Ω ∪ ˙Ω − x k D ( C ˆ f ρ ˙Ω ) /ρ − CD ˆ f ρ − (cid:17) , and (cid:79) x Ω ∪ ˙Ω − x k h ( · ) = B + ˜ AD ( C ˆ f ρ ˙Ω ) /ρ − , where B | ˙Ω | × | Ω ∪ ˙Ω | is defined so that B i,j is equal to ˆ f − ρ ˙Ω when i, j satisfies ˙Ω i = { Ω ∪ ˙Ω } j and zero otherwise. The implicit function theorem implies (cid:79) x Ω ∪ ˙Ω − x k ˆ f ˙Ω = − (cid:16) D ( x ˙Ω − x k ) ˆ f − ρ ˙Ω + ˜ AD ( x Ω ∪ ˙Ω − x k ) ⊗ ( C ˆ f ρ ˙Ω ) /ρ − CD ˆ f ρ − (cid:17) − · (cid:16) B + ˜ AD ( C ˆ f ρ ˙Ω ) /ρ − (cid:17) / (1 − ρ ) , so G ˙Ω , · := (cid:79) µ ˆ f ˙Ω = (cid:79) x Ω ∪ ˙Ω − x k ˆ f ˙Ω P (cid:79) µ x = − (cid:16) D ( x ˙Ω − x k ) ˆ f − ρ ˙Ω + ˜ AD ( x Ω ∪ ˙Ω − x k ) ⊗ ( C ˆ f ρ ˙Ω ) /ρ − CD ˆ f ρ − (cid:17) − · (cid:16) B + ˜ AD ( C ˆ f ρ ˙Ω ) /ρ − (cid:17) P (cid:16) ψD (cid:11) µ ψ T − D ˆ f (cid:17) + ψD (cid:11) µ γ/ (1 − ρ ) , (23)where P | Ω ∪ ˙Ω | × m := (cid:79) x ( x Ω ∪ ˙Ω − x k ) is defined so that P i,j is equal to when i, j satisfies { Ω ∪ ˙Ω } i = j, − when j = m, and zero otherwise.Note that (20) implies x i = x k for all i ∈ { j | j / ∈ Ω ∩ j / ∈ ˙Ω } . The proof ofTheorem 2 shows that this condition defines f unc,i , so we have f unc,i = ˆ f i for all such i. This implies, G − Ω ∩− ˙Ω , · := (cid:79) µ ˆ f − Ω ∩− ˙Ω = K − Ω ∩− ˙Ω , · . (24)Lastly, writing the equations defining ˆ f Ω in terms of ˆ f ˙Ω yields ˆ f Ω = ( C ˆ f ρ ˙Ω ) /ρ , sowe have, G Ω , · := (cid:79) µ ˆ f Ω = D ( C ˆ f ρ ˙Ω ) /ρ − CD ˆ f ρ − G ˙Ω , · . (25)18 There are also a few options for testing if a population density satisfies a shapeconstraint. Hypothetically, one could consistently test the null hypothesis that apopulation density is ρ − concave using any consistent shape constrained density es-timator. This can be done by estimating the population distribution subject to theshape constraint and then using one of the classic tests for whether or not the empir-ical distribution of the data is equal to this estimate; see for example (Smirnov, 1948;Anderson and Darling, 1952). In these cases, choosing a test with a statistic that isclosely related to the fidelity criterion used for estimation allows for a more straight-forward interpretation of the result. For example, if the test statistic is equal to thefidelity criterion that the estimator optimizes, we would reject the null if and only ifwe would also reject the null for every density that satisfies the shape constraint.The following theorem provides the distribution of W γ ( f unc , µ ) and a consistenttest for the null hypothesis that µ (cid:63) ( x ) satisfies the shape constraint based on thisdistribution. The method also has the straightforward interpretation from the pre-ceding paragraph, so, if we denote the set of ρ − concave densities by K ρ , the nullis rejected min f ∈K ρ W γ ( f, µ ) − W γ ( f unc , µ ) is statistically significant. Since W γ ( f, µ ) is differentiable in f, this can be achieved without conditioning on the set of activeconstraints. Theorem 5:
Suppose the assumptions from Lemma 1 hold and that σN →∞ . Let T be defined as N ( σ + γ/ d/ ( W γ ( µ (cid:63) , µ ) − W γ ( f unc , µ )) , ψ as the opti-mal coupling between µ to f unc , ψ (cid:63) as the optimal coupling between µ (cid:63) and itself, B (cid:63) as γ ( D µ (cid:63) − ψ (cid:63) D (cid:11) µ (cid:63) ψ (cid:63)T ) + / , and B as γ ( D f unc − ψD (cid:11) µ ψ T ) + / . Then, T d → Z T B (cid:63) Z, where the iid elements of Z ∈ R m are distributed Z i (cid:112) N ( σ + γ/ d/ ∼ N (cid:16) (cid:52) µ (cid:63)i / , µ (cid:63)i / (2 √ π ) d (cid:17) . Also, the hypothesis that µ (cid:63) satisfies the shape constraint can be consistently testedat a significance level of α by rejecting the null when N ( σ + γ/ d/ (cid:16) W γ ( ˆ f , µ ) − W γ ( f unc , µ )) ≥ c α , where c α satisfies P ( X T BX ≥ c α ) = α and the iid elements of X ∈ R m are distributed X i (cid:112) N ( σ + γ/ d/ ∼ N (cid:16) , µ i / (2 √ π ) d (cid:17) . Proof:
The gradient and Hessian of W γ ( f, µ ) with respect to f are (cid:79) f W γ ( f, µ ) = x µ,f and (cid:79) f,f W γ ( f, µ ) = (cid:79) f x µ,f = γ ( D f − ψ µ,f D (cid:11) µ ψ Tµ,f ) + , which are derived in the proofof the next theorem. Thus, the second order Taylor series expansion about µ (cid:63) = f unc is given by, W γ ( µ (cid:63) , µ ) = W γ ( f unc , µ ) + x Tµ,f unc ( µ (cid:63) − f unc )+( f unc − µ (cid:63) ) T B ( f unc − µ (cid:63) ) + O ( (cid:107) f unc ( µ ) − µ (cid:63) (cid:107) ) . f unc := arg min f W γ ( f, µ ) and (cid:79) f W γ ( f, µ ) = x, we have x = . Given thegeneral discretized densities µ , µ ∈ R m , the next theorem will also shows that thematrix D µ − ψD (cid:11) µ ψ T has one eigenvalue that is zero and m − eigenvalues thatare strictly positive. Note that the pseudo inverse is a continuous function when itsdomain is restricted to the set of matrices with the same rank (Stewart, 1969). Since µ and f unc converge in probability to µ (cid:63) , the Slutsky theorem implies B p → B (cid:63) . Lemma1 implies that N ( σ + γ/ d/ (cid:107) f unc − µ (cid:63) (cid:107) converges to zero in probability at a rateof O p (cid:0) N − / ( σ + γ/ − d/ (cid:1) . Combining these results with the limiting distributionof f unc given in Lemma 1 implies that the Taylor series expansion given above can bewritten as, T := N ( σ + γ/ d/ ( W γ ( µ, µ (cid:63) ) − W γ ( f unc , µ )) d → Z T B (cid:63) Z. Since µ is a consistent estimator for µ (cid:63) , we also have X d → Z, so we also have T d → X T B (cid:63) X. To show the test is consistent, suppose µ (cid:63) ( x ) is not ρ − concave. Recall that W γ ( ˆ f , µ ) converges to the metric W ( ˆ f ( x ) , µ ( x )) asymptotically (Benamou et al, 2015)and that Wasserstein distance metrizes weak convergence of distributions (and con-vergence in the first two moments). Continuity of the functions ˆ f ( x ) , µ ( x ) and µ (cid:63) ( x ) implies that the distributions corresponding to these densities converge weakly if andonly if the densities themselves converge to one another in probability. Since ˆ f is in thefeasible set and µ (cid:63) is not, ˆ f ( x ) cannot converge to µ (cid:63) ( x ) in probability, so W ( µ (cid:63) , ˆ f ) does not converge to zero. Also, asymptotically we have W ( ˆ f , µ ) − W ( µ (cid:63) , µ ) >W ( µ (cid:63) , ˆ f ) by the triangle inequality, so N ( σ + γ/ d/ (cid:16) W γ ( ˆ f , µ ) − W γ ( f unc , µ ) (cid:17) di-verges under the alternative hypothesis. (cid:3) In this section we will derive a trust region algorithm to find the global minimum of(13) and (14). To do this, we will require the gradient and the Hessian of W γ ( µ, g /ρ ) . The following Theorem provides these values and shows that the optimization prob-lem is convex for cases in which ρ (cid:54) = 0 . Appendix A contains these derivations forthe log − concave case, which corresponds to the limit as ρ → . For notational con-venience, the Hessian given below corresponds to the case in which the index k is setequal to m ; although this is not a requirement of the theorem. Theorem 6:
The gradient of the W γ ( µ, g /ρ ) is r := (cid:79) g W γ ( µ, g /ρ ) = D g /ρ − /ρ ( x − k − x k ) , (26)20 nd the Hessian is H := (cid:79) g W γ ( µ, g /ρ ) = ABA T + C, (27) where A := (cid:2) D g /ρ − /ρ − g /ρ − /ρ (cid:3) , B := γ ( D g /ρ − ψD (cid:11) µ ψ T ) + , and C := − ρρ D g /ρ − D x − k − x k . In addition, W γ ( f, µ ) is strictly convex in f when k is chosen tobe arg min i x i and γ > , and the optimization problem given in (13)-(14) is convex.Proof: Since W γ ( µ, g /ρ ) is differentiable, the envelope theorem implies that thegradient of the objective function in (12) is equal to the gradient of the function givenin (13), so (cid:79) g W γ ( µ, g /ρ ) = D g /ρ − /ρ ( x − k − x k ) . The derivative of (26) yields the sum of two matrices. Specifically, H = (cid:79) g W γ ( µ, f ( g ))= D g /ρ − /ρ (cid:79) g ( x − k ( g ) − x k ( g )) + (cid:0) (cid:79) g g /ρ − /ρ (cid:1) D x − k − x k . (28)Next we will begin by deriving the first term, which will require several intermediatederivatives.First, since, f = ( g /ρ , m − · g /ρ ) , we have (cid:79) g f ( g ) = (cid:20) D g /ρ − /ρ − g /ρ − /ρ (cid:21) . Second, we will view w as a function of f in order to find (cid:79) f w ( f ) . (9) and (10) can becombined to yield the equality f − D w K ( µ (cid:11) ( Kw )) = 0 , and implicit differentiationof this equality implies, (cid:79) f w ( f ) = ( D f (cid:11) w − ψD (cid:11) µ ψ T D (cid:11) w ) + . Third, the definition w := exp ( x/γ ) implies x = γ log( w ) , so (cid:79) w x ( w ) = γD (cid:11) w . Lastly, let ˜ x ( x ) := x − k − x k , so (cid:79) x ˜ x ( x ) = (cid:2) I − (cid:3) m − × . After combining all fourderivatives, we have D g /ρ − /ρ (cid:79) g ˜ x ( x ( w ( f ( g )))) = D g /ρ − /ρ (cid:79) x ˜ x ( x ) (cid:79) w x ( w ) (cid:79) f w ( f ) (cid:79) g f ( g )= γD g /ρ − /ρ (cid:2) I − (cid:3) D (cid:11) w ( D f (cid:11) w − ψD (cid:11) µ ψ T D (cid:11) w ) + (cid:20) D g /ρ − /ρ − g /ρ − /ρ (cid:21) = γ (cid:2) D g /ρ − /ρ − g /ρ − /ρ (cid:3) ( D f − ψD (cid:11) µ ψ T ) + (cid:20) D g /ρ − /ρ − (cid:0) g /ρ − (cid:1) T /ρ (cid:21) ABA T . Since (cid:79) g g /ρ − /ρ = (1 − ρ ) /ρ D g /ρ − , the second matrix in (28) is given by C. Convexity requires that this Hessian is positive semidefinite. If k is chosen to bearg min i x i , then x − k − x k ≥ . Since this is the case, C is a diagonal matrix withnonnegative diagonal elements, so C is positive semidefinite.Next we will establish that ABA T is positive definite in several steps. First,note that ABA T is symmetric if B is symmetric. Also, since D f and ψD (cid:11) µ ψ T aresymmetric B + is also symmetric. Since the Moore-Penrose pseudo inverse preservessymmetry, B is also symmetric. We will proceed by establishing a few intermediaryresults on the components of ABA T . Since D (cid:11) µ is positive semidefinite, ψD (cid:11) µ ψ T is as well. Since ψ is a coupling ofthe densities µ and f, we have D (cid:11) f ψD (cid:11) µ ψ T = D (cid:11) f ψ = . In other words, is an eigenvector of D (cid:11) f ψD (cid:11) µ ψ T with a corresponding eigenvalueof 1. The Perron-Frobenius theorem states that an m × m matrix with all positiveelements and columns that sum to one has a unique eigenvalue that is equal to oneand m − eigenvalues that are strictly less than one. Note that each eigenvalue, say λ, and corresponding eigenvector, say p, of D (cid:11) f ψD (cid:11) µ ψ T D (cid:11) f satisfies, (cid:0) D (cid:11) f ψD (cid:11) µ ψ T − λI (cid:1) p = 0 , = ⇒ (cid:16) I − D (cid:11) f ψD (cid:11) µ ψ T − ˜ λI (cid:17) p = 0 so, if p is an eigenvector of D (cid:11) f ψD (cid:11) µ ψ T D (cid:11) f , then p is an eigenvector of I − D (cid:11) f ψD (cid:11) µ ψ T D (cid:11) f , with an eigenvalue corresponding to ˜ λ := 1 − λ. This impliesthat I − D (cid:11) f ψD (cid:11) µ ψ T is a positive semidefinite matrix with rank m − . Sincemultiplication by a positive definite matrix and applying the pseudoinverse preserveboth the rank and the signs of the eigenvalues, B = γ (cid:0) D f (cid:0) I − D (cid:11) f ψD (cid:11) µ ψ T (cid:1)(cid:1) + isalso a positive semidefinite matrix with rank m − . Observation 7.1.8 in Horn and Johnson (1990) implies that the nullspace of
ABA T is the same as the nullspace of BA T . Since the eigenvector of B that corresponds tothe eigenvalue of zero is , ABA T is positive definite if there does not exist v ∈ R m − such that A T v = m . This system of equations is equivalent to g /ρ − i v i = ρ for all i ∈ { , ..., m − } and (cid:80) i g /ρ − i v i = − ρ, which does not have a solution, so ABA T ispositive definite.This, along with the fact that the constraints in (14) are equivalent to constraining sgn ( ρ ) g to be in the set of concave functions, which is a convex cone, implies thatthe optimization problem is convex. 22 Remark:
Choosing k to be arg min i x i is a sufficient, but not necessary, conditionto guarantee convexity. Choosing k to correspond to the element on the boundary ofthe mesh over A , with the lowest corresponding value of w i , ensures that the densityestimate satisfies the shape constraints everywhere on the interior of its domain andrarely results in the objective function being nonconvex along the convergence path.As described below, we initialize our algorithm at a very good approximation of ˆ f . Note that this method does not require the specification of k. When the mesh isenlarged beyond the convex hull of the data, so that µ i and ˆ f i are lowest when a i corresponds to a point on the boundary of the domain, w and v generally obtain theirminimum value on the boundary points. Thus, initializing the algorithm at a densitythat is near ˆ f almost always results in w falling to a sufficiently low value at pointscorresponding to the boundary of A to ensure strict positive definiteness of H alongthe entire convergence path. (cid:3) Having already derived the gradient and Hessian of W γ ( µ, g /ρ ) , it is straight-forward to create a trust region algorithm. The algorithm takes an initial densityestimate, f (0) , as input and in iteration i the algorithm solves ∆ ← arg min d (cid:110) d T Hd/ d T r | (cid:16) d + g ( i − − k (cid:17) sgn ( ρ ) ∈ K , (cid:107) d (cid:107) ≤ c (cid:111) . If the value of the objective function evaluated at g ( i − − k +∆ results in an improvementover its value at g ( i − − k , then g ( i ) − k is defined to be g ( i − − k + ∆ . If the improvement wassignificant, then the radius of the trust region, c, is increased, and otherwise it isdecreased and the value of g ( i ) − k is defined to be g ( i − − k . This is described in Algorithm2. It may be possible to modify the algorithm we use to compute the approximation, as describedin Appendix B, to provide a convergence guarantee. This would allow for the formulation of theestimator using g ∈ R m rather than g ∈ R m − . This is an ongoing area of research. lgorithm 2 The parameter values for this method were set using the recommen-dations of Fan and Yuan (2001).
Wasserstein Distance ( · ) uses Algorithm 1 to find w, and then computes W γ ( µ, f ) , the gradient, Hessian, and the index k (as definedin Theorem 6). function Trust Region ( µ, ρ, f (0) , K ) ( W (0) γ , H (0) , r (0) , k (0) ) ← Wasserstein Distance( µ, ρ, K, f (0) ) g (0) − k ← f (0) ρ − k ˜ W (0) γ ← W (0) γ η ← , c ← for i = 1 , , ... : ( W ( i ) γ , H ( i ) , r ( i ) , k ( i ) ) ← Wasserstein Distance( µ, ρ, K, f ( i − ) if W ( i ) γ > W ( i − γ : W ( i ) γ ← W ( i − γ , H ( i ) ← H ( i − ,r ( i ) ← r ( i − , k ( i ) ← k ( i − , g ← f ( i − ρ if ( W ( i ) γ − W ( i − γ ) / ˜ W ( i − γ < . : c ← c/ η/ else: c ← c/ ← arg min d d T Hd/ d T r s.t. d + g ∈ K , (cid:107) d (cid:107) ≤ c. ˜ W ( i ) γ ← ∆ T H ∆ / T rη ← (cid:107) ∆ (cid:107) g ← g + ∆ g k ← ( m − (cid:80) i g /ρi ) ρ f ( i ) ← g /ρ return f Figures 1 and 2 illustrate two examples of the output of Algorithm 2. Figure1 provides density estimates of the rotational velocity of stars that are constrainedto be − − concave and − / − concave, respectively. Figure 2 provides a plot of atwo dimensional density; to illustrate the tail behavior of the density more clearly,the logarithm of the density is shown. This estimator uses a dataset containing theheight and left middle finger length of 3,000 British criminals that was analyzed byMacdonell (1902) and Student (1908). 24 a) (b) Figure 1: The red density in each plot is a kernel density estimator of the rotationalvelocity of stars from from Hoffleit and Warren (1991). The blue density is thealgorithm’s output. ρ was set equal to − . in (a) and − in (b). γ was set using usingthe first method described above, so the constraints are binding almost everywhere.The data used to generate both figures are also used by Koenker and Mizera (2010)to compare log − concave density estimates with − / − concave density estimates.In the case of the dataset for the rotational velocity of stars, they show that theformer provides a monotonic density in the region in which the speed of rotationis strictly positive, while the latter density has a peak near the mode of the kerneldensity estimator shown in Figure 1. This peak is also present in both of the shape-constrained densities shown in Figure 1.For the dataset used in Figure 2, Koenker and Mizera (2010) show that the loga-rithm of the maximum likelihood density estimator subject to a log − concavity con-straint is below − near the observation at the very top of Figure 2, so observationsthis far from the rest of the data would be fairly unlikely to occur if the densitywas in fact log − concave. The logarithm of the − / − concave density given below isapproximately − . near this observation.25igure 2: The data points (shown in red) consists of the height and finger length of3,000 criminals from Macdonell (1902). The points on the convex hull of the dataare illustrated with blue asterisks. The contour plot depicts the logarithm of the − / − concave density estimate to illustrate the tail behavior of the density.To compute these estimates as well as the estimates given in the applicationsection, each iteration of Algorithm 2 used MOSEK, a highly optimized quadraticprogram solver, to find ∆ efficiently; however, this step is the limiting variable interms of the time complexity of the algorithm. The computational efficiency of thealgorithm can be improved by initializing f (0) at a good approximation of the finaldensity estimate. Also, using the best available approximation of ˆ f to define f (0) ensures that w is as close as possible to , for the reasons given in Remark 1 ofTheorem 2. This results in an increase in the numerical accuracy of the gradient andHessian in Algorithm 2, and thus increased stability of the algorithm. The appendixprovides a method for finding a particularly accurate and computationally efficientapproximation. In many cases initializing Algorithm 2 at this input density results inconvergence within two to four iterations, bringing the total computation time of ourimplementation down to approximately 10 seconds in our MATLAB implementationand 35 seconds in our R implementation when m = 200 . The next section provides results on a generalization of the optimization problem We have not finished optimizing these implementations. We believe there is an opportunity foran improvement in this computation time by at least an order of magnitude, particularly in theversion written in R. H and r exist. Alterna-tively, one could use gradient descent to solve this more general optimization problem,which would only require the user to provide the first derivative of the transformationof g. Although log − concavity, and ρ − concavity more generally, are the most commonlystudied shape constraints, the results provided in the previous sections are also ap-plicable to a large class of new shape constraints. Specifically, this section considerssolutions to the general optimization problem given by, min g W γ ( µ, α ( g )) (29)subject to: ∩ i β i ( g ) ≤ , (30)where α i : R m − → R + and β i : R m − → R . Analogous to the estimator describedin the preceding sections, we will denote the minimizer of (29)-(30) as ˜ g and thegeneralized density estimator as ˜ f := f ( g ) , where f ( g ) := [ α ( g ) T , m − T α ( g )] T . Thefollowing theorem provides sufficient conditions for the results provided in Theorems2, 4, and 5 to hold in this more general setting. The statements of these theorems werepurposefully ambiguous in regards to the constraint, so we simply provide additionalrequirements on the functions α ( · ) and β ( · ) for these generalizations. We also providesufficient conditions for Theorem 6, but this requires restating the result in its entiretyusing the new notation. Lastly, point (4) of the theorem provides a new result forstrict convexity of W γ ( µ, α ( g )) in the neighborhood of its global minimum, withoutrequiring the assumption that α ( · ) is convex or concave. Theorem 7:
Let i ( x ) be defined as arg min i (cid:107) x − a i (cid:107) , Θ as the set of properand uniformly continuous density functions, Ω N as { g | ∩ i β i ( g ) ≤ ∧ α ( g ) ∈ R m ∧ (cid:80) i α i ( g ) ≤ m } , the set Λ so that f ( x ) ∈ Λ if and only if there exists asequence { g ( N ) } ∞ N such that g ( N ) ∈ Ω N for all N sufficiently large and f ( x ) =lim N →∞ [ α ( g ( N ) ) T , m − T α ( g ( N ) )] i ( x ) . If µ (cid:63) ( x ) ∈ Θ , both of the sets Θ ∩ Λ and Ω N are nonempty, the function α ( · ) isinvertible, then the following additional conditions are sufficient for the applicabilityof Theorems 2, 4, 5, and 6 to ˜ f . (1) If the codomain of the function g (cid:55)→ [ α ( g ) T , m − T α ( g )] contains f unc thenTheorems 2 and 5 hold for ˜ f .
2) If each of the functions β i ( g ) , as well as α ( g ) , are differentiable in the neigh-borhood of ˜ g asymptotically, ˜ g converges to a point on the interior of the domains ofthese functions, and (cid:79) g β i ( g ) | g =˜ g (cid:54) = 0 for each i and (cid:79) g α ( g ) | g =˜ g (cid:54) = 0 , then Theorem 4holds for ˜ f . (3) Suppose { g | β i ( g ) ≤ } is convex for all i, and α j ( g ) is convex (respectively,concave) for all j. If k := arg min i x i ( k := arg max i x i ), then (29)-(30) is a convexoptimization problem. Also, the gradient and Hessian of W γ ( µ, α ( g )) exist almosteverywhere, and at these points they are given by (cid:79) g W γ ( µ, α ( g )) = ( x − k − x k ) (cid:79) g α ( g ) , (cid:79) g W γ ( µ, α ( g )) = ABA T + C, where A := (cid:2) (cid:79) g α ( g ) − (cid:79) g α ( g ) (cid:3) , B := γ ( D α ( g ) − ψD (cid:11) µ ψ T ) + , and C := (cid:80) j ( x j − x k ) (cid:79) g α j ( g ) . (4) Let d : R m − × R m − → R denote an arbitrary distance measure. If α ( g ) ∈ C and (cid:79) g α ( g ) has full rank, then there exists δ > such that W γ ( µ, α ( g )) is convex in g when d ([ α ( g ) T , m − T α ( g )] , f unc, − k ) ≤ δ. Proof: (1):
The proof of theorems 2 and 5 do not use the ρ − concavity constraint,so they still hold as long as there exists g unc such that f unc = f ( g unc ) . Since f unc is theglobal minimum of f (cid:55)→ W γ ( f, µ ) , ˜ f will be equal to f unc whenever g unc is feasible. (2): The proof given for Theorem 4 uses the first order delta method, whichrequires the conditions given in the theorem. These conditions are also sufficientfor the application of the continuous mapping theorem, which was used to showconvergence in probability of the estimator. (3):
Since each α j ( g ) is convex, Aleksandrov’s theorem implies that (cid:79) g α j ( g ) and (cid:79) g α j ( g ) exist almost everywhere. We will begin by deriving the gradient and Hessianat these points. Let ω ( x, y, g ) be defined as x T − k α ( g − k ) + x k (cid:0) m − Tm − α ( g − k ) (cid:1) + y T µ − γ (cid:80) i,j exp (( x i + y j − M ij ) /γ ) , so that we can write W γ ( µ, α ( g )) as max x,y ω ( x, y, g ) . Since ω ( x, y ; g ) is differentiable in all of its arguments, the envelope theorem implies (cid:79) g W γ ( µ, α ( g )) = (cid:79) g ω ( x, y, α ( g )) = ( x − k − x k ) (cid:79) g α ( g ) . By the same logic used in the proof of Theorem 4, we can write the Hessian as, (cid:79) g W γ ( µ, α ( g )) = ( (cid:79) g f ( g )) ( (cid:79) f x ( w ( f ))) ( (cid:79) g f ( g )) T + (cid:80) l (cid:54) = k ( x l − x k ) (cid:79) g α l ( g )= ABA T + C. BA T is positive definite by the same argument used in Theorem 6, which isalso expanded on further in the proof of statement (2). When each α j ( g ) is con-vex and (cid:79) g α j ( g ) exists, then (cid:79) g α j ( g ) is a positive semidefinite matrix. In this case, k := arg min i x i , so each element of ( x − k − x k ) is nonnegative. This implies each termin the sum defining C is a positive semidefinite matrix, so C is positive semidefinite.Also, when each α j ( g ) is concave and (cid:79) g α j ( g ) exists, then (cid:79) g α j ( g ) is negative semidef-inite. Choosing k := arg max i x i implies ( x j − x k ) (cid:79) g α j ( g ) is a positive semidefinitematrix, so C is also positive semidefinite in this case.Since { g | β i ( g ) ≤ } is convex for all i, the intersection of these sets is also convex.This, combined with the strict convexity of W γ ( µ, α ( g )) in g, implies convexity of theoptimization problem given in (29)-(30). (4): The argument given in Remark 1 following Theorem 2 implies that eachof the elements of x are the same when f ( g ) = f unc . When this occurs, we have x − k − x k = 0 , so C = m − × m − when f ( g ) = f unc . The argument in the second to last paragraph of Theorem 6 implies that
ABA T is positive definite when (cid:79) g α ( g ) has full rank and there does not exist v ∈ R m − such that A T v = m . This system of equations requires ( (cid:79) g α ( g )) v = m − and − Tm − (cid:79) g α ( g ) v = 1 . However, if v satisfies ( (cid:79) g α ( g )) v = m − , then − Tm − (cid:79) g α ( g ) v =1 − m, so there is not a solution to this system of equations. This, combined with thefact that (cid:79) g α ( g ) has full rank, implies ABA T is positive definite.Since α ( g ) ∈ C , the eigenvalues of the (cid:79) g W γ ( µ, α ( g )) are continuous in g. Sincewe have shown that these eigenvalues are strictly positive when f ( g ) = f unc , conti-nuity implies that there exists δ > such that all eigenvalues are nonnegative when d ( f ( g ) , f unc ) ≤ δ. (cid:3) Remark 1:
Some of the assumptions given above were made to simplify theexposition rather than necessity. For example, we can replace assumptions regardingdifferentiability and rank for all g with similar assumptions in the neighborhood of ˜ g. The assumption regarding the existence of the inverse of α ( · ) is worth mentioningin particular. Cases in which either α ( · ) or α − ( · ) cannot be expressed in a closedform appear to be fairly common, but closed form solutions are not a requirementof the theorem since they can be replaced with their numerical counterparts. Theapplication provided in the next section is one example of this case. (cid:3) Mechanism design appears to be a particularly fruitful source for applications ofthis generalized shape constrained density estimator. For example, consider a pri-vate values auction model in which bidders have valuations that are drawn fromthe density f ( x ) . Myerson (1981) defines the virtual valuations function as J f ( x ) := − (1 − F ( x )) /f ( x ) , and shows that if J f ( x ) is monotone increasing, then an auctionthat awards the item to the highest bidder is optimal in the sense of maximizingexpected revenue. It is common in mechanism design to make the stronger assump-tion that the hazard function, defined by f ( x ) / (1 − F ( x )) , is increasing or the evenstronger assumption that f ( x ) is log − concave. McAfee and McMillan (1987) showthat monotonicity of x − (1 − F ( x )) /f ( x ) is equivalent to convexity of the function g ( x ) = 1 / (1 − F ( x )) , which is used in the the next section to formulate a densityestimate subject to the constraint that f ( x ) satisfies Myerson’s (1981) regularity con-dition. In addition, Myerson and Satterthwaite (1983) show that bilateral bargainingbetween a buyer and a seller will only result in trade when the virtual valuation of thebuyer and the virtual cost of the seller, defined by x + F ( x ) /f ( x ) , are both increasingfunctions. Note that we can define H ( x ) := 1 − F ( x ) and h ( x ) := H (cid:48) ( x ) = − f ( x ) toshow that this last condition is equivalent to monotonicity of x − (1 − H ( x )) /h ( x ) . Areformulation of McAfee and McMillan’s (1987) condition for monotonicity of J f ( x ) shows that this is equivalent to convexity of g ( x ) := 1 / (1 − H ( x )) . This allows forthe formulation of this shape constraint in an analogous manner as the method usedto formulate the shape constraint in the next section.It would also be interesting to explore constraining a density to have an increasinghazard function. Wellner and Laha (2017) show that this is equivalent to constraining g ( x ) = − log(1 − F ( x )) to be convex. In all three of the examples given above,guaranteeing that C is positive definite requires the density estimate to satisfy a setof inequalities that do not appear to have an obvious interpretation. Regardless,statement (2) in Theorem 5 implies that W γ ( µ, α ( g )) is still convex as long as f ( g ) is sufficiently close to f unc . Initializing the density near this unconstrained minimizerand then checking for convexity in each iteration often results in local convexity of W γ ( µ, α ( g )) along the path of convergence.Since the eigenvalues of the positive definite matrix ABA T are increasing in γ, (cid:112) γ/ σ can also be increased to ensure that W γ ( µ, α ( g )) is convex over a largerset. This has the added benefit of increasing the dispersion of f unc , which results in f unc moving closer to the feasible set in the case of most shape constraints, includingall the examples discussed so far. In some cases ensuring convexity by increasing (cid:112) γ/ σ may result in the density being too disperse. If this occurs, it would bebest to compare the resulting density estimate with an estimate subject to a strongerconstraint, that allows for the formulation of a convex optimization problem, andcheck which density estimate fits the data more closely. For example, Ewerhart (2013)shows that a sufficient condition for a density to satisfy Myerson’s (1981) regularitycondition is ρ − concavity for ρ > − / , and a log − concavity constraint can be used This example also demonstrates that α ( · ) and β ( · ) do not need to be unique, since we couldconstrain the discretized counterparts of / (1 − F ( x )) to be convex, (cid:79) x / (1 − F ( x )) to be monotonic,or (cid:79) x,x / (1 − F ( x )) to be positive.
30o ensure that the hazard function is monotonic.Many other examples of constraints that are commonly imposed on densities ineconomics are given by Ewerhart (2013). Even though the examples cited in thispaper all constrain g ( x ) to be concave or convex, this is by no means a requirementof Theorem 7. For example, one could define a density estimate of the form givenby (29)-(30) to estimate densities subject to any of the examples of shape constraintsthat are given by Ewerhart (2013). The California Department of Transportation (Caltrans) uses first price auctions toallocate construction contracts. In this section we use data on the bids submitted toCaltrans in 1999 and 2000 to explore whether or not this choice of auction formatminimizes the costs to the state of California. As discussed in the previous section, if f ( x ) is the density of private valuations for the bidders, with a distribution functiondenoted by F ( x ) , and if the bidders are risk neutral, Myerson (1981) shows thatauctions that award the item to the person with the highest bid are optimal whenthe virtual valuations of the bidders is monotonically increasing.To examine whether this condition is plausible, we need to estimate the valuations(or, in this case, marginal costs) of the construction firms. Guerre, Perrigne, andVuong (2000) used the fact that the best response function of bidders in a first pricesealed bid auction is an increasing function of the bidders’s valuations to show thatthe valuation of bidder i can be estimated by b i + ˆ F b ( b i )( l −
1) ˆ f b ( b i ) , (31)where l is the numbers of bidders participating in the auction, b i is i ’s bid, ˆ f b ( · ) is aconsistent estimate of the density of bids, and ˆ F b ( · ) is its corresponding distributionfunction. To control for the size of each project, we normalize each bid by Caltrans’sengineers’s estimates of the cost of each project before estimating ˆ f b ( · ) and ˆ F b ( · ) foreach auction size.Bajari, Houghton and Tadelis (2006) use the same dataset to estimate the costsof each firm. We follow a similar estimation strategy but make some modificationsbecause our focus is on the costs to the state of California. Specifically, we did notsubtract transportation costs from the cost estimates or treat bids from small firmsdifferently than larger firms. Each bid consists of a unit cost bid on each item thatthe contract requires, and the total bid is equal to the dot product of the number ofitems required and the unit bid of each item. If small modifications are made to thecontract after it is awarded, the final payment to the firm is equal to the dot productof the modified quantities and the unit costs in the original bid. Bajari, Houghton31nd Tadelis (2006) found evidence that firms are able to make accurate forecasts ofthese final quantities, so we follow their recommendation and replace the first termin (26) with the final amount that is paid to the firm (normalized by the Caltran’sengineers’ estimate of the project cost). We also exclude all auctions in which thesemodifications resulted in a change in the payment received by the firm by more than3%. After excluding these auctions we were left with 1,393 bids. Lastly, Hickmanand Hubbard (2015) showed that the accuracy of the valuations estimates can beimproved by applying a boundary correction to ˆ f b ( · ) , which we also employed in ourestimation procedure.After we estimated the valuations for each firm, we estimated f unc by setting (cid:112) γ/ σ using Scott’s (1992) rule of thumb; however, the resulting virtual valua-tions function was not monotonic. This could be an innocuous idiosyncrasy of thedata or it could be evidence that Caltran’s choice of auction format is suboptimal.To investigate which possibility is more plausible, we find the proposed densityestimate subject to Myerson’s (1981) regularity condition. To define α ( · ) we solvedfor F ( x ) in the equation introduced in the previous section, g ( x ) = 1 / (1 − F ( x )) . This derivative is f ( x ) = mg (cid:48) ( x ) /g ( x ) , and after discretizing, we defined α j ( g ) as ( g j − g j +1 ) /g j . The convexity of the objective function was maintained along theentire path of convergence, without requiring that we increase (cid:112) γ/ σ above therecommendation given in the third section. The input density and the output of thealgorithm are shown in Figure 3.We also performed the test described in Theorem 5. We failed to reject the nullhypothesis that the objective function, evaluated at ˆ f , was equal to the objective func-tion evaluated at its unconstrained counterpart, f unc , with a p − value of 0.93. This issimilar to the result of the Kolmogorov-Smirnov (1948) test and the Anderson-Darling(1952) test for the null hypothesis that the sample was drawn from the distributionfunction of ˆ f ; these tests also failed to reject the null with p − values of 0.98 and 0.94,respectively. In this case the constraints are inactive at all but 24 points in the righttail in a mesh of 300 points. Since the density already appears parsimonious, there is little reason to decrease (cid:112) γ/ σ ; however, in the interest of comparing these three tests further, we alsoestimated the density using Scott’s (1992) rule of thumb multiplied by 1/2 ratherthan 2/3. In this case the p − value of our test decreased to 0.32, while the p − valuesof other two tests both increased to 0.99. This divergence in p − values underlines thedifference between these two approaches. Specifically, as we decrease the smoothing,the distribution function converges to the empirical distribution function over the vastmajority of the domain, so tests based on comparisons between a distribution and itsempirical counterpart are less likely to reject. In contrast, our statistic measures the Myerson’s (1981) regularity condition can also be expressed as f ( x ) + f (cid:48) ( x )(1 − F ( x )) ≥ , soit is always satisfied when the density is increasing. In this case, f unc decreases rapidly to the rightof the mode, as shown in Figure 3, so it is not in the set of feasible densities. f (cid:55)→ W γ ( µ, f ) and the setof feasible densities. The test is most reliable when f unc is a reasonable estimate of µ (cid:63) , so we do not recommend setting (cid:112) γ/ σ to a value that under-smooths f unc in this way. (a) (b) Figure 3: The input density and the output density, subject to Myerson’s (1981)regularity condition, are shown in black and blue, respectively. (a) shows the estimatewhen (cid:112) γ/ σ is set using Scott’s (1992) rule of thumb multiplied by 2/3, whilein (b) a factor of 1/2 is used instead. The absolute value of the valuations can beviewed as the cost per dollar of the Caltrans’s engineer’s estimates. This paper proposes a density estimator that is defined as the density that minimizesa regularized Wasserstein distance from the input kernel density estimator subject to ρ − concavity constraints. This framework provides the advantages of convexity andconsistency, and it allows for a generalization that is capable of estimating densitiessubject to a large class of alternative shape constraints. In addition, it allows for atest of the impact of the shape constraints on the fidelity criterion.The framework presented here can also be extended to allow γ and σ to takedifferent values at each column of the matrix K, which would be appealing in twosituations. When one would like f (cid:63) to be as close as possible to µ, γ and σ can bedecreased below what would have otherwise been possible in regions where the shape-constrained density estimator is closer to µ, without interfering with convergence33f Algorithm 1. Secondly, this would allow for the development of methods thatset (cid:112) γ/ σ using an adaptive approach that is similar to the one described bySheather and Jones (1991) for kernel density estimators.Another promising area for future research that has not already been mentionedwould be to extend this framework to allow for the estimation of a regression and thedensity of residuals simultaneously. Dümbgen, Samworth, and Schuhmacher (2011)showed that this does not result in a convex optimization problem in the maximumlikelihood setting, so verifying convexity of the objective function in this case is an ac-tive area of research. Note that extending the framework presented here to estimatingthe mode of a data generating process conditional on covariates, or a modal regres-sion, is straightforward. For example, this could be done by imposing a ρ − concavityconstraint on the conditional density of the dependent variable. Using a relatively lowvalue of ρ, say − or − , could be viewed as similar to a quasi-concavity constraint. Convexity of the optimization problem in this case follows from Theorem 7. Note that quasi-concavity is equivalent to uni-modality if and only if d = 1 . ppendix A: Log-Concavity Constraints A different approach is necessary for ρ → , which corresponds to log − concavity. Inthis case g = log( f ) is constrained to be concave, and W γ ( µ, g /ρ ) is defined as max ( x,y ) ∈ R m x T − k exp( g − k ) + x k ( m − (cid:80) i exp( g − k,i )) + y T µ − γ (cid:88) i,j exp (( x i + y j − M ij ) /γ ) . (32)The index k can be chosen in the same way as described above to ensure the objectivefunction is convex.The gradient in this case is equal to r i := ∂W γ ( µ, g /ρ ) ∂g − k,i = ( x i − x k ) exp( g − k,i ) , (33)and the Hessian is H := (cid:79) W γ ( µ, g /ρ ) = ABA T + C (34)where A := (cid:2) D exp( g ) − exp( g ) (cid:3) ,B := γ ( D exp( g ) − ψD (cid:11) µ ψ T ) − , and C := D exp( g ) D x − k − x k . Appendix B: An Approximation based on AlternatingBregman Pro jections
Algorithm 1 can be derived using the method of alternating Bregman projections(MABP), which is also the basis for the algorithm proposed in this section (Bregman,1967). Bregman explores a class of divergence measures defined by D ϕ ( x | y ) := ϕ ( x ) − ϕ ( y ) − ( x − y ) T (cid:79) ϕ ( y ) , where ϕ : R m → R is a convex function. In other words, if ˆ ϕ y ( x ) is the first or-der Taylor series expansion of ϕ ( · ) at y ∈ R m , then D ϕ ( x | y ) = ϕ ( x ) − ˆ ϕ y ( x ) . Many distance measures and divergences can be viewed as Bregman divergences.For example, squared Euclidian distance can be derived using ϕ ( x ) := (cid:107) x (cid:107) , and ϕ ( x ) := (cid:80) i x i log( x i ) results in Kullback-Leibler divergence.35regman (1967) described a way to to minimize ϕ ( · ) subject to multiple sets ofaffine constraints using this divergence measure. As an example, let’s consider min x ϕ ( x ) subject to: A x = b , A x = b . For l ∈ { , } , let the Bregman projection of y onto the constraint A l x = b l be denotedby P l ( y ) := arg min A l x = b l D ϕ ( · ) ( x | y ) . MABP begins by initializing x (0) at arg min x ϕ ( x ) and the i th iteration takes x ( i − asinput and defines x ( i ) as x ( i ) ← P ( P ( x ( i − )) . This is a very efficient way to solve an optimization problem when there is ananalytic solution for one or both of the projections. For example, the two updatesfound in Algorithm 1 can be viewed as Bregman (1967) projections of the coupling ψ onto the constraints ψ m = µ and ψ T m = µ . To approximate f (cid:63) , we need to solve min ψ (cid:88) i,j ψ ij M ij + γψ ij log ( ψ ij ) subject to: (35) ψ T m = µ, (36) ψ m = α i + β i a i , and ( α i + β i a i ) ρ ≤ ( α j + β j a i ) ρ ∀ i, j ∈ { , ..., m − } . (37)We can only guarantee that MABP converges to the global minimum if the con-straints are affine. The constraint in (37) is not convex, so theory cannot provideus with a guarantee that MABP converges. Regardless, MABP is often employedwith reasonable success in nonconvex cases; for examples, see Bauschke, Borwein,and Combettes (2003) and the references therein. MABP also performs well in oursetting, and since the output of the algorithm will only be used to initialize Algorithm2, inaccuracies in the output will not impact our final density estimates.The Bregman divergence corresponding to the objective function in (3) is D W γ ( · ) ( ψ ij | ¯ ψ ij ) = γ (cid:80) i,j ψ i,j log (cid:16) ψ ij e ¯ ψ ij (cid:17) + ¯ ψ ij . The equality constraints could be replaced with inequalities. However, the constraints musthave a nonempty intersection, be closed, and be affine. Bauschke and Lewis (2000) prove that asimilar algorithm, which replaces the requirement that the constraints are affine with a convexityassumption, also converges to the global minimum.
36s previously mentioned, the Bregman projection onto the constraint in (36) is v ← µ (cid:11) ( K T w ) . The constraints given by (37) can be combined to define the projection, P ( ¯ ψ ) := arg min ψ,α,β (cid:80) i,j ψ i,j log (cid:16) ψ ij e ¯ ψ ij (cid:17) subject to ψ m = α i + β i a i , and ( α i + β i a i ) ρ ≤ ( α j + β j a i ) ρ ∀ i, j ∈ { , ..., m − } . (38)Rather than attempting to solve (38) numerically, we can use the change of variable f = g /ρ , as in Section 3. The following problem has Kuhn-Tucker conditions thatare equivalent to (38) but provide a reduction in dimensionality.arg min g,α,β (cid:80) i g /ρi log (cid:18) g /ρi e ¯ v i (cid:19) subject to: g i = α i + β i a i and α i + β i a i ≤ α j + β j a i ∀ i, j ∈ { , ..., m − } , where ¯ v i := (cid:80) j ¯ ψ ij . To ensure this optimization problem is convex we need to have ( ρ −
1) log( g /ρi / ¯ v i ) ≤ for every i ∈ { , , ..., m } . As discussed in Section 2, v and w are only unique up to amultiplicative constant, so this can easily be achieved by dividing v by c ∈ R and mul-tiplying w by c in iterations in which this inequality may fail to hold. The pseudocodefor this method is given in Algorithm 3. In this implementation we renormalize v and w whenever max i ( ρ −
1) log( g /ρi / ¯ v i ) > / , and define c as sgn ( ρ ) . Generally five tothirty iterations are sufficient to provide a good initialization for Algorithm 2. Note that after v is defined using v ← µ (cid:11) ( Kw ) , the equality ψ = D w KD v from the secondsection implies that solving for the optimal density is equivalent to solving for the optimal value of w such that the density D w Kv satisfies the shape constraint. The variable ¯ v is Kv, the componentof f that is already fixed by the v − update. lgorithm 3 Produces an approximate shape-constrained density estimate usingMABP. Note that no constraint regarding the mass of f was made. The mass of f will be correct when the algorithm converges due to the assignment v ← µ (cid:11) ( Kw ) , butrenormalization at the end of the algorithm is necessary in the absence of convergence. function MABP ( µ, K, ρ ) w ← m f ← µc ← sgn ( ρ ) for i = 1 , , ... : v ← µ (cid:11) ( Kw )¯ v ← Kv if max i ( ρ −
1) log( f i / ¯ v i ) > / : v ← v/c, w ← cwg ← arg min g (cid:80) i g /ρi log (cid:18) g /ρi e ¯ v i (cid:19) s.t. g ∈ K f ← g /ρ w ← f (cid:11) ( Kv ) f ← mf / ( f T ) return f eferences Afriat, S. N. (1972). Efficiency estimation of production functions.
InternationalEconomic Review , 568-598.Anderson, T. W., & Darling, D. A. (1952). Asymptotic theory of certain “goodnessof fit” criteria based on stochastic processes.
The annals of mathematicalstatistics , 193-212.Andrews, D. W. (2000). Inconsistency of the bootstrap when a parameter is onthe boundary of the parameter space.
Econometrica, 68 (2), 399-405.Bagnoli, M., & Bergstrom, T. (2005). Log-concave probability and its applications.
Economic Theory, 26 (2), 445-469.Bajari, P., Houghton, S., & Tadelis, S. (2006). Bidding for incomplete contracts: Anempirical analysis.
National Bureau of Economic Research.
Benamou, J. D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). IterativeBregman projections for regularized transportation problems.
SIAM Journal onScientific Computing, 37 (2), A1111-A1138.Bauschke, H. H., Borwein, J. M., & Combettes, P. L. (2003). Bregman monotoneoptimization algorithms.
SIAM Journal on Control and Optimization, 42 (2),596-636.Bauschke, H. H., & Lewis, A. S. (2000). Dykstras algorithm with bregmanprojections: A convergence proof.
Optimization, 48 (4), 409-427.Bregman, L. M. (1967). The relaxation method of finding the common point of convexsets and its application to the solution of problems in convex programming.
USSRComputational Mathematics and Mathematical Physics, 7 (3), 200-217.Carroll, R. J., Delaigle, A., & Hall, P. (2011). Testing and estimating shape-constrainednonparametric density and regression in the presence of measurement error.
Journal of the American Statistical Association, 106 (493), 191-202.Chen, Y., & Samworth, R. J. (2013). Smoothed log − concave maximum likelihoodestimation with applications. Statistica Sinica , 1373-1398.Cule, M., Samworth, R., & Stewart, M. (2010). Maximum likelihood estimation of amulti-dimensional log-concave density.
Journal of the Royal Statistical Society:Series B (Statistical Methodology), 72 (5), 545-607.Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport.
Advances in Neural Information Processing Systems , 2292-2300.Cuturi, M., & Doucet, A. (2014). Fast computation of Wasserstein barycenters.In
International Conference on Machine Learning (pp. 685-693).Cuturi, M., & Peyré, G. (2016). A smoothed dual approach for variationalWasserstein problems.
SIAM Journal on Imaging Sciences, 9 (1), 320-343.Doss, C. R., & Wellner, J. A. (2016). Global rates of convergence of the MLEs of log-concave and s-concave densities.
The Annals of Statistics, 44 (3), 954-981.Dümbgen, L., Samworth, R., & Schuhmacher, D. (2011). Approximation by log-39oncave distributions, with applications to regression.
The Annals of Statistics,39 (2), 702-730.Dümbgen, L., & Rufibach, K. (2009). Maximum likelihood estimation of a log-concave density and its distribution function: Basic properties and uniformconsistency.
Bernoulli, 15 (1), 40-68.Einmahl, U., & Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type function estimators.
The Annals of Statistics, 33 (3), 1380-1403.Ewerhart, C. (2013). Regular type distributions in mechanism design and ρ − concavity. Economic Theory, 53 (3), 591-603.Fan, J. Y., & Yuan, Y. X. (2001). A new trust region algorithm with trust regionradius converging to zero. In
Proceeding of the 5th International Conference onOptimization: Techiniques and Applications (pp. 786-794). ICOTA, Hong Kong.Galichon, A. (2016).
Optimal Transport Methods in Economics . Princeton UniversityPress.Grenander, U. (1956). On the theory of mortality measurement.
ScandinavianActuarial Journal, 1956 (2), 125-153.Guerre, E., Perrigne, I., & Vuong, Q. (2000). Optimal nonparametric estimation offirst-price auctions.
Econometrica, 68 (3), 525-574.Hickman, B. R., & Hubbard, T. P. (2015). Replacing sample trimming with boundarycorrection in nonparametric estimation of first-price auctions.
Journal ofApplied Econometrics, 30 (5), 739-762.Hoffleit, E. D., & Warren Jr, W. H. (1991). Yale Bright Star Catalog.
New Haven:Yale Univ. Obs.
Horn, R. A., & Johnson, C. R. (1990).
Matrix analysis . Cambridge university press.Horowitz, J. L., & Lee, S. (2017). Nonparametric estimation and inference undershape restrictions.
Journal of Econometrics, 201 (1), 108-126.Kantorovitch, L. (1958). On the translocation of masses.
Management Science,5 (1), 1-4.Karlin, S. (1968).
Total positivity . Stanford: Stanford University Press.Karunamuni, R. J., & Mehra, K. L. (1991). Optimal convergence properties of kerneldensity estimators without differentiability conditions.
Annals of the Institute ofStatistical Mathematics, 43 (2), 327-346.Kiefer, J., & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimatorin the presence of infinitely many incidental parameters.
The Annals ofMathematical Statistics , 887-906.Kim, A. K., & Samworth, R. J. (2016). Global rates of convergence in log − concavedensity estimation. The Annals of Statistics, 44 (6), 2756-2779.Koenker, R., & Mizera, I. (2010). Quasi-concave density estimation.
The Annals ofStatistics , 2998-3027.Laffont, J. J., Ossard, H., & Vuong, Q. (1995). Econometrics of first-price auctions.
Econometrica: Journal of the Econometric Society , 953-980.40rupp, R. S. (1979). Properties of Kruithof’s projection method.
Bell Labs TechnicalJournal, 58 (2), 517-538.Mallows, C. L. (1972). A note on asymptotic joint normality.
The Annals ofMathematical Statistics , 508-515.Macdonell, W. R. (1902). On criminal anthropometry and the identification ofcriminals.
Biometrika, 1 (2), 177-227.McAfee, R. P., & McMillan, J. (1987). Auctions and bidding.
Journal of economicliterature, 25 (2), 699-738.Myerson, R. B. (1981). Optimal auction design.
Mathematics of operations research,6 (1), 58-73.Myerson, R. B., & Satterthwaite, M. A. (1983). Efficient mechanisms for bilateraltrading.
Journal of economic theory, 29 (2), 265-281.Parzen, E. (1962). On estimation of a probability density function and mode.
TheAnnals of Mathematical Statistics, 33 (3), 1065-1076.Seregin, A., & Wellner, J. A. (2010). Nonparametric estimation of multivariateconvex-transformed densities.
Annals of statistics, 38 (6), 3751.Scott, D. W. (1992).
Multivariate Density Estimation . Wiley.Sheather, S. J., & Jones, M. C. (1991). A reliable data-based bandwidth selectionmethod for kernel density estimation.
Journal of the Royal Statistical Society,
Density Estimation for Statistics and Data Analysis .CRC press.Sinkhorn, R. (1967). Diagonal equivalence to matrices with prescribed row andcolumn sums.
The American Mathematical Monthly, 74 (4), 402-405.Smirnov, N. (1948). Table for estimating the goodness of fit of empirical distributions.
The annals of mathematical statistics, 19 (2), 279-281.Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T.,& Guibas, L. (2015). Convolutional Wasserstein distances: Efficient optimaltransportation on geometric domains.
ACM Transactions on Graphics (TOG),34 (4), 66.Student. (1908). The probable error of a mean.
Biometrika,
Asymptotic statistics (Vol. 3). Cambridge universitypress.Villani, C. (2003).
Topics in optimal transportation (No. 58). AmericanMathematical Society.Wellner, J. A. & Laha, N. (2017). Bi-s-concave distributions. arXiv preprint.arXiv preprint.