[PDF] Fast Multivariate Log-Concave Density Estimation

Abstract

A novel computational approach to log-concave density estimation is proposed. Previous approaches utilize the piecewise-affine parametrization of the density induced by the given sample set. The number of parameters as well as non-smooth subgradient-based convex optimization for determining the maximum likelihood density estimate cause long runtimes for dimensions d≥2 and large sample sets. The presented approach is based on mildly non-convex smooth approximations of the objective function and \textit{sparse}, adaptive piecewise-affine density parametrization. Established memory-efficient numerical optimization techniques enable to process larger data sets for dimensions d≥2 . While there is no guarantee that the algorithm returns the maximum likelihood estimate for every problem instance, we provide comprehensive numerical evidence that it does yield near-optimal results after significantly shorter runtimes. For example, 10000 samples in R 2 are processed in two seconds, rather than in ≈14 hours required by the previous approach to terminate. For higher dimensions, density estimation becomes tractable as well: Processing 10000 samples in R 6 requires 35 minutes. The software is publicly available as CRAN R package fmlogcondens.

Full PDF

FFast multivariate log-concave density estimation

Fabian Rathke a, ∗ , Christoph Schn¨orr a a Image & Pattern Analysis Group (IPA), Heidelberg University, Im Neuenheimer Feld 205, 69120 Heidelberg, Germany.

Abstract

A novel computational approach to log-concave density estimation is proposed. Previous approachesutilize the piecewise-aﬃne parametrization of the density induced by the given sample set. The numberof parameters as well as non-smooth subgradient-based convex optimization for determining the maximumlikelihood density estimate cause long runtimes for dimensions d ≥ sparse , adaptivepiecewise-aﬃne density parametrization. Established memory-eﬃcient numerical optimization techniquesenable to process larger data sets for dimensions d ≥

2. While there is no guarantee that the algorithmreturns the maximum likelihood estimate for every problem instance, we provide comprehensive numericalevidence that it does yield near-optimal results after signiﬁcantly shorter runtimes. For example, 10000samples in R are processed in two seconds, rather than in ≈

14 hours required by the previous approach toterminate. For higher dimensions, density estimation becomes tractable as well: Processing 10000 samplesin R requires 35 minutes. The software is publicly available as CRAN R package fmlogcondens .Keywords: log-concavity, maximum likelihood estimation, nonparametric density estimation, adaptive piecewise-aﬃne parametrization

1. Introduction

Log-concave density estimation has been an active area of research. Quoting Chen and Samworth(2013), the “allure is the prospect of obtaining fully automatic nonparametric estimators, with no tuningparameters to choose”, as a ﬂexible alternative to parametric models, like the Gaussian, that are oftenadopted by practitioners in an ad-hoc way. The mathematical analysis as well as the design of algorithmsbeneﬁt from the convexity properties of the class of log-concave densities. We refer to Samworth (2017) fora recent survey.The general form of a log-concave density reads f ( x ) = exp (cid:0) − ϕ ( x ) (cid:1) , ϕ ∈ F ( R d ) , (1.1)where F ( R d ) denotes the class of convex lower-semicontinuous proper functions ϕ : R d → ( −∞ , ∞ ] suchthat (cid:82) R d f = 1. Given i.i.d. samples X n = { x , . . . , x n } ⊂ R d (1.2)of a random vector X ∼ f , with n ≥ d + 1, the task is to determine an estimateˆ f n = exp (cid:0) − ˆ ϕ n ( x ) (cid:1) (1.3)of f . This estimate, determined as maximizer of the log-likelihood, exists and is unique with probability1 (Cule et al., 2010, Thm. 1). Moreover, the corresponding convex function ˆ ϕ n ∈ F is supported on the ∗ Corresponding author (Mail: [email protected]/[email protected], Tel.: +49 6221 5414858)

Preprint submitted to Elsevier February 21, 2019 a r X i v : . [ s t a t . C O ] F e b onvex hull C n = conv X n of the given data and is piecewise linear , in the sense of Rockafellar and Wets(2009, Def. 2.47): C n can be represented as union of ﬁnitely many polyhedral sets C n = N n,d (cid:91) i =1 C n,i , (1.4)relative to each of which ˆ ϕ n admits the aﬃne representationˆ ϕ n ( x ) (cid:12)(cid:12) C n,i =: ˆ ϕ i,n ( x ) = (cid:104) a i , x (cid:105) + b i , a i ∈ R d , b i ∈ R , i = 1 , . . . , N n,d . (1.5)This is equivalent to the fact that the epigraph of ˆ ϕ n , denoted byepi ˆ ϕ n = (cid:8) ( x, α ) ∈ R d × R : α ≥ ˆ ϕ n ( x ) (cid:9) (1.6)is polyhedral and ˆ ϕ n admits the representation (Rockafellar and Wets, 2009, Thm. 2.49)ˆ ϕ n = (cid:40) max (cid:8) ˆ ϕ ,n ( x ) , . . . , ˆ ϕ N n,d ,n ( x ) (cid:9) , x ∈ C n , ∞ , x (cid:54)∈ C n . (1.7)We denote the class of piecewise linear proper convex functions over C n byΦ n := (cid:8) ϕ n ∈ F ( R d ) : ϕ n has the form (1.5) and (1.7) (cid:9) . (1.8)Figure 1.1 displays a function ϕ n in the planar case d = 2. Given the function values y ϕ = ( y ϕ, , . . . , y ϕ,n ) := (cid:0) ϕ n ( x ) , . . . , ϕ n ( x n ) (cid:1) , (1.9) ϕ n is uniquely determined as lower convex envelope , that is the largest convex function majorized at thegiven sample points x i by y ϕ,i , ϕ n ( x i ) ≤ y ϕ,i , i = 1 , . . . , n. (1.10)Due to Cule et al. (2010, Thm. 2), a natural and admissible variational approach for determining themaximum likelihood estimate ˆ f n in terms of ˆ ϕ n and ˆ y ϕ , respectively, is given byˆ y ϕ = arg min y ϕ J ( y ϕ ) , J ( y ϕ ) = 1 n n (cid:88) i =1 y ϕ,i + (cid:90) C n exp (cid:0) − ϕ n ( x ) (cid:1) d x (1.11)where the latter integral acts like a Lagrangian multiplier term enforcing the constraint (cid:82) C n f n = 1 (Silver-man, 1982, Thm. 3.1). In fact, it was shown that solving problem (1.11) amounts to eﬀectively minimizingover Φ n (1.8) to obtain ˆ ϕ n and in turn the ML-estimate (1.3).An algorithm for computing ˆ y ϕ was worked out by Cule et al. (2010) based on the convexity of J . Whilethis algorithm is guaranteed to return the global optimum, its runtime complexity suﬀers from two facts:(i) The objective function J is convex but non-smooth due to the polyhedral class of functions (1.8) inwhich the algorithm searches for ˆ ϕ n . As consequence, the iterative scheme is based on subgradientswhich are known to converge rather slowly.(ii) The integral of (1.11) has to be evaluated in every iterative step for each subset C n,i of the polyhedraldecomposition (1.4), where the subsets C n,i are required to be simplices . While this can be convenientlydone in closed form (Cule et al., 2010, App. B), it is the increasing number of these subsets for largerdimension d > igure 1.1: A piecewise aﬃne concave function − ϕ n ( x ) (1.5) whose parameters determine the density estimate (1.3) .The function values − ϕ ( x ) , . . . , − ϕ ( x n ) at given data points X n = { x , . . . , x n } ⊂ R d induce a polyhedral decomposition C n of the convex hull conv( X ) , here shown for the case of bivariate data. Determining the parameters through iterativenumerical optimization may change this decomposition depending on which data point deﬁnes a vertex of the hypograph(green point) or not (yellow and blue points). This increases the complexity considerably, in particular for larger n and dimension d . In this paper, we work with a smooth approximation that enables more eﬃcient log-concave densityestimation. The number N n,d of components of the decomposition (1.4) is known to depend linearly on n , N n,d = O ( n ),for n points uniformly distributed in R d (Dwyer, 1991), whereas the worst case bound for ‘pathologically’distributed n points is N n,d = O ( n (cid:100) d (cid:101) ), i.e. grows exponentially with the dimension d (McMullen, 1970).For n points sampled from log-concave distributions that are unimodal and in this sense simply shaped,it is plausible to assume that the lower complexity bound holds approximately, i.e. a linear dependency N n,d = O ( n ). This means, in particular, that the number of parameters of the aﬃne functions forming ˆ ϕ n due to (1.7) linearly depends on n as well. While these bounds take into account the entire data set X n , itwas shown for d = 1 that under suﬃcient smoothness and other conditions, not all x i need to participate inthe decomposition C n (D¨umbgen and Ruﬁbach, 2009). No proofs exist for d >

1, but results presented inthis paper indicate that this property of the ML estimator carries over to the multivariate case. Thereforethe actual dependency of N n,d on n may be lower than O ( n ).On the other hand, concerning the ultimate objective of accurately estimating a multivariate log-concavedensity f , it was recently shown by Diakonikolas et al. (2017) that in order to achieve an estimation error (cid:15) in total variation distance with high probability, a function ˆ f n suﬃces that is deﬁned by O (cid:0) (1 /(cid:15) ) ( d +1) / (cid:1) hyperplanes. In the univariate case d = 1, an algorithm that matches this complexity bound was publishedrecently (Acharya et al., 2017). In the multivariate case d >

1, on the other hand, the design of a computa-tionally eﬃcient algorithm was considered as a “challenging and important open question” by Diakonikolaset al. (2017).Quite recently, two approaches where published (Axelrod and Valiant, 2018; Diakonikolas et al., 2018)which solve the log-concave MLE (1.11) with high probability with an estimation error (cid:15) < n, d, /(cid:15) ) time. Both approaches are stochastic and rely on the work ofLov´asz and Vempala (2007) to sample from ϕ n over the convex body C n . Regarding the computationaleﬃciency of the latter approach, Lov´asz and De´ak (2012) noted that they ”could not experiment with otherconvex bodies than cubes, because the oracle describing the convex bodies took too long to run”. Sinceneither Axelrod and Valiant (2018) nor Diakonikolas et al. (2018) provide an implementation of their novelapproaches, a fair and competitive evaluation has to be left for future work.3 (a) n -gon with n = 25 −1 −0.5 0 0.5 1−1−0.500.51 (b) Cule et al. (2010), l ( θ ) = − . , N n,d = 23 −1 −0.5 0 0.5 1−1−0.500.51 (c) Our estimate, l ( θ ) = − . , N n,d = 1 Figure 1.2:

Example due to D. Schuhmacher from the discussion in the appendix of Cule et al. (2010) regardingcomputational eﬃciency: (a) n=25 data points X n form a n -gon. Due to the symmetry of X n , the density estimate ˆ f n isthe uniform density (not explicitly shown here; both the approach Cule et al. (2010) and our approach return this estimate,as the equal log-likelihood values l ( θ ) (3.1) demonstrate). It is clear that this uniform density can be represented by asingle hyperplane, N n,d = 1 , that our approach correctly ﬁnds (panel (c)). In contrast, the approach of Cule et al. (b)relies on a triangulation of C n , which leads to a more involved density parameter estimation problem: N n,d = 23 aﬃnefunction parameters. This gap of complexity increases considerably with larger numbers n of data points and dimension d of the data space. This preceding discussion motivated us to address the two shortcomings (i), (ii) raised above as follows.(1) We consider the representation (1.8) of ˆ ϕ n and adopt a smooth approximation of the non-smooth max-operation. While the resulting modiﬁcation of (1.11) no longer is convex, numerical methods can beapplied that are orders of magnitude more eﬃcient than the subgradient based iterative schemes ofCule et al. (2010). Furthermore, we exploit the fact that the smoothness of the approximation can becontrolled by a single parameter γ : While we utilize strong smoothing to obtain an initial parametervector, the subsequent optimization is carried out with minimal smoothing.(2) Rather than optimizing all parameters of (1.7), we apply a threshold criterion in order to drop ‘inactive’hyperplanes, since the optimal estimate ˆ ϕ n can be expected to be deﬁned by a small subset of them, asdiscussed above. This measure speeds up the computations too without essentially compromising theaccuracy of the resulting density estimator ˆ f n . Moreover, unlike the approach of Cule et al. (2010), wedo not restrict polyhedral subsets C n,i to simplices. Figure 1.2 shows a somewhat extreme academicalexample in order to illustrate these points.Due to the non-convexity of our objective function, we cannot guarantee that our approach determines themaximum-likelihood density estimate for every problem instance, as does the approach of Cule et al. (2010).This was the case, however, in a comprehensive series of numerical experiments indicate, that we reportbelow. In particular, log-concave density estimation for large sample sets and for higher dimensions becomescomputationally tractable.Our paper is organized as follows. We present our approach in Section 2 and discuss details of thealgorithm and its implementation. In Section 3, we report extensive numerical results up to dimension d = 6 using sample sizes in the range n ∈ [10 , ]. In the univariate case d = 1, our method is on parwith the active set approach of D¨umbgen et al. (2007) regarding both runtime and accuracy. This methodis not applicable to higher dimensions, however. In such cases, d ∈ { , . . . , } , our method is as accurateas the algorithm of Cule et al. (2010) but orders of magnitude more eﬃcient. For example, for d = 2 and n = 10 .

000 samples, the algorithm of Cule et al. takes 4 . . d = 6 and n = 1 .

000 samples, the algorithm of Cule et al. takes about 10 hours, whereas ouralgorithm terminates after 5 minutes.An implementation of our approach is publicly available as software R package fmlogcondens (Rathkeand Schn¨orr, 2018) on CRAN.

2. Approach

We deﬁne in Section 2.1 the objective function as smooth approximation of the negative log-likelihood.The subsequent sections discuss how the parameter values of the log-concave density estimate are determinedby numerical optimization. The overall structure of the approach is summarized as Algorithm 1 on page 11.

We rewrite the negative log-likelihood functional (1.11) in the form L ( θ ) := 1 n n (cid:88) i =1 ϕ n ( x i ) + (cid:90) C n exp (cid:0) − ϕ n ( x ) (cid:1) d x , (2.1a) θ := (cid:8) ( a , b ) , . . . , ( a N n,d , b N n,d ) (cid:9) , (2.1b)where ϕ n and all ϕ i,n have the form (1.5) and (1.7), respectively, and θ collects all parameters that determine ϕ n . We deﬁne the log-concave density estimate (1.3) in terms of the functionˆ ϕ n = ϕ n | θ =ˆ θ : ˆ θ locally minimizes L ( θ ) . (2.2)Our next step is to smoothly approximate the representation (1.7) of ϕ n . Using the convex log-exponentialfunction logexp : R d → R , x (cid:55)→ logexp( x ) := log (cid:16) d (cid:88) i =1 e x i (cid:17) (2.3)we introduce a smoothing parameter γ > γ : R d → R , x (cid:55)→ logexp γ ( x ) := γ logexp (cid:16) xγ (cid:17) = γ log (cid:18) d (cid:88) i =1 exp (cid:16) x i γ (cid:17)(cid:19) , (2.4)that uniformly approximates the non-smooth max-operation (Rockafellar and Wets, 2009, Example 1.30) inthe following sense:logexp γ ( x ) − γ log d ≤ max i =1 ,...,d { x , . . . , x d } ≤ logexp γ ( x ) , ∀ x ∈ R d . (2.5)Utilizing this function, we deﬁne in view of (1.7) the smooth approximation ϕ n,γ ( x ) := (cid:40) logexp γ (cid:0) ϕ ,n ( x ) , . . . , ϕ N n,d ,n ( x ) (cid:1) , x ∈ C n , ∞ , x (cid:54)∈ C n , (2.6)and in turn the smooth approximation of the objective function (2.1) L γ ( θ ) := 1 n n (cid:88) i =1 ϕ n,γ ( x i ) + (cid:90) C n exp (cid:0) − ϕ n,γ ( x ) (cid:1) d x . (2.7)We point out that by virtue of (2.5), we have ∀ x ∈ C n , ≤ ϕ n,γ ( x ) − ϕ n ( x ) ≤ γ log d → γ → L γ ( θ ) → L ( θ ) for γ → . (2.9)5 .2. Numerical Optimization We apply an established, memory-eﬃcient quasi-Newton method known as L-BFGS in the literature(Nocedal and Wright, 2006), to compute a sequence (cid:0) θ ( k ) (cid:1) k ≥ (2.10)of parameter values that converges to a local minimum ˆ θ of the objective function (2.7). A key aspectof this iterative procedure is to maintain at each step k an approximation H ( k ) of the inverse Hessian (cid:0) ∇ L γ ( θ ( k ) ) (cid:1) − of the objective function (2.7), in terms of a few gradients ∇ L γ ( θ ( k (cid:48) ) ) evaluated and storedat preceding iterative steps k (cid:48) < k . This avoids to handle directly the Hessian of size (dim θ ) = (cid:0) ( d +1) N n,d (cid:1) and hence enables to cope with much larger problem sizes.The basic update steps with search direction p ( k ) and step size λ k read θ ( k +1) = θ ( k ) + λ k p ( k ) , p ( k ) = − H ( k ) ∇ L γ ( θ ( k ) ) . (2.11)The stepsize λ k is determined by backtracking line search. More speciﬁcally, we select the largest λ k in theset { , p, p , . . . } , p = 0 . , such that the condition L γ ( θ ( k ) + λ k p ( k ) ) − L γ ( θ ( k ) ) ≤ σλ k ( p ( k ) ) T ∇ L γ ( θ ( k ) ) (2.12)holds. We chose σ = 10 − , meaning we accept a decrease in L γ by 1% of the prediction based on the linearextrapolation.Now, instead of computing H k anew in every iteration, it is merely updated to account for the curvaturemeasured in the most recent step. Given a new iterate θ ( k +1) , the update for H ( k ) in the BFGS approachis (Nocedal and Wright, 2006, Chap. 6.1) H ( k +1) = ( V ( k ) ) T H ( k ) V ( k ) + ρ ( k ) s ( k ) ( s ( k ) ) T , (2.13)where ρ ( k ) = 1( y ( k ) ) T s ( k ) , V ( k ) = I − ρ ( k ) y ( k ) ( s ( k ) ) T , and s ( k ) = θ ( k +1) − θ ( k ) , y ( k ) = ∇ L γ ( θ ( k +1) ) − ∇ L γ ( θ ( k ) ) . This update has the property, that if H ( k ) is positive deﬁnite and the curvature condition ( y ( k ) ) T s ( k ) > H ( k +1) is also positive deﬁnite, which in turn guarantees that the step p ( k +1) (2.11) is adescent direction.While (2.14) automatically holds in the convex case, this property has to be enforced explicitly for non-convex objective functions. Li and Fukushima (2001), therefore, proposed the following modiﬁcation of y ( k ) : ˜ y ( k ) = y ( k ) + t k s ( k ) , t k = (cid:107)∇ L γ ( θ ( k ) ) (cid:107) + max (cid:26) − ( y ( k ) ) T s ( k ) (cid:107) s ( k ) (cid:107) , (cid:27) , (2.15)which fulﬁlls (2.14) since (˜ y ( k ) ) T s ( k ) ≥ (cid:107)∇ L γ ( θ ( k ) ) (cid:107)(cid:107) s ( k ) (cid:107) >

0. Thus using ˜ y ( k ) in (2.13) guarantees thepositive deﬁniteness of H ( k +1) . See Li and Fukushima (2001, Thm. 5.1) for a proof of convergence.Storing H ( k ) in memory quickly becomes prohibitive with growing dim( θ ). This is addressed by limited-memory BFGS (L-BFGS) by only storing the m most recent vectors ( y ( k ) , s ( k ) ), representing H ( k ) implicitly.At every iteration p ( k ) (2.11) is directly calculated by recursively applying formula (2.13), see (Nocedal andWright, 2006, Ch. 7.2). As a result, this approximation of H ( k ) only requires the curvature information ofthe last m steps. We set m = 40 to obtain a reasonably accurate approximation.6 idpoint Clenshaw Sparse Simplex (a) Clenshaw-CurtisQuadrature (b) Quadrature rule forsimplices -6 -5 -4 -3 -2 I n t eg r a t i on e rr o r (c) d = 2 , n = 1000 -5 -4 -3 -2 I n t eg r a t i on e rr o r (d) d = 4 , n = 1000 Figure 2.1: (a-b) Two integration schemes for the dataset from Figure 1.1 illustrate diﬃculties that arise for convexintegration areas: Integration schemes (a) designed for cubic integration areas lose accuracy when truncated outside theconvex integration domain (gray dots). Schemes (b) suited for simplical integration domains degrade when the simplicesare not well aligned to the polyhedral subdivision of C n induced by ϕ n ( x ) (1.5) . Higher dimensions d aggravate theseeﬀects. (c-d) Simple Riemann sums using the midpoint rule and uniform weights performed best in our experiments, asthey do not assume any speciﬁc integration area. Simplex based schemes worked only well for small n and d . The numerical optimization steps of Section 2.2 require the accurate integration of smooth functionsover C n , due to (2.7).We examined various numerical integration schemes: Sparse grid approaches (Bungartz and Griebel,2004) utilize truncated tensor products of one-dimensional quadrature rules and scale well with the dimension d . But they are not practical for integrating over non-quadrilateral domains (Bungartz and Griebel, 2004,Sec. 5.3), an observation conﬁrmed by our experiments. Another family of approaches are Monte-Carlomethods based on random-walks (Lov´asz and Vempala, 2006; Rudolf, 2013), that speciﬁcally address theproblem of integrating a log-concave density f over a convex body. Nevertheless, experimental results(Lov´asz and De´ak, 2012) raised doubts about their eﬃciency, and we did not further pursue this type ofapproach.In addition, we examined various dense integrations schemes for the hypercube (simple Newton-Cotesschemes and Clenshaw-Curtis quadrature) as well as schemes tailored to simplical integration domains,e.g. Grundmann and M¨oller (1978). Again, regarding the quadrature rules designed for the hypercube,the need to truncate them outside convex subsets (illustrated in Figure 2.1 (a)) had a negative impact onintegration accuracy. On the other hand, the simplex-based schemes only worked well if the randomly chosensimplical decomposition of C n for the integration resembled the decomposition (1.4) of ˆ ϕ n . This was onlythe case for small d and n , however. Figure 2.1 (b) illustrates a typical case of misalignment.Overall, simple Riemann sums with uniform integration weights performed best in our experiments,because the inﬂuence of the shape of the integration domain is minor (Figure 2.1 (c-d)). For future reference,we denote the integration grid by Z m = { z , . . . , z m } ⊂ R d , (2.16)and the uniform integration weights by ∆. Accordingly, the numerical approximation of the integral of theobjective (2.7) reads (cid:90) C n exp (cid:0) − ϕ n,γ ( x ) (cid:1) d x ≈ ∆ m (cid:88) i =1 exp (cid:0) − ϕ n,γ ( z i ) (cid:1) . (2.17)While naively evaluating (2.17) for the combination of all hyperplanes and grid points would quicklybecome intractable, we point out that the impact of each hyperplane ( a i , b i ) is close to zero for most grid7 × l ( θ ∗ ) − l ( ˆ θ ) (a) Log-likelihood × R un t i m e i n s (b) Runtime × h y pe r p l ane s (c) Number of hyperplanes Figure 2.2:

Eﬀect of the grid density (number of grid points of Z m ) versus (a) quality, (b) runtime and (c) complexityof the solution for a sample of n = 5000 points in R . As the log-likelihood converges rapidly along with the number ofhyperplanes, the runtime increases linearly with the number of grid points. The red marked square deﬁnes the densityused in our implementation, a trade-oﬀ between accuracy and runtime. points. This fact combined with further plausible measures renders the integration task tractable even forlarger dimensions. See Appendix A for details and discussion.Regarding the density of the integration grid, we traded oﬀ accuracy against computational complexity.Figure 2.2 (a) illustrates, for a sample of 5000 points in R , how the accuracy improves with increasing griddensity and ﬁnally converges to a solution close to the optimum. As expected, the runtime grows linearlywith the number of grid points (Figure 2.2 (b)). In general we found the ratio of grid points to numberof hyperplanes (Figure 2.2 (c)) to be a good performance indicator. Keeping this ratio above 3 yieldedgood results, and we set a minimal number of grid points based on the expected number of hyperplanes foreach dimension d (except for 6-D, where we chose a lower ratio for performance reasons). The red squareindicates our choice for 4-D. Initialization of θ plays a crucial role due to the non-convexity of L γ ( θ ). We examined two diﬀerentapproaches: The ﬁrst approach is based on a kernel density estimate f kernel ( x ) as in Cule et al. (2010), usinga multivariate normal kernel with a diagonal bandwidth matrix M with entries M jj = σ j n − / ( d +4) , j = 1 , . . . d, where σ j is the standard deviation of X n in dimension j . Setting y i = log f kernel ( x i ) for i = 1 , . . . , n , wecompute a simplical decomposition of C n induced by the upper convex hull of ( X, y ), using the popularquickhull algorithm (Barber et al., 1996). The simplical decomposition combined with y then yields aninitial set of hyperplane parameters θ (0) , one for each simplex C n,i .As for the second approach , we randomly initialize a small number of hyperplanes and optimize L γ ( θ )with γ = 1. The rational behind this is that since γ governs the degree of non-convexity and smoothness of L γ ( θ ), its optimization is less involved than for smaller γ . Having found the optimal log-concave density for γ = 1, we evaluate y i for all x i and proceed as described above (ﬁrst approach) to obtain θ (0) . Regarding thespeciﬁc choice for γ , experiments showed that initializations with γ = 1 yielded superior results comparedto other initial values of γ , thus oﬀering the “best” trade-oﬀ between smoothness of the objective and initialaccuracy of the max approximation.Except for small datasets, in general the second initialization performs better. In practice, we calculateboth and select the one with smaller L γ ( θ (0) ). Both initializations produce a very large set of hyperplanes based on a simplical decomposition of C n , withone hyperplane per simplex. During the optimization, hyperplanes may become inactive. Inactivity of somehyperplane ( a j , b j ) in the light of (1.5) means that there exists no x ∈ C n for which ˆ ϕ n ( x ) = (cid:104) a j , x (cid:105) + b j . Interms of our smooth approximation ϕ n,γ ( x ) (2.6), every hyperplane contributes due to exp( x ) > , ∀ x ∈ R ,8 (a) θ (0) , N n,d = 99 (b) θ (5) , N n,d = 45 (c) θ (9) , N n,d = 26 (d) Final solution θ ( ﬁnal ) , N n,d = 5 (e) Optimal solution, N n,d = 5 ,D¨umbgen et al. (2007) (f) Optimal solution, N n,d = 20 ,Cule et al. (2010) Figure 2.3: (a)-(d) Several steps of the optimization process for a sample of 100 points in R drawn from a gammadistribution Gamma ( α = 5 , β = 1) . Hyperplanes to be dropped in the next step are drawn in red. (a) The initial densityis represented by 99 hyperplanes, one for each simplex of the simplical decomposition of C n and based on the smoothmax-approximation ϕ n,γ =1 ( x ) (see Section 2.6). Plots (b) and (c) show the transition towards the optimal ﬁnal shapecomposed of N n,d = 5 hyperplanes (d). For comparison the non-sparse solution of Cule et al. (2010) (f) and in (e) theoptimal solution (only available in 1-D) with a minimal representation by D¨umbgen et al. (2007) which is identical to thesolution returned by our approach. albeit the contribution may be very close to 0. We therefore resort to the following deﬁnition of inactivityusing our integration grid: m (cid:88) i =1 exp (cid:0) γ − ( (cid:104) a j , z i (cid:105) + b j ) (cid:1)(cid:80) N n,d k =1 exp( γ − ( (cid:104) a k , z i (cid:105) + b k )) ≤ ϑ. (2.18)After each update of θ ( k ) we remove all hyperplanes that satisfy (2.18) with ϑ = 10 − , which corresponds toa total contribution of less than 10 − grid points. We chose this criterion after observing that hyperplanesusually remained inactive once they lost support on the integration grid, due to their very small contributionto the objective function gradient.Figure 2.3 visualizes several intermediate steps during an optimization process for d = 1, together withthe shrinking set of hyperplanes. Plots (d)-(f) show the eﬀectiveness of this scheme, since we arrive at thesame parametrization as the approach of D¨umbgen et al. (2007), which – provided d = 1 – ﬁnds the minimal representation for ˆ ϕ n ( x ) in R . 9 .6. Termination To determine convergence at each iteration, we check if the following inequalities are satisﬁed: | − ∆ m (cid:88) i =1 exp (cid:0) − ˆ ϕ n,γ ( z i ) (cid:1) | ≤ (cid:15), (2.19a) | L γ ( θ ( k +1) ) − L γ ( θ ( k ) ) | ≤ δ, (2.19b)where we use (cid:15) = 10 − and δ = 10 − . The ﬁrst criterion asserts that the current density estimate ˆ f n =exp( − ˆ ϕ n,γ ) satisﬁes (cid:82) ˆ f n d x ≥ − ε . Then second condition detects when the decrease of the objectivefunction becomes negligible. We denote the ﬁnal parameter vector by θ (ﬁnal) . As a ﬁnal step, after convergence of the optimization algorithm, we normalize the estimated densityusing exact integration and the non-smoothed representation (1.7) of ˆ ϕ n , which may be seen as setting γ = 0 in (2.6). Setting y i = ˆ ϕ n,γ ( x i ) for all x i ∈ X n , we again use qhull (Barber et al., 1996) to obtain atriangulation of C n and calculate a hyperplane ˆ ϕ i,n for every simplex C n,i . We then split the integral over C n into separate integrals over simplices C n,i and denote the result by λ : (cid:90) C n exp( − ˆ ϕ n ( x )) d x = N n,d (cid:88) i =1 (cid:90) C n,i exp( − ˆ ϕ i,n ( x )) d x := λ (2.20)We make use of Lemma Appendix B.1 to evaluate (2.20) exactly.The value of λ is close to 1 but not equal to 1. We therefore add the same oﬀset parameter δ to everyhyperplane ˆ ϕ i,n , to obtain˜ ϕ i,n ( x ) := ˆ ϕ i,n ( x ) + δ = (cid:104) x, a i (cid:105) + b i + δ, i = 1 , . . . , N n,d . (2.21)Inserting ˜ ϕ i,n into (B.1) shows that the integral for the modiﬁed hyperplanes (2.21) changes to exp( − δ ) λ .Therefore, after setting δ = log( λ ), the ﬁnal density estimate is ˆ f n = exp( − ˜ ϕ n ) with ˜ ϕ n | C n,i = ˜ ϕ i,n givenby (2.21). We denote the corresponding dense parameter vector byˆ θ. (2.22)The normalization process eliminates the sparsity of the parametrization used for optimization, since itrelies on a simplical decomposition of C n , as does the approach of Cule et al. (2010). Nevertheless, the resultsof our approach are shown using the sparse parameterization θ (ﬁnal) , which is the essential characteristic ofour approach causing the signiﬁcant speed ups reported in Section 3, whereas the reported log-likelihoodvalues are for ˆ θ .

3. Experiments

This section provides empirical evidence that our approach can ﬁnd sparse solutions that are very close tothe optimum and determined computationally using substantially less CPU runtime. Section 3.1 contrastsqualitative properties of our approach with the state of the art, using experiments in dimensions d ∈ { , } along with illustrations. Quantitative results up to dimension d = 6 are reported in Section 3.2. Finally, weextend our approach to the estimation of mixtures of log-concave densities in Section 3.3.For all our experiments we used γ = 10 − for determining the accuracy of the smooth approximation ϕ n,γ (2.6) to the max function. This choice is a compromise between less accurate approximations (thatis larger gammas) and more accurate but numerically unstable ones. As noted in the previous section,qualitative results show θ (ﬁnal) , i.e. the sparse parametrization found during the optimization (Section2.2 - 2.6), whereas quantitative results are reported in terms of ˆ θ , the dense parametrization obtained by10 lgorithm 1: Fast Log-Concave Density Estimation

Input: X , parameters: γ = 10 − , ϑ = 10 − , (cid:15) = 10 − , δ = 10 − Output:

Log-concave density estimate ˆ f n parametrized by θ (2.1).Find initial θ (0) (Section 2.4); for k = 1 , , . . . do Delete inactive hyperplanes from θ ( k ) based on criterion (2.18);Compute the gradient ∇ L γ ( θ ( k ) ) of the objective (2.7) using numerical integration;Find descent direction p ( k ) from the previous m gradients vectors and step size λ k (2.11) andupdate θ ( k +1) ; if the termination criterion (2.19) holds, then Denote ﬁnal parameter vector by θ (ﬁnal) ;Quit for-loop; endend Switch from ˆ ϕ n,γ to ˆ ϕ n and perform exact normalization: θ (ﬁnal) → ˆ θ (Section 2.7); return ˆ θ (2.22)performing the ﬁnal analytical normalization (Section 2.7), and the non-smoothed representation (1.7) of ϕ n . The quality of our solutions is measured in terms of l (ˆ θ ) = n (cid:88) i =1 − ϕ n ( x i ) , (3.1)the total log-likelihood. In order to illustrate the sparse structure of solutions determined by our approach, we investigatedexamples in one and two dimensions. We compared the results with Cule et al. (2010) whose approach(implemented in the R package

LogConcDEAD ) ﬁnds optimal solutions in terms of L ( θ ), but cannot takeadvantage of any sparsity of the solution, since its representation is based on the ﬁxed triangulation inducedby X n (recall Figure 1.2). For d = 1, we additionally compared to the approach of D¨umbgen et al. (2007)using their R package logcondens , which can be only applied to univariate data, but ﬁnds optimal solutionsin terms of L ( θ ) and utilizes the minimal representation of the solution in terms of the number N n,d ofhyperplanes. We sampled 500 data points from three diﬀerent distributions in 1-D (normal, gamma andexponential) and 500 samples from a normal distribution N (0 , I ) in 2-D.Figure 3.1 depicts the results of all three approaches for univariate data in terms of − ˆ ϕ n ( x ) = log ˆ f n ( x ).While all solutions are almost identical in terms of the estimated density, their parametrizations diﬀer. Thesolution of D¨umbgen et al. (2007) is guaranteed both to have the optimal sparse structure in terms of thenumber of hyperplanes and to yield the optimal value L (ˆ θ ). Comparing their solution to ours, we see thatthey are almost identical, with only hyperplanes missing that have very small support and negligible impacton L (ˆ θ ) and l (ˆ θ ) respectively. The solution of Cule et al. (2010), on the other hand, is densely parametrized.The runtimes reﬂect these diﬀerent parametrizations.We made similar observations for two-dimensional data d = 2. Our approach found an density estimateˆ f n that is almost identical to the optimal solution but required only about 10% of the parameters. Panels(a) and (b) of Figure 3.2 depict the density ˆ f n ( x ) estimated by the approach of Cule et al. (2010) and ourapproach, respectively, whereas panels (c) and (d) show the respective decompositions of ˆ ϕ n ( x ) into its aﬃnerepresentation ˆ ϕ i,n (1.5). While the solution of Cule et al. (2010) is based on a ﬁxed hyperplane arrangementand simplical supports C n,i induced by the given data, our decomposition of C n (1.4) is adaptive and canutilize more general convex polytopes C n,i . Comparing panels (c) and (d) of Figure 3.2 clearly shows the11 (a) D¨umbgen et al. (0 .

23 s), l (ˆ θ ) = − . , N n,d = 9 -2 0 2-6-4-20 (b) Cule et al. (3 .

26 s), l (ˆ θ ) = − . , N n,d = 45 -2 0 2-6-4-20 (c) Our approach (0 .

21 s), l (ˆ θ ) = − . , N n,d = 8 (d) D¨umbgen et al. (0 .

19 s), l (ˆ θ ) = − . , N n,d = 6 (e) Cule et al. (3 .

54 s), l (ˆ θ ) = − . , N n,d = 30 (f) Our approach (0 .

27 s), l (ˆ θ ) = − . , N n,d = 6 (g) D¨umbgen et al. (0 .

05 s), l (ˆ θ ) = − . , N n,d = 2 (h) Cule et al. (3 .

79 s), l (ˆ θ ) = − . , N n,d = 8 (i) Our approach (0 .

09 s), l (ˆ θ ) = − . , N n,d = 1 Figure 3.1:

Estimates − ˆ ϕ n ( x ) and their piecewise linear representation for 500 samples drawn from a (a-c) normaldistribution N (0 , , (d-f) gamma distribution Γ(2 , and (g-i) exponential distribution with λ = 1 . (a,d,g) The approachof D¨umbgen et al. (2007) returns ground truth for univariate data, that is the maximal log-likelihood l (ˆ θ ) and the correctpiecewise linear representation of − ˆ ϕ n . (b,e,h) The approach of Cule et al. (2010) also returns the optimal value l (ˆ θ ) butgenerally works with a redundant piecewise linear representation that signiﬁcantly increases runtime. (c,f,i) Our approachis as eﬃcient as the former and only misses components of the piecewise linear representation that have very small supportand hence a negligible impact on l (ˆ θ ) . (a) ˆ f n ( x ) , Cule et al. (14 .

35 s), l (ˆ θ ) = − . -2 -1.5 -1 -0.5 0 0.5 1 1.5-3-2-10123 (b) ˆ f n ( x ) , Our approach (0 .

16 s), l (ˆ θ ) = − . -2 -1.5 -1 -0.5 0 0.5 1 1.5-3-2-10123 (c) C n , N n,d = 610 -2 -1.5 -1 -0.5 0 0.5 1 1.5-3-2-10123 (d) C n , N n,d = 66 Figure 3.2:

Top row : Contour lines displaying estimates ˆ f n ( x ) = exp( ˆ ϕ n ( x )) for a sample of size n = 500 drawnfrom N (0 , I ) . The density plots (a) and (b) as well as the log-likelihood values l (ˆ θ ) demonstrate that both solutions arevery close to each other. Bottom row : Decomposition of C n induced by the piecewise-linear concave functions − ˆ ϕ n ( x ) .While the approach of Cule et al. induces a triangulation of C n , our approach works with more general polytopes. Ourapproach adapts this representation to given data and thus avoids overfragmented representations as depicted by (c) thatsigniﬁcantly increase runtime. (a) Isotropic O -2 -1 0 1 2 3-2-1012 (b) Anisotropic I -2 0 2 4-2-10123 (c) Anisotropic II Figure 3.3:

The three levels of anisotropy used in our experiments: None (a), mild (b) and strong (c). beneﬁcial eﬀect of adaptivity which is an ‘automatic’ byproduct of minimizing L ( θ ) using our approach,together with pruning inactive hyperplanes. Besides the introductory experiments used for illustration, we conducted a comprehensive numericalevaluation using more challenging examples. The authors of Cule et al. (2010) reported the evaluation of upto n = 2000 data points and dimension d = 4, drawn from N (0 , I d ). In order to demonstrate the ability ofour approach to process eﬃciently large data sets, we evaluated samples set of size up to n = 10000 pointsand dimension d = 6, drawn from the same distribution.We extended their benchmarks in terms of sample size, dimension as well as introducing anisotropy tothe covariance matrices: Besides the identity matrix (degree ), we deﬁned two levels of anisotropy ( I , II )based on the following metric for symmetric positive deﬁnitive matrices S and S (Bhatia, 2006): d ( S , S ) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 (cid:16) log λ i ( S − S ) (cid:17) . (3.2)Setting S = I d , we can see that the metric reduces to the euclidean norm of the logarithm of eigenvaluesof S . We deﬁne our two levels of anisotropy as I : d ( I d , S ) = 2 . , II : d ( I d , S ) = 5 . . (3.3)Let S be S = DED T , E = diag( e ) , (3.4)with D being a random orthogonal matrix. Since the eigenvalues of S are solely determined by the vector e , we ﬁx the last value of e to 1 and set the remaining values to be equidistant in log-space, such that d ( I d , S ) = (cid:107) log e (cid:107) ∈ { . , . } . For dimension d = 2 and n = 500, Figure 3.3 depicts examples for eachdegree of anisotropy.For each level of anisotropy, we drew samples of sizes n ∈ { , , , , , , } indimensions d ∈ { , . . . , } and repeated each experiment ﬁve times. We run the R package LogConcDEAD (Version 1.6-1) by Cule et al. (2010) in all dimensions but stopped increasing n when the runtime exceeded24 hours. Table 3.1 reports the results for all sample sizes, the speed ups as well as the quality of the solutionachieved by our approach in comparison to the optimal solution returned by the approach of Cule et al.(2010), measured as the diﬀerence of total log-likelihoods l (ˆ θ ).Overall we see that our approach delivers very accurate results, with very small diﬀerences given therespective number of samples. Regarding the inﬂuence of anisotropy on the accuracy, we notice only a slightincrease the diﬀerence of l ( ˆ θ )). Figure 3.4 (d) provides a diﬀerent perspective on the quality of the estimated14 able 3.1: Runtimes for Cule’s (mark: C ) and our approach ( X ) and the resulting speedups. For d = 1 we additionallyreport results for the approach of D¨umbgen et al. ( D ). Quality is the diﬀerence in total log-likelihood l (ˆ θ ) to the solutionof C . For the multidimensional datasets, accuracy is reported separately for all three levels of anisotropy ( O , I , II ).While accuracy slightly decreases for examples with strong anisotropy (degree II ), it is still comparatively small given thenumber of samples. For some samples we achieved a better log-likelihood value, because the implementation of Cule etal. terminates after a hard-coded number of 15000 iterations which is no issue with our approach (for 2-D less than 500iterations are required). n

100 250 500 1000 2500 5000 10 0001-D Runtime D .

08 s 0 . . . . X .

02 s 0 .

03 s 0 .

06 s 0 . . C . . Speedup

10 x 43 x 121 x 780 x 17 695 x 65 513 x 159 534 x

Quality D . . . − . − . − . − . X . . . . . − . − . X . . . . . . C . Speedup

Quality O . . . . . . − . I . . . . . . − . II . . . . . . . X C Speedup

Quality O . . . . . . . I . . . . . . . II . . . . . . . X

11 s 14 s 23 s 32 s 1 min 1 min 1 min C

13 s 1 min 5 min 24 min 2 h 57 min 13 h 16 min –

Speedup

Quality O . . . . . . I . . . . . . II . . . . . . X

14 s 24 s 44 s 1 min 2 min 3 min 4 min C Speedup

Quality O . . . . . I . . . . . II . . . . . X

46 s 1 min 2 min 6 min 13 min 19 min 35 min C Speedup

12 x 41 x 75 x 93 x – – –

Quality O . . . . I . . . . II . . . . − D 3 − D 4 − D 5 − D 6 − D Number of samples10 N u m be r o f h y pe r p l ane s (a) Cule et al. Number of samples10 N u m be r o f h y pe r p l ane s (b) Our approach Runtime in seconds10 -3 -2 -1 S qua r ed H e lli nge r d i s t an c e (c) X (solid) & C (dashed) vs. true density Number of samples10 -5 -4 -3 -2 -1 S qua r ed H e lli nge r d i s t an c e (d) X vs. C for degree II examples Figure 3.4:

Top row : Complexity of the density ˆ f n in terms of the number N n,d of linear functions forming − ˆ ϕ n ( x ) =log (cid:0) ˆ f n ( x ) (cid:1) . While scales linearly with the number of samples n for the approach of Cule et al. N n,d , our approach showsonly sublinear growth and starts to become ﬂat for larger n . Bottom row : Squared Hellinger distance h ( ˆ f n , f ) between(c) the estimate ˆ f n and the underlying density f versus runtime, evaluated for our test suite introduced in Section 3.2with additional results for n = 25 and n = 50 . For a speciﬁc runtime, our approach (solid lines) obtains estimates closerto the underlying density f while being able to process signiﬁcantly more data points n . Panel (d) depicts the averagesquared Hellinger distance between our estimates and those of Cule et al. for degree II samples (cf. Figure 3.3). Whileconstituting the most diﬃcult samples, diﬀerences are very small and decrease with sample size. densities, by showing the very small squared Hellinger distance between our estimates and those of C fordegree II samples. While our approach estimated almost optimal densities, it achieved speed ups of up toa factor 30 000 (even more for d = 1, though Cule et al. (2010) point out that their approach is not designedwith the one-dimensional case in mind). These factors increase with dimension and, in particular, with thenumber of data points.We empirically observed and estimated how runtime t depends on the sample size n . Regarding theapproach Cule et al. (2010) we observed linear dependency of the number of iterations on n , as is alsomentioned in Cule et al. (2010). Moreover, we found that the complexity of their density estimate ˆ f n ,expressed by the number of hyperplanes N n,d , also grew linearly with n (see Figure 3.4 (a)), which is inaccordance with our discussion in Section 1. Altogether runtime grew quadratically with the number ofsamples: O ( n ).Examining the results for our approach revealed that the number of iterations depends much less on n . For example, for dimension d = 3, the number of iterations for n ∈ { , , } was about thesame, while for d = 4 it doubled from n = 100 to n = 1000 but did not change when increasing n to 10000.This observation relates to the next paragraph below. Concerning the complexity of ϕ n ( x ) depending on16he sample size n , we found that the number N n,d of hyperplanes grew sublinearly and started to becomeﬂat when n > t on n to be roughly O ( √ n ) for d < d = 6.The above observations partially relate to computational constraints for dimensions d ≥

5, where thegrid size for numerical integration may limit the number N n,d of hyperplanes. For example, increasing thenumber of grid points by the factor 5 left unchanged the number of hyperplanes for n = 5000 and d = 2,increased it slightly by 20% for d = 4, but by about 50% for d = 6. Apparantly, this had no impact on thequality of the corresponding density estimates, but we cannot predict N n,d for d > h ( ˆ f n , f ) = 1 − (cid:90) (cid:113) ˆ f n ( x ) f ( x ) dx (3.5)to measure the similarity between the estimate ˆ f n and the true density f , Figure 3.4 (c) depicts h ( ˆ f n , f )for the experiments performed in this section, plotted against the required runtime. Additional results for n = 25 and n = 50 are included. The plot demonstrates that our approach obtains much more accuratedensity estimates within a speciﬁc runtime. This enables to take more data points into account than theapproach of Cule et al. (2010), due to the adaptive sparse parametrization. We extended our approach to the estimation of mixtures of log-concave densities. LetΠ = { π , . . . , π K } , K (cid:88) k =1 π k = 1 (3.6)be the mixing coeﬃcients for classes 1 to K and Θ = { θ , . . . , θ K } be class-speciﬁc hyperplane parameters.Then the log-concave mixture distribution is f n ( x | Θ , Π) = K (cid:88) k =1 π k f n ( x | θ k ) , (3.7)where f n ( x | θ k ) is a log-concave density as deﬁned in Section 1, parametrized by θ k . Given i.i.d samples X n = { x , . . . , x n } , maximizing the log-likelihood function L (Θ , Π) := n (cid:88) i =1 log K (cid:88) k =1 π i f n ( x i | θ k ) (3.8)is challenging due to the summation over k inside the logarithm. A common technique to maximize locally L (Θ , Π) is the EM algorithm (Dempster et al., 1977). Here one introduces assignment probabilities γ i,j = π j f n ( x i | θ j ) (cid:80) k π k f n ( x i | θ k ) , i = 1 , . . . , n, j = 1 , . . . , K (3.9)that point x i belongs to class j as latent parameters, that are iteratively estimated along with the mixturecoeﬃcients and the parameters of the mixture components. More speciﬁcally, the EM algorithm iteratesthe following two steps until convergence: The E-Step computes (3.9). The

M-Step updates the mixingcoeﬃcients π k = 1 n (cid:88) i γ i,k , k = 1 , . . . , K, (3.10)and reﬁnes the parameters θ k , k = 1 , . . . , K by minimizing the modiﬁed negative log-likelihood function L γ ( θ k ) := 1 n k n (cid:88) i =1 γ i,k ϕ n,γ ( x i ) + (cid:90) C n exp( − ϕ n,γ ( x )) dx, n k = (cid:88) i γ i,k , k = 1 , . . . , K. (3.11)17 able 3.2: Performance parameters for the estimation of log-concave mixture densities using the approach of Culeet al. (2010) (mark: C ) and our approach (mark: X ). Estimates based on our approach are very close to the solutionsof C in terms of the log-likelihood ( Quality ) as well as the number of misclassiﬁed points, in one case even performingsigniﬁcantly better. Moreover, runtime is signiﬁcantly reduced. Gaussian mixture models (

GMMs ) are clearly outperformedby log-concave mixtures for both datasets.

Wisconsin USPS2-D 3-D 2-D 3-DRuntime X

21 s 3 min 38 s 20 min C

19 min 1 h 3 min 2 h 21 min 12 h 50 min

Speedup

54 x 20 x 222 x 38 x

Quality . . . X

47 47

200 78 C

45 45

201 109

Quality

GMM .

18 % 93 .

36 % 96 .

88 % 93 .

93 %Missclassiﬁed

GMM

58 64 239 215Since the objective function (3.8) is non-convex, good initialization is essential. To obtain initial values forall γ i,j , we follow Cule et al. (2010) and use hierarchical clustering (Fraley et al., 2012). We terminated theEM approach if the diﬀerence between L (Θ , Π) values in three subsequent iterations dropped below 10 − .We tested our approach on two datasets: • The

Wisconsin breast cancer dataset , consisting of 569 samples with 30 features each, with 357benign and 212 malignant instances. • The well-known

USPS dataset , containing 11000 images of dimension 32 ×

32 (i.e. 256 features) ofhandwritten digits from zero to nine. We selected all samples for the two classes ‘ﬁve‘ and ‘six‘, 1100each.We reduced the dimension for both datasets to d ∈ { , } using PCA. Figure 3.5 (a) depicts the USPS datasetprojected onto the ﬁrst two PCA eigenmodes. One can see the skewness of class ’six’, which the Gaussiandistribution is not able to capture. We compared our approach to Cule et al. (2010) as well as to the standardGaussian mixture model (GMM). Performance was measured in terms of the achieved log-likelihood andthe number of misclassiﬁed samples, where each sample was assigned to the class argmax k π k f n ( x | θ k ).First, we compare Gaussian mixtures and log-concave mixtures. Table 3.2 demonstrates that the log-concave mixture better reﬂects the structure of both datasets and clearly outperforms GMMs with respectto both performance measures. Naturally, using more information by increasing the dimension of the PCAsubspace may lead to better estimates for the class-wise probabilities, as can be seen for the USPS dataset.Comparing the results of the log-concave mixture density estimates of both approaches, we again see verysimilar results in terms of the log-likelihood as well as the number of misclassiﬁed samples, the only exceptionbeing the USPS dataset for d = 3. Again our approach achieves signiﬁcant speedups.Figure 3.5 (b)-(d) visualizes some mixture distributions for the USPS dataset for d = 2 and d = 3. Whilethe GMM fails to properly model class ‘ﬁve‘ as expected, log-concave mixtures succeed, especially for d = 3.

4. Conclusion

This work presented a new computational approach to the estimation of multivariate log-concave densitiesˆ f n . Its crucial feature is the iterative search of a sparse representation of ˆ f n in terms of the number ofhyperplanes N n,d comprising the piecewise-linear concave function ˆ ϕ n ( x ) = − log ˆ f n ( x ). In addition, itsparametrization involves hyperplanes supported on general polytopes C n,i , whereas the approach of Cule18

15 −10 −5 0 5−10−50510 FiveSix (a) USPS Dataset, d = 2 -15 -10 -5 0 5-10-50510 (b) GMM, d = 2 , 239 points missclassiﬁed -15 -10 -5 0 5-10-50510 (c) Log-Concave, d = 2 , 200 points missclassiﬁed -15 -10 -5 0 5-10-50510 (d) Log-Concave, d = 3 , 78 points missclassiﬁed Figure 3.5: (a) The USPS dataset projected onto the ﬁrst two eigenmodes. (b,c) GMM and log-concave mixtureestimates for this dataset. Missclassiﬁed points are drawn red. (d) Log-concave mixture estimate for d = 3 with the thirdPCA dimension marginalized out by numerical integration for visualization.

19t al. (2010) is restricted to simplices. By smoothing the original non-smooth objective, eﬃcient establishednumerical methods can be applied. Overall, this led to signiﬁcant speed ups compared to the state-of-the-artapproach, in particular for larger data sets and increasing dimension. Although our approach minimizes anon-convex objective, we showed that this does not compromise the quality of the solution. On the contrary,almost optimal results are returned after runtimes that are shorter by factors up to 30 000 x.Empirical evidence suggests the following dependency on the runtime t and the sample size n : Weobserved that t grows like O ( n ) for the approach by Cule et al. (2010), resulting from a linear growth ofboth N n,d and the number of iterations with n . The observed linear dependency of N n,d on n supports ourreasoning in Section 1 that the lower bound O ( n ) for N n,d (cf. Dwyer (1991)) also applies to data sampledfrom log-concave distributions.Regarding our approach we observed a O ( √ n ) dependency, due to a sublinear growth of t with both N n,d and the number of iterations. We pointed out that, at least for d ≤

4, there is strong empiricalevidence that the sparse parametrization of ˆ ϕ n ( x ) reﬂects structural properties of the maximum-likelihoodestimator of log-concave densities, which the approach of Cule et al. (2010) does not exploit: Our approachsuccessfully identiﬁes maximal polytopes where log (cid:0) ˆ f n (cid:1) is linear. Since no theoretical results are available,to our knowledge, regarding the sparse parametrization of ˆ ϕ n ( x ) for the case d ≥

2, our empirical resultsmay stimulate corresponding research.We published our code with the R package fmlogcondens in the CRAN repository (Rathke and Schn¨orr,2018). A Matlab implementation is also availableat https://github.com/FabianRathke/FastLogConvDens . Acknowledgements

This work has been supported by the German Research Foundation (DFG), grant GRK 1653, as partof the research training group on “Probabilistic Graphical Models and Applications in Image Analysis” http://graphmod.iwr.uni-heidelberg.de . Appendix A. Calculating ∇ L γ ( θ ( k ) ) A large amount of computation time is is spent on computing the gradient ∇ L γ ( θ ( k ) ) of the objective(2.7). The gradient component with respect to a hyperplane normal a j reads ∇ a j L γ ( θ ) = 1 n n (cid:88) i =1 w ij x i − ∆ m (cid:88) l =1 w lj z l exp (cid:0) − ϕ n,γ ( z l ) (cid:1) , w ij := exp (cid:16) γ ( (cid:104) a j , x i (cid:105) + b j ) (cid:17)(cid:80) N n,d k =1 exp (cid:16) γ ( (cid:104) a k , x i (cid:105) + b k ) (cid:17) . (A.1)Gradient components for the intercepts b j are similar. Since γ ( (cid:104) a j , x i (cid:105) + b j )) → ∞ for γ →

0, a robustevaluation of terms w ij that prevents numerical overﬂow of the exp-function is given by w ij = exp( ν j ) (cid:80) N n,d k =1 exp( ν k ) = exp( ν j − max ν ) (cid:80) N n,d k =1 exp( ν k − max ν ) ,ν k ( x i ) := γ − ( (cid:104) a k , x i (cid:105) + b k ) , ν = ( ν , . . . , ν N n,d ) . (A.2)Similarly, the smooth max-approximation ϕ n,γ ( x ) (2.6) is numerically evaluated as ϕ n,γ ( x ) = γ log N n,d (cid:88) k =1 exp( ν k ) = γ max ν + γ log N n,d (cid:88) k =1 exp( ν k − max ν ) . (A.3)Calculating ∇ L γ ( θ ( k ) ) for all combinations of hyperplanes and grid points is the by far most expensive stepin our approach. The problem is inherently sparse, however, since for most grid and data points only a fewhyperplanes are relevant with most terms w ij negligibly small. We exploit this property in several ways.20omputing the exponential function on a CPU is relatively expensive (about 50 times more than ad-dition/multiplication (Ueberhuber, 2012)). We set exp( ν j − max ν ) = 0, whenever ν j − max ν ≤ −

25. Asecond strategy attempts to reduce the number of hyperplane evaluations ν k . It utilizes two integrationgrids of varying density: A sparse grid to ﬁlter inactive hyperplanes and a dense grid to evaluate the integralof f ( x ) and its gradient for all active hyperplanes. The sparse grid is divided into boxes B i consisting of 2 d adjacent grid points { v i , . . . , v i d } , e.g. 4 boxes in Figure 2.1 (a). For each box B i we perform the followingsteps:1. Pick the point z ∈ B i that is closest to the mean of X n , evaluate all ν k ( z ), k = 1 , . . . , N n,d and set¯ k = arg max k ν k .2. For each k = 1 , . . . , N n,d ﬁnd ∆ k , the upper bound on the increase of hyperplane k relative to hyper-plane ¯ k in B i . Let w j be the width of box B i in dimension j and ζ j = (cid:40) z j = min l v il,j , − . (A.4)Then ∆ k = d (cid:88) j =1 max (cid:0) ( a k,j − a ¯ k,j ) w j ζ j , (cid:1) . (A.5)3. If ν k ( z ) + ∆ k − ν ¯ k ( z ) ≤ −

25, exclude hyperplane k from the integration over B i .For medium sized problems, this scheme reduces the number of evaluations of w ij by about 90%.Using a numerical integration scheme based on a regular grid facilitates parallelization . We automaticallydistribute the numerical integration (and other expensive for-loops) across all available CPU cores, usingthe OpenMP API (Dagum and Menon, 1998). In addition, we utilize Advanved Vector Extensions (AVX),a technique that vectorizes code by performing certain operations like addition or multiplication simultane-ously for 8 ﬂoating point or 4 double values on a single CPU core. AVX is supported by all CPUs releasedsince 2010. Both techniques, within-core and across-core parallelization led to speed ups by a factor of morethan 10 on a standard four core machine. Due to the local character of most computations, transferring thecode to the GPU promises even larger speed-ups. Appendix B. Exact integration

Cule et al. (2010) provided an explicit formula in order to evaluate exactly the integral in the objective(1.11) based on a simplical decomposition of C n : Lemma Appendix B.1.