[PDF] L p -nested symmetric distributions

Abstract

Tractable generalizations of the Gaussian distribution play an important role for the analysis of high-dimensional data. One very general super-class of Normal distributions is the class of ν -spherical distributions whose random variables can be represented as the product $\x = r\cdot \u$ of a uniformly distribution random variable $\u$ on the 1 -level set of a positively homogeneous function ν and arbitrary positive radial random variable r . Prominent subclasses of ν -spherical distributions are spherically symmetric distributions ($\nu(\x)=\|\x\|_2$) which have been further generalized to the class of L p -spherically symmetric distributions ($\nu(\x)=\|\x\|_p$). Both of these classes contain the Gaussian as a special case. In general, however, ν -spherical distributions are computationally intractable since, for instance, the normalization constant or fast sampling algorithms are unknown for an arbitrary ν . In this paper we introduce a new subclass of ν -spherical distributions by choosing ν to be a nested cascade of L p -norms. This class is still computationally tractable, but includes all the aforementioned subclasses as a special case. We derive a general expression for L p -nested symmetric distributions as well as the uniform distribution on the L p -nested unit sphere, including an explicit expression for the normalization constant. We state several general properties of L p -nested symmetric distributions, investigate its marginals, maximum likelihood fitting and discuss its tight links to well known machine learning methods such as Independent Component Analysis (ICA), Independent Subspace Analysis (ISA) and mixed norm regularizers. Finally, we derive a fast and exact sampling algorithm for arbitrary L p -nested symmetric distributions, and introduce the Nested Radial Factorization algorithm (NRF), which is a form of non-linear ICA.

Full PDF

LL p -NESTED SYMMETRIC DISTRIBUTIONS FABIAN SINZ AND MATTHIAS BETHGE

Abstract.

Tractable generalizations of the Gaussian distribution play an important role forthe analysis of high-dimensional data. One very general super-class of Normal distributions isthe class of ν -spherical distributions whose random variables can be represented as the product x = r · u of a uniformly distribution random variable u on the 1-level set of a positivelyhomogeneous function ν and arbitrary positive radial random variable r . Prominent subclassesof ν -spherical distributions are spherically symmetric distributions ( ν ( x ) = (cid:107) x (cid:107) ) which havebeen further generalized to the class of L p -spherically symmetric distributions ( ν ( x ) = (cid:107) x (cid:107) p ).Both of these classes contain the Gaussian as a special case. In general, however, ν -sphericaldistributions are computationally intractable since, for instance, the normalization constant orfast sampling algorithms are unknown for an arbitrary ν . In this paper we introduce a newsubclass of ν -spherical distributions by choosing ν to be a nested cascade of L p -norms. Thisclass, which we consequently call L p -nested symmetric distributions is still computationallytractable, but includes all the aforementioned subclasses as a special case. We derive a generalexpression for L p -nested symmetric distributions as well as the uniform distribution on the L p -nested unit sphere, including an explicit expression for the normalization constant. Westate several general properties of L p -nested symmetric distributions, investigate its marginals,maximum likelihood ﬁtting and discuss its tight links to well known machine learning methodssuch as Independent Component Analysis (ICA), Independent Subspace Analysis (ISA) andmixed norm regularizers. Finally, we derive a fast and exact sampling algorithm for arbitrary L p -nested symmetric distributions, and introduce the Nested Radial Factorization algorithm(NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial L p -nested source into statistically independent signals. parametric density model, symmetric distribution, ν -spherically symmetric distributions, non-linear independent component analysis, independent subspace analysis, robust Bayesian inference,mixed norm density model, uniform distributions on mixed norm spheres, nested radial factoriza-tion 1. Introduction

High-dimensional data analysis virtually always starts with the measurement of ﬁrst and second-order moments that are suﬃcient to ﬁt a multivariate Gaussian distribution, the maximum entropydistribution under these constraints. Natural data, however, often exhibit signiﬁcant deviationsfrom a Gaussian distribution. In order to model these higher-order correlations, it is necessaryto have more ﬂexible distributions available. Therefore, it is an important challenge to ﬁndgeneralizations of the Gaussian distribution which are, on the one hand, more ﬂexible but, on theother hand, still exhibit enough structure to be computationally and analytically tractable. Inparticular, probability models with an explicit normalization constant are desirable because theymake direct model comparison possible by comparing the likelihood of held out test samples fordiﬀerent models. Additionally, such models often allow for a direct optimization of the likelihood.One way of imposing structure on probability distributions is to ﬁx the general form of theiso-density contour lines. This approach was taken by Fernandez et al. [1995]. They modeled thecontour lines by the level sets of a positively homogeneous function of degree one, i.e. functions ν that fulﬁll ν ( a · x ) = a · ν ( x ) for x ∈ R n and a ∈ R +0 . The resulting class of ν -spherical distributionshave the general form p ( x ) = ρ ( ν ( x )) for an appropriate ρ which causes p ( x ) to integrate to one.Since the only access of ρ to x is via ν one can show that, for a ﬁxed ν , those distributions aregenerated by a univariate radial distribution. In other words, ν -spherically distributed randomvariables can be represented as a product of two independent random variables: one positive radialvariable and another variable which is uniform on the 1-level set of ν . This property makes this a r X i v : . [ s t a t . O T ] A ug FABIAN SINZ AND MATTHIAS BETHGE class of distributions easy to ﬁt to data since the maximum likelihood procedure can be carriedout on the univariate radial distribution instead of the joint density. Unfortunately, derivingthe normalization constant for the joint distribution in the general case is intractable because itdepends on the surface area of those level sets which can usually not be computed analytically.Known tractable subclasses of ν -spherical distributions are the Gaussian, elliptically contoured,and L p -spherical distributions. The Gaussian is a special case of elliptically contoured distribu-tions. After centering and whitening x := C − / ( s − E [ s ]) a Gaussian distribution is sphericallysymmetric and the squared L -norm || x || = x + · · · + x n of the samples follow a χ -distribution(i.e. the radial distribution is a χ -distribution). Elliptically contoured distributions other thanthe Gaussian are obtained by using a radial distribution diﬀerent from the χ -distribution [Fanget al., 1990; Kelker, 1970].The extension from L - to L p -spherically symmetric distributions is based on replacing the L -norm by the L p -norm ν ( x ) = (cid:107) x (cid:107) p = (cid:32) n (cid:88) i =1 | x i | p (cid:33) p , p > L p -spherical distributions can alwaysbe written in the form p ( x ) = ρ ( || x || p ). Those distributions have been studied by Osiewalskiand Steel [1993] and Gupta and Song [1997]. We will adopt the naming convention of Guptaand Song [1997] and call (cid:107) x (cid:107) p an L p -norm even though the triangle inequality only holds for p ≥ L p -spherically symmetric distribution with p (cid:54) = 2 are no longer invariant with respectto rotations (transformations from SO ( n )). Instead, they are only invariant under permutationsof the coordinate axes. In some cases, it may not be too restrictive to assume permutation oreven rotational symmetry for the data. In other cases, such symmetry assumptions might not bejustiﬁed and let the model miss important regularities.Here, we present a generalization of the class of L p -spherically symmetric distribution withinthe class of ν -spherical distributions that makes weaker assumptions about the symmetries in thedata but still is analytically tractable. Instead of using a single L p -norm to deﬁne the contour ofthe density, we use a nested cascade of L p -norms where an L p -norm is computed over groups of L p -norms over groups of L p -norms ..., each of which having a possibly diﬀerent p . Due to thisnested structure we call this new class of distributions L p -nested symmetric distributions . Thenested combination of L p -norms preserves positive homogeneity but does not require permutationinvariance anymore. While L p -nested distributions are still invariant under reﬂections of thecoordinate axes, permutation symmetry only holds within the subspaces of the L p -norms at thebottom of the cascade. As demonstrated in Sinz et al. [2009b], one possible application domain of L p -nested symmetric distributions are patches of natural images. In the current paper, we wouldlike to present a formal treatment of this class of distributions. We ask readers interested in theapplication of this distributions to natural images to refer to Sinz et al. [2009b].We demonstrate below that the construction of the nested L p -norm cascade still bears enoughstructure to compute the Jacobian of polar-like coordinates similar to those of Song and Gupta[1997] and Gupta and Song [1997]. With this Jacobian at hand it is possible to compute theunivariate radial distribution for an arbitrary L p -nested density and to deﬁne the uniform distri-bution on the L p -nested unit sphere L ν = { x ∈ R n | ν ( x ) = 1 } . Furthermore, we compute thesurface area of the L p -nested unit sphere and, therefore, the general normalization constant for L p -nested distributions. By deriving these general relations for the class of L p -nested distributionswe have determined a new class of tractable ν -spherical distributions which is so far the only onecontaining the Gaussian, elliptically contoured, and L p -spherical distributions as special cases. L p -spherically symmetric distributions have been used in various contexts in statistics andmachine learning. Many results carry over to L p -nested symmetric distributions which allow awider application range. Osiewalski and Steel [1993] showed that the posterior on the locationof a L p -spherically symmetric distributions together with an improper Jeﬀrey’s prior on the scaledoes not depend on the particular type of L p -spherically symmetric distribution used. Below,we show that this results carries over to L p -nested symmetric distributions. This means that p -NESTED SYMMETRIC DISTRIBUTIONS 3 we can robustly determine the location parameter by Bayesian inference for a very large class ofdistributions.A large class of machine learning algorithms can be written as an optimization problem onthe sum of a regularizer and a loss functions. For certain regularizers and loss functions, likethe sparse L regularizer and the mean squared loss, the optimization problem can be seen asthe Maximum A Posteriori (MAP) estimate of a stochastic model in which the prior and thelikelihood are the negative exponentiated regularizer and loss terms. Since p ( x ) ∝ exp( −|| x || pp ) isan L p -spherically symmetric model, regularizers which can be written in terms of a norm have atight link to L p -spherically symmetric distributions. In an analogous way, L p -nested distributionsexhibit a tight link to mixed-norm regularizers which have recently gained increasing interest inthe machine learning community [see e.g. Kowalski et al., 2008; Yuan and Lin, 2006; Zhao et al.,2008]. L p -nested symmetric distributions can be used for a Bayesian treatment of mixed-normregularized algorithms. Furthermore, they can be used to understand the prior assumptions madeby such regularizers. Below we discuss an implicit dependence assumptions between the regularizedvariables that follows from the theory of L p -nested symmetric distributions.Finally, the only factorial L p -spherically symmetric distribution [Sinz et al., 2009a], the p -generalized Normal distribution, has been used as an ICA model in which the marginals followan exponential power distribution. This class of ICA is particularly suited for natural signalslike images and sounds [Lee and Lewicki, 2000; Lewicki, 2002; Zhang et al., 2004]. Interestingly, L p -spherically symmetric distributions other than the p -generalized Normal give rise to a non-linear ICA algorithm called Radial Gaussianization for p = 2 [Lyu and Simoncelli, 2009] or RadialFactorization for arbitrary p [Sinz and Bethge, 2009]. As discussed below, L p -nested distributionsare a natural extension of the linear L p -spherically symmetric ICA algorithm to ISA, and give riseto a more general non-linear ICA algorithm in the spirit of Radial Factorization.The remaining part of the paper is structured as follows: in Section 2 we deﬁne polar-likecoordinates for L p -nested symmetrically distributed random variables and present an analyticalexpression for the determinant of the Jacobian for this coordinate transformation. Using thisexpression, we deﬁne the uniform distribution on the L p -nested unit sphere and the class of L p -nested symmetric distributions for an arbitrary L p -nested function in Section 3. In Section 4 wederive an analytical form of L p -nested symmetric distributions when marginalizing out lower levelsof the L p -nested cascade and demonstrate that marginals of L p -nested symmetric distributionsare not necessarily L p -nested. Additionally, we demonstrate that the only factorial L p -nestedsymmetric distribution is necessarily L p -spherical and discuss the implications of this result formixed norm regularizers. In Section 5.1 we propose an algorithm for ﬁtting arbitrary L p -nestedmodels and derive a sampling scheme for arbitrary L p -nested symmetric distributions. In Section6 we generalize a result by Osiewalski and Steel [1993] on robust Bayesian inference on the locationparameter to L p -nested symmetric distribution. In Section 7 we discuss the relationship of L p -nested symmetric distributions to ICA, ISA and their possible role as prior on hidden variablein over-complete linear models. Finally, we derive a non-linear ICA algorithm for linearly mixednon-factorial L p -nested sources in Section 8 which we call Nested Radial Factorization (NRF).2. L p -nested functions, Coordinate Transformation and Jacobian Consider the function f ( x ) = (cid:16) | x | p ∅ + ( | x | p + | x | p ) p ∅ p (cid:17) p ∅ . (1)with p ∅ , p ∈ R + . This function is obviously a cascade of two L p -norms and is thus positivelyhomogeneous of degree one. Figure 1(a) shows this function visualized as a tree. Naturally, anytree like the ones in Figure 1 corresponds to a function of the kind of equation (1). In general,the n leaves of the tree correspond to the n coeﬃcients of the vector x ∈ R n and each inner nodecomputes the L p -norm of its children using its speciﬁc p . We call the class of functions whichis generated in that way L p -nested and the corresponding distributions, that are symmetric orinvariant with respect to it, L p -nested symmetric distributions . FABIAN SINZ AND MATTHIAS BETHGE (a) Equation (1) as tree. (b) Equation (1) as tree in multi-index notation.

Figure 1.

Equation (1) visualized as a tree with two diﬀerent naming conven-tions. Figure (a) shows the tree where the nodes are labeled with the coeﬃcientsof x ∈ R n . Figure (b) shows the same tree in multi-index notation where themulti-index of a node describes the path from the root node to that node in thetree. The leaves v , v , and v , still correspond to x , x and x , respectively,but have been renamed to the multi-index notation used in this article. f ( · ) = f ∅ ( · ) L p -nested function I = i , ..., i m Multi-index denoting a node in the tree. The single indices describethe path from the root node to the respective node I . x I All entries in x that correspond to the leaves in the subtree underthe node I . x (cid:98) I All entries in x that are not leaves in the subtree underthe node I . f I ( · ) L p -nested function corresponding to the subtree under the node I . v ∅ Function value at the root node v I Function value at an arbitrary node with multi-index I . (cid:96) I The number of direct children of a node I . n I The number of leaves in the subtree under the node I . v I, (cid:96) I Vector with the function values at the direct children of a node I . Table 1.

Summary of the notation used for L p -nested functions in this article. L p -nested functions are much more ﬂexible in creating diﬀerent shapes of level sets than single L p -norms. Those level sets become the iso-density contours in the family of L p -nested symmetricdistributions. Figure 2 shows a variety of contours generated by the simplest non-trivial L p -nestedfunction shown in equation (1). The shapes show the unit spheres for all possible combinations of p ∅ , p ∈ { . , , , } . On the diagonal, p ∅ and p are equal and therefore constitute L p -norms.The corresponding distributions are members of the L p -spherically symmetric class.In order to make general statements about general L p -nested functions, we introduce a notationthat is suitable for the tree structure of L p -nested functions. As we will heavily use that notation inthe remainder of the paper, we would like to emphasize the importance of the following paragraphs.We will illustrate the notation with an example below. Additionally, Figure 1 and Table 1 can beused for reference.We use multi-indices to denote the diﬀerent nodes of the tree corresponding to an L p -nestedfunction f . The function f = f ∅ itself computes the value v ∅ at the root node (see Figure 1).Those values are denoted by variables v . The functions corresponding to its children are denoted p -NESTED SYMMETRIC DISTRIBUTIONS 5 Figure 2.

Variety of contours created by the L p -nested function of equation (1)for all combinations of p ∅ , p ∈ { . , , , } .by f , ..., f (cid:96) ∅ , i.e. f ( · ) = f ∅ ( · ) = (cid:107) ( f ( · ) , ..., f (cid:96) ∅ ( · )) (cid:107) p ∅ . We always use the letter “ (cid:96) ” indexed by thenode’s multi-index to denote the total number of direct children of that node. The functions ofthe children of the i th child of the root node are denoted by f i, , ..., f i,(cid:96) i and so on. In this manner,an index is added for denoting the children of a particular node in the tree and each multi-indexdenotes the path to the respective node in the tree. For the sake of compact notation, we use uppercase letters to denote a single multi-index I = i , ..., i (cid:96) . The range of the single indices and thelength of the multi-index should be clear from the context. A concatenation I, k of a multi-index I with a single index k corresponds to adding k to the index tuple, i.e. I, k = i , ..., i m , k . We usethe convention that I, ∅ = I . Those coeﬃcients of the vector x that correspond to leaves of thesubtree under a node with the index I are denoted by x I . The complement of those coeﬃcients,i.e. the ones that are not in the subtree under the node I , are denoted by x (cid:98) I . The number ofleaves in a subtree under a node I is denoted by n I . If I denotes a leaf then n I = 1. FABIAN SINZ AND MATTHIAS BETHGE

The L p -nested function associated with the subtree under a node I is denoted by f I ( x I ) = || ( f I, ( x I, ) , ..., f I,(cid:96) I ( x I,(cid:96) I )) (cid:62) || p I . Just like for the root node, we use the variable v I to denote the function value v I = f I ( x I ) of asubtree I . A vector with the function values of the children of I is denoted with bold font v I, (cid:96) I where the colon indicates that we mean the vector of the function values of the (cid:96) I children of node I : f I ( x I ) = || ( f I, ( x I, ) , ..., f I,(cid:96) I ( x I,(cid:96) I )) (cid:62) || p I = || ( v I, , ..., v I,(cid:96) I ) (cid:62) || p I = || v I, (cid:96) I || p I . Note that we can assign an arbitrary p to leaf nodes since p for single variables always cancel.For that reason we can choose an arbitrary p for convenience and ﬁx its value to p = 1. Figure1(b) shows the multi-index notation for our example of equation (1).To illustrate the notation: Let I = i , ..., i d be the multi-index of a node in the tree. i , ..., i d describes the path to that node, i.e. the respective node is the i thd child of the i thd − child of the i thd − child of the ... of the i th child of the root node. Assume that the leaves in the subtree belowthe node I cover the vector entries x , ..., x . Then x I = ( x , ..., x ), x (cid:98) I = ( x , x , x , ... ),and n I = 9. Assume that node I has (cid:96) I = 2 children. Those would be denoted by I, I, I would be denoted by f I and only acts on x I . The value of thefunction would be f I ( x I ) = v I and the vector containing the values of the children of I would be v I, = ( v I, , v I, ) (cid:62) = ( f I, ( x I, ) , f I, ( x I, )) (cid:62) .We now introduce a coordinate representation that is especially tailored to L p -nested symmetri-cally distributed variables: One of the most important consequence of the positive homogeneity of f is that it can be used to “normalize” vectors and, by that property, create a polar like coordinaterepresentation of a vector x . Such polar-like coordinates generalize the coordinate representationfor L p -norms by Gupta and Song [1997]. Deﬁnition 1 (Polar-like Coordinates) . We deﬁne the following polar-like coordinates for a vector x ∈ R n : u i = x i f ( x ) for i = 1 , ..., n − r = f ( x ) . The inverse coordinate transformation is given by x i = ru i for i = 1 , ..., n − x n = r ∆ n u n where ∆ n = sgn x n and u n = | x n | f ( x ) .Note that u n is not part of the coordinate representation since normalization with 1 /f ( x )decreases the degrees of freedom u by one, i.e. u n can always be computed from all other u i bysolving f ( u ) = f ( x /f ( x )) = 1 for u n . We only use the term u n for notational simplicity. Witha slight abuse of notation, we will use u to denote the normalized vector x /f ( x ) or only its ﬁrst n − L p -norm is replaced by an L p -nested function. Just as in thecase of L p -spherical coordinates, it will turn out that the determinant of the Jacobian of thecoordinate transformation does not depend on the value of ∆ n and can be computed analytically.The determinant is essential for deriving the uniform distribution on the unit L p -nested sphere L f , i.e. the 1-level set of f . Apart from that, it can be used to compute the radial distribution fora given L p -nested distribution. We start by stating the general form of the determinant in termsof the partial derivatives ∂u n ∂u k , u k and r . Afterwards we demonstrate that those partial derivativeshave a special form and that most of them cancel in Laplace’s expansion of the determinant. p -NESTED SYMMETRIC DISTRIBUTIONS 7 Lemma 1 (Determinant of the Jacobian) . Let r and u be deﬁned as in Deﬁnition 1. The generalform of the determinant of the Jacobian J = (cid:16) ∂x i ∂y j (cid:17) ij of the inverse coordinate transformation for y = r and y i = u i − for i = 2 , ..., n , is given by | det J | = r n − (cid:32) − n − (cid:88) k =1 ∂u n ∂u k · u k + u n (cid:33) . (2) Proof.

The proof can be found in the Appendix A. (cid:3)

The problematic part in equation (2) are the terms ∂u n ∂u k , which obviously involve extensive usageof the chain rule. Fortunately, most of them cancel when inserting them back into equation (2),leaving a comparably simple formula. The remaining part of this section is devoted to computingthose terms and demonstrating how they vanish in the formula for the determinant. Before we statethe general case we would like to demonstrate the basic mechanism through a simple example. Weurge the reader to follow the next example as it illustrates all important ideas about the coordinatetransformation and its Jacobian. Example . Consider an L p -nested function very similar to our introductory example of equation(1): f ( x ) = (cid:16) ( | x | p + | x | p ) p ∅ p + | x | p ∅ (cid:17) p ∅ . Setting u = x f ( x ) and solving for u yields f ( u ) = 1 ⇔ u = (cid:16) − ( | u | p + | u | p ) p ∅ p (cid:17) p ∅ (3)We would like to emphasize again, that u is actually not part of the coordinate representation andonly used for notational simplicity. By construction, u is always positive. This is no restrictionsince Lemma 2 shows that the determinant of the Jacobian does not depend on its sign. However,when computing the volume and the surface area of the L p -nested unit sphere it will becomeimportant since it introduces a factor of 2 to account for the fact that u (or u n in general) canin principle also attain negative values.Now, consider G ( u (cid:98) ) = g ( u (cid:98) ) − p ∅ = (cid:16) − ( | u | p + | u | p ) p ∅ p (cid:17) − p ∅ p ∅ F ( u ) = f ( u ) p ∅ − p = ( | u | p + | u | p ) p ∅− p p , where the subindices of u , f, g, G and F have to be read as multi-indices. The function g I computes the value of the node I from all other leaves that are not part of the subtree under I byﬁxing the value of the root node to one. G ( u (cid:98) ) and F ( u ) are terms that arise from applying the chain rule when computing thepartial derivatives ∂u ∂u k . Taking those partial derivatives can be thought of as pealing oﬀ layer bylayer of Equation (3) via the chain rule. By doing so, we “move” on a path between u and u k .Each application of the chain rule corresponds to one step up or down in the tree. First, we moveupwards in the tree, starting from u . This produces the G -terms. In this example, there is onlyone step upwards, but in general, there can be several, depending on the depth of u n in the tree.Each step up will produce one G -term. At some point, we will move downwards in the tree toreach u k . This will produce the F -terms. While there are as many G -terms as upward steps, thereis one term less when moving downwards. Therefore, in this example, there is one term G ( u (cid:98) )which originates from using the chain rule upwards in the tree and one term F ( u ) from using itdownwards. The indices correspond to the multi-indices of the respective nodes.Computing the derivative yields ∂u ∂u k = − G ( u (cid:98) ) F ( u )∆ k | u k | p − . FABIAN SINZ AND MATTHIAS BETHGE

By inserting the results in equation (2) we obtain1 r |J | = (cid:88) k =1 G ( u (cid:98) ) F ( u ) | u k | p + u = G ( u (cid:98) ) (cid:32) F ( u ) (cid:88) k =1 | u k | p + 1 − F ( u ) F ( u ) − ( | u | p + | u | p ) p ∅ p (cid:33) = G ( u (cid:98) ) (cid:32) F ( u ) (cid:88) k =1 | u k | p + 1 − F ( u ) (cid:88) k =1 | u k | p (cid:33) = G ( u (cid:98) ) . The example suggests that the terms from using the chain rule downwards in the tree cancelwhile the terms from using the chain rule upwards remain. The following proposition states thatthis is true in general.

Proposition 1 (Determinant of the Jacobian) . Let L be the set of multi-indices of the path fromthe leaf u n to the root node (excluding the root node) and let the terms G I,(cid:96) I ( u (cid:100) I,(cid:96) I ) recursively bedeﬁned as G I,(cid:96) I ( u (cid:100) I,(cid:96) I ) = g I,(cid:96) I ( u (cid:100) I,(cid:96) I ) p I,(cid:96)I − p I =  g I ( u (cid:98) I ) p I − (cid:96) − (cid:88) j =1 f I,j ( u I,j ) p I  pI,(cid:96)I − pIpI , (4) where each of the functions g I,(cid:96) I computes the value of the (cid:96) th child of a node I as a function ofits neighbors ( I, , ... , ( I, (cid:96) I − and its parent I while ﬁxing the value of the root node to one.This is equivalent to computing the value of the node I from all coeﬃcients u (cid:98) I that are not leavesin the subtree under I . Then, the determinant of the Jacobian for an L p -nested function is givenby det |J | = r n − (cid:89) L ∈L G L ( u (cid:98) L ) . Proof.

The proof can be found in the Appendix A. (cid:3)

Let us illustrate the determinant with two examples:

Example . Consider a normal L p -norm f ( x ) = (cid:32) n (cid:88) i =1 | x i | p (cid:33) p which is obviously also an L p -nested function. Resolving the equation for the last coordinate ofthe normalized vector u yields g n ( u (cid:98) n ) = u n = (cid:16) − (cid:80) n − i =1 | u i | p (cid:17) p . Thus, the term G n ( u (cid:98) n ) term isgiven by (cid:16) − (cid:80) n − i =1 | u i | p (cid:17) − pp which yields a determinant of | det J | = r n − (cid:16) − (cid:80) n − i =1 | u i | p (cid:17) − pp .This is exactly the one derived by Gupta and Song [1997]. Example . Consider the introductory example f ( x ) = (cid:16) | x | p ∅ + ( | x | p + | x | p ) p ∅ p (cid:17) p ∅ . Normalizing and resolving for the last coordinate yields u = (cid:16) (1 − | u | p ∅ ) p p ∅ − | u | p (cid:17) p p -NESTED SYMMETRIC DISTRIBUTIONS 9 and the terms G ( u (cid:98) ) and G , ( u (cid:99) , ) of the determinant | det J | = r G ( u (cid:98) ) G , ( u (cid:99) , ) are givenby G ( u (cid:98) ) = (1 − | u | p ∅ ) p − p ∅ p ∅ G , ( u (cid:99) , ) = (cid:16) (1 − | u | p ∅ ) p p ∅ − | u | p (cid:17) − p p . Note the diﬀerence to Example 1 where x was at depth one in the tree while x is at depth twoin the current case. For that reason, the determinant of the Jacobian in Example 1 only involvedone G -term while it has two G -terms here.3. L p -Nested Symmetric and L p -Nested Uniform Distribution In this section, we deﬁne the L p -nested symmetric and the L p -nested uniform distribution andderive their partition functions. In particular, we derive the surface area of an arbitrary L p -nestedunit sphere L f = { x ∈ R n | f ( x ) = 1 } corresponding to an L p -nested function f . By equation (5)of Fernandez et al. [1995] every ν -spherically symmetric and hence any L p -nested density has theform ρ ( x ) = (cid:37) ( f ( x )) f ( x ) n − S f (1) , (5)where S f is the surface area of L f and (cid:37) is a density on R + . Thus, we need to compute the surfacearea of an arbitrary L p -nested unit sphere to obtain the partition function of equation (5). Proposition 2 (Volume and Surface of the L p -nested Sphere) . Let f be an L p -nested functionand let I be the set of all multi-indices denoting the inner nodes of the tree structure associatedwith f . The volume V f ( R ) and the surface S f ( R ) of the L p -nested sphere with radius R are givenby V f ( R ) = R n n n (cid:89) I ∈I p (cid:96) I − I (cid:96) I − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n I,k p I , n I,k +1 p I (cid:35) (6) = R n n n (cid:89) I ∈I (cid:81) (cid:96) I k =1 Γ (cid:104) n I,k p I (cid:105) p (cid:96) I − I Γ (cid:104) n I p I (cid:105) (7) S f ( R ) = R n − n (cid:89) I ∈I p (cid:96) I − I (cid:96) I − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n I,k p I , n I,k +1 p I (cid:35) (8) = R n − n (cid:89) I ∈I (cid:81) (cid:96) I k =1 Γ (cid:104) n I,k p I (cid:105) p (cid:96) I − I Γ (cid:104) n I p I (cid:105) (9) where B [ a, b ] = Γ[ a ]Γ[ b ]Γ[ a + b ] denotes the β -function.Proof. The proof can be found in the Appendix B. (cid:3)

Inserting the surface area in equation 5, we obtain the general form of an L p -nested symmetricdistribution for any given radial density (cid:37) . Corollary 1 ( L p -nested Symmetric Distribution) . Let f be an L p -nested function and (cid:37) a densityon R + . The corresponding L p -nested symmetric distribution is given by ρ ( x ) = (cid:37) ( f ( x )) f ( x ) n − S f (1)= (cid:37) ( f ( x ))2 n f ( x ) n − (cid:89) I ∈I p (cid:96) I − I (cid:96) I − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n I,k p I , n I,k +1 p I (cid:35) − . (10) The results of Fernandez et al. [1995] imply that for any ν -spherically symmetric distribution,the radial part is independent of the directional part, i.e. r is independent of u . The distributionof u is entirely determined by the choice of ν , or by the L p -nested function f in our case. Thedistribution of r is determined by the radial density (cid:37) . Together, an L p -nested symmetric distri-bution is determined by both, the L p -nested function f and the choice of (cid:37) . From equation (10),we can see that its density function must be the inverse of the surface area of L f times the radialdensity when transforming (5) into the coordinates of Deﬁnition 1 and separating r and u (thefactor f ( x ) n − = r cancels due to the determinant of the Jacobian). For that reason we call thedistribution of u uniform on the L p -sphere L f in analogy to Song and Gupta [1997]. Next, westate its form in terms of the coordinates u . Proposition 3 ( L p -nested Uniform Distribution) . Let f be an L p -nested function. Let L be setset of multi-indices on the path from the root node to the leaf corresponding to x n . The uniformdistribution on the L p -nested unit sphere, i.e. the set L f = { x ∈ R n | f ( x ) = 1 } is given by thefollowing density over u , ..., u n − ρ ( u , , ..., u n − ) = (cid:81) L ∈L G L ( u (cid:98) L )2 n − (cid:89) I ∈I p (cid:96) I − I (cid:96) I − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n I,k p I , n I,k +1 p I (cid:35) − Proof.

Since the L p -nested sphere is a measurable and compact set, the density of the uniformdistribution is simply one over the surface area of the unit L p -nested sphere. The surface S f (1)is given by Proposition 2. Transforming S f (1) into the coordinates of Deﬁnition 1 introducesthe determinant of the Jacobian from Proposition 1 and an additional factor of 2 since the( u , ..., u n − ) ∈ R n − have to account for both half-shells of the L p -nested unit sphere, i.e. toaccount for the fact that u n could have been be positive or negative. This yields the expressionabove. (cid:3) Example . Let us again demonstrate the proposition at the special case where f is an L p -norm f ( x ) = || x || p = ( (cid:80) ni =1 | x i | p ) p . Using Proposition 2, the surface area is given by S ||·|| p = 2 n p (cid:96) ∅ − ∅ (cid:96) ∅ − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n k p ∅ , n k +1 p ∅ (cid:35) = 2 n Γ n (cid:104) p (cid:105) p n − Γ (cid:104) np (cid:105) . The factor G n ( u (cid:98) n ) is given by (cid:16) − (cid:80) n − i =1 | u i | p (cid:17) − pp (see the L p -norm example before), which,after including the factor 2, yields the uniform distribution on the L p -sphere as deﬁned in Songand Gupta [1997] p ( u ) = p n − Γ (cid:104) np (cid:105) n − Γ n (cid:104) p (cid:105) (cid:32) − n − (cid:88) i =1 | u i | p (cid:33) − pp . Example . As a second illustrative example, we consider the uniform density on the L p -nestedunit ball, i.e. the set { x ∈ R n | f ( x ) ≤ } , and derive its radial distribution (cid:37) . The densityof the uniform distribution on the unit L p -nested ball does not depend on x and is given by ρ ( x ) = 1 / V f (1). Transforming the density into the polar-like coordinates with the determinantfrom Proposition 1 yields1 V f (1) = nr n − (cid:81) L ∈L G L ( u (cid:98) L )2 n − (cid:89) I ∈I p (cid:96) I − I (cid:96) I − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n I,k p I , n I,k +1 p I (cid:35) − . After separating out the uniform distribution on the L p -nested unit sphere, we obtain the radialdistribution (cid:37) ( r ) = nr n − for 0 < r ≤ β -distribution with parameters n and 1. p -NESTED SYMMETRIC DISTRIBUTIONS 11 The radial distribution from the preceding example is of great importance for our samplingscheme derived in Section 5.2. The idea behind it is the following: First, a sample from an “simple” L p -nested distribution is drawn. Since the radial and the uniform component on the L p -nestedunit sphere are statistically independent, we can get a sample from the uniform distribution on the L p -nested unit sphere by simply normalizing the sample from the simple distribution. Afterwardswe can multiply it with a radius drawn from the radial distribution of the L p -nested distributionthat we actually want to sample from. The role of the simple distribution will be played by theuniform distribution within the L p -nested unit ball. Sampling from it is basically done by applyingthe steps in proof of Proposition 2 backwards. We lay out the sampling scheme in more detail inSection 5.2. 4. Marginals

In this section we discuss two types of marginals: First, we demonstrate that, in contrast to L p -spherically symmetric distributions, marginals of L p -nested distributions are not necessarily L p -nested again. The second type of marginals we discuss are obtained by collapsing all leaves of asubtree into the value of the subtree’s root node. For that case we derive an analytical expressionand show that the values of the root node’s children follow a special kind of Dirichlet distribution.Gupta and Song [1997] show that marginals of L p -spherically symmetric distributions are again L p -spherically symmetric. This does not hold, however, for L p -nested symmetric distributions.This can be shown by a simple counterexample. Consider the L p -nested function f ( x ) = (cid:16) ( | x | p + | x | p ) p ∅ p + | x | p ∅ (cid:17) p ∅ . The uniform distribution inside the L p -nested ball corresponding to f is given by ρ ( x ) = np p ∅ Γ (cid:104) p (cid:105) Γ (cid:104) p ∅ (cid:105) Γ (cid:104) p (cid:105) Γ (cid:104) p (cid:105) Γ (cid:104) p (cid:105) . The marginal ρ ( x , x ) is given by ρ ( x , x ) = np p ∅ Γ (cid:104) p (cid:105) Γ (cid:104) p ∅ (cid:105) Γ (cid:104) p (cid:105) Γ (cid:104) p (cid:105) Γ (cid:104) p (cid:105) (cid:16) (1 − | x | p ∅ ) p p ∅ − | x | p (cid:17) p . This marginal is L p -spherically symmetric. Since any L p -nested distribution in two dimensionsmust be L p -spherically symmetric it cannot be L p -nested symmetric as well. Figure 3 shows ascatter plot of the marginal distribution. Besides the fact that the marginals are not containedin the family of L p -nested distributions, it is also hard to derive a general form for them. This isnot surprising given that the general form of marginals for L p -spherically symmetric distributionsinvolves an integral that cannot be solved analytically in general and is therefore not very usefulin practice [Gupta and Song, 1997]. For that reason we cannot expect marginals of L p -nestedsymmetric distributions to have a simple form.In contrast to single marginals, it is possible to specify the joint distribution of leaves andinner nodes of an L p -nested tree if all descendants of their inner nodes in question have beenintegrated out. For the simple function above (the same that has been used in Example 1), thejoint distribution of x and v = (cid:107) ( x , x ) (cid:62) (cid:107) p would be an example of such a marginal. Sincemarginalization aﬀects the L p -nested tree vertically, we call this type of marginals layer marginals .In the following, we present their general form.From the form of a general L p -nested function and the corresponding symmetric distribution,one might think that the layer marginals are L p -nested again. However, this is not the case sincethe distribution over the L p -nested unit sphere would deviate from the uniform distribution inmost cases if the distribution of its children was L p -spherically symmetric. Proposition 4.

Let f be an L p -nested function. Suppose we integrate out complete subtrees fromthe tree associated with f , that is we transform subtrees into radial times uniform variables andintegrate out the latter. Let J be the set of multi-indices of those nodes that have become new a bc d Figure 3.

Marginals of L p -nested symmetric distributions are not necessarily L p -nested symmetric: Figure ( a ) shows a scatter plot of the ( x , x )-marginal ofthe counterexample in the text with p ∅ = 2 and p = . Figure ( d ) displaysthe corresponding L p -nested sphere. ( b-c ) show the univariate marginals forthe scatter plot. Since any two-dimensional L p -nested distribution must be L p -spherical, the marginals should be identical. This is clearly not the case. Thus,( a ) is not L p -nested symmetric. leaves, i.e. whose subtrees have been removed, and let n J be the number of leaves (in the originaltree) in the subtree under the node J . Let x (cid:98) J ∈ R m denote those coeﬃcients of x that are stillpart of that smaller tree and let v J denote the vector of inner nodes that became new leaves. Thejoint distribution of x (cid:98) J and v J is given by ρ ( x (cid:98) J , v J ) = (cid:37) ( f ( x (cid:98) J , v J )) S f ( f ( x (cid:98) J , v J )) (cid:89) J ∈J v n J − J . (11) Proof.

The proof can be found in the Appendix C. (cid:3) p -NESTED SYMMETRIC DISTRIBUTIONS 13 Equation (11) has an interesting special case when considering the joint distribution of the rootnode’s children.

Corollary 2.

The children of the root node v (cid:96) ∅ = ( v , ..., v (cid:96) ∅ ) (cid:62) follow the distribution ρ ( v (cid:96) ∅ ) = p (cid:96) ∅ − ∅ Γ (cid:104) np ∅ (cid:105) f ( v , ..., v (cid:96) ∅ ) n − m (cid:81) (cid:96) ∅ k =1 Γ (cid:104) n k p ∅ (cid:105) (cid:37) (cid:0) f ( v , ..., v (cid:96) ∅ ) (cid:1) (cid:96) ∅ (cid:89) i =1 v n i − i where m ≤ (cid:96) ∅ is the number of leaves directly attached to the root node. In particular, v (cid:96) ∅ canbe written as the product RU , where R is the L p -nested radius and the single | U i | p ∅ are Dirichletdistributed, i.e. ( | U | p ∅ , ..., | U (cid:96) ∅ | p ∅ ) ∼ Dir (cid:104) n p ∅ , ..., n (cid:96) ∅ p ∅ (cid:105) .Proof. The joint distribution is simply the application of Proposition (4). Note that f ( v , ..., v (cid:96) ∅ ) = || v (cid:96) ∅ || p ∅ . Applying the pointwise transformation s i = | u i | p ∅ yields( | U | p ∅ , ..., | U (cid:96) ∅ − | p ∅ ) ∼ Dir (cid:20) n p ∅ , ..., n (cid:96) ∅ p ∅ (cid:21) . (cid:3) The Corollary shows that the values f I ( x I ) at inner nodes I , in particular the ones directlybelow the root node, deviate considerably from L p -spherical symmetry. If they were L p -sphericallysymmetric, the | U i | p should follow a Dirichlet distribution with parameters α i = p as has beenalready shown by Song and Gupta [1997]. The Corollary is a generalization of their result.We can use the Corollary to prove an interesting fact about L p -nested symmetric distributions:The only factorial L p -nested symmetric distribution must be L p -spherically symmetric. Proposition 5.

Let x be L p -nested symmetric distributed with independent marginals. Then x is L p -spherically symmetric distributed. In particular, x follows a p -generalized Normal distribution.Proof. The proof can be found in the Appendix D. (cid:3)

One immediate implication of Proposition 5 is that there is no factorial probability model corre-sponding to mixed norm regularizers which have of the form (cid:80) ki =1 (cid:107) x I k (cid:107) qp where the index sets I k form a partition of the dimensions 1 , ..., n [see e.g. Kowalski et al., 2008; Yuan and Lin, 2006; Zhaoet al., 2008]. Many machine learning algorithms are equivalent to minimizing the sum of a regu-larizer R ( w ) and a loss function L ( w , x , ..., x m ) over the coeﬃcient vector w . If the exp ( − R ( w ))and exp ( − L ( w , x , ..., x m )) correspond to normalizeable density models, the minimizing solutionof the objective function can be seen as the Maximum A Posteriori (MAP) estimate of the pos-terior p ( w | x , ..., x m ) ∝ p ( w ) · p ( x , ..., x m | w ) = exp ( − R ( w )) · exp ( − L ( w , x , ..., x m )). In thatsense, the regularizer naturally corresponds to the prior and the loss function corresponds to thelikelihood. Very often, regularizers are speciﬁed as a norm over the coeﬃcient vector w whichin turn correspond to certain priors. For example, in Ridge regression [Hoerl, 1962] the coeﬃ-cients are regularized via (cid:107) w (cid:107) which corresponds to a factorial zero mean Gaussian prior on w .The L -norm (cid:107) w (cid:107) in the LASSO estimator [Tibshirani, 1996], again, is equivalent to a factorialLaplacian prior on w . Like in these two examples, regularizers often correspond to a factorial prior.Mixed norm regularizers naturally correspond to L p -nested distributions. Proposition 5 showsthat there is no factorial prior that corresponds to such a regularizer. In particular, it impliesthat the prior cannot be factorial between groups and coeﬃcients at the same time. This meansthat those regularizers implicitly assume statistical dependencies between the coeﬃcient variables.Interestingly, for q = 1 and p = 2 the intuition behind these regularizers is exactly that wholegroups I k get switched on at once, but the groups are sparse. The Proposition shows that thismight not only be due to sparseness but also due to statistical dependencies between the coeﬃcientswithin one group. The L p -nested distribution which implements independence between groups willbe further discussed below as a generalization of the p -generalized Normal (see Section 7). Note that the marginals can be independent if the regularizer is of the form (cid:80) ki =1 (cid:107) x I k (cid:107) pp . However, inthis case p = q and the L p -nested function collapses to a simple L p -norm which means that theregularizer is not mixed norm.5. Estimation of and Sampling from L p -Nested Symmetric Distributions Maximum Likelihood Estimation.

In this section, we describe procedures for maximumlikelihood ﬁtting of L p -nested symmetric distributions on data. We provide a toolbox online forﬁtting L p -spherically symmetric and L p -nested symmetric distributions to data. The toolbox canbe downloaded at .Depending on which parameters are to be estimated the complexity of ﬁtting an L p -nestedsymmetric distribution varies. We start with the simplest case and later continue with morecomplex ones. Throughout this subsection, we assume that the model has the form p ( x ) = ρ ( W x ) · | det W | = (cid:37) ( W x ) f ( W x ) n − S f (1)) · | det W | where W ∈ R n × n is a complete whitening matrix. Thismeans that given any whitening matrix W , the freedom in ﬁtting W is to estimate an orthonormalmatrix Q ∈ SO ( n ) such that W = QW . This is analogous to the case of elliptically contoureddistributions where the distributions can be endowed with 2nd-order correlations via W . In thefollowing, we ignore the determinant of W since that data points can always be rescaled such thatdet W = 1.The simplest case is to ﬁt the parameters of the radial distribution when the tree structure, thevalues of the p I and W are ﬁxed. Due to the special form of L p -nested symmetric distributions (5)it then suﬃces to carry out maximum likelihood estimation on the radial component only, whichrenders maximum likelihood estimation eﬃcient and robust. This is because the only remainingparameters are the parameters ϑ of the radial distribution and, therefore,argmax ϑ log ρ ( W x | ϑ ) = argmax ϑ ( − log S f ( f ( W x )) + log (cid:37) ( f ( W x ) | ϑ ))= argmax ϑ log (cid:37) ( f ( W x ) | ϑ ) . In a slightly more complex case, when only the tree structure and W are ﬁxed, the values of the p I , I ∈ I and ϑ can be jointly estimated via gradient ascent on the log-likelihood. The gradientfor a single data point x with respect to the vector p that holds all p I for all I ∈ I is given by ∇ p log ρ ( W x ) = ddr log (cid:37) ( f ( W x )) · ∇ p f ( W x ) − ( n − f ( W x ) ∇ p f ( W x ) − ∇ p log S f (1) . For i.i.d. data points x i the joint gradient is given by the sum over the gradients for the singledata points. Each of them involves the gradient of f as well as the gradient of the log-surface areaof L f with respect to p , which can be computed via the recursive equations ∂∂p J v I =  I is not a preﬁx of Jv − p I I v p I − I,k · ∂∂p J v I,k if I is a preﬁx of J v J p J (cid:16) v − p J J (cid:80) (cid:96) J k =1 v p J J,k · log v J,k − log v J (cid:17) if J = I and ∂∂p J log S f (1) = − (cid:96) J − p J + (cid:96) J − (cid:88) k =1 Ψ (cid:34) (cid:80) k +1 i =1 n J,k p J (cid:35) (cid:80) k +1 i =1 n J,k p J − (cid:96) J − (cid:88) k =1 Ψ (cid:34) (cid:80) ki =1 n J,k p J (cid:35) (cid:80) ki =1 n J,k p J − (cid:96) J − (cid:88) k =1 Ψ (cid:20) n J,k +1 p J (cid:21) n J,k +1 p J , where Ψ[ t ] = ddt log Γ[ t ] denotes the digamma function. When performing the gradient ascent oneneeds to set as a lower bound for p . Note that, in general, this optimization might be a highlynon-convex problem.On the next level of complexity, only the tree structure is ﬁxed and W can be estimated alongwith the other parameters by joint optimization of the log-likelihood with respect to p , ϑ and W .Certainly, this optimization problem is also not convex in general. Usually, it is numerically morerobust to whiten the data ﬁrst with some whitening matrix W and perform a gradient ascent p -NESTED SYMMETRIC DISTRIBUTIONS 15 on the special orthogonal group SO ( n ) with respect to Q for optimizing W = QW . Given thegradient ∇ W log ρ ( W x ) of the log-likelihood the optimization can be carried out by performingline searches along geodesics as proposed by Edelman et al. [1999] (see also Absil et al. [2007]) orby projecting ∇ W log ρ ( W x ) on the tangent space T W SO ( n )) and performing a line search along SO ( n ) in that direction as proposed by Manton [2002].The general form of the gradient to be used in such an optimization scheme can be deﬁned as ∇ W log ρ ( W x )= ∇ W ( − ( n − · log f ( W x ) + log (cid:37) ( f ( W x )))= − ( n − f ( W x ) · ∇ y f ( W x ) · x (cid:62) + d log (cid:37) ( r ) dr ( f ( W x )) · ∇ y f ( W x ) · x (cid:62) . where the derivatives of f with respect to y are deﬁned by recursive equations ∂∂y i v I =  i (cid:54)∈ I sgn y i if v I,k = | y i | v − p I I · v p I − I,k · ∂∂y i v I,k for i ∈ I, k.

Note, that f might not be diﬀerentiable at y = 0. However, we can always deﬁne a sub-derivativeat zero, which is zero for p I (cid:54) = 1 and [ − ,

1] for p I = 1. Again, the gradient for i.i.d. data points x i is given by the sum over the single gradients.Finally, the question arises whether it is possible to estimate the tree structure from data aswell. So far, we were not able to come up with an eﬃcient algorithm to do so. A simple heuristicwould be to start with a very large tree, e.g. a full binary tree, and to prune out inner nodesfor which the parents and the children have suﬃciently similar values for their p I . The intuitionbehind this is that if they were exactly equal, they would cancel in the L p -nested function. Thisheuristic is certainly sub-optimal. Firstly, the optimization will be time consuming since there canbe about as many p I as there are leaves in the L p -nested tree (a full binary tree on n dimensionswill have n − n leaves. For example, if two leaves are separated bythe root node in the original full binary tree, there is no way to prune out inner nodes such thatthe path between those two nodes will not contain the root node anymore.The computational complexity for the estimation of all other parameters despite the tree struc-ture is diﬃcult to assess in general because they depend, for example, on the particular radialdistribution used. While the maximum likelihood estimation of a simple log-Normal distributiononly involves the computation of a mean and a variance which are in O ( m ) for m data points, amixture of log-Normal distributions already requires an EM algorithm which is computationallymore expensive. Additionally, the time it takes to optimize the likelihood depends on the startingpoint as well as the convergence rate and we neither have results about the convergence rate noris it possible to make problem independent statements about a good initialization of the param-eters. For this reason we only state the computational complexity of single steps involved in theoptimization.Computation of the gradient ∇ p log ρ ( W x ) involves the derivative of the radial distribution,the computation of the gradients ∇ p f ( W x ) and ∇ p S f (1). Assuming that the derivative of theradial distribution can be computed in O (1) for each single data point, the costly steps are theother two gradients. Computing ∇ p f ( W x ) basically involves visiting each node of the tree onceand performing a constant number of operations for the local derivatives. Since every inner nodein an L p -nested tree must have at least two children, the worst case would be a full binary treewhich has 2 n − O ( nm ) for m data points. For similar reasons, f ( W x ), ∇ p log S f (1) and the evaluation of the likelihood canalso be computed in O ( nm ). This means that each step in the optimization of p can be done O ( nm ) plus the computational costs for the line search in the gradient ascent. When optimizingfor W = QW as well, the computational costs per step increase to O ( n + n m ) since m datapoints have to be multiplied with W at each iteration (requiring O ( n m ) steps) and the line search involves projecting Q back onto SO ( n ) which requires an inverse matrix square root or a similarcomputation in O ( n ).For comparison, each step of fast ICA [Hyv¨arinen and Oja, 1997] for a complete demixingmatrix takes O ( n m ) when using hierarchical orthogonalization and O ( n m + n ) for symmet-ric orthogonalization. The same applies to ﬁtting an ISA model [Hyv¨arinen and Hoyer, 2000;Hyv¨arinen and K¨oster, 2006, 2007]. A Gaussian Scale Mixture (GSM) model does not need toestimate another orthogonal rotation Q because it belongs to the class of spherically symmetricdistributions and is, therefore, invariant under transformations from SO ( n ) [Wainwright and Si-moncelli, 2000]. Therefore, ﬁtting a GSM corresponds to estimating the parameters of the scaledistribution which is in O ( nm ) in the best case but might be costlier depending on the choice ofthe scale distribution.5.2. Sampling.

In this section, we derive a sampling scheme for arbitrary L p -nested symmetricdistributions which can for example be used for solving integrals when using L p -nested symmet-ric distributions for Bayesian learning. Exact sampling from an arbitrary L p -nested symmetricdistribution is in fact straightforward due to the following observation: Since the radial and theuniform component are independent, normalizing a sample from any L p -nested distribution to f -length one yields samples from the uniform distribution on the L p -nested unit sphere. By mul-tiplying those uniform samples with new samples from another radial distribution, one obtainssamples from another L p -nested distribution. Therefore, for each L p -nested function f a single L p -nested distribution which can be easily sampled from is enough. Sampling from all other L p -nested distributions with respect to f is then straightforward due to the method we just described.Gupta and Song [1997] sample from the p -generalized Normal distribution since it has indepen-dent marginals which makes sampling straightforward. Due to Proposition 5, no such factorial L p -nested distribution exists. Therefore, a sampling scheme like for L p -spherically symmetric dis-tributions is not applicable. Instead we choose to sample from the uniform distribution insidethe L p -nested unit ball for which we already computed the radial distribution in Example 5. Thedistribution has the form ρ ( x ) = V f (1) . In order to sample from that distribution, we will ﬁrstonly consider the uniform distribution in the positive quadrant of the unit L p -nested ball whichhas the form ρ ( x ) = n V f (1) . Samples from the uniform distributions inside the whole ball can beobtained by multiplying each coordinate of a sample with independent samples from the uniformdistribution over {− , } .The idea of the sampling scheme for the uniform distribution inside the L p -nested unit ball isbased on the computation of the volume of the L p -nested unit ball in Proposition 2. The basicmechanism underlying the sampling scheme below is to apply the steps of the proof backwards,which is based on the following idea: The volume of the L p -unit ball can be computed by computingits volume on the positive quadrant only and multiplying the result with 2 n afterwards. The keyis now to not transform the whole integral into radial and uniform coordinates at once, butsuccessively upwards in the tree. We will demonstrate this through a little example which alsoshould make the sampling scheme below more intuitive. Consider the L p -nested function f ( x ) = (cid:16) | x | p ∅ + ( | x | p + | x | p ) p ∅ p (cid:17) p ∅ . In order to solve the integral (cid:90) { x : f ( x ) ≤ x ∈ R n + } d x , we ﬁrst transform x and x into radial and uniform coordinates only. According to Proposition1 the determinant of the mapping ( x , x ) (cid:55)→ ( v , ˜ u ) = ( (cid:107) x (cid:107) p , x / (cid:107) x (cid:107) p ) is given by v (1 − ˜ u p ) − p p . Therefore the integral transforms into (cid:90) { x : f ( x ) ≤ x ∈ R n + } d x = (cid:90) { v ,x : f ( x ,v ) ≤ x ,v ∈ R + } (cid:90) (cid:90) v (1 − ˜ u p ) − p p dx dv d ˜ u. p -NESTED SYMMETRIC DISTRIBUTIONS 17 Algorithm 5.1

Exact sampling algorithm for L p -nested distributions Input:

The radial distribution (cid:37) of an L p -nested distribution ρ for the L p -nested function f . Output:

Sample x from ρ . Algorithm (1) Sample v ∅ from a beta distribution β [ n, I of the tree associated with f sample the auxiliary variable s I from aDirichlet distribution Dir (cid:104) n I, p I , ..., n I,(cid:96)I p I (cid:105) where n I,k are the number of leaves in the subtreeunder node

I, k . Obtain coordinates on the L p -nested sphere within the positive orthantby s I (cid:55)→ s pI I = ˜ u I (the exponentiation is taken component-wise).(3) Transform these samples to Cartesian coordinates by v I · ˜ u I = v I, (cid:96) I for each inner node,starting from the root node and descending to lower layers. The components of v I, (cid:96) I constitute the radii for the layer direct below them. If I = ∅ , the radius had been sampledin step 1.(4) Once the two previous steps have been repeated until no inner node is left, we have asample x from the uniform distribution in the positive quadrant. Normalize x to get auniform sample from the sphere u = x f ( x ) .(5) Sample a new radius ˜ v ∅ from the radial distribution of the target radial distribution (cid:37) andobtain the sample via ˜ x = ˜ v ∅ · u .(6) Multiply each entry x i of ˜ x by an independent sample z i from the uniform distributionover {− , } .Now we can separate the integrals over x and v , and the integral over ˜ u since the boundary ofthe outer integral does only depend on v and not on ˜ u : (cid:90) { x : f ( x ) ≤ x ∈ R n + } d x = (cid:90) (1 − ˜ u p ) − p p d ˜ u · (cid:90) { v ,x : f ( x ,v ) ≤ x ,v ∈ R + } (cid:90) v dx dv . The value of the ﬁrst integral is known explicitly since the integrand equals the uniform distributionon the (cid:107) · (cid:107) p -unit sphere. Therefore the value of the integral must be its normalization constantwhich we can get using Proposition 2. (cid:90) (1 − ˜ u p ) − p p d ˜ u = Γ (cid:104) p (cid:105) · p Γ (cid:104) p (cid:105) . An alternative way to arrive at this result is to use the transformation s = ˜ u p and to notice thatthe integrand is a Dirichlet distribution with parameters α i = p . The normalization constant ofthe Dirichlet distribution and the constants from the determinant Jacobian of the transformationyield the same result.In order to compute the remaining integral, the same method can be applied again yieldingthe volume of the L p -nested unit ball. The important part for the sampling scheme, however, isnot the volume itself but the fact that the intermediate results in this integration process equalcertain distributions. As shown in Example 5 the radial distribution of the uniform distributionon the unit ball is β [ n, L p -nested unit balls although the parameters of the Dirichlet distribution can beslightly diﬀerent. Reversing the steps leads us to the following sampling scheme. First, we samplefrom the β -distribution which gives us the radius v ∅ on the root node. Then we sample from theappropriate Dirichlet distribution and exponentiate the samples by p ∅ which transforms them intothe analogs of the variable u from above. Scaling the result with the sample v ∅ yields the valuesof the root node’s children, i.e. the analogs of x and v . Those are the new radii for the levelsbelow them where we simply repeat this procedure with the appropriate Dirichlet distributionsand exponents. The single steps are summarized in Algorithm 5.1. The computational complexity of the sampling scheme is O ( n ). Since the sampling procedureis like expanding the tree node by node starting with the root, the number of inner nodes andleaves is the total number of samples that have to be drawn from Dirichlet distributions. Everynode in an L p -nested tree must at least have two children. Therefore, the maximal number ofinner nodes and leaves is 2 n − O ( n ) the total computational complexity for one sample is in O ( n ).6. Robust Bayesian Inference of the Location

For L p -spherically symmetric distributions with a location and a scale parameter p ( x | µ , τ ) = τ n ρ ( (cid:107) τ ( x − µ ) (cid:107) p ) Osiewalski and Steel [1993] derived the posterior in closed form using a prior p ( µ , τ ) = p ( µ ) · c · τ − , and showed that p ( x , µ ) does not depend on the radial distribution (cid:37) , i.e. the particular type of L p -spherically symmetric distributions used for a ﬁxed p . Theprior on τ corresponds to an improper Jeﬀrey’s prior which is used to represent lack of priorknowledge on the scale. The main implication of their result is that Bayesian inference of thelocation µ under that prior on the scale does not depend on the particular type of L p -sphericallysymmetric distribution used for inference. This means that under the assumption of an L p -spherically symmetric distributed variable, for a ﬁxed p , one does have to know the exact form ofthe distribution in order to compute the location parameter.It is straightforward to generalize their result to L p -nested symmetric distributions and, hence,making it applicable to a larger class of distributions. Note that when using any L p -nestedsymmetric distribution, introducing a scale and a location via the transformation x (cid:55)→ τ ( x − µ )introduces a factor of τ n in front of the distribution. Proposition 6.

For ﬁxed values p ∅ , p , ... and two independent priors p ( µ , τ ) = p ( µ ) · cτ − of thelocation µ and the scale τ where the prior on τ is an improper Jeﬀrey’s prior, the joint distribution p ( x , µ ) is given by p ( x , µ ) = f ( x − µ ) − n · c · Z · p ( µ ) , where Z denotes the normalization constant of the L p -nested uniform distribution.Proof. Given any L p -nested symmetric distribution ρ ( f ( x )) the transformation into the polar-likecoordinates yields the following relation1 = (cid:90) ρ ( f ( x )) d x = (cid:90) (cid:90) (cid:89) L ∈L G L ( u (cid:98) L ) r n − ρ ( r ) drd u = (cid:90) (cid:89) L ∈L G L ( u (cid:98) L ) d u · (cid:90) r n − ρ ( r ) dr. Since (cid:81) L ∈L G L ( u (cid:98) L ) is the unnormalized uniform distribution on the L p -nested unit sphere, theintegral must equal the normalization constant that we denote with Z for brevity (see Proposition3 for an explicit expression). This implies that ρ has to fulﬁll1 Z = (cid:90) r n − ρ ( r ) dr. Writing down the joint distribution of x , µ and τ , and using the substitution s = τ f ( x − µ ) weobtain p ( x , µ ) = (cid:90) τ n ρ ( f ( τ ( x − µ ))) · cτ − · p ( µ ) dτ = (cid:90) s n − ρ ( s ) · c · p ( µ ) f ( x − µ ) − n ds = f ( x − µ ) − n · c · Z · p ( µ ) . (cid:3) Note that this result could easily be extended to ν -spherical distributions. However, in thiscase the normalization constant Z cannot be computed for most cases and, therefore, the posteriorwould not be known explicitly. p -NESTED SYMMETRIC DISTRIBUTIONS 19 Relations to ICA, ISA and Over-Complete Linear Models

In this section, we explain the relations among L p -spherically symmetric, L p -nested symmetric,ICA and ISA models. For a general overview see Figure 4.The density model underlying ICA models the joint distribution of the signal x as a linearsuperposition of statistically independent hidden sources A y = x or y = W x . If the marginals ofthe hidden sources are belong to the exponential power family we obtain the p -generalized Normalwhich is a subset of the L p -spherically symmetric class. The p -generalized Normal distribution p ( y ) ∝ exp( − τ (cid:107) y (cid:107) pp ) is a density model that is often used in ICA algorithms for kurtotic naturalsignals like images and sound by optimizing a demixing matrix W w.r.t. to the model p ( y ) ∝ exp( − τ (cid:107) W x (cid:107) pp ) [Lee and Lewicki, 2000; Lewicki, 2002; Zhang et al., 2004]. It can be shownthat the p -generalized Normal is the only factorial model in the class of L p -spherically symmetricmodels [Sinz et al., 2009a], and, by Proposition 5, also the only factorial L p -nested symmetricdistribution. Figure 4.

Relations between the diﬀerent classes of distributions: For arbitrarydistributions on the subspaces ISA (blue) is a superclass of ICA (green). Obvi-ously, L p -nested symmetric distributions (red) are a superclass of L p -sphericallysymmetric distributions (yellow). L p -nested ISA models live in the intersectionof L p -nested symmetric distributions and ISA models (intersection of red andblue). Those L p -nested ISA models that are L p -spherically symmetric are alsoICA models (intersection of green and yellow). This is the class of p -generalizedNormal distributions. If p is ﬁxed to two, one obtains the spherically symmetricdistributions (pink). The only class of distributions in the intersection betweenspherically symmetric distributions and ICA models is the Gaussian (intersectiongreen, yellow and pink).An important generalization of ICA is the Independent Subspace Analysis (ISA) proposedby Hyv¨arinen and Hoyer [2000] and by Hyv¨arinen and K¨oster [2007] who used L p -sphericallysymmetric distributions to model the single subspaces. Like in ICA, also ISA models the hiddensources of the signal as a product of multivariate distributions: p ( y ) = K (cid:89) k =1 p ( y I k ) . Here, y = W x and I k are index sets selecting the diﬀerent subspaces from the responses of W to x . The collection of index sets I k forms a partition of 1 , ..., n . ICA is a special case of ISA in which I k = { k } such that all subspaces are one-dimensional. For the ISA models used by Hyv¨arinen et al.the distribution on the subspaces was chosen to be either spherically or L p -spherically symmetric.ICA and ISA have been used to infer features from natural signals, in particular from naturalimages. However, as mentioned by several authors [Simoncelli, 1997; Wainwright and Simoncelli,2000; Zetzsche et al., 1993] and demonstrated quantitatively by Bethge [2006] and Eichhorn et al.[2008], the assumptions underlying linear ICA are not well matched by the statistics of naturalimages. Although the marginals can be well described by an exponential power family, the jointdistribution cannot be factorized with linear ﬁlters W .A reliable parametric way to assess how well the independence assumption is met by a signal athand is to ﬁt a more general class of distributions that contains factorial as well as non-factorialdistributions which both can equally well reproduce the marginals. By comparing the likelihoodon held out test data between the best ﬁtting non-factorial and the best-ﬁtting factorial case, onecan asses how well the sources can be described by a factorial distribution. For natural images,for example, one can use an arbitrary L p -spherically symmetric distribution ρ ( (cid:107) W x (cid:107) p ), ﬁt it tothe whitened data and compare its likelihood on held out test data to the one of the p -generalizedNormal [Sinz and Bethge, 2009]. Since any choice of radial distribution (cid:37) determines a particular L p -spherically symmetric distribution the idea is to explore the space between factorial and non-factorial models by using a very ﬂexible density (cid:37) on the radius. Note that having an explicitexpression of the normalization constant allows for particularly reliable model comparisons via thelikelihood. For many graphical models, for instance, such an explicit and computable expressionis often not available. Figure 5.

Tree corresponding to an L p -nested ISA model.The same type of dependency-analysis can be carried out for ISA using L p -nested symmetricdistributions [Sinz et al., 2009b]. Figure 5 shows the L p -nested tree corresponding to an ISA withfour subspaces. For general such trees, each inner node—except the root node—corresponds to asingle subspace. When using the radial distribution (cid:37) ∅ ( v ∅ ) = p ∅ v n − ∅ Γ (cid:104) np ∅ (cid:105) s np ∅ exp (cid:18) − v p ∅ ∅ s (cid:19) (12) p -NESTED SYMMETRIC DISTRIBUTIONS 21 the subspaces v , ..., v (cid:96) ∅ become independent and one obtains an ISA model of the form ρ ( y ) = 1 Z exp (cid:18) − f ( y ) p ∅ s (cid:19) = 1 Z exp (cid:32) − (cid:80) (cid:96) ∅ k =1 (cid:107) y I k (cid:107) p k s (cid:33) = p (cid:96) ∅ ∅ s np ∅ (cid:81) (cid:96) ∅ i =1 Γ (cid:104) n i p ∅ (cid:105) exp (cid:32) − (cid:80) (cid:96) ∅ k =1 (cid:107) y I k (cid:107) p k s (cid:33) (cid:96) ∅ (cid:89) k =1 p (cid:96) k − k Γ (cid:104) n k p k (cid:105) n k Γ n k (cid:104) p I (cid:105) , which has L p -spherically symmetric distributions on each subspace. Note that this radial distribu-tion is equivalent to a Gamma distribution whose variables have been raised to the power of p ∅ . Inthe following we will denote distributions of this type with γ p ( u, s ), where u and s are the shapeand scale parameter of the Gamma distribution, respectively. The particular γ p distribution thatresults in independent subspaces has arbitrary scale but shape parameter u = np ∅ . When usingany other radial distribution, the diﬀerent subspaces do not factorize and the distribution is alsonot an ISA model. In that sense L p -nested symmetric distributions are a generalization of ISA.Note, however, that not every ISA model is also L p -nested symmetric since not every product ofarbitrary distributions on the subspaces, even if they are L p -spherically symmetric, must also be L p -nested.It is natural to ask, whether L p -nested symmetric distributions can serve as a prior distri-bution p ( y | ϑ ) over hidden factors in over-complete linear models of the form p ( x | W, σ, ϑ ) = (cid:82) p ( x | W y , σ ) p ( y | ϑ ) d y , where p ( x | W y ) represents the likelihood of the observed data point x giventhe hidden factors y and the over-complete matrix W . For example, p ( x | W y , σ ) = N ( W y , σ · I )could be a Gaussian like in Olshausen and Field [1996]. Unfortunately, such a model would suﬀerfrom the same problems as all over-complete linear models: While sampling from the prior isstraightforward sampling from the posterior p ( y | x , W, ϑ , σ ) is diﬃcult because a whole subspaceof y leads to the same x . Since parameter estimation either involves solving the high-dimensionalintegral p ( x | W, σ, ϑ ) = (cid:82) p ( x | W y , σ ) p ( y | ϑ ) d y or sampling from the posterior, learning is compu-tationally demanding in such models. Various methods have been proposed to learn W , rangingfrom sampling the posterior only at its maximum [Olshausen and Field, 1996], approximating theposterior with a Gaussian via the Laplace approximation [Lewicki and Olshausen, 1999] or usingExpectation Propagation [Seeger, 2008]. In particular, all of the above studies either do not ﬁthyper-parameters ϑ for the prior [Lewicki and Olshausen, 1999; Olshausen and Field, 1996] orrely on the factorial structure of it [Seeger, 2008]. Since L p -nested distributions do not providesuch a factorial prior, Expectation Propagation is not directly applicable. An approximation likein Lewicki and Olshausen [1999] might be possible, but additionally estimating the parameters ϑ of the L p -nested symmetric distribution adds another level of complexity in the estimationprocedure. Exploring such over-complete linear models with a non-factorial prior may be an in-teresting direction to investigate, but it will need a signiﬁcant amount of additional numerical andalgorithmical work to ﬁnd an eﬃcient and robust estimation procedure.8. Nested Radial Factorization with L p -Nested Symmetric Distributions L p -nested symmetric distribution also give rise to a non-linear ICA algorithm for linearly mixednon-factorial L p -nested hidden sources y . The idea is similar to the Radial Factorization algorithmsproposed by Lyu and Simoncelli [2009] and Sinz and Bethge [2009]. For this reason, we call it Nested Radial Factorization (NRF) . For a one layer L p -nested tree, NRF is equivalent to RadialFactorization as described in Sinz and Bethge [2009]. If additionally p is set to p = 2, one obtainsthe Radial Gaussianization by Lyu and Simoncelli [2009]. Therefore, NRF is a generalization ofRadial Factorization. It has been demonstrated that Radial Factorization algorithms outperformlinear ICA on natural image patches [Lyu and Simoncelli, 2009; Sinz and Bethge, 2009]. Since L p -nested symmetric distributions are slightly better in likelihood on natural image patches [Sinzet al., 2009b] and since the diﬀerence in the average log-likelihood directly corresponds to the reduction in dependencies between the single variables [Sinz and Bethge, 2009], NRF will slightlyoutperform Radial Factorization on natural images. For other type of data the performance willdepend on how well the hidden sources can be modeled by a linear superposition of—possiblynon-independent— L p -nested symmetrically distributed sources. Here we state the algorithm as apossible application of L p -nested symmetric distributions for unsupervised learning.The idea is based on the observation that the choice of the radial distribution (cid:37) already deter-mines the type of L p -nested symmetric distribution. This also means that by changing the radialdistribution by remapping the data, the distribution could possibly be turned in a factorial one.Radial Factorization algorithms ﬁt an L p -spherically symmetric distribution with a very ﬂexibleradial distribution to the data and map this radial distribution (cid:37) s ( s for source) into the one of a p -generalized Normal distribution by the mapping y (cid:55)→ ( F − ⊥⊥ ◦ F s )( (cid:107) y (cid:107) p ) (cid:107) y (cid:107) p · y , (13)where F ⊥⊥ and F s are the cumulative distribution functions of the two radial distributions involved.The mapping basically normalizes the demixed source y and rescales it with a new radius thathas the correct distribution.Exactly the same method cannot work for L p -nested symmetric distributions since Proposition5 states that there is no factorial distribution we could map the data to by merely changing theradial distribution. Instead we have to remap the data in an iterative fashion beginning withchanging the radial distribution at the root node into the radial distribution of the L p -nested ISAshown in equation (12). Once the nodes are independent, we repeat this procedure for each ofthe child nodes independently, then for their child nodes and so on, until only leaves are left. Therescaling of the radii is a non-linear mapping since the transform in equation (13) is non-linear.Therefore, NRF is a non-linear ICA algorithm. Figure 6. L p -nested non-linear ICA for the tree of Example 6: For an arbitrary L p -nested symmetric distribution, using equation (13), the radial distribution canbe remapped such that the children of the root node become independent. This isindicated in the plot via dotted lines. Once the data has been rescaled with thatmapping, the children of root node can be separated. The remaining subtrees areagain L p -nested symmetric and have a particular radial distribution that can beremapped into the same one that makes their root nodes’ children independent.This procedure is repeated until only leaves are left.We demonstrate this at a simple example. p -NESTED SYMMETRIC DISTRIBUTIONS 23 Example . Consider the function f ( y ) = (cid:32) | y | p ∅ + (cid:18) | y | p ∅ , + ( | y | p , + | y | p , ) p ∅ , p , (cid:19) p ∅ p ∅ , (cid:33) p ∅ for y = W x where W as been estimated by ﬁtting an L p -nested symmetric distribution with aﬂexible radial distribution to W x as described in Section 5.1. Assume that the data has alreadybeen transformed once with the mapping of equation (13). This means that the current radialdistribution is given by (12) where we chose s = 1 for convenience. This yields a distribution ofthe form ρ ( y ) = p ∅ Γ (cid:104) np ∅ (cid:105) exp (cid:32) −| y | p ∅ − (cid:18) | y | p ∅ , + ( | y | p , + | y | p , ) p ∅ , p , (cid:19) p ∅ p ∅ , (cid:33) × n (cid:89) I ∈I p (cid:96) I − I Γ (cid:104) n I p I (cid:105)(cid:81) (cid:96) I k =1 Γ (cid:104) n I,k p I (cid:105) . Now we can separate the distribution of y from the distribution over y , ..., y . The distributionof y is a p -generalized Normal p ( y ) = p ∅ (cid:104) p ∅ (cid:105) exp ( −| y | p ∅ ) . Thus the distribution of y , ..., y is given by ρ ( y , ..., y ) = p ∅ Γ (cid:104) n ∅ , p ∅ (cid:105) exp (cid:32) − (cid:18) | y | p ∅ , + ( | y | p , + | y | p , ) p ∅ , p , (cid:19) p ∅ p ∅ , (cid:33) × n − (cid:89) I ∈I\∅ p (cid:96) I − I Γ (cid:104) n I p I (cid:105)(cid:81) (cid:96) I k =1 Γ (cid:104) n I,k p I (cid:105) . By using equation (10) we can identify the new radial distribution to be (cid:37) ( v ∅ , ) = p ∅ v n − ∅ , Γ (cid:104) n ∅ , p ∅ (cid:105) exp (cid:16) − v p ∅ ∅ , (cid:17) . Replacing this distribution by the one for the p -generalized Normal (for data we would use themapping in equation (13)), we obtain ρ ( y , ..., y ) = p ∅ , Γ (cid:104) n ∅ , p ∅ , (cid:105) exp (cid:18) −| y | p ∅ , − ( | y | p , + | y | p , ) p ∅ , p , (cid:19) × n − (cid:89) I ∈I\∅ p (cid:96) I − I Γ (cid:104) n I p I (cid:105)(cid:81) (cid:96) I k =1 Γ (cid:104) n I,k p I (cid:105) . Now, we can separate out the distribution of y which is again p -generalized Normal. This leavesus with the distribution for y and y ρ ( y , y ) = p ∅ , Γ (cid:104) n , p ∅ , (cid:105) exp (cid:18) − ( | y | p , + | y | p , ) p ∅ , p , (cid:19) n − (cid:89) I ∈I\{∅ , ( ∅ , } p (cid:96) I − I Γ (cid:104) n I p I (cid:105)(cid:81) (cid:96) I k =1 Γ (cid:104) n I,k p I (cid:105) . For this distribution we can repeat the same procedure which will also yield p -generalized Normaldistributions for y and y . Algorithm 8.1

Recursion NRF( y , f, (cid:37) s ) Input:

Data point y , L p -nested function f , current radial distribution (cid:37) s , Output:

Non-linearly transformed data point y Algorithm (1) Set the target radial distribution to be (cid:37) ⊥⊥ ← γ p  n ∅ p ∅ , Γ (cid:104) p ∅ (cid:105) p ∅ Γ (cid:104) p ∅ (cid:105) p ∅  (2) Set y ← F − ⊥⊥ ( F s ( f ( y ))) f ( y ) · y where F denotes the cumulative distribution function the re-spective (cid:37) .(3) For all children i of the root node that are not leaves:(a) Set (cid:37) s ← γ p  n ∅ ,i p ∅ , Γ (cid:104) p ∅ (cid:105) p ∅ Γ (cid:104) p ∅ (cid:105) p ∅  (b) Set y ∅ ,i ← NRF( y ∅ ,i , f ∅ ,i , (cid:37) s ). Note that in the recursion ∅ , i will become the new ∅ .(4) Return y This non-linear procedure naturally carries over to arbitrary L p -nested trees and distributions,thus yielding a general non-linear ICA algorithm for linearly mixed non-factorial L p -nested sources.For generalizing Example 6, note the particular form of the radial distributions involved. As al-ready noted above the distribution (12) on the root node’s values that makes its children statisticalindependent is that of a Gamma distributed variable with shape parameter n ∅ p ∅ and scale parame-ter s which has been raised to the power of p ∅ . In Section 7 we denoted this class of distributionswith γ p [ u, s ], where u and s are the shape and the scale parameter, respectively. Interestingly,the radial distributions of the root node’s children are also γ p except that the shape parameter is n ∅ ,i p ∅ . The goal of the radial remapping of the children’s values is hence just changing the shapeparameter from n ∅ ,i p ∅ to n ∅ ,i p ∅ ,i . Of course, it is also possible to change the scale parameter of thesingle distributions during the radial remappings. This will not aﬀect the statistical independenceof the resulting variables. In the general algorithm, that we describe now, we choose s such thatthe transformed data is white.The algorithm starts with ﬁtting a general L p -nested model of the form ρ ( W x ) as described inSection 5.1. Once this is done, the linear demixing matrix W is ﬁxed and the hidden non-factorialsources are recovered via y = W x . Afterwards, the sources y are non-linearly made independentby calling the recursion speciﬁed in Algorithm 8.1 with the parameters W x , f and (cid:37) , where (cid:37) isthe radial distribution of the estimated model.The computational complexity for transforming a single data point is O ( n ) because of thematrix multiplication W x . In the non-linear transformation, each single data dimension is notrescaled more that n times which means that the rescaling is certainly also in O ( n ).An important aspect of NRF is that it yields a probabilistic model for the transformed data.This model is simply a product of n independent exponential power marginals. Since the radialremappings do not change the likelihood, the likelihood of the non-linearly separated data is thesame as the likelihood of the data under L p -nested symmetric distribution that was ﬁtted to it inthe ﬁrst place. However, in some cases, one might like to ﬁt a diﬀerent distribution to the outcomeof Algorithm 8.1. In that case the determinant of the transformation is necessary to determinethe likelihood of the input data—and not the transformed one—under the model. The followinglemma provides the determinant of the Jacobian for the non-linear rescaling. Lemma 2 (Determinant of the Jacobian) . Let z = NRF ( W x , f, (cid:37) s ) as described above. Let t I denote the value of W x below the inner node I which have been transformed with Algorithm 8.1up to node I . Let g I ( r ) = ( F (cid:37) ⊥⊥ ◦ F (cid:37) s )( r ) denote the radial transform at node I in Algorithm 8.1.Furthermore, let I denote the set of all inner nodes, excluding the leaves. Then, the determinant p -NESTED SYMMETRIC DISTRIBUTIONS 25 of the Jacobian (cid:16) ∂z i ∂x j (cid:17) ij is given by (cid:12)(cid:12)(cid:12)(cid:12) det ∂z i ∂x j (cid:12)(cid:12)(cid:12)(cid:12) = | det W | · (cid:89) I ∈I (cid:12)(cid:12)(cid:12)(cid:12) g I ( f I ( t I )) n I − f I ( t I ) n I − · (cid:37) s ( f I ( t I )) (cid:37) ⊥⊥ ( g I ( f I ( t I ))) (cid:12)(cid:12)(cid:12)(cid:12) . (14) Proof.

The proof can be found in the Appendix E. (cid:3) Conclusion

In this article we presented a formal treatment of the ﬁrst tractable subclass of ν -sphericaldistributions which generalizes the important family of L p -spherically symmetric distributions. Wederived an analytical expression for the normalization constant, introduced a coordinate systemparticularly tailored to L p -nested functions and computed the determinant of the Jacobian forthe corresponding coordinate transformation. Using these results, we introduced the uniformdistribution on the L p -nested unit sphere and the general form of an L p -nested distribution forarbitrary L p -nested functions and radial distributions. We also derived an expression for the jointdistribution of inner nodes of an L p -nested tree and derived a sampling scheme for an arbitrary L p -nested distribution. L p -nested symmetric distributions naturally provide the class of probability distributions corre-sponding to mixed norm priors, allowing full Bayesian inference in the corresponding probabilisticmodels. We showed that a robustness result for Bayesian inference of the location parameterknown for L p -spherically symmetric distributions carries over to the L p -nested symmetric class.We discussed the relations of L p -nested symmetric distributions to Indepedent Component (ICA)and Independent Subspace Analysis (ISA), and discussed its applicability as a prior distributionin over-complete linear models. Finally, we showed how L p -nested distributions can be used toconstruct a non-linear ICA algorithm called Nested Radial Factorization (NRF).The application of L p -nested symmetric distribution has been presented in a previous conferencepaper [Sinz et al., 2009b]. Code for training this class of distribution is provided online under . Acknowledgements

We would like to thank Eero Simoncelli for mentioning the idea of replacing L p -norms by L p -nested functions to us. Furthermore, we want to thank Sebastian Gerwinn, Suvrit Sra and ReshadHosseini for fruitful discussions and feedback on the manuscript. Finally, we would like to thankthe anonymous reviewers for their comments that helped to improve the manuscript.This work is supported by the German Ministry of Education, Science, Research and Technologythrough the Bernstein prize to MB (BMBF; FKZ: 01GQ0601), a scholarship to FS by the GermanNational Academic Foundation, and the Max Planck Society. Appendix A. Determinant of the Jacobian

Lemma 1.

The proof is very similar to the one in Song and Gupta [1997]. To derive equation (2)one needs to expand the Jacobian of the inverse coordinate transformation with respect to thelast column using the Laplace expansion of the determinant. The term ∆ n can be factored out ofthe determinant and cancels due to the absolute value around it. Therefore, the determinant ofthe coordinate transformation does not depend on ∆ n .The partial derivatives of the inverse coordinate transformation are given by: ∂∂u k x i = δ ik r for 1 ≤ i, k ≤ n − ∂∂u k x n = ∆ n r ∂u n ∂u k for 1 ≤ k ≤ n − ∂∂r x i = u i for 1 ≤ i ≤ n − ∂∂r x n = ∆ n u n . Therefore, the structure of the Jacobian is given by J =  r . . . u ... . . . ... ...0 . . . r u n − ∆ n r ∂u n ∂u . . . ∆ n r ∂u n ∂u n − ∆ n u n  . Since we are only interested in the absolute value of the determinant and since ∆ n ∈ {− , } , wecan factor out ∆ n and drop it. Furthermore, we can factor out r from the ﬁrst n − | det J | = r n − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) det  . . . u ... . . . ... ...0 . . . u n − ∂u n ∂u . . . ∂u n ∂u n − u n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Now we can use the Laplace expansion of the determinant with respect to the last column. Forthat purpose, let J i denote the matrix which is obtained by deleting the last column and the i throw from J . This matrix has the following structure J i =  ∂u n ∂u ∂u n ∂u i ∂u n ∂u n −  . We can transform J i into a lower triangular matrix by moving the column with all zeros and ∂u n ∂u i bottom entry to the rightmost column of J i . Each swapping of two columns introduces a factor of −

1. In the end, we can compute the value of det J i by simply taking the product of the diagonalentries and obtain det J i = ( − n − − i ∂u n ∂u i . This yields | det J | = r n − (cid:32) n (cid:88) k =1 ( − n + k u k det J k (cid:33) = r n − (cid:32) n − (cid:88) k =1 ( − n + k u k det J k + ( − n ∂x n ∂r (cid:33) = r n − (cid:32) n − (cid:88) k =1 ( − n + k u k ( − n − − k ∂u n ∂u k + u n (cid:33) = r n − (cid:32) − n − (cid:88) k =1 u k ∂u n ∂u k + u n (cid:33) . (cid:3) Before proving Proposition 1 stating that the determinant only depends on the terms G I ( u (cid:98) I )produced by the chain rule when used upwards in the tree, let us quickly outline the essentialmechanism when taking the chain rule for ∂u n ∂u q : Consider the tree corresponding to f . By deﬁnition u n is the rightmost leaf of the tree. Let L, (cid:96) L be the multi-index of u n . As in the example, thechain rule starts at the leaf u n ascends in the the tree until it reaches the lowest node whose subtreecontains both, u n and u q . At this point, it starts descending the tree until it reaches the leaf u q .Depending on whether the chain rule ascends or descends, two diﬀerent forms of derivatives occur:while ascending, the chain rule produces G I ( u (cid:98) I )-terms like the one in the example above. At p -NESTED SYMMETRIC DISTRIBUTIONS 27 descending, it produces F I ( u I )-terms. The general deﬁnitions of the G I ( u (cid:98) I )- and F I ( u I )-termsare given by the recursive formulae G I,(cid:96) I ( u (cid:100) I,(cid:96) I ) = g I,(cid:96) I ( u (cid:100) I,(cid:96) I ) p I,(cid:96)I − p I =  g I ( u (cid:98) I ) p I − (cid:96) I − (cid:88) j =1 f I,j ( u I,j ) p I  pI,(cid:96)I − pIpI (15)and F I,i r ( u I,i r ) = f I,i r ( u I,i r ) p I − p I,ir =  (cid:96) I,ir (cid:88) k =1 f I,i r ,k ( u I,i r ,k ) p I,ir  pI − pI,irpI,ir . The next two lemmata are required for the proof of Proposition 1. We use the somewhat sloppynotation k ∈ I, i r if the variable u k is a leaf in the subtree below I, i r . The same notation is usedfor (cid:98) I . Lemma 3.

Let I = i , ..., i r − and I, i r be any node of the tree associated with an L p -nestedfunction f . Then the following recursions hold for the derivatives of g I,i r ( u (cid:100) I,i r ) p I,ir and f p I I,i r ( u I,i r ) w.r.t u q : If u q is not in the subtree under the node I, i r , i.e. k (cid:54)∈ I, i r , then ∂∂u q f I,i r ( u I,i r ) p I = 0and ∂∂u q g I,i r ( u (cid:100) I,i r ) p I,ir = p I,i r p I G I,i r ( u (cid:100) I,i r ) ·  ∂∂u q g I ( u (cid:98) I ) p I if q ∈ I − ∂∂u q f I,j ( u I,j ) p I if q ∈ I, j for q ∈ I, j and q (cid:54)∈ I, k for k (cid:54) = j . Otherwise ∂∂u q g I,i r ( u (cid:100) I,i r ) p I,ir = 0 and ∂∂u q f I,i r ( u I,i r ) p I = p I p I,i r F I,i r ( u I,i r ) ∂∂u q f I,i r ,s ( u I,i r ,s ) p I,ir for q ∈ I, i r , s and q (cid:54)∈ I, i r , k for k (cid:54) = s .Proof. Both of the ﬁrst equations are obvious, since only those nodes have a non-zero derivative forwhich the subtree actually depends on u q . The second equations can be seen by direct computation ∂∂u q g I,i r ( u (cid:100) I,i r ) p I,ir = p I,i r g I,i r ( u (cid:100) I,i r ) p I,ir − ∂∂u q G I,i r ( u (cid:100) I,i r )= p I,i r g I,i r ( u (cid:100) I,i r ) p I,ir − ∂∂u q  g I ( u (cid:98) I ) p I − (cid:96) I − (cid:88) j =1 f I,j ( u I,j ) p I  pI = p I,i r p I g I,i r ( u (cid:100) I,i r ) p I,ir − g I,i r ( u (cid:100) I,i r ) − p I ∂∂u q  g I ( u (cid:98) I ) p I − (cid:96) I − (cid:88) j =1 f I,j ( u I,j ) p I  = p I,i r p I G I,i r ( u (cid:100) I,i r ) ·  ∂∂u q g I ( u (cid:98) I ) p I if q ∈ I − ∂∂u q f I,j ( u I,j ) p I if q ∈ I, j

Similarly ∂∂u q f I,i r ( u I,i r ) p I = p I f I,i r ( u I,i r ) p I − ∂∂u q f I,i r ( u I,i r )= p I f I,i r ( u I,i r ) p I − ∂∂u q  (cid:96) I,ir (cid:88) k =1 f I,i r ,k ( u I,i r ,k ) p I,ir  pI,ir = p I p I,i r f I,i r ( u I,i r ) p I − f I,i r ( u I,i r ) − p I,ir ∂∂u q f I,i r ,s ( u I,i r ,s ) p I,ir = p I p I,i r F I,i r ( u I,i r ) ∂∂u q f I,i r ,s ( u I,i r ,s ) p I,ir for k ∈ I, i r , s . (cid:3) The next lemma states the form of the whole derivative ∂u n ∂u q in terms of the G I ( u (cid:98) I )- and F I ( u I )-terms. Lemma 4.

Let | u q | = v (cid:96) ,...,(cid:96) m ,i ,...,i t , | u n | = v (cid:96) ,...,(cid:96) d with m < d . The derivative of u n w.r.t. u q is given by ∂∂u q u n = − G (cid:96) ,...,(cid:96) d ( u (cid:92) (cid:96) ,...,(cid:96) d ) · ... · G (cid:96) ,...,(cid:96) m +1 ( u (cid:92) (cid:96) ,...,(cid:96) m +1 ) × F (cid:96) ,...,(cid:96) m ,i ( u (cid:96) ,...,(cid:96) m ,i ) · F (cid:96) ,...,(cid:96) m ,i ,...,i t − ( u (cid:96) ,...,(cid:96) m ,i ,...,i t − ) · ∆ q | u q | p (cid:96) ,...,(cid:96)m,i ,...,it − − with ∆ q = sgn u q and | u q | p = (∆ q u q ) p . In particular u q ∂∂u q u n = − G (cid:96) ,...,(cid:96) d ( u (cid:92) (cid:96) ,...,(cid:96) d ) · ... · G (cid:96) ,...,(cid:96) m +1 ( u (cid:92) (cid:96) ,...,(cid:96) m +1 ) × F (cid:96) ,...,(cid:96) m ,i ( u ) · F (cid:96) ,...,(cid:96) m ,i ,...,i t − ( u (cid:96) ,...,(cid:96) m ,i ) · | u q | p (cid:96) ,...,(cid:96)m,i ,...,it − . Proof.

Successive application of Lemma (3). (cid:3)

Proposition 1.

Before we begin with the proof, note that F I ( u I ) and G I ( u (cid:98) I ) fulﬁll followingequalities G I,i m ( u (cid:91) I,i m ) − g I,i m ( u (cid:91) I,i m ) p I,im = g I,i m ( u (cid:91) I,i m ) p I (16) = g I ( u (cid:98) I ) p I − (cid:96) I − (cid:88) k =1 F I,k ( u I,k ) f I,k ( u I,k ) p I,k (17)and f I,i m ( u I,i m ) p I,im = (cid:96) I,im (cid:88) k =1 F I,i m ,k ( u I,i m ,k ) f I,i m ,k ( u I,i m ,k ) p I,im,k . (18)Now let L = (cid:96) , ..., (cid:96) d − be the multi-index of the parent of u n . We compute r n − | det J | andobtain the result by solving for | det J | . As shown in Lemma (1) r n − | det J | has the form1 r n − | det J | = − n − (cid:88) k =1 ∂u n ∂u k · u k + u n . By deﬁnition u n = g L,(cid:96) d ( u (cid:91) L,(cid:96) d ) = g L,(cid:96) d ( u (cid:91) L,(cid:96) d ) p L,(cid:96)d . Now, assume that u m , ..., u n − are children of L , i.e. u k = v L,I,i t for some I, i t = i , ..., i t and m ≤ k < n . Remember, that by Lemma (4) theterms u q ∂∂u q u n for m ≤ q < n have the form u q ∂∂u q u n = − G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) · F L,i ( u L,i ) · ... · F L,I ( u L,I ) · | u q | p (cid:96) ,...,(cid:96)d − ,i ,...,it − . p -NESTED SYMMETRIC DISTRIBUTIONS 29 Using equation (17), we can expand the determinant as follows − n − (cid:88) k =1 ∂u n ∂u k · u k + g L,(cid:96) d ( u (cid:91) L,(cid:96) d ) p L,(cid:96)d = − m − (cid:88) k =1 ∂u n ∂u k · u k − n − (cid:88) k = m ∂u n ∂u k · u k + g L,(cid:96) d ( u (cid:91) L,(cid:96) d ) p L,(cid:96)d = − m − (cid:88) k =1 ∂u n ∂u k · u k + G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) (cid:32) − n − (cid:88) k = m G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − ∂u n ∂u k · u k + G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − g L,(cid:96) d ( u (cid:91) L,(cid:96) d ) p L,(cid:96)d (cid:33) = − m − (cid:88) k =1 ∂u n ∂u k · u k + G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) (cid:32) − n − (cid:88) k = m G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − ∂u n ∂u k · u k + g L ( u (cid:98) L ) p L − (cid:96) d − (cid:88) k =1 F L,k ( u L,k ) f L,k ( u L,k ) p L,k (cid:33) . Note that all terms G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − ∂u n ∂u k · u k for m ≤ k < n now have the form G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − u k ∂∂u k u n = − F L,i ( u L,i ) · ... · F L,I ( u L,I ) · | u q | p (cid:96) ,...,(cid:96)d − ,i ,...,it − since we constructed them to be neighbors of u n . However, with equation (18), we can furtherexpand the sum (cid:80) (cid:96) d − k =1 F L,k ( u L,k ) f L,k ( u L,k ) p L,k down to the leaves u m , ..., u n − . When doingso we end up with the same factors F L,i ( u L,i ) · ... · F L,I ( u L,I ) · | u q | p (cid:96) ,...,(cid:96)d − ,i ,...,it − as in thederivatives G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − u q ∂∂u q u n . This means exactly that − n − (cid:88) k = m G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − ∂u n ∂u k · u k = (cid:96) d − (cid:88) k =1 F L,k ( u L,k ) f L,k ( u L,k ) p L,k and, therefore,= − m − (cid:88) k =1 ∂u n ∂u k · u k + G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) (cid:32) − n − (cid:88) k = m G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) − ∂u n ∂u k · u k + g L ( u (cid:98) L ) p L − (cid:96) d − (cid:88) k =1 F L,k ( u L,k ) f L,k ( u L,k ) p L,k (cid:33) = − m − (cid:88) k =1 ∂u n ∂u k · u k + G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) (cid:32) (cid:96) d − (cid:88) k =1 F L,k ( u L,k ) f L,k ( u L,k ) p L,k + g L ( u (cid:98) L ) p L − (cid:96) d − (cid:88) k =1 F L,k ( u L,k ) f L,k ( u L,k ) p L,k (cid:33) = − m − (cid:88) k =1 ∂u n ∂u k · u k + G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) g L ( u (cid:98) L ) p L . By factoring out G L,(cid:96) d ( u (cid:91) L,(cid:96) d ) from the equation, the terms ∂u n ∂u k · u k loose the G L,(cid:96) d in front andwe get basically the same equation as before, only that the new leaf (the new “ u n ”) is g L ( u (cid:98) L ) p L and we got rid of all the children of L . By repeating that procedure up to the root node, wesuccessively factor out all G L (cid:48) ( u (cid:99) L (cid:48) ) for L (cid:48) ∈ L until all terms of the sum vanish and we are onlyleft with v ∅ = 1. Therefore, the determinant is1 r n − | det J | = (cid:89) L ∈L G L ( u (cid:98) L )which completes the proof. (cid:3) Appendix B. Volume and Surface of the L p -Nested Unit Sphere Proposition 2.

We obtain the volume by computing the integral (cid:82) f ( x ) ≤ R d x . Diﬀerentiation withrespect to R yields the surface area. For symmetry reasons we can compute the volume only onthe positive quadrant R n + and multiply the result with 2 n later to obtain the full volume andsurface area. The strategy for computing the volume is as follows. We start oﬀ with inner nodes I that are parents of leaves only. The value v I of such a node is simply the L p I norm of its children.Therefore, we can convert the integral over the children of I with the transformation of Guptaand Song [1997]. This maps the leaves v I, (cid:96) I into v I and “angular” variables ˜ u . Since integral borders of the original integral depend only on the value of v I and not on ˜ u , we can separate thevariables ˜ u from the radial variables v I and integrate the variables ˜ u separately. The integrationover ˜ u yields a certain factor, while the variable v I eﬀectively becomes a new leaf.Now suppose I is the parent of leaves only. Without loss of generality let the (cid:96) I leaves correspondto the last (cid:96) I coeﬃcients of x . Let x ∈ R n + . Carrying out the ﬁrst transformation and integrationyields (cid:90) f ( x ) ≤ R d x = (cid:90) f ( x n − (cid:96)I,vI ) ≤ R (cid:90) ˜ u ∈V (cid:96)I − v (cid:96) I − I (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33) − pIpI dv I d ˜ u d x n − (cid:96) I = (cid:90) f ( x n − (cid:96)I,vI ) ≤ R v n I − I dv I d x n − (cid:96) I × (cid:90) ˜ u ∈V (cid:96)I − (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33) nI,(cid:96)I − pIpI d ˜ u . For solving the second integral we make the pointwise transformation s i = ˜ u p I i and obtain (cid:90) ˜ u ∈V (cid:96)I − (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33) nI,(cid:96)I − pIpI d ˜ u = 1 p (cid:96) I − I (cid:90) (cid:80) s i ≤ (cid:32) − (cid:96) I − (cid:88) i =1 s i (cid:33) nI,(cid:96)IpI − (cid:96) I − (cid:89) i =1 s pI − i d s (cid:96) I − = 1 p (cid:96) I − I (cid:96) I − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n I,k p I , n I,k +1 p I (cid:35) = 1 p (cid:96) I − I (cid:96) I − (cid:89) k =1 B (cid:20) kp I , p I (cid:21) by using the fact that the transformed integral has the form of an unnormalized Dirichlet distri-bution and, therefore, the value of the integral must equal its normalization constant.Now, we solve the integral (cid:90) f ( x n − (cid:96)I,vI ) ≤ R v n I − I dv I d x n − (cid:96) I . (19)We carry this out in exactly the same manner as we solved the previous integral. We only need tomake sure that we only contract nodes that have only leaves as children (remember that radii ofcontracted nodes become leaves) and we need to ﬁnd a formula how the factors v n I − I propagatethrough the tree.For the latter, we ﬁrst state the formula and then prove it via induction. For notationalconvenience let J denote the set of multi-indices corresponding to the contracted leaves, x (cid:98) J theremaining coeﬃcients of x and v J the vector of leaves resulting from contraction. The integralwhich is left to solve after integrating over all ˜ u is given by (remember that n J denotes real leaves,i.e. the ones corresponding to coeﬃcients of x ): (cid:90) f ( x (cid:99) J , v J ) ≤ R (cid:89) J ∈J v n J − J d v J d x (cid:98) J . We already proved the ﬁrst induction step by computing equation (19). For computing the generalinduction step suppose I is an inner node whose children are leaves or contracted leaves. Let J (cid:48) be the set of contracted leaves under I and K = J \J (cid:48) . Transforming the children of I into radial p -NESTED SYMMETRIC DISTRIBUTIONS 31 coordinates by Gupta and Song [1997] yields (cid:90) f ( x (cid:99) J , v J ) ≤ R (cid:89) J ∈J v n J − J d v J d x (cid:98) J = (cid:90) f ( x (cid:99) J , v J ) ≤ R (cid:32) (cid:89) K ∈K v n K − K (cid:33) · (cid:32) (cid:89) J (cid:48) ∈J (cid:48) v n J (cid:48) − J (cid:48) (cid:33) d v J d x (cid:98) J = (cid:90) f ( x (cid:98) K , v K ,v I ) ≤ R (cid:90) ˜ u (cid:96)I − ∈V (cid:96)I − (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33) − pIpI v (cid:96) I − I  · (cid:32) (cid:89) K ∈K v n K − K (cid:33) × (cid:32) v I (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33)(cid:33) n(cid:96)I − pI (cid:96) I − (cid:89) k =1 ( v I ˜ u k ) n k −  d x (cid:98) K d v K dv I d ˜ u (cid:96) I − = (cid:90) f ( x (cid:98) K , v K ,v I ) ≤ R (cid:90) ˜ u (cid:96)I − ∈V (cid:96)I − (cid:32) (cid:89) K ∈K v n K − K (cid:33) ×  v (cid:96) I − (cid:80) (cid:96)Ii =1 ( n i − I (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33) n(cid:96)I − pIpI (cid:96) I − (cid:89) k =1 ˜ u n k − k  d x (cid:98) K d v K dv I d ˜ u (cid:96) I − = (cid:90) f ( x (cid:98) K , v K ,v I ) ≤ R (cid:32) (cid:89) K ∈K v n K − K (cid:33) v n I − I d x (cid:98) K d v K dv I × (cid:90) ˜ u (cid:96)I − ∈V (cid:96)I − (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33) n(cid:96)I − pIpI (cid:96) I − (cid:89) k =1 ˜ u n k − k d ˜ u (cid:96) I − . Again, by transforming it into a Dirichlet distribution, the latter integral has the solution (cid:90) ˜ u (cid:96)I − ∈V (cid:96)I − (cid:32) − (cid:96) I − (cid:88) i =1 ˜ u p I i (cid:33) n(cid:96)I − pIpI (cid:96) I − (cid:89) k =1 ˜ u n k − k d ˜ u (cid:96) I − = (cid:96) I − (cid:89) k =1 B (cid:34) (cid:80) ki =1 n I,k p I , n I,k +1 p I (cid:35) while the remaining former integral has the form (cid:90) f ( x (cid:98) K , v K ,v I ) ≤ R (cid:32) (cid:89) K ∈K v n K − K (cid:33) v n I − I d x (cid:98) K d v K dv I = (cid:90) f ( x (cid:99) J , v J ) ≤ R (cid:89) J ∈J v n J − J d v J d x (cid:98) J as claimed.By carrying out the integration up to the root node the remaining integral becomes (cid:90) v ∅ ≤ R v n − ∅ dv ∅ = (cid:90) R v n − ∅ dv ∅ = R n n . Collecting the factors from integration over the ˜ u proves the equations (6) and (8). Using B [ a, b ] = Γ[ a ]Γ[ b ]Γ[ a + b ] yields equations (7) and (9). (cid:3) Appendix C. Layer Marginals

Proposition 4. ρ ( x ) = (cid:37) ( f ( x )) S f ( f ( x ))= (cid:37) ( f ( x n − (cid:96) I , v I , ˜ u (cid:96) I − , ∆ n )) S f ( f ( x )) · v (cid:96) I − I (cid:32) − (cid:96) I − (cid:88) i =1 | ˜ u i | p I (cid:33) − pIpI where ∆ n = sign( x n ). Note that f is invariant to the actual value of ∆ n . However, whenintegrating it out, it yields a factor of 2. Integrating out ˜ u (cid:96) I − and ∆ n now yields ρ ( x n − (cid:96) I , v I ) = (cid:37) ( f ( x n − (cid:96) I , v I )) S f ( f ( x )) · v (cid:96) I − I (cid:96) I Γ (cid:96) I (cid:104) p I (cid:105) p (cid:96) I − I Γ (cid:104) (cid:96) I p I (cid:105) = (cid:37) ( f ( x n − (cid:96) I , v I )) S f ( f ( x n − (cid:96) I , v I )) · v (cid:96) I − I Now, we can go on an integrate out more subtrees. For that purpose, let x (cid:98) J denote the remainingcoeﬃcients of x , v J the vector of leaves resulting from the kind of contraction just shown for v I and J the set of multi-indices corresponding to the “new leaves”, i.e the node v I after contraction.We obtain the following equation ρ ( x (cid:98) J , v J ) = (cid:37) ( f ( x (cid:98) J , v J )) S f ( f ( x (cid:98) J , v J )) (cid:89) J ∈J v n J − J . where n J denotes the number of leaves in the subtree under the node J . The calculations for theproof are basically the same as the one for proposition (2). (cid:3) Appendix D. Factorial L p -Nested Distributions Proposition 5.

Since the single x i are independent, f ( x ) , ..., f (cid:96) ∅ ( x (cid:96) ∅ ) and, therefore, v , ..., v (cid:96) ∅ must be independent as well ( x i are the elements of x in the subtree below the i th child of theroot node). Using Corollary 2 we can write the density of v , ..., v (cid:96) ∅ as (the function name g isunrelated to the usage of the function g above) ρ ( v (cid:96) ∅ ) = (cid:96) ∅ (cid:89) i =1 h i ( v i ) = g ( (cid:107) v (cid:96) ∅ (cid:107) p ∅ ) (cid:96) ∅ (cid:89) i =1 v n i − i with g ( (cid:107) v (cid:96) ∅ (cid:107) p ∅ ) = p (cid:96) ∅ − ∅ Γ (cid:104) np ∅ (cid:105) f ( v , ..., v (cid:96) ∅ ) n − m (cid:81) (cid:96) ∅ k =1 Γ (cid:104) n k p ∅ (cid:105) (cid:37) (cid:0) (cid:107) v (cid:96) ∅ (cid:107) p ∅ (cid:1) Since the integral over g is ﬁnite, it follows from Sinz et al. [2009a] that g has the form g ( (cid:107) v (cid:96) ∅ (cid:107) p ∅ ) =exp( a ∅ (cid:107) v (cid:96) ∅ (cid:107) p ∅ p ∅ + b ∅ ) for appropriate constants a ∅ and b ∅ . Therefore, the marginals have the form h i ( v i ) = exp( a ∅ v p ∅ i + c ∅ ) v n i − i . (20)On the other hand, the particular form of g implies that the radial density has the form (cid:37) ( f ( x )) ∝ f ( x ) ( n − exp( a ∅ f ( x ) p ∅ + b ∅ ) p ∅ . In particular, this implies that the root node’s children f i ( x i ) ( i = 1 , ..., (cid:96) ∅ ) are independent and L p -nested again. With the same argument as above,it follows that their children v i, (cid:96) i follow the distribution ρ ( v i, , ..., v i,(cid:96) i ) = exp( a i (cid:107) v i, (cid:96) i (cid:107) p i p i + b i ) (cid:81) (cid:96) i j =1 v n i,j − i,j . Transforming that distribution to L p -spherically symmetric polar coordinates v i = (cid:107) v i, (cid:96) i (cid:107) p i and ˜ u = v i, (cid:96) i − / (cid:107) v i, (cid:96) i (cid:107) p i as in Gupta and Song [1997], we obtain the form ρ ( v i , ˜ u ) = exp( a i v p i i + b i ) v (cid:96) i − i  − (cid:96) i − (cid:88) j =1 | ˜ u i | p i  − pipi  v i  − (cid:96) i − (cid:88) j =1 | ˜ u i | p i  pi  n i,(cid:96)i − (cid:96) i − (cid:89) j =1 (˜ u j v i ) n i,j − = exp( a i v p i i + b i ) v n i − i  − (cid:96) i − (cid:88) j =1 | ˜ u i | p i  ni,(cid:96)i − pipi (cid:96) i − (cid:89) j =1 ˜ u n i,j − j , where the second equation follows the same calculations as in the proof of 2. After integrating out˜ u , assuming that the x i are statistically independent, we obtain the density of v i which is equal to p -NESTED SYMMETRIC DISTRIBUTIONS 33 (20) if and only if p i = p ∅ . However, if p ∅ and p i are equal, the hierarchy of the L p -nested functionshrinks by one layer since p i and p ∅ cancel themselves. Repeated application of the above argumentcollapses the complete L p -nested tree until one eﬀectively obtains an L p -spherical function. Sincethe only factorial L p -spherically symmetric distribution is the p -generalized Normal [Sinz et al.,2009a] the claim follows. (cid:3) Appendix E. Determinant of the Jacobian for NRF

Lemma 2.

The proof is a generalization of the proof of Lyu and Simoncelli [2009]. Due to thechain rule the Jacobian of the entire transformation is the multiplication of the Jacobians for eachsingle step, i.e. the rescaling of a subset of the dimensions for one single inner node. The Jacobianfor the other dimensions is simply the identity matrix. Therefore, the determinant of the Jacobianfor each single step is the determinant for the radial transformation on the respective dimensions.We show how to compute the determinant for a single step.Assume that we reached a particular node I in Algorithm 8.1. The leaves, which have beenrescaled by the preceding steps, are called t I . Let ξ I = g I ( f I ( t I )) f I ( t I )) · t I with g I ( r ) = ( F − ⊥⊥ ◦ F s )( r ).The general form of a single Jacobian is ∂ ξ I ∂ t I = t I · ∂∂ t I (cid:18) g I ( f I ( t I )) f I ( t I ) (cid:19) + g I ( f I ( t I )) f I ( t I ) I n I , where ∂∂ t I (cid:18) g I ( f I ( t I )) f I ( t I ) (cid:19) = (cid:18) g (cid:48) I ( f I ( t I )) f I ( t I ) − g I ( f I ( t I )) f I ( t I ) (cid:19) ∂∂ t I f I ( t I ) . Let y i be a leave in the subtree under I and let I, J , ..., J k be the path of inner nodes from I to y i , then ∂∂y i f I ( t I ) = v − p I I v p I − p J J · ... · v p Jk − − p Jk k | y i | p Jk − · sgn y i . If we denote r = f I ( t I ) and ζ i = v p I − p J J · ... · v p Jk − − p Jk k | y i | p Jk − · sgn y i for the respective J k ,we obtaindet (cid:18) t I · ∂∂ t I (cid:18) g I ( f I ( t I )) f I ( t I ) (cid:19) + g I ( f I ( t I )) f I ( t I ) I n I (cid:19) = det (cid:18)(cid:18) g (cid:48) I ( r ) − g I ( r ) r (cid:19) r − p I t I · ζ (cid:62) + g I ( r ) r I n I (cid:19) . Now we can use Sylvester’s determinant formula det( I n + b t I ζ (cid:62) ) = det(1 + b t (cid:62) I ζ ) = 1 + b t (cid:62) I ζ or equivalently det( aI n + b t I ζ (cid:62) ) = det (cid:18) a · (cid:18) I n + ba t I ζ (cid:62) (cid:19)(cid:19) = a n det (cid:18) I n + ba t I ζ (cid:62) (cid:19) = a n − ( a + b t (cid:62) I ζ ) , as well as t (cid:62) I ζ = f I ( t I ) p I = r p I to see thatdet (cid:18)(cid:18) g (cid:48) I ( r ) − g I ( r ) r (cid:19) r − p I t I · ζ (cid:62) + g I ( r ) r I n (cid:19) = g I ( r ) n − r n − det (cid:18)(cid:18) g (cid:48) I ( r ) − g I ( r ) r (cid:19) r − p I t (cid:62) I · ζ + g I ( r ) r (cid:19) = g I ( r ) n − r n − det (cid:18) g (cid:48) I ( r ) − g I ( r ) r + g I ( r ) r (cid:19) = g I ( r ) n − r n − ddr g I ( r ) . ddr g I ( r ) is readily computed via ddr g I ( r ) = ddr ( F − ⊥⊥ ◦ F s )( r ) = (cid:37) s ( r ) (cid:37) ⊥⊥ ( g I ( r )) .Multiplying the single determinants along with det W for the ﬁnal step of the chain rule com-pletes the proof. (cid:3) References [1] P.-A. Absil, R. Mahony, and R. Sepulchre.

Optimization Algorithms on Matrix Manifolds . Prince-ton Univ Pr, Dec 2007. ISBN 0691132984.[2] M. Bethge. Factorial coding of natural images: How eﬀective are linear model in removing higher-order dependencies?

J. Opt. Soc. Am. A , 23(6):1253–1268, June 2006.[3] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonalityconstraints.

SIAM J. Matrix Anal. Appl. , 20(2):303–353, 1999. ISSN 0895-4798.[4] J. Eichhorn, F. Sinz, and M. Bethge. Simple cell coding of natural images in V1: How much useis orientation selectivity? (in preparation) , -:–, 2008.[5] K. T. Fang, S. Kotz, and K. W. Ng.

Symmetric multivariate and related distributions . Chapmanand Hall New York, 1990.[6] C. Fernandez, J. Osiewalski, and M. Steel. Modeling and inference with ν -spherical distributions. Journal of the American Statistical Association , 90(432):1331–1340, Dec 1995. URL .[7] A. Gupta and D. Song. L p -norm spherical distribution. Journal of Statistical Planning andInference , 60:241–260, 1997.[8] A. Hoerl. Application of ridge analysis to regression problems.

Chemical Engineering Progress , 58(3):54–59, 1962.[9] A. Hyv¨arinen and P. Hoyer. Emergence of phase and shift invariant features by decomposition ofnatural images into independent feature subspaces.

Neural Comput. , 12(7):1705–1720, 2000.[10] A. Hyv¨arinen and U. K¨oster.

FastISA: A fast ﬁxed-point algorithm for independent subspaceanalysis , page 371376. 2006.[11] A. Hyv¨arinen and U. K¨oster. Complex cell pooling and the statistics of natural images.

Network:Computation in Neural Systems , 18(2):81–100, 2007.[12] A. Hyv¨arinen and E. Oja. A fast ﬁxed-point algorithm for independent component analysis.

NeuralComputation , 9(7):1483–1492, Oct 1997. doi: 10.1162/neco.1997.9.7.1483.[13] D. Kelker. Distribution theory of spherical distributions and a location-scale parameter general-ization.

Sankhya: The Indian Journal of Statistics, Series A , 32(4):419–430, Dec 1970. doi:10.2307/25049690. URL .[14] M. Kowalski, E. Vincent, and R. Gribonval.

Under-determined source separation via mixed-normregularized minimization . 2008.[15] T. Lee and M. Lewicki. The generalized gaussian mixture model using ica. In P. Pajunen andJ. Karhunen, editors,

ICA’ 00 , pages 239–244, Helsinki, Finland, june 2000.[16] M. Lewicki and B. Olshausen. Probabilistic framework for the adaptation and comparison ofimage codes.

J. Opt. Soc. Am. A , 16:1587–1601, 1999.[17] M. S. Lewicki. Eﬃcient coding of natural sounds.

Nat Neurosci , 5(4):356–363, Apr 2002. doi:10.1038/nn831.[18] S. Lyu and E. P. Simoncelli. Nonlinear extraction of independent components of natural imagesusing radial gaussianization.

Neural Computation , 21(6):1485–1519, Jun 2009. doi: 10.1162/neco.2009.04-08-773.[19] J. H. Manton. Optimization algorithms exploiting unitary constraints.

IEEE Transactions onSignal Processing , 50:635 – 650, 2002.[20] B. Olshausen and D. Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparsecode for natural images.

Nature , 381:560–561, 1996.[21] J. Osiewalski and M. F. J. Steel. Robust bayesian inference in l q -spherical models. Biometrika ,80(2):456–460, Jun 1993. URL .[22] M. W. Seeger. Bayesian inference and optimal design for the sparse linear model.

Journal of Ma-chine Learning Research , 9:759–813, 04 2008. URL .[23] E. Simoncelli. Statistical models for images: compression, restoration and synthesis. In

Signals,Systems & Computers, 1997. Conference Record of the Thirty-First Asilomar Conference on ,volume 1, pages 673–678 vol.1, 1997. doi: 10.1109/ACSSC.1997.680530. p -NESTED SYMMETRIC DISTRIBUTIONS 35 [24] F. Sinz and M. Bethge. The conjoint eﬀect of divisive normalization and orientation selectivity onredundancy reduction. In D. S. Y. B. L. B. Koller, D., editor, Twenty-Second Annual Conferenceon Neural Information Processing Systems , pages 1521–1528, Red Hook, NY, USA, 06 2009.Curran. URL http://nips.cc/Conferences/2008/ .[25] F. Sinz, S. Gerwinn, and M. Bethge. Characterization of the p-generalized normal distribution.

Journal of Multivariate Analysis , 100(5):817–820, May 2009a. doi: 10.1016/j.jmva.2008.07.006.[26] F. Sinz, E. P. Simoncelli, and M. Bethge. Hierarchical modeling of local image features through L p -nested symmetric distributions. In Twenty-Third Annual Conference on Neural InformationProcessing Systems , pages 1–9, 12 2009b. URL http://nips.cc/Conferences/2009/ .[27] D. Song and A. Gupta. L p -norm uniform distribution. Proceedings of the American MathematicalSociety , 125:595–601, 1997.[28] R. Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal StatisticalSociety. Series B (Methodological) , 58(1):267–288, 1996. ISSN 00359246. URL .[29] M. Wainwright and E. Simoncelli. Scale mixtures of Gaussians and the statistics of natural images.In S. Solla, T. Leen, and K.-R. M¨uller, editors,

Adv. Neural Information Processing Systems(NIPS*99) , volume 12, pages 855–861, Cambridge, MA, May 2000. MIT Press.[30] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.

Journalof the Royal Statistical Society Series B , 68(1):49–67, 2006.[31] C. Zetzsche, B. Wegmann, and E. Barth. Nonlinear aspects of primary vision: entropy reductionbeyond decorrelation. In

Int’l Symposium, Soc. for Information Display , volume XXIV, pages933–936. 1993.[32] L. Zhang, A. Cichocki, and S. Amari. Self-adaptive blind source separation based on activationfunctions adaptation.

Neural Networks, IEEE Transactions on , 15:233–244, 2004.[33] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through compositeabsolute penalties.

Annals of Statistics , 2008.

E-mail address : [email protected] Max Planck Institute for Biological Cybernetics, Spemannstraße 41, 72076 T¨ubingen, Germany

E-mail address : [email protected]@tuebingen.mpg.de