Modeling with a Large Class of Unimodal Multivariate Distributions
MModeling with a Large Class of UnimodalMultivariate Distributions
Marina Silva Paez and Stephen G. Walker Instituto de Matem´atica, Universidade Federal do Rio de Janeiro,Brazil Department of Statistics and Data Sciences, University of Texas atAustin, U.S.A
Abstract
In this paper we introduce a new class of multivariate unimodal distributions,motivated by Khintchine’s representation. We start by proposing a univariatemodel, whose support covers all the unimodal distributions on the real line. Theproposed class of unimodal distributions can be naturally extended to higher di-mensions, by using the multivariate Gaussian copula. Under both univariate andmultivariate settings, we provide MCMC algorithms to perform inference aboutthe model parameters and predictive densities. The methodology is illustrated withunivariate and bivariate examples, and with variables taken from a real data-set.
Keywords : unimodal distribution, multivariate unimodality, mixture models, nonpara-metric Bayesian inference.
The clustering of data into groups is one of the current major topics under research. Itis fundamental in many statistical problems; most notably in machine learning, see, for1 a r X i v : . [ s t a t . M E ] J un xample, [29].The traditional approach is to model the data via a mixture model, see [31] for fi-nite mixture distributions, and, more recently, [13]. The idea, which carries through toinfinite mixture models, is that each component in the mixture is represented by a para-metric density function, typically from the same family, which is usually the normaldistribution. This assumption, often made for convenience, supposes each cluster orgroup can be adequately modeled via a normal distribution. For if a group requires twosuch normals, then it is deemed there are in fact two groups. Yet two normals could beneeded even if there is one group and it happens to be skewed, for example.On the other hand, if we start by thinking about what a cluster could be representedby in terms of probability density functions, then a unimodal density is the most obvi-ous choice. Quite simply, with lack of further knowledge, i.e. only observing the data,a bimodal density would indicate two clusters. This was the motivation behind [25],which relies heavily on the unimodal density models being defined on the real line, forwhich there are no obvious extensions to the multivariate setting.There are representations of unimodal density functions on the real line (e.g. [18]and [10]) and it is not difficult to model such a density adequately. See, for example,[3], [23] and [25]. Clearly, the aim is to provide large classes of unimodal densities andhence infinite dimensional models are often used.However, while there is an abundance of literature on unimodal density functionmodeling on the real line, there is a noticeable lack of work in two and higher dimen-sions. The reasons are quite straightforward; first there is a lack of a representation forunimodal distributions outside of the real line and, second, the current approaches tomodeling unimodal densities on the real line do not naturally extend to higher dimen-sions.The aim in this paper is to introduce and demonstrate a class of unimodal densitiesfor a multivariate setting. While marginal density functions will be modeled using the[18] representation on the real line, the dependence structure will be developed usinga Gaussian copula model. To elaborate, using an alternative form of representation of2nimodal densities on the real line, we can write a unimodal density as f ( y ) = (cid:90) f ( y | x ) x. . (1)This then naturally extends to a bivariate setting with marginals of the form (1) via theuse of a copula density function c ( x , x ) ; f ( y , y ) = (cid:90) (cid:90) f ( y | x ) f ( y | x ) c ( x , x ) x. x. . Using a Gaussian copula, which has the ability to model pairwise dependence, we canthen easily proceed upwards to the general multivariate unimodal density.The layout of the paper is as follows: In section 2 we provide the key representationof unimodal densities on the real line which we adapt in order to define continuous den-sities. Otherwise there is an issue of continuity at the mode. With a full representationof unimodal densities, we need a novel MCMC algorithm, particularly, and perhapsstrangely, to include a non-zero mode. Section 3 then deals with the multivariate set-ting. There are peculiarities to the higher dimension case with the MCMC and again itis the location of the mode parameter which needs special attention. Section 4 providesa real data analysis.
Our aim in this paper is to start with the representation of [18] for unimodal densitieson the real line and extend it to higher dimensions in a natural way, using the mul-tivariate Gaussian copula. The representation of [18] is given by Y = X Z , where X is uniform on [0 , and Z , independent of X , has any distribution. Then Y has aunimodal distribution with mode at . To see this let us consider the situation of y > .So P ( Y ≤ y ) = (cid:90) P ( Z ≤ y/x ) x.and hence f ( y ) = (cid:90) x − g ( y/x ) x. , g is used to represent the density function of Z . To see more clearly the uni-modality at , let us use the transform s = y/x , so f ( y ) = (cid:90) ∞ y s − g ( s ) s. , (2)which is maximized at y = 0 .This representation has been widely used and is usually presented as a mixture ofuniform distribution; i.e. f ( y ) = (cid:90) s − (0 < y < s ) G. ( s ) . The prior for G , in a Bayesian nonparametric setting, is usually taken as a Dirichletprocess ([11]). To denote this we write G ∼ D ( M, G ) , where M > is a scaleparameter, and G a distribution function. In particular, E ( G ) = G .Now [3] and [23], among others, have worked with this specification and extendedto the real line with arbitrary mode by using U ( y | κ − s, κ + s ) for some κ ≥ . Thisspecification leads to a symmetric density.Furthermore, [25] develop the model to allow for asymmetry by adopting the ideaof [12]; incorporating an asymmetry parameter λ in the uniform kernel, f ( y | λ, κ, G ) = (cid:90) U (cid:0) y | κ − se − λ , κ + se − λ (cid:1) G. ( s ) . (3)So (3) defines a unimodal density determined by ( λ, κ, G ) , where κ is the locationparameter, λ defines the asymmetry, and the distribution G defines characteristics suchas variance, kurtosis, tails and higher moments. With this representation, the supportof the model proposed by [25] includes all symmetric unimodal densities and a largeclass of asymmetric ones.However, when extending to the real line, and using a natural density for g such asthe normal, we encounter the situation of f (0) = ∞ . In fact, on the real line, g needsto be somewhat unusual to ensure that f (0) (cid:54) = ∞ . See the form of density in (2).To resolve this problem, we instead work with Y = X/Z ; so again looking at y > , we have P ( Y ≤ y ) = (cid:90) P ( Z ≥ x/y ) x. . f ( y ) = (cid:90) ( x/y ) g ( x/y ) x. (4)and using the transform s = x/y we obtain f ( y ) = (cid:90) /y s g ( s ) s. . (5)Thus we see the mode is again at y = 0 and f (0) will be finite subject to the morereasonable assumption that sg ( s ) is integrable at , rather than the more unreasonableassumption that s − g ( s ) is integrable at in the former setting. Here we further investigate the choice of Z being a normal distribution with the repre-sentation Y = X/Z . We can easily extend (5) to the whole real line; resulting in f ( y ) = ( y > (cid:90) /y s g ( s ) s. + ( y < (cid:90) /y | s | g ( s ) s. . (6)For a mode at κ (cid:54) = 0 we simply exchange y for y − κ . If the mode is at , then for lim y ↓ f ( y ) = lim y ↑ f ( y ) , we need (cid:90) ∞ sg ( s ) s. = (cid:90) ∞ | s | g ( s ) s. . Our aim is to take g as a large class of density functions on the real line and the fullclass is given by mixtures of normal distributions. Thus, it is expedient to compute (6)for g ( s ) normal with mean µ and variance σ . That is, for ξ > f ( ξ ) = (cid:90) ξ s N ( s. | µ, σ ) , which, after some algebra, results in f ( ξ ) = µ (cid:20) Φ (cid:18) ξ − µσ (cid:19) − Φ (cid:18) − µσ (cid:19)(cid:21) + σ (cid:20) φ (cid:18) − µσ (cid:19) − φ (cid:18) ξ − µσ (cid:19)(cid:21) , where φ and Φ are the pdf and cdf, respectively, for the standard normal distribution.With ξ < we simply need to rearrange the signs, so f ( ξ ) = (cid:12)(cid:12)(cid:12)(cid:12) µ (cid:20) Φ (cid:18) ξ − µσ (cid:19) − Φ (cid:18) − µσ (cid:19)(cid:21) + σ (cid:20) φ (cid:18) − µσ (cid:19) − φ (cid:18) ξ − µσ (cid:19)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) . (7)5ence, this f ( ξ ) holds for all ξ ∈ ( −∞ , + ∞ ) .To move the mode to κ , rather than , write f ( y | µ, σ, κ ) = f (1 / ( y − κ )) with f ( · ) given by (7). We could also maintain the representation (4) in which case we can write f ( y | µ, σ, κ ) = 1 σ √ π (cid:90) { x/ ( y − κ ) } exp (cid:40) − (cid:18) x/ ( y − κ ) − µσ (cid:19) (cid:41) x. . The usefulness of this is that it provides a latent joint density function which will behelpful when it comes to model estimation, namely, f ( y, x | µ, σ, κ ) ∝ { x/ ( y − κ ) } exp (cid:40) − (cid:18) x/ ( y − κ ) − µσ (cid:19) (cid:41) . (8)For the forthcoming mixture model, where we consider mixtures of normals for Z , wefind it convenient for identifiability concerns, to use the parametrization ( µ, c ) , where c = ( µ/σ ) . The model proposed for the class of unimodal density functions on IR is a mixture ofthe unimodal densities with mode at κ proposed in section 2.1, i.e. (8). This model canwe written as f ( y ) = ∞ (cid:88) j =1 w j f ( y | µ j , c, κ ) . (9)The mode is therefore at κ and we allow parameter c , which is a transformation ofthe coefficient of variation, to remain fixed across the normal components. This isanalogous to keeping the variance terms fixed, for identifiability concerns, in a mixtureof normal model. The support of the model is not diminished by doing this.Therefore, we can write f ( y ) = (cid:90) f ( y | µ, c, κ ) G. ( µ ) (10)where G is a discrete distribution with mass w j at the point µ j . Hence, in a Bayesiancontext, the parameters to be estimated and first assigned prior distributions are (( µ j , w j ) , c, κ ) , with the constraints being that the weights sum to 1, c > , and the6 µ j ) are distinct. As an illustration, Figure 1 shows the density function of a mixtureof two components with κ = 10 , c = 1 , µ = ( − , and w = (0 . , . .The claim is that model 10 can be arbitrarily close to any unimodal density on thereal line. The key is writing y = X/Z rather than y = XZ , and with the former wecan employ the full support abilities of a mixture of normals without the problem of adiscontinuity at the mode. To complete the model described in the previous section, we now set the prior distribu-tions for the unknown model parameters. Specific values are assigned in the illustrationsection. • The prior distribution for the parameters ( µ j ) are normal with zero mean andvariance σ µ ; while the prior for κ will also be normal with zero mean and vari-ance σ κ . The prior for c will be gamma with parameters ( α c , β c ) . • The prior distribution for the weights ( w j ) has a stick-breaking construction([26]), given by w = v and w j = v j (cid:89) l
The concept of unimodality is well established for univariate distributions, and, fol-lowing Khintchine’s representation ([18]), it can be defined in different but equivalentways. It is not straightforward, however, to extend the notion of unimodality to higherdimensional distributions, as different definitions of unimodality are not equivalentwhen extended.The first attempts to define multivariate unimodality were made for symmetricdistributions only, by [1] and [28]. Again under symmetry, [7] introduced the no-tion of convex unimodality and monotone unimodality . The authors also defined lin-ear unimodality , which can be applied to asymmetric distributions. A random vector ( Y , . . . , Y n ) is linear unimodal about if every linear combination (cid:80) ni =1 a i Y i has aunivariate unimodal distribution about . This, however, is not a desired definition, aspointed out by [7] themselves, as the density of such may not be a maximum at themode of univariate unimodality.Further, [21] define α − unimodality about for a random variable Y in IR n whenfor all real, bounded, nonnegative Borel function g on IR n the function t α E ( g ( tY )) isnon-decreasing as t increases in [0 , ∞ ) , where E denotes the expectation with respectto the density of Y . If X has a density function f with respect to the Lebesgue measure µ n on IR n , then X is α − unimodal if and only if t n − α f ( ty ) is decreasing in t ≥ .The distribution of a random vector ( Y , . . . , Y n ) is said to be star unimodal about ,according to [7], if it is n -unimodal in the sense of [21].According to [7], a linear unimodal distribution need not be star unimodal , andvice versa. Thus there is no implied relationship between star and linear unimodality .Another useful definition of unimodality is given by [5]: A multivariate density f on IR d is orthounimodal with mode at m = ( m , . . . , m d ) if for each j , f ( y , . . . , y d ) is a decreasing function of y j as y j → ∞ for y j ≥ m j , and as x j → −∞ for y j ≤ m j ,when all other components are held fixed.13 rthounimodal densities are widely applicable. They form a robust class in thesense that all lower-dimensional marginals of orthounimodal densities are also or-thounimodal . Also, they are star unimodal , according to [5], and most bivariate densi-ties are either orthounimodal , or orthounimodal after a linear transformation. Narrowernotions than orthounimodality can also be explored, and they are discussed in [6].For the two dimensional case only, [27] generalizes Khintchine’s stating that thedistribution function of ( Y , Y ) is unimodal if and only if it can be written as ( Y , Y ) = ( X Z , X Z ) , (11)where ( X , X ) and ( Z , Z ) are independent and ( X , X ) is uniformly distributedin [0 , . For higher dimensions, Khintchine’s representation was generalized by [17]for the symmetric case. He defines symmetric multivariate unimodal distributions on IR d as generalized mixtures (in the sense of integrating on a probability measure space)of uniform distributions on symmetric, compact and convex sets in IR d . In this paperwe will use a form of (11) while guaranteeing orthounimodality.Comprehensive reviews on multivariate unimodality can be found in [4] and [7].Also [19] reviews the literature and introduces and discusses a new class of nonpara-metric prior distributions for multivariate multimodal distributions. To extend the univariate unimodality proposed in section 2.1, we start by de-fining marginal variables Y , Y , . . . , Y d via Y l = X l /Z l , where X l ∼ U (0 , and Z l ∼ N ( µ l , µ l /c l ) , for l = 1 , , . . . , d . The dependence between variables Y =( Y , Y , . . . , Y d ) can be obtained imposing dependence to either Z = ( Z , . . . , Z d ) or X = ( X , . . . , X d ) , or both.Our first attempt was to allow dependence only in Z through a multivariate nor-mal distribution with a general variance-covariance matrix. This way our constructionwould be following the generalization of Khintchine’s representation made by [27] forthe bivariate case. The dependence obtained between Y under that construction, how-ever, for a bivariate setting, is limited. Better results were obtained when imposingdependence in X instead, through a Gaussian copula.14o demonstrate: Figure 6 shows the approximate 95% confidence intervals of thecorrelations between Y when varying the correlations between Z (top) and X (bottom)respectively, through the interval {− , − . , − . , . . . , . , } . These confidence in-tervals were found by the simulation of samples of size . The figure was con-structed considering µ = (10 , and µ = ( − , . Similar results to the ones with µ = (10 , were found for µ = ( − , − . In the same way, µ = (10 , − and µ = ( − , present similar results. The magnitude of µ did not seem to changethis output significantly. It can be seen that when imposing dependence through Z , thecorrelation between Y do not vary much beyond the interval [ − . , . , while the cor-relation between the ( Y , . . . , Y d ) and the correlation between the ( X , . . . , X d ) wasfound to be similar.Due to the results just described we opted to work with the second construction:i.e. Y l = X l /Z l , Z l ind ∼ N ( µ lj , µ lj /c l ) , l = 1 , . . . , d , and X = ( X , X , . . . , X d ) aremodeled through the Gaussian copula with correlation matrix Σ . Once again, the modeat κ = (0 , . . . , can be exchanged, substituting Y for ( Y − κ ) where κ , as well as Y ,is now a d − dimensional vector.Given x , µ j = ( µ j , . . . , µ dj ) , c = ( c , . . . , c d ) and κ = ( κ , . . . , κ d ) , the obser-vations y = ( y , y , . . . , y d ) are independent, and f ( y | x, µ j , c, κ ) is given by f ( y | x, µ j , c, κ ) = d (cid:89) l =1 ( x l /y l ) g l ( x l /y l | µ lj , c l , κ l ) . The marginal model, f ( y | µ j , c, κ ) , can be obtained through the integral f ( y | µ j , c, κ ) = (cid:90) [0 , d d (cid:89) l =1 ( x l /y l ) g l ( x l /y l | µ lj , c l , κ l ) c. ( x , . . . , x d | Σ) , (12)where c ( x , . . . , x d | Σ) represents the multivariate Gaussian copula with correlationmatrix Σ : c ( x , . . . , x d | Σ) = 1 (cid:112) | Σ | exp − Φ − ( x ) ... Φ − ( x d ) (Σ − − I ) Φ − ( x ) ... Φ − ( x d ) . Note, however, that the dependence created between variables Y is defined notonly by the correlation matrix Σ , but also by the sign of the components of µ . As15n illustration, we sampled four bivariate data-sets, and examined the dispersion plotsbetween y and y . Figure 7 shows the dispersion plot between data sampled with µ =3 , and either µ = 10 or µ = − , and with ρ = Σ[1 ,
2] = 0 . , ρ = 0 . or ρ = − . .It can be seen that positive dependence is created when ( µ × µ × ρ ) is positive, andnegative dependence is created otherwise. To help the model identification, we assumethat ρ ∈ [0 , . This restriction does not take away the model flexibility, and negativecorrelations between the variables can be captured by opposite signs in µ .Let us now define the mixture of these densities, which is the unimodal multivariatemodel of interest. For observation Y = ( Y , . . . , Y d ) the model is: f ( y ) = ∞ (cid:88) j =1 w j f ( y | µ j , c, κ ) , where f ( y | µ j , c, κ ) is in ( ). Note that, by construction, the proposed distribution ismultivariate orthounimodal , which is, therefore, also star unimodal . Unlike the uni-variate case, however, we cannot solve the integral in (12) and obtain f ( y | µ j , c, κ ) analytically. Therefore, we must come up with a different algorithm to solve the sam-pling of κ . As a “bridge” to the multivariate case, this algorithm is first developed forthe univariate case, and then extended to higher dimensions. The univariate “bridge”algorithm is presented in subsection 3.3. To mimic the situation in which the latent variable x cannot be integrated out, wepresent in this section an algorithm which samples from x = ( x , . . . , x n ) , and samplesfrom the other parameters given x . We consider the same prior distributions specifiedin section 2.2. For x we assume a uniform prior at [0 , n . The algorithm goes asfollows:1. Sample u i and obtain N = max { ξ − ( d i ) } as previously in section 2.2;2. Sample M as previously in section 2.2;3. Sample ( µ j ) for j = 1 , . . . , N as before, but using f ( y i | x i , µ j , c, κ ) (propor-tional to 8) instead of f ( y i | µ j , c, κ ) in the Metropolis ratio. This way, the M-H16cceptance criterion is given by: q = (cid:81) ni =1 f ( y i | x i , µ d i , c ∗ , κ ) π ( c ∗ ) (cid:81) ni =1 f ( y i | x i , µ d i , c, κ ) π ( c ) .
4. The full conditional distribution of c in this case has a closed form, and it isupdated via Gibbs sampling. Given y , x , µ and κ , we have: c | x, µ, κ, y ∼ ga (cid:32) n α c , n (cid:88) i =1 (cid:18) x i y i − κ − µ d i (cid:19) + β c (cid:33) .
5. Sample ( x, d, κ | µ, c, y ) in two stages: first sample from f ( d, κ | µ, c, y ) and thenfrom f ( x | d, κ, µ, c, y ) . d and κ are sampled as before, and ( x i ) for i = 1 , . . . , n ,is sampled via rejection sampling, as follows: • sample a proposal ˜ x i ∼ U [0 , ; • compute f (˜ x i | c, µ d i , y i , κ ) ∝ ˜ x i exp (cid:26) − c µ di (cid:16) ˜ x i y i − κ − µ i (cid:17)(cid:27) • compute the value ˆ x which maximizes the function: ˆ x = min (cid:26) , µ d i ( y i − κ )2 + (cid:12)(cid:12)(cid:12) ( y i − κ ) (cid:113) µ d i /c + 0 . µ d i (cid:12)(cid:12)(cid:12)(cid:27) . • accept ˜ x i with probability min (cid:110) , f ( ˜ x i | c,µ di ,y i ,κ ) f ( ˆ x i | c,µ di ,y i ,κ ) (cid:111) , or go back to the firststep.The results obtained under this algorithm were similar to those obtained from theprevious one, despite being considerably slower. The new algorithm, however, can beeasily extended to the multivariate case. In this paper we will discuss the algorithm andresults obtained for the bivariate case, and leave higher dimensionality for future work. In this section we present the algorithm developed to handle bivariate observations( p = 2 ). Under this scenario, the unknown quantities are: ( µ j = ( µ j , µ j ) , v j ) , j =1 , , . . . ; ( d i , u i ) , i = 1 , . . . , n ; c = ( c , c ) ; M ; κ = ( κ , κ ) , x li , l = 1 , , i =1 , . . . , n , and also the correlation parameter in the bivariate Gaussian Copula ( ρ ). As17ointed out in Section 3.2, without compromising the model flexibility, we can assumethat ρ ∈ [0 , . Therefore we assigned a U [0 , prior for this parameter.The algorithm proposed for the bivariate case is given below:1. Sample u i and obtain N = max { ξ − ( d i ) } as previously;2. Sample M as previously;3. Sample ( µ j ) and ( µ j ) , independently for j = 1 , . . . , N , following the sameidea as in (3.3);4. Sample c and c via Gibbs sampling. Given y , x , µ , and κ , we have: c l | x, µ, κ ∼ ga (cid:32) n α c , n (cid:88) i =1 (cid:18) x li y li − κ − µ l,d i (cid:19) + β c (cid:33) , l = 1 , .
5. Sample κ and κ independently, using a similar algorithm as previously. Notethat every time κ and κ are updated, some elements of d might also be changed.6. Sample d i directly from its probability mass function, P ( d i = j ) = w j ξ j f ( y i | µ j , c, κ ) f ( y i | µ j , c, κ ) for j = 1 , . . . , N i , where N i = max { j : u i < ξ j } .7. Sample ( x , x ) via rejection sampling. Here we consider two possible algo-rithms:7.1. • Sample from the copula C ( x ∗ i , x ∗ i ) : (Φ − ( x ∗ ) , Φ − ( x ∗ )) T ∼ N , ρρ ; • compute f ( x ∗ i , x ∗ i | c, µ d i , y i , µ d i , y i , κ ) ∝ (cid:89) j =1 x ∗ ji exp (cid:40) − c µ d ji (cid:18) ˜ x ji y ji − κ − µ ji (cid:19)(cid:41) ; compute the values ˆ x and ˆ x which maximize the function above: ˆ x j = min (cid:26) , µ d ji ( y ji − κ )2 + (cid:12)(cid:12)(cid:12) ( y ji − κ ) (cid:113) µ d ji /c + 0 . µ d ji (cid:12)(cid:12)(cid:12)(cid:27) , j = 1 , • accept ( x ∗ i , x ∗ i ) with probability min (cid:26) , f ( x ∗ i , x ∗ i | c, µ d i , y i , µ d i , y i , κ ) f (ˆ x i , ˆ x i | c, µ d i , y i , µ d i , y i , κ ) (cid:27) , or go back to the first step.7.2. • Sample x ∗ ji from the function f ( x ∗ i , x ∗ i | c, µ d i , y i , µ d i , y i , κ ) ∝ x ∗ ji exp (cid:40) − c µ d ji (cid:18) x ∗ ji y ji − κ − µ ji (cid:19)(cid:41) , through adaptive rejection sampling ([14]); • compute c ( x ∗ i , x ∗ i ) ; • compute the values ˆ x and ˆ x which maximize the copula: ˆ x = ρx and ˆ x = ρx ; • accept ( x ∗ i , x ∗ i ) with probability min (cid:26) , c ( x ∗ i , x ∗ i ) c (ˆ x i , ˆ x i ) (cid:27) , or go back to the first step.The first algorithm proposed to sample from x leads to faster convergence. How-ever, it can be very slow at times, requiring a large amount of simulations untilacceptance. We combined the two algorithms in the following way: Sample from7.1 until it either accepts the proposal or reaches a set limited number of trials.If the latter occurs, sample from x through algorithm 7.2.8. Sample ρ via Metropolis-Hastings. The proposal ρ ∗ is obtained from: log (cid:18) ρ ∗ − ρ ∗ (cid:19) = log (cid:18) ρ − ρ (cid:19) + N (0 , h ρ ) , σ ρ . The M-H acceptance criterion is given by: q = f ( ρ ∗ | x , x ) Q ( ρ ∗ ; ρ ) f ( ρ | x , x ) Q ( ρ ; ρ ∗ ) , where Q ( ρ ∗ ; ρ ) = 1 / ( ρ ∗ (1 − ρ ∗ )) . We accept the proposal with probability min { , q } . In this section we present the results obtained for a simulated bivariate data-set of size n = 100 , from a bivariate Normal distribution: y ∼ N ( κ, Ω) with κ = (30 , , and Ω = 10 ρ y ρ y , with ρ y = 0 . .The algorithm proposed in section 3.4 was applied to sample from the posteriordistribution of the model parameters and obtain samples of the predictive distributions.The algorithm was run for T = 75000 iterations, with a burn in of . Due to auto-correlation in the MCMC chains, samples were recorded at each iterations, leavinga sample of size to perform inference.Figure 8 shows the histograms of the posterior densities of the components of pa-rameter κ . It can be seen that this parameter was reasonably well estimated, with thetrue value being close to the posterior mode. Figure 9 presents a comparison betweenthe histograms of the simulated samples of y and y , and the ones predicted throughthe proposed MCMC algorithm, showing that the predictions seem close to what wasexpected.We also wish to verify if the dependence between y and y was preserved in thepredictions. To do this analysis, we computed the correlation at every samples ofthe predicted ( y , y ) . This time we used the full T = 70000 sampled values, endingup with a sample of size correlations. Figure 10 compares the histogram of thesecorrelations, which can be seen as a proxy of the posterior distribution of ρ y , to the realvalues (in red), showing good predictions.20 Boston Housing Data
As an application, we work with a part of the Boston Housing data-set created by [15]and taken from [20]. This database comprises cases of variables concerninghousing values in the suburbs of Boston, and it has been used in many applications,especially in regression and machine learning, such as [2] and [22]. We chose to workwith two variables of the database: nitric oxides concentration (parts per 10 million)at the household (NOX) and weighted distances to five Boston employment centers(DIS). Exploratory analysis points towards the unimodality of the joint density of NOXand DIS, as can be seen by the contour plot displayed in Figure 11. This contourplot also shows a negative, unusually shaped, dependence between these variables.Histograms of observations NOX and DIS are also presented in Figure 11, showingthat both variables have a non-normal unimodal shape.Our objective in this application is to illustrate the flexibility of our approach incapturing the form of this bivariate distribution and the oddly shaped dependence be-tween NOX and DIS. We worked with a sub-set of n = 100 observations and appliedthe proposed bivariate model with the same priors used for the toy examples. Again,the algorithm was run for T = 75000 iterations, with a burn in of , and recordedat every iteration, totalizing a sample of size for inference.Figure 12 shows the histograms of the predictive density of NOX and DIS, and thecontour plot made with points of the predictive distribution. We observe a highsimilarity between the real and the predicted histograms and dispersion plots, and canconclude that the proposed methodology was able to capture well the behavior of thedata for this example.Note that our approach also allows for predictions of one variable given the otherafter samples of both had been previously observed. As a second exercise, we analyzethe predictive capacity of our model, compared to a more usual linear alternative. Ifthe objective is to predict nitric oxides concentration based on weighted distances toemployment centers, a regression model would probably be considered. As can be seenin Figure 11, however, the data needs transformation for the normal linear regressionmodel assumptions to hold adequately. After exploratory analysis, we propose the21ollowing regression model:NOX − i = α + β log ( DIS i ) + e i , i = 1 , . . . , n,e i ∼ N (0 , σ e ) , with vague Normal priors for α and β , and a vague inverse Wishart prior for σ e .Figure 13 shows the box-plot of the response variable NOX − and the dispersionplot between NOX − and log(DIS), showing a linear dependence between both vari-ables. The simple regression was fit using software OpenBugs 3.2.3 ([30]).A comparison is then performed between predictions of values of NOX givenDIS, after observing n = 100 cases of both DIS and NOX, under both models. Fig-ure 14 presents the 95% credible intervals and posterior medians under the proposednonparametric model and the simple regression model, being compared with the realvalues. Figure 15 presents the histograms of the posterior distribution of the first threepredicted values. As is typical, nonparametric methods produce larger credible inter-vals than the parametric counterparts, yet are typically centered in the true values. Theparametric versions however can simply be wrong. In this paper we have proposed a new class of unimodal densities whose support includeall the unimodal densities on the real line. Our motivation for this is that a mixture ofthe proposed univariate model can be adequate to cluster data, with each cluster beingrepresented by a unimodal density.One of our objectives was also to be able to create a class of unimodal densities thatcould be naturally extended to the multivariate case. Our models allow this through theuse of the multivariate Gaussian copula. This way, modeling multivariate clusters canalso benefit from the methodology developed in this paper.The proposed models, however, cannot be dealt with analytically. We have alsoproposed an MCMC algorithm to obtain samples of the posterior distribution of themodel parameters and to obtain predictive densities. The methodology was illustrated22ith univariate and bivariate examples, and with two variables taken from the BostonHousing Data ([15]).Modeling multivariate unimodal densities with dimension higher than two can beeasily done with a slight modification to the code presented in section 3.4. In thatcase, instead of sampling from a single correlation parameter ρ , we must sample froma correlation matrix, which can be done following [32]. Preliminary results showedthis to be effective for a three-variate model. For future work we must test the codeextensively for three or more dimensions, and finally do clustering for an arbitrarydimension d , through the mixture of the densities proposed in this paper. The first author acknowledges the support of a research grant from CNPq-Brazil.
References [1] Anderson, T. W. The integral of a symmetric unimodal function. Proc. Amer. Math.Soc. , 170-176 (1955).[2] Belsley, D.A., Kuh, E. and Welsch, R.E. Regression Diagnostics: Identifying In-fluential Data and Sources of Collinearity. Wiley. Pages 244-261. (1980).[3] Brunner, J.L. and Lo, A.Y. Bayes methods for symmetric unimodal density and itsmode. Ann. Stat. , 1550-1566 (1989).[4] Dai, T. Master Dissertation. On multivariate unimodal distributions. The Universityof British Columbia. Canada. (1982).[5] Devroye, L. Random Variate Generation for Multivariate Unimodal Densities.ACM Trans. Model. Comp. Simul. , 447477 (1997).[6] Dharmadhikari, S. and Joag-dev, K. Unimodality, Convexity, and Applications.Academic Press, New York. (1988). 237] Dharmadhikari, S. and Jogdeo, K. Multivariate Unimodality. The Annals of Statis-tics, , 607-613 (1976).[8] Doornik, A. Object-Oriented Matrix Programming Using Ox. London, England:Press and Oxford. (2008).[9] Escobar, M.D. and West, M. Bayesian density estimation and inference using mix-tures. J. Amer. Stat. Assoc. , 577-588 (1995).[10] Feller, W. An Introduction to Probability Theory and its Applications (2nd Edi-tion). Wiley (1971).[11] Ferguson, T.S. A Bayesian analysis of some nonparametric problems. Ann.Statist. , 209-230 (1973).[12] Fernandez, C., Steel, M.F.J. On Bayesian modeling of fat tails and skewness. J.Amer. Stat. Assoc. , 359-371 (1998).[13] Frauhwirth-Schnatter, S. Finite Mixture and Markov Switching Models. SpringerSeries in Statistics. New York: Springer-Verlag (2006).[14] Gilks, W. R., Wild, P. Adaptive Rejection Sampling for Gibbs Sampling. Appl.Statist. , 337-348 (1991).[15] Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air. J.Environ. Economics & Management. , 81-102 (1978).[16] Kalli, M., Griffin, J.E. and Walker, S.G. Slice sampling mixture models. Stat.Comput. , 93-105 (2011).[17] Kanter, M.. Unimodality and dominance for symmetric random vectors. Trans.Amer. Math. Soc., , 6585 (1977).[18] Khintchine, A.Y. On unimodal distributions. Izvestiya Nauchnolssledova-tel’skoyo Instituta Matematiki i Mekka . 1-7 (1938).2419] Kouvaras, G., Kokolakis, G.: Random multivariate multimodal distributions. In:Skiadas, Ch. (ed.) Recent Advances in Stochastic Modelling and Data Analysis, pp.6875. World Scientific Publishing Co. (2008)[20] Lichman, M. UCI Machine Learning Repository. University of California, Irvine,School of Information and Computer Sciences. http://archive.ics.uci.edu/ml. (2013).[21] Olshen, R. A. and Savage, L. J. Generalized unimodality. J. Appl. Probability ,21-34 (1970).[22] Quinlan, R. Combining Instance-Based and Model-Based Learning. In Proceed-ings on the Tenth International Conference of Machine Learning, 236-243, Univer-sity of Massachusetts, Amherst. Morgan Kaufmann. (1993).[23] Quintana, F.A. and Steel, M.F.J. and Ferreira, J.T.A.S. Flexible univariate contin-uous distributions. Bayesian Analysis. , 497-522 (2009).[24] R Core team. R: A Language and Environment for Statistical Computing. Vienna,Austria: R Foundation for Statistical Computing. (2013).[25] Rodr´ıguez, C.E. and Walker, S.G. Univariate Bayesian nonparametric mixturemodeling with unimodal kernels. Stat. Comput. , 35-49 (2014).[26] Sethuraman, J. A constructive definition of Dirichlet priors. Statistica Sinica ,639-650 (1994).[27] Shepp, L. A. Symmetric Random Walk. Transactions of the American Mathemat-ical Society , 144-153 (1962).[28] Sherman, S. A theorem on convex sets with applications. Ann. Math. Statist., ,763-766. (1955).[29] Teh, Y.W. and Jordan, M.I. Hierarchical Bayesian nonparametric models with ap-plications. In Bayesian Nonparametrics, Chapter 5 (Hjort et al. Eds.). Wiley (2010).[30] Thomas, A. and O’Hara, B. and Ligges, U. and Sturtz, S. Making BUGS Open.R News. , 12-17. (2006). 2531] D. M., Titterington, A. F. M., Smith, U. E., Markov. Statistical Analysis of FiniteMixture Distributions. New York: Wiley (1985).[32] Wu, J., Wang, X. and Walker, S.G. Bayesian Nonparametric Inference for a Mul-tivariate Copula Function. Methodology and Computing in Applied Probability. ,747-763 (2014). 26igure 1: Unimodal density function (8) with a mixture of two components, κ = 10 , µ = ( − , , w = (0 . , . .Figure 2: Toy example of initial value for κ and possible values of s in the MCMCalgorithm.Figure 3: Histograms based on n = 100 points sampled from models A and B. Theblack line represents their respective densities.27igure 4: Histograms based on samples from the posterior of parameter κ underobservations from models A and B . Red lines indicate the true values of the parameter.Figure 5: Histograms based on samples from the predictive distribution underobservations from models A and B . Black line indicates the true density.28igure 6: 95% confidence intervals of the correlation obtained between ( Y , Y ) whenvarying the correlation between ( Z , Z ) (top) and ( X , X ) (bottom) respectively,through the interval {− , − . , − . , . . . , . , } .Figure 7: Dispersion Plot of samples y vs y of size from the simulated bivariateexample. 29igure 8: Histograms of the posterior of κ under simulated bivariate model. Red linesindicate true values.Figure 9: Histograms comparing the original simulated samples (size 100) to the pre-dicted samples under the bivariate model. Black lines represent the true density.30igure 10: Histogram of the estimated correlation between samples of size ofpredicted ( y , y ) . The red line indicates the true correlation.Figure 11: Histograms of variables DIS and NOX, and contour plot of DIS vs NOXfrom the Boston Housing database.Figure 12: Predicted histograms of DIS and NOX, and predicted contour plot of DISvs NOX under the proposed bivariate mixture model.31igure 13: Histogram of NOX − and dispersion plot of log(DIS) vs NOX −1