HHorseshoe shrinkage methods for Bayesian fusionestimation
Sayantan Banerjee ∗ Operations Management and Quantitative Techniques AreaIndian Institute of Management IndoreIndore, M.P., India
Abstract
We consider the problem of estimation and structure learning in a high dimensionalnormal sequence model, where the underlying parameter vector is piecewise constant,or has a block structure. We develop a Bayesian fusion estimation method by using theHorseshoe prior to induce a strong shrinkage effect on successive differences in the meanparameters, simultaneously imposing sufficient prior concentration for non-zero valuesof the same. The proposed method thus facilitates consistent estimation and structurerecovery of the signal pieces. We provide theoretical justifications of our approachby deriving posterior convergence rates and establishing selection consistency undersuitable assumptions. We demonstrate the superior performance of the Horseshoebased Bayesian fusion estimation method through extensive simulations and two real-life examples.
Keywords:
Bayesian shrinkage; Fusion estimation; Horseshoe prior; posteriorconvergence rate; piecewise constant function. ∗ Corresponding author. Address: IIM Indore, Rau-Pithampur Road, Indore, M.P. - 453 556, India.e-mail: [email protected] a r X i v : . [ s t a t . M E ] F e b Introduction
With modern technological advances and massive data storage capabilities, large datasetsare becoming increasingly common with applications in finance, econometrics, bioinformat-ics, engineering, signal-processing, among others. Flexible modeling of such datasets oftenrequire parameters whose dimension exceed the available sample size. Plausible inference ispossible by finding a lower dimensional embedding of the high dimensional parameter.Sparsity plays an important role in statistical learning in a high-dimensional scenario,where the underlying parameters are ‘nearly black’ (Donoho et al., 1992), implying thatonly a small subset of the same is non-zero. Several regularization based methods have beenproposed in this regard in the literature, that induce sparsity in the models. From a fre-quentist perspective, such methods include penalization based methods like ridge regression(Hoerl and Kennard, 1970), lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005),SCAD (Fan and Li, 2001), fused lasso (Tibshirani et al., 2005; Rinaldo, 2009), graphicallasso (Friedman et al., 2008), among others; see B¨uhlmann and Van De Geer (2011) forfrequentist methods in high dimensions. Bayesian methods in high dimensions have beendeveloped more recently, where sparsity is induced via suitable prior distributions on theparameters. These include spike-and-slab priors (Mitchell and Beauchamp, 1988; Ishwaranand Rao, 2005), Bayesian lasso (Park and Casella, 2008), Bayesian graphical lasso (Wang,2012), and other shrinkage priors like spike-and-slab lasso (Roˇckov´a and George, 2018), Nor-mal Exponential Gamma (Griffin and Brown, 2010), Horseshoe (Carvalho et al., 2009, 2010)and other variants like Horseshoe+ (Bhadra et al., 2017) and Horseshoe-like (Bhadra et al.,2019), Dirichlet-Laplace (Bhattacharya et al., 2015), R2-D2 (Zhang et al., 2016), amongothers. Theoretical guarantees of such Bayesian procedures have been developed recently aswell. We refer the readers to Banerjee et al. (2021) for a comprehensive overview of Bayesianmethods in a high-dimensional framework.In this paper, we consider the normal sequence model y i = θ i + (cid:15) i , i = 1 , . . . , n, (1)where (cid:15) i iid ∼ N (0 , σ ) , σ being the error variance, and θ = ( θ , . . . , θ n ) T ∈ R n is the vector ofmean parameters. We further assume that the true parameter θ = ( θ , , . . . , θ ,n ) T ∈ R n is2iecewise constant, or has an underlying block structure, in the sense that the transformedparameters η ,j = θ ,j − θ ,j − , ≤ j ≤ n belong to a nearly-black class l [ s ] = { η ∈ R n − : { j : η j (cid:54) = 0 , ≤ j ≤ n } ≤ s, ≤ s = s ( n ) ≤ n } . Here s := { j : η ,j (cid:54) = 0 , ≤ j ≤ n } . Our fundamental goals are two-fold – (i) estimating the mean parameter θ , and (ii)recovering the underlying piecewise constant structure. Piecewise constant signal estimationand structure recovery is an important and widely studied problem. Such signals occurin many applications, including bioinformatics, remote sensing, digital image processing,finance, and geophysics. We refer the readers to Little and Jones (2011) for a review ongeneralized methods for noise removal in piecewise constant signals, along with potentialapplications.Frequentist methods in addressing the above problem include penalization methods likethe fused lasso method (Tibshirani et al., 2005) and the L -fusion method (Rinaldo, 2009).The Bayesian equivalent of the fused lasso penalty is using a Laplace shrinkage prior on thesuccessive differences η (Kyung et al., 2010). However, the Laplace prior leads to posteriorinconsistency (Castillo et al., 2015). To overcome this problem, Song and Liang (2017) used aheavy-tailed shrinkage prior for the coefficients in a linear regression framework. Motivatedby this, Song and Cheng (2020) proposed to use a t -shrinkage prior for Bayesian fusionestimation. Shimamura et al. (2019) use a Normal Exponential Gamma (NEG) prior in thiscontext, and make inference based on the posterior mode.Motivated by strong theoretical guarantees regarding estimation and structure learninginduced by the Horseshoe prior in a regression model (Datta and Ghosh, 2013; Van Der Paset al., 2014; van der Pas et al., 2017), we propose to investigate the performance of the samein our fusion estimation framework. The Horseshoe prior induces sparsity via an infinitespike at zero, and also possess a heavy tail to ensure consistent selection of the underlyingpieces or blocks. Furthermore, the global scale parameter that controls the level of sparsity inthe model automatically adapts to the actual level of sparsity in the true model, as opposedto choosing the scale parameter in the t -shrinkage prior on the basis of the underlyingdimension n . The conjugate structure of the posterior distribution arising out of modelingvia a Horseshoe prior allows fast computation along with working with the full posterior3istribution, so that the samples obtained via MCMC can further be used for uncertaintyquantification.The paper is organized as follows. In the next section, we specify the Bayesian modelalong with the prior specifications, followed by posterior computations in Section 3. In Sec-tion 4, we provide theoretical guarantees of our proposed method via determining posteriorconvergence rates and establishing posterior selection consistency of the signal pieces. Wedemonstrate the numerical performance of our method along with other competing methodsvia simulations in Section 5, followed by real-life applications in two different areas – DNAcopy number analysis and Solar X-Ray flux data analysis. We conclude our paper witha brief discussion of our proposed fusion estimation method along with future directionsfor research. Additional lemmas, proofs of main results, and a practical solution for blockstructure recovery along with numerical results for the same are provided in the Appendix.The notations used in the paper are as follows. For real-valued sequences { a n } and { b n } , a n = O ( b n ) implies that a n /b n is bounded, and a n = o ( b n ) implies that a n /b n → n → ∞ . By a n (cid:46) b n , we mean that a n = O ( b n ), while a n (cid:16) b n means that both a n (cid:46) b n and b n (cid:46) a n hold. a n ≺ b n means a n = o ( b n ) . For a real vector x = ( x , . . . , x n ) T , the L r -norm of x for r > (cid:107) x (cid:107) r = ( (cid:80) ni =1 | x i | r ) /r . We denote the cardinality of a finite set S as S .The indicator function is denoted by 1l . We consider the normal sequence model (1) and assume that the successive differences ofthe means belong to a nearly-black class l [ s ]. As discussed earlier, frequentist proceduresinduce sparsity in the model via suitable regularization based procedures like penalizationof the underlying parameters, whereas Bayesian methods usually induce sparsity via impos-ing suitable prior distributions on the same. For example, for learning a high-dimensionalparameter θ ∈ R n , a general version of a penalized optimization procedure can be written asarg min θ ∈ R n { l ( θ ; y ) + π ( θ ) } , where l ( θ ; y ) is the empirical risk and π ( θ ) is the penalty func-tion. If l ( θ ; y ) is defined as the negative log-likelihood (upto a constant) of the observations y , the above optimization problem becomes equivalent to finding the mode of the posterior4istribution p ( θ | y ), corresponding to the prior density p ( θ ) ∝ exp ( − π ( θ )) . In our context, Tibshirani et al. (2005) proposed the fused lasso estimator ˆ θ F L thatinduces sparsity on both θ and η , defined asˆ θ F L = arg min θ ∈ R n (cid:26) (cid:107) y − θ (cid:107) + λ (cid:107) θ (cid:107) + λ (cid:107) η (cid:107) (cid:27) , (2)for suitable penalty parameters λ and λ . Rinaldo (2009) considered the L -fusion estimatorˆ θ F with penalization of the successive differences only, given by,ˆ θ F = arg min θ ∈ R n (cid:26) (cid:107) y − θ (cid:107) + λ (cid:107) η (cid:107) (cid:27) , (3)where λ is the corresponding penalty (tuning) parameter. A Bayesian equivalent of thefused lasso estimator (3) can be obtained by putting a Laplace ( λ/σ ) prior on the successivedifferences η and finding the corresponding posterior mode. As in a normal regression modelwith Bayesian lasso (Park and Casella, 2008), the Bayesian fused lasso estimator given bythe posterior mode will converge to the true η at a nearly optimal rate. However, theinduced posterior distribution has sub-optimal contraction rate (Castillo et al., 2015), owingto insufficient prior concentration near zero.The posterior inconsistency of the Bayesian fused lasso method motivates us to exploreother approaches that would mitigate the problems leading to the undesirable behavior of theposterior distribution. Shrinkage priors qualify as naturally good choices for our problem,as they can address the dual issue of shrinking true zero parameters to zero, and retrievingthe ‘boundaries’ of the blocks effectively by placing sufficient mass on the non-zero values ofsuccessive differences of the normal means. In the normal sequence model, optimal posteriorconcentration has been achieved via using shrinkage prior distributions with polynomiallydecaying tails (Song and Liang, 2017). This is in contrast to the Laplace prior, that hasexponentially light tails.The Horseshoe prior is a widely acclaimed choice as a shrinkage prior owing to its infinitespike at 0 and simultaneously possessing a thick tail. The tails decay like a second-orderpolynomial, and hence the corresponding penalty function behaves like a logarithmic penalty,and is non-convex (see Carvalho et al. (2010); Bhadra et al. (2019) for more details). TheHorseshoe prior can be expressed as a scale mixture of normals with half-Cauchy prior, thus5cting as a global-local shrinkage prior. We put a Horseshoe prior on the pairwise differencesin the parameters η i = θ i − θ i − , i = 2 , . . . , n. We also need to put suitable priors on themean parameter θ and the error variance σ . The full prior specification is given by, θ | λ , σ ∼ N (0 , λ σ ) , η i | λ i , τ , σ ind ∼ N (0 , λ i τ σ ) , ≤ i ≤ n,λ i iid ∼ C + (0 , , ≤ i ≤ n, τ ∼ C + (0 , , σ ∼ IG ( a σ , b σ ) . (4)Here C + ( · , · ) and IG ( · , · ) respectively denote the half Cauchy density and Inverse Gammadensity. The level of sparsity induced in the model is controlled by the global scale parameter τ , and choosing the same is a non-trivial problem. Using a plug-in estimate for τ based onempirical Bayes method suffers from a potential danger of resulting in a degenerate solutionresulting in a heavily sparse model. There are several works (Carvalho et al., 2010; Piironenand Vehtari, 2017b,a) that suggest effective methods regarding the choice of τ . In this paper,we have proposed to take a fully Bayesian approach as suggested in Carvalho et al. (2010);Piironen and Vehtari (2017b) and use a half-Cauchy prior.The half-Cauchy distribution can further be written as a scale-mixture of Inverse-Gammadistributions. For a random variable X ∼ C + (0 , ψ ), we can write, X | φ ∼ IG (1 / , /φ ) , φ ∼ IG (1 / , /ψ ) . Thus, the full hierarchical prior specification for our model is given by, θ | λ , σ ∼ N (0 , λ σ ) , η i | λ i , τ , σ ind ∼ N (0 , λ i τ σ ) , ≤ i ≤ n,λ i | ν i ind ∼ IG (1 / , /ν i ) , ≤ i ≤ n, τ | ξ ∼ IG (1 / , /ξ ) ,ν , . . . , ν n , ξ iid ∼ IG (1 / , , σ ∼ IG ( a σ , b σ ) (5)The hyperparameters a σ and b σ may be chosen in such a way that the corresponding priorbecomes non-informative. The local scale parameter λ is considered to be fixed as well. The conditional posterior distributions of the underlying parameters can be explicitly de-rived via exploring the conjugate structure. Hence, the posterior computations can be ac-6omplished easily via Gibbs sampling. We present the conditional posterior distributions ofthe parameters below.The normal means have the conditional posterior distribution θ i | · · · ∼ N ( µ i , ζ i ) , ≤ i ≤ n, (6)where µ i and ζ i are given by, ζ − = 1 σ (cid:18) λ τ + 1 λ (cid:19) , µ = ζ σ (cid:18) y + θ λ τ (cid:19) ,ζ − i = 1 σ (cid:18) λ i +1 τ + 1 λ i τ (cid:19) , µ i = ζ i σ (cid:18) y i + θ i +1 λ i +1 τ + θ i − λ i τ (cid:19) , ≤ i ≤ n. Here λ n +1 is considered to be infinity. The conditional posteriors for the rest of the param-eters are given by, λ i | · · · ∼ IG (cid:18) , ν i + ( θ i − θ i − ) τ σ (cid:19) , ≤ i ≤ n,σ | · · · ∼ IG (cid:32) n + a σ , b σ + 12 (cid:34) n (cid:88) i =1 ( y i − θ i ) + 1 τ n (cid:88) i =2 ( θ i − θ i − ) λ i + θ λ (cid:35)(cid:33) ,τ | · · · ∼ IG (cid:32) n , ξ + 12 σ n (cid:88) i =2 ( θ i − θ i − ) λ i (cid:33) ,ν i | · · · ∼ IG (cid:18) , λ i (cid:19) , ≤ i ≤ n,ξ | · · · ∼ IG (cid:18) , τ (cid:19) . (7) In this section, we present the theoretical validations of using a Horseshoe prior for fusionestimation. We first discuss the result involving posterior convergence rate of the meanparameter θ under certain assumptions and then proceed to discuss the result on posteriorselection of the underlying true block structure of the mean parameter. Assumption 1.
The number of true blocks in the model satisfies s ≺ n/ log n . ssumption 2. The true mean parameter vector θ = ( θ , , . . . , θ ,n ) T and the true errorvariance σ satisfy the following conditions:(i) Define η ,j = θ ,j − θ ,j − , ≤ j ≤ n. Then, max j | η ,j /σ | < L, where log L = O (log n ) . (ii) θ , / ( λ σ ) + log λ = O (log n ) , where λ is the prior hyperparameter appearing in theprior for θ in (4). Assumption 3.
The global scale parameter τ in the prior specification (4) satisfies τ
Consider the Gaussian means model (1) with prior specification as in (4), andsuppose that assumptions 1, 2 and 3 hold. Then the posterior distribution of θ , given by n ( · | y ) , satisfies Π n ( (cid:107) θ − θ (cid:107) / √ n ≥ M σ (cid:15) n | y ) → , as n → ∞ , (8) in probability or in L wrt the probability measure of y , for (cid:15) n (cid:16) (cid:112) s log n/n and a constant M > . The above result implies that the posterior convergence rate for (cid:107) θ − θ (cid:107) / √ n is of theorder σ (cid:112) s log n/n . When the exact piecewise constant structure is known, the proposedBayesian fusion estimation method achieves the optimal convergence rate O ( σ (cid:112) s/n ) uptoa logarithmic term in n . The posterior convergence rate also adapts to the (unknown) sizeof the pieces. The rate directly compares with the Bayesian fusion estimation method asproposed in Song and Cheng (2020).The Horseshoe prior (and other global-local shrinkage priors) are continuous shrinkagepriors, and hence exact block structure recovery is not possible. However, we can considerdiscretization of the posterior samples via the posterior projection of the samples ( θ, σ )to a discrete set S ( θ, σ ) = { ≤ j ≤ n : | θ j − θ j − | /σ < (cid:15) n /n } . The number of false-positives resulting from such a discretization can be expressed as the cardinality of the set A ( θ, σ ) = S c ( θ, σ ) − { ≤ j ≤ n : θ ,j − θ ,j − (cid:54) = 0 } . The induced posterior distribution of S ( θ, σ ) (and hence that of A ( θ, σ )) can be shown to be ‘selection consistent’, in the sensethat the number of false-positives as defined above is bounded in probability. We formallypresent this result below. Theorem 5.
Under the assumptions of Theorem 4, the posterior distribution of A ( θ, σ ) satisfies Π n ( A ( θ, σ ) > Ks | y ) → , (9) in probability or in L wrt the measure of y for some fixed constant K > . Note that the thresholding rule depends on the posterior contraction rate, that involvesthe number ( s ) of true blocks. However, in practical scenarios, knowledge of s may not bereadily available. To tackle such situations, we propose to use a threshold analogous to theconcept of shrinkage weights as in the sparse normal sequence model. We outline the detailsin the appendix. 9 Simulation Study
To evaluate the performance of our method and compare with other competing approaches,we carry out simulation studies for varying signal and noise levels. Three different types ofpiecewise constant functions of length n = 100 are considered – (i) 10 evenly spaced signalpieces, with each of the pieces having length 10, (ii) 10 unevenly spaced signal pieces, withthe shorter pieces each having length 5, and (iii) 10 very unevenly spaced signal pieces, withthe shorter pieces each having length 2. The response variable y is generated from an n -dimensional Gaussian distribution with mean θ and variance σ , where θ is the true signalvector, and σ ∈ { . , . , . } . We estimate the true signal using our Horseshoe prior based fusion estimation approach,and compare with three other approaches – (a) Bayesian t -fusion method, as proposed inSong and Cheng (2020), (b) Bayesian fusion approach based on a Laplace prior, and (c) the L -fusion method. For the Bayesian methods, we consider 5000 MCMC samples, with initial500 samples as burn-in. The MCMC-details including Gibbs sampler updates for the t -fusionand Laplace fusion approaches, and the choice of the corresponding scale parameters for theabove priors are taken as suggested in Song and Cheng (2020). The hyperparameter valuesfor the prior on error variance are taken as a σ = b σ = 0 . λ = 5. For the frequentist fused lasso method based on L -fusion penalty, we use the genlasso package in R , and choose the fusion penalty parameterusing a 5-fold cross-validation approach. To evaluate the performance of the estimators, weuse the Mean Squared Error (MSE) and the adjusted MSE, respectively defined as,MSE = (cid:107) ˆ θ − θ (cid:107) /n, adj . MSE = (cid:107) ˆ θ − θ (cid:107) / (cid:107) θ (cid:107) , where ˆ θ is the estimated signal, given by the posterior mean in case of the Bayesian methodsand the fused lasso estimate for the frequentist method. All the computations were performedin R on a laptop having an Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz with 16GB RAMand a 64-bit OS, x64-based processor.The summary measures for the performance of the four methods using MSE, adjustedMSE and their associated standard errors (in brackets) based on 100 Monte-Carlo replica-10ions are presented in Table 1. We find that our proposed Horseshoe based fusion estimationmethod has excellent signal estimation performance across all the different types of signalpieces and noise levels. Within a Bayesian framework, the performance of the Laplace basedfusion estimation method is not at all promising. Though the estimation performances forthe Horseshoe-fusion, t -fusion and fused lasso are comparable in case of low noise levels,our proposed method performs much better in the higher noise scenario. Additionally, theBayesian methods provide credible bands as well that can be utilized further for uncertaintyquantification. In that regard, we observe from Figures 1, 2 and 3 that the Horseshoe basedestimates have comparatively narrower credible bands as compared to other Bayesian com-peting methods considered here. Overall, we could successfully demonstrate the superiorityof using a heavy-tailed prior for successive differences in the signals in contrast to an expo-nentially lighter one. In addition to that, using a prior with a comparatively sharper spike atzero results in better signal and structure recovery, especially in situations where the noiselevel is higher. We check the performance of our proposed method on two real-world data examples where theobserved data can be modeled using piecewise constant functions. The first example involvesDNA copy-number analysis of Glioblastoma multiforme (GBM) data, and the second oneinvolves analysis of Solar X-Ray flux data for the period of October-November 2003 thatencompasses the Halloween 2003 Solar storm events.
Array Comparative Genomic Hybridization (aCGH) is a high-throughput technique used toidentify chromosomal abnormalities in the genomic DNA. Identification and characterizationof chromosomal aberrations are extremely important for pathogenesis of various diseases, in-cluding cancer. In cancerous cells, genes in a particular chromosome may get amplified ordeleted owing to mutations, thus resulting in gains or losses in DNA copies of the genes. Ar-ray CGH data measure the log -ratio between the DNA copy number of genes in tumor cells11venly spaced pieces Unevenly spaced pieces Very unevenly spaced piecesTrueHS t LaplaceFusedFigure 1: Fusion estimation performance with differently spaced signals and error sd σ = 0 . . Observations are represented in red dots, true signals in blue dots, point estimates in blackdots, and 95% credible bands of the Bayesian procedures in green.12venly spaced pieces Unevenly spaced pieces Very unevenly spaced piecesTrueHS t LaplaceFusedFigure 2: Fusion estimation performance with differently spaced signals and error sd σ = 0 . . Observations are represented in red dots, true signals in blue dots, point estimates in blackdots, and 95% credible bands of the Bayesian procedures in green.13venly spaced pieces Unevenly spaced pieces Very unevenly spaced piecesTrueHS t LaplaceFusedFigure 3: Fusion estimation performance with differently spaced signals and error sd σ = 0 . . Observations are represented in red dots, true signals in blue dots, point estimates in blackdots, and 95% credible bands of the Bayesian procedures in green.14nd those in control cells. The aCGH data thus may be considered as a piecewise constantsignal with certain non-zero blocks corresponding to aberrations owing to additions/deletionsat the DNA level.Glioblastoma multiforme (GBM) is the most common malignant brain tumors in adults(Holland, 2000). Lai et al. (2005) analyzed array CGH data in GBM using glioma datafrom Bredel et al. (2005) along with control samples. Tibshirani and Wang (2008) used thedata from Lai et al. (2005), and created aCGH data corresponding to a pseudo-chromosomeso that the resulting genome sequence have shorter regions of high amplitude gains (copynumber additions) as well as a broad region of low amplitude loss (copy number deletions).To be precise, the combined array consists of aCGH data from chromosome 7 in GBM29and chromosome 13 in GBM31. The aCGH data is presented in Figure 4, and is availablefrom the archived version of the cghFlasso package in R .We apply the Horseshoe prior based Bayesian fusion estimation method and also comparethe performance with the fused lasso method. We post-process the Bayesian estimates bydiscretization via the thresholding method as outlined in the appendix. The results aredisplayed in Figure 5. We find that our proposed Bayesian fusion estimation method hassuccessfully detected the high amplitude regions of copy number additions, as well as thebroad low amplitude region of copy number deletions. The results are very much comparablewith the fused lasso estimates, and the findings are similar to that in Lai et al. (2005). TheBayesian credible intervals obtained using our proposed method may further be utilized inaddressing future research problems, like controlling false discovery rates. The Earth’s ionosphere is responsible for blocking high frequency radio waves on the sidefacing the Sun. Large solar flares affect the ionosphere, thus interrupting satellite communi-cations, and posing threats of radiation hazards to astronauts and spacecrafts. Large solarflares are also associated with Coronal Mass Ejections (CMEs) that have the potential totrigger geomagnetic storms. Geomagnetic storms have the ability to disrupt satellite andradio communications, cause massive black-outs via electric grid failures, and affect globalnavigation systems. The Geostationary Operational Environmental Satellite (GOES), op-15igure 4: Figure showing array CGH data for Chromosome 7 in GBM29 and chromosome13 in GBM31.erated by the United States’ National Oceanic and Atmospheric Administration (NOAA),records data on solar flares. Solar X-ray flux is measured as Watts-per-square-meter (
W/m )unit. They are primarily classified in five categories, namely, A, B, C, M, and X, dependingon the approximate peak flux with wavelengths in the range of 1 − . − < − W/m ), and those in caregory X are the most powerful ones ( > − W/m ).Around Halloween in 2003, a series of strong solar storms affected our planet, leading todisruptions in satellite communication, damages to spacecrafts, and massive power outagesin Sweden. The increased levels of solar activity made it possible to observe the Northernlights (Aurora Borealis) even from far south latitudes in Texas, USA. We consider the shortwave solar X-ray flux data for the two months of October and November 2003. The datahave been accessed from NOAA’s website ( ). We use the 5-minute average flux data for all the days in the twomonths, and extract the information for the flux level for shortwave X-rays.The data contain some missing values, which we impute via a linear interpolation method.Also, we further use 1-hour averages of the X-ray flux data, so as to arrive at 24 data pointsfor each day, thus leading to a total of 1464 data points. Figure 6 shows the plot of the16a) Bayesian Horseshoe fusion method(b) L fused lasso methodFigure 5: Figure showing estimation performances of (a) the proposed Bayesian Horseshoeprior based fusion method, and (b) the L fused lasso method. The blue lines represent theestimated signals, and the red dots represent the log ratios between the DNA copy numbersof genes in tumor cells and those in control cells.17igure 6: Figure showing scatter plot of logarithm of solar X-ray flare flux values for themonths of October and November 2003, recorded by GOES.logarithm of the flux levels for the two months. As mentioned in Little and Jones (2011), wecan approximate this data as a piecewise constant signal, and use fusion-based estimationmethods to identify days of extreme solar activities. We apply our Horseshoe prior basedBayesian fusion estimation method, and post-process the estimates by using the thresholdingapproach as mentioned in the appendix. We also apply the L -fusion estimation method.The fitted estimates are shown in Figure 7. We find that our Horseshoe prior based Bayesianapproach has been able to detect the periods of increased solar activity around October 28and also on November 04, 2003, besides giving piecewise constant estimates. Our findingsmatch with related analyses; see Pulkkinen et al. (2005). On the other hand, the fused lassomethod suffers from serious over-fitting problems. In this paper, we have addressed the problem of fusion estimation in a normal sequencemodel where the underlying mean parameter vector is piecewise constant. Noise removaland block structure learning is an important and widely studied problem with applicationsin a plethora of areas. We developed a Horseshoe prior based fusion estimation method and18a) Bayesian Horseshoe fusion method(b) L fused lasso methodFigure 7: Figure showing estimation performances of (a) the proposed Bayesian Horseshoeprior based fusion method, and (b) the L fused lasso method. The blue lines represent theestimated signals, and the red dots represent the logarithm of the shortwave X-ray solar flareflux values as recorded by GOES. 19rovided theoretical guarantees via establishing posterior convergence and selection consis-tency. Numerical studies demonstrated excellent performance of our method as comparedto other competing Bayesian and frequentist approaches.There are several directions for extending our work. In many applications, we observenoisy signals defined over an undirected graph. In a recent work, Banerjee and Shen (2020)proposed to use the t -shrinkage prior to address the graph signal denoising problem via re-ducing the original graph to a linear chain graph. The excellent performance of the Horseshoeprior based learning can be extended to the graph signal denoising problem as well. Anotherdirection to work is the graph fused lasso method for estimation and structure learning inmultiple Gaussian graphical models (Danaher et al., 2014). We propose to explore theseideas as future work. Data Availability
Data for the DNA copynumber analysis are available for free from the archived version ofthe cghFlasso package in R . Data on solar X-ray flux are available for free from the NationalOceanic and Atmospheric Administration (NOAA)’s website at . Acknowledgements
The author is supported by DST INSPIRE Faculty Award Grant No. 04/2015/002165, andalso by IIM Indore Young Faculty Research Chair Award grant.20 ppendix
Additional lemmas and proofs of results
Lemma 6.
Under the assumptions on the true parameter θ and the prior distribution on θ as outlined in Theorem 4, for some u (cid:48) > , we have, (cid:90) s log n/n − s log n/n p HS ( η ; τ ) dη ≥ − n − (1+ u (cid:48) ) , (10) where p HS ( η ; τ ) denotes the Horseshoe prior density on η with hyperparameter τ > .Proof. Define a n = s log n/n . Note that, a n < √ n(cid:15) n /n, where (cid:15) n is the posterior conver-gence rate (cid:112) s log n/n. We will show that (cid:82) a n − a n p HS ( η ; τ ) dη ≥ − n − (1+ u (cid:48) ) for some u (cid:48) > − (cid:90) a n − a n p HS ( η ; τ ) dη = 2 (cid:90) ∞ (cid:110) − Φ (cid:16) a n λτ (cid:17)(cid:111) f C + ( λ ) dλ = 2 (cid:90) n u (cid:110) − Φ (cid:16) a n λτ (cid:17)(cid:111) f C + ( λ ) dλ + 2 (cid:90) ∞ n u (cid:110) − Φ (cid:16) a n λτ (cid:17)(cid:111) f C + ( λ ) dλ ≤ (cid:110) − Φ (cid:16) a n τ n − (1+ u ) (cid:17)(cid:111) + 2 (cid:90) ∞ n (1+ u ) π
11 + λ dλ ≤ (cid:112) /π exp (cid:0) − a n n − u ) /τ (cid:1) a n n − (1+ u ) /τ + n − (1+ u ) ≤ n − (1+ u (cid:48) ) , where 0 < u (cid:48) < u , for τ ≤ a n n − (1+ u ) (cid:16) n − (3+ u ) , and (cid:90) ∞ n u π
11 + λ dλ = 2 π (cid:16) π − arctan (cid:0) n u (cid:1)(cid:17) (cid:16) n − (1+ u ) . This completes the proof.As mentioned earlier, the above result guarantees that the Horseshoe prior puts sufficientmass around zero for the successive differences η i so as to facilitate Bayesian shrinkage for theblock-structured mean parameters. However, the prior should be able to retrieve the blocks21ffectively as well, so that successive differences that are non-zero should not be shrunk toomuch. The following result below guarantees that the Horseshoe prior is ‘thick’ at non-zeroparameter values, so that it is not too sharp. Lemma 7.
Consider the prior structure and the assumptions as mentioned in Theorem 4.Then, we have, − log (cid:18) inf η/σ ∈ [ − L,L ] p HS ( η ; τ ) (cid:19) = O (log n ) . (11) Proof.
The Horseshoe prior does not have an explicit functional form. However, Carvalhoet al. (2010) provide a lower bound for the Horseshoe prior density that we would utilise forproving the above result. We have,inf η/σ ∈ [ − L,L ] p HS ( η ; τ ) = (cid:90) ∞ √ πλτ exp (cid:18) − L λ τ (cid:19) f C + ( λ ) dλ ≥ √ π log (cid:18) τ L (cid:19) (cid:16) τ L = O (cid:0) n − (1+ u ) (cid:1) , for − log( τ ) = O (log n ) and log L = O (log n ) . Thus, we get, − log (cid:0) inf η/σ ∈ [ − L,L ] p HS ( η ; τ ) (cid:1) = O (log n ) , hence completing the proof.We now present the proofs of the main results in our paper. Proofs of Theorem 4 and Theorem 5.
The proof readily follows from Theorems 2.1 and 2.2in Song and Cheng (2020), if we can verify the conditions (see display (2.5) in the afore-mentioned paper) for posterior convergence specified therein. Lemma 6 and Lemma 7 implythat the first two conditions are satisfied. The third condition is satisfied for a Normalprior distribution on θ and for a fixed hyperparameter λ > . The fourth condition is triv-ially satisfied for fixed choices of (non-zero) hyperparamters a σ and b σ , and for some fixed(unknown) true error variance σ . practical solution to block structure recovery The Horseshoe prior is a continuous shrinkage prior, and hence block structure recoveryis not straight-forward. In Bayesian fusion estimation with Laplace shrinkage prior orwith t -shrinkage prior, one may discretize the samples obtained via MCMC using a suit-able threshold. Song and Cheng (2020) recommended using the 1 / n -th quantile of thecorresponding prior for discretization of the scaled samples. In Section 4, we proposed athreshold based on the posterior contraction rate. However, that would also require an ideaof the true block size, which may not be available always, or may be difficult to ascer-tain beforehand. We provide a practical approach for discretization of the samples. Notethat in a sparse normal sequence problem, the posterior mean of the mean parameter θ i isgiven by κ i y i , where κ i is the shrinkage weight. These shrinkage weights mimic the poste-rior selection probability of the means, and hence thresholding the weights to 0.5 provideexcellent selection performance, which is justified numerically (Carvalho et al., 2010) andlater theoretically (Datta and Ghosh, 2013). Motivated by the same, we propose to usethe following thresholding rule for selection of the blocks for our proposed method: θ j and θ j are equal if | ˆ θ j − ˆ θ j | / ˆ σ < . | y j − y j | , where θ j , θ j , ˆ σ are the posterior meansof θ j , θ j and σ respectively, for 1 ≤ j (cid:54) = j ≤ n. We thus estimate the block struc-ture indicator s j j = 1l { θ j = θ j } by ˆ s j j = 1l {| ˆ θ j − ˆ θ j | / ˆ σ < . | y j − y j |} . To eval-uate the performance of the structure recovery method using the proposed thresholdingapproach, we define the metrics W and B , where W is the average within-block variationdefined as W := mean { s j j (cid:54) =0 } | ˆ θ j − ˆ θ j | , and B is the between-block separation defined as B := min { s j j =0 } | ˆ θ j − ˆ θ j | . This practical thresholding approach resulted in excellent block structure recovery per-formance with respect to the metrics W and B . We compare our results with the otherBayesian fusion estimation methods and also with the frequentist fused lasso method. Thefused lasso method results in exact structure recovery, and hence no thresholding is required.As mentioned earlier, the thresholding rules for the competing Bayesian methods are takenas suggested in Song and Cheng (2020). We notice that for low error variances, the within-block average variation is lower in case of the Horseshoe fusion and t -fusion priors, with the23orseshoe fusion method having larger between-block speration especially in case of highernoise, thus indicating superior structure learning capabilities.Other possible directions for coming up with a threshold include using a multiple hypoth-esis testing approach for successive differences in the means, and also using simultaneouscredible intervals. We leave the theoretical treatment of using (or improving, if possible) ourpractical thresholding approach and exploring other proposed methods as a future work. References
Banerjee, S., Castillo, I., and Ghosal, S. (2021). Bayesian inference in high-dimensionalmodels. arXiv preprint arXiv:2101.04491 .Banerjee, S. and Shen, W. (2020). Graph signal denoising using t -shrinkage priors. arXivpreprint arXiv:2012.13696 .Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017). The horseshoe+ estimator ofultra-sparse signals. Bayesian Analysis , 12(4):1105–1131.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. T. (2019). The horseshoe-like regular-ization for feature subset selection.
Sankhya B , pages 1–30.Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2015). Dirichlet–laplace priorsfor optimal shrinkage.
Journal of the American Statistical Association , 110(512):1479–1490.Bredel, M., Bredel, C., Juric, D., Harsh, G. R., Vogel, H., Recht, L. D., and Sikic, B. I.(2005). High-resolution genome-wide mapping of genetic alterations in human glial braintumors.
Cancer Research , 65(10):4088–4096.B¨uhlmann, P. and Van De Geer, S. (2011).
Statistics for high-dimensional data: methods,theory and applications . Springer Science & Business Media.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009). Handling sparsity via the horseshoe.In
Artificial Intelligence and Statistics , pages 73–80. PMLR.24arvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparsesignals.
Biometrika , 97(2):465–480.Castillo, I., Schmidt-Hieber, J., and Van der Vaart, A. (2015). Bayesian linear regressionwith sparse priors.
Annals of Statistics , 43(5):1986–2018.Danaher, P., Wang, P., and Witten, D. M. (2014). The joint graphical lasso for inversecovariance estimation across multiple classes.
Journal of the Royal Statistical Society.Series B, Statistical methodology , 76(2):373.Datta, J. and Ghosh, J. K. (2013). Asymptotic properties of bayes risk for the horseshoeprior.
Bayesian Analysis , 8(1):111–132.Donoho, D. L., Johnstone, I. M., Hoch, J. C., and Stern, A. S. (1992). Maximum entropy andthe nearly black object.
Journal of the Royal Statistical Society: Series B (Methodological) ,54(1):41–67.Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and itsoracle properties.
Journal of the American Statistical Association , 96(456):1348–1360.Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimationwith the graphical lasso.
Biostatistics , 9(3):432–441.Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions inregression problems.
Bayesian Analysis , 5(1):171–188.Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: applications to nonorthogonalproblems.
Technometrics , 12(1):69–82.Holland, E. C. (2000). Glioblastoma multiforme: the terminator.
Proceedings of the NationalAcademy of Sciences , 97(12):6242–6244.Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist andbayesian strategies.
Annals of Statistics , 33(2):730–773.Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression, standarderrors, and bayesian lassos.
Bayesian Analysis , 5(2):369–411.25ai, W. R., Johnson, M. D., Kucherlapati, R., and Park, P. J. (2005). Comparative analysisof algorithms for identifying amplifications and deletions in array cgh data.
Bioinformatics ,21(19):3763–3770.Little, M. A. and Jones, N. S. (2011). Generalized methods and solvers for noise removalfrom piecewise constant signals. i. background theory.
Proceedings of the Royal Society A:Mathematical, Physical and Engineering Sciences , 467(2135):3088–3114.Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.
Journal of the American Statistical Association , 83(404):1023–1032.Park, T. and Casella, G. (2008). The bayesian lasso.
Journal of the American StatisticalAssociation , 103(482):681–686.Piironen, J. and Vehtari, A. (2017a). On the hyperprior choice for the global shrinkageparameter in the horseshoe prior. In
Artificial Intelligence and Statistics , pages 905–913.PMLR.Piironen, J. and Vehtari, A. (2017b). Sparsity information and regularization in the horseshoeand other shrinkage priors.
Electronic Journal of Statistics , 11(2):5018–5051.Pulkkinen, A., Lindahl, S., Viljanen, A., and Pirjola, R. (2005). Geomagnetic storm of 29–31 october 2003: Geomagnetically induced currents and their relation to problems in theswedish high-voltage power transmission system.
Space Weather , 3(8).Rinaldo, A. (2009). Properties and refinements of the fused lasso.
Annals of Statistics ,37(5B):2922–2952.Roˇckov´a, V. and George, E. I. (2018). The spike-and-slab lasso.
Journal of the AmericanStatistical Association , 113(521):431–444.Shimamura, K., Ueki, M., Kawano, S., and Konishi, S. (2019). Bayesian generalized fusedlasso modeling via neg distribution.
Communications in Statistics-Theory and Methods ,48(16):4132–4153. 26ong, Q. (2020). Bayesian shrinkage towards sharp minimaxity.
Electronic Journal of Statis-tics , 14(2):2714–2741.Song, Q. and Cheng, G. (2020). Bayesian fusion estimation via t-shrinkage.
Sankhya A ,82(2):353–385.Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensionalregression. arXiv preprint arXiv:1712.08964 .Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society: Series B (Methodological) , 58(1):267–288.Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity andsmoothness via the fused lasso.
Journal of the Royal Statistical Society: Series B (Statis-tical Methodology) , 67(1):91–108.Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hot spot detection for cgh datausing the fused lasso.
Biostatistics , 9(1):18–29.van der Pas, S., Szab´o, B., and van der Vaart, A. (2017). Adaptive posterior contractionrates for the horseshoe.
Electronic Journal of Statistics , 11(2):3196–3225.Van Der Pas, S. L., Kleijn, B. J., and Van Der Vaart, A. W. (2014). The horseshoe estima-tor: Posterior concentration around nearly black vectors.
Electronic Journal of Statistics ,8(2):2585–2618.Wang, H. (2012). Bayesian graphical lasso models and efficient posterior computation.
Bayesian Analysis , 7(4):867–886.Wei, R. and Ghosal, S. (2020). Contraction properties of shrinkage priors in logistic regres-sion.
Journal of Statistical Planning and Inference , 207:215–229.Zhang, Y., Reich, B. J., and Bondell, H. D. (2016). High dimensional linear regression viathe r2-d2 shrinkage prior. arXiv preprint arXiv:1609.00046 .Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical methodology) , 67(2):301–320.27 S - f u s i o n t - f u s i o n L a p l a ce f u s i o n L f u s i o n S i g n a l σ M S E a d j M S E M S E a d j M S E M S E a d j M S E M S E a d j M S E . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) E v e n . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) U n e v e n . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) V . U n e v e n . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) T a b l e : M S E a nd a d j u s t e d M S E v a l u e s ( w i t h a ss o c i a t e d s t a nd a r d e rr o r s i np a r e n t h e s e s ) f o r H o r s e s h o e - f u s i o n , t - f u s i o n , L a p l a ce f u s i o n , a nd f u s e d l a ss o m e t h o d , w h e n t h e tr u e s i g n a li s e v e n l y s p a ce d ( “ E v e n ” ) , un e v e n l y s p a ce d ( “ U n e v e n ” ) , a nd v e r y un e v e n l y s p a ce d ( “ V . U n e v e n ” ) . S - f u s i o n t - f u s i o n L a p l a ce f u s i o n L f u s i o n S i g n a l σ W B W B W B W B . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) E v e n . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) U n e v e n . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) V . U n e v e n . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) T a b l e : W i t h i n ( W ) a ndb e t w ee n ( B ) b l o c k a v e r ag e v a r i a t i o n ( w i t h a ss o c i a t e d s t a nd a r d e rr o r s i np a r e n t h e s e s ) f o r d i ff e r e n t f u s i o n e s t i m a t i o n m e t h o d s . F o r goo d s tr u c t u r e r ec o v e r y , W v a l u e ss h o u l db e l o w a nd B v a l u e ss h o u l db e h i g h ..