[PDF] Streamlined Variational Inference for Higher Level Group-Specific Curve Models

Abstract

A two-level group-specific curve model is such that the mean response of each member of a group is a separate smooth function of a predictor of interest. The three-level extension is such that one grouping variable is nested within another one, and higher level extensions are analogous. Streamlined variational inference for higher level group-specific curve models is a challenging problem. We confront it by systematically working through two-level and then three-level cases and making use of the higher level sparse matrix infrastructure laid down in Nolan and Wand (2018). A motivation is analysis of data from ultrasound technology for which three-level group-specific curve models are appropriate. Whilst extension to the number of levels exceeding three is not covered explicitly, the pattern established by our systematic approach sheds light on what is required for even higher level group-specific curve models.

Full PDF

SStreamlined Variational Inference forHigher Level Group-Speciﬁc Curve Models B Y M. M

ENICTAS , T.H. N OLAN , D.G. S IMPSON AND

M.P. W

AND University of Technology Sydney and University of Illinois Abstract

A two-level group-speciﬁc curve model is such that the mean response of each mem-ber of a group is a separate smooth function of a predictor of interest. The three-levelextension is such that one grouping variable is nested within another one, and higherlevel extensions are analogous. Streamlined variational inference for higher level group-speciﬁc curve models is a challenging problem. We confront it by systematically workingthrough two-level and then three-level cases and making use of the higher level sparsematrix infrastructure laid down in Nolan & Wand (2018). A motivation is analysis of datafrom ultrasound technology for which three-level group-speciﬁc curve models are appro-priate. Whilst extension to the number of levels exceeding three is not covered explicitly,the pattern established by our systematic approach sheds light on what is required foreven higher level group-speciﬁc curve models.

Keywords: longitudinal data analysis, multilevel models, panel data, mean ﬁeld variationalBayes.

We provide explicit algorithms for ﬁtting and approximate Bayesian inference for multi-level models involving, potentially, thousands of noisy curves. The algorithms includecovariance parameter estimation and allow for pointwise credible intervals around theﬁtted curves. Contrast function ﬁtting and inference is also supported by our approach.Both two-level and three-level situations are covered, and a template for even higher levelsituations is laid down.Models and methodology for statistical analyses of grouped data for which the basicunit is a noisy curve continues to be an important area of research. A driving force israpid technological change which is resulting in the generation of curve-type data at ﬁneresolution levels. Examples of such technology include accelerometers (e.g. Goldsmith etal. , 2015) personal digital assistants (e.g. Trail et al. , 2014) and quantitative ultrasound (e.g.Wirtzfeld et al. , 2015). In some applications curve-type data have higher levels of group-ing, with groups at one level nested inside other groups. Our focus here is streamlinedvariational inference for such circumstances.Some motivating data is shown in Figure 1 from an experiment involving quantitativeultrasound technology. Each curve corresponds to a logarithmically transformed backscat-ter coefﬁcient over a ﬁne grid of frequency values for tumors in laboratory mice, with ex-actly one tumor per mouse. The backscatter/frequency curves are grouped according toone of 5 slices of the same tumor, corresponding to probe locations. The slices are groupedaccording to being from one of 10 tumors. We refer to such data as three-level data withfrequency measurements at level 1, slices being the level 2 groups and tumors constitutingthe level 3 groups. The gist of this article is efﬁcient and ﬂexible variational ﬁtting andinference for such data, that scales well to much larger multilevel data sets. Indeed, ouralgorithms are linear in the number of groups at both level 2 and level 3. Simulation study a r X i v : . [ s t a t . M E ] M a r esults given later in this article show that curve-type data with thousands of groups canbe analyzed quickly using our new methodology. Depending on sample sizes and imple-mentation language, ﬁtting times range from a few seconds to a few minutes. In contrast,na¨ıve implementations become infeasible when the number of groups are in the severalhundreds due to storage and computational demands. frequency l og ( ba cksc a tt e r) −35−30−25−20−15−10−5 slice 1tumor 1

10 15 20 slice 2tumor 1 slice 3tumor 1

10 15 20 slice 4tumor 1 slice 5tumor 1

10 15 20 slice 1tumor 2 slice 2tumor 2

10 15 20 slice 3tumor 2 slice 4tumor 2

10 15 20 slice 5tumor 2slice 1tumor 3 slice 2tumor 3 slice 3tumor 3 slice 4tumor 3 slice 5tumor 3 slice 1tumor 4 slice 2tumor 4 slice 3tumor 4 slice 4tumor 4 −35−30−25−20−15−10−5 slice 5tumor 4 −35−30−25−20−15−10−5 slice 1tumor 5 slice 2tumor 5 slice 3tumor 5 slice 4tumor 5 slice 5tumor 5 slice 1tumor 6 slice 2tumor 6 slice 3tumor 6 slice 4tumor 6 slice 5tumor 6slice 1tumor 7 slice 2tumor 7 slice 3tumor 7 slice 4tumor 7 slice 5tumor 7 slice 1tumor 8 slice 2tumor 8 slice 3tumor 8 slice 4tumor 8 −35−30−25−20−15−10−5 slice 5tumor 8 −35−30−25−20−15−10−5 10 15 20 slice 1tumor 9 slice 2tumor 9

10 15 20 slice 3tumor 9 slice 4tumor 9

10 15 20 slice 5tumor 9 slice 1tumor 10

10 15 20 slice 2tumor 10 slice 3tumor 10

10 15 20 slice 4tumor 10 slice 5tumor 10

Figure 1:

Illustrative three-level curve-type data. The response variable is

10 log ( backscatter ) according to ultrasound technology. Level 1 corresponds to different ultrasound frequencies andmatches the horizontal axes in each panel. Level 2 corresponds to different slices of a tumor dueto differing probe locations. Level 3 corresponds to different tumors with one tumor for each of laboratory mice. We work with a variant of group-speciﬁc curve models that at least go back to Don-nelly, Laird & Ware (1995). Other contributions of this type include Brumback & Rice(1998), Verbyla et al. (1999), Wang (1998) and Zhang et al. (1998). The speciﬁc formulationthat we use is that given by Durban et al. (2005) which involves an embedding within theclass of linear mixed models (e.g. Robinson, 1991) with low-rank smoothing splines usedfor ﬂexible function modelling and ﬁtting.Even though approximate Bayesian variational inference is our overarching goal, wealso provide an important parallelism involving classical frequentist inference. Contem-porary mixed model software such as nlme() (Pinheiro et al. , 2018) and lme4() (Bates etal. , 2015) in the R language provide streamlined algorithms for obtaining the best linearunbiased predictions of ﬁxed and random effects in multilevel mixed models with detailsgiven in, for example, Pinheiro & Bates (2000). However, the sub-blocks of the covariancematrices required for construction of pointwise conﬁdence interval bands around the esti-2ated curves are not provided by such software. In the variational Bayesian analog, thesesub-blocks are required for covariance parameter ﬁtting and inference which, in turn, areneeded for curve estimation. A signiﬁcant contribution of this article is streamlined com-putation for both the best linear unbiased predictors and its corresponding covariancecomputation. Similar mathematical results lead to the mean ﬁeld variational Bayesian in-ference equivalent. We present explicit ready-to-code algorithms for both two-level andthree-level group-speciﬁc curve models. Extensions to higher level models could be de-rived using the blueprint that we establish here. Nevertheless, the algebraic overhead isincreasingly burdensome with each increment in the number of levels. It is prudent totreat each multilevel case separately and here we already require several pages to covertwo-level and three-level group-speciﬁc curve models. To our knowledge, this is the ﬁrstarticle to provide streamlined algorithms for ﬁtting three-level group-speciﬁc curve mod-els. Another important aspect of our group-speciﬁc curve ﬁtting algorithms is the fact thatthey make use of the S OLVE T WO L EVEL S PARSE L EAST S QUARES and S OLVE T HREE L EVEL S PARSE L EAST -S QUARES algorithms developed for ordinary linear mixed models in Nolan et al. (2018).This realization means that the algorithms listed in Sections 2 and 3 are more concise andcode-efﬁcient: there is no need to repeat the implementation of these two fundamental al-gorithms for stable QR-based solving of higher level sparse linear systems. Sections S.11–S.12 of the web-supplement provide details on the S OLVE T HREE L EVEL S PARSE L EAST S QUARES and S OLVE T HREE L EVEL S PARSE L EAST S QUARES algorithms.Section 2 deals with the two-level case and the three-level case is covered in Section 3.In Section 4 we provide some assessments concerning the accuracy and speed of the newvariational inference algorithms.

The simplest version of group-speciﬁc curve models involves the pairs ( x ij , y ij ) where x ij is the j th value of the predictor variable within the i th group and y ij is the correspondingvalue of the response variable. We let m denote the number of groups and n i denote thenumber of predictor/response pairs within the i th group. The Gaussian response two-level group speciﬁc curve model is y ij = f ( x ij ) + g i ( x ij ) + ε i , ε ij ind. ∼ N (0 , σ ε ) , ≤ i ≤ m, ≤ j ≤ n i , (1)where the smooth function f is the global regression mean function and the smooth func-tions g i , ≤ i ≤ m , allow for ﬂexible group-speciﬁc deviations from f . As in Durban et al. (2005), we use mixed model-based penalized basis functions to model f and the g i .Speciﬁcally, f ( x ) = β + β x + K gbl (cid:88) k =1 u gbl, k z gbl, k ( x ) , u gbl, k ind. ∼ N (0 , σ gbl ) , and g i ( x ) = u lin, i + u lin, i x + K grp (cid:88) k =1 u grp, ik z grp,k ( x ) , (cid:34) u lin, i u lin, i (cid:35) ind. ∼ N ( , Σ ) , u grp, ik ind. ∼ N (0 , σ grp ) , where { z gbl, k ( · ) : 1 ≤ k ≤ K gbl } and { z grp,k ( · ) : 1 ≤ k ≤ K grp } are suitable sets of basisfunctions. Splines and wavelet families are the most common choices for the z gbl, k ( · ) and z grp,k ( · ) . In our illustrations and simulation studies we use the canonical cubic O’Sullivanspline basis as described in Section 4 of Wand & Ormerod (2008), which corresponds to alow-rank version of classical smoothing splines (e.g. Wahba, 1990). The variance param-eters σ gbl and σ grp control the effective degrees of freedom used for the global mean and3roup-speciﬁc deviation functions respectively. Lastly, Σ is a × unstructured covari-ance matrix for the coefﬁcients of the group-speciﬁc linear deviations.We also use the notation: x i ≡  x i ... x in i  and y i ≡  y i ... y in i  for the vectors of predictors and responses corresponding to the i th group. Notation suchas z gbl, ( x i ) denotes the n i × vector containing z gbl, ( x ij ) values, ≤ j ≤ n i . Model (1) is expressible as a Gaussian response linear mixed model as follows: y | u ∼ N ( Xβ + Z u , σ ε I ) , u ∼ N ( , G ) , (2)where X ≡  X ... X m  with X i ≡ [ x i ] and β ≡ (cid:34) β β (cid:35) are the ﬁxed effects design matrix and coefﬁcients, corresponding to the linear componentof f . The random effects design matrix Z and corresponding random effects vector u arepartitioned according to Z = (cid:104) Z gbl blockdiag ≤ i ≤ m ([ X i Z grp, i ]) (cid:105) and u =  u gbl (cid:20) u lin, i u grp, i (cid:21) ≤ i ≤ m  (3)where u gbl = [ u gbl, . . . u gbl, K gbl ] T are the coefﬁcients corresponding to the non-linear com-ponent of f , u lin, i = [ u lin, i u lin, i ] T are the coefﬁcients corresponding to the linear componentof g i and u grp, i = [ u grp, i . . . u grp, iK grp ] T are the coefﬁcients corresponding to the non-linearcomponent of g i , ≤ i ≤ m . In (3), Z gbl ≡ stack ≤ i ≤ m ( Z gbl, i ) and the matrices Z gbl, i and Z grp, i , ≤ i ≤ m , contain, respectively, spline basis functions for the global mean function f and the i th group deviation functions g i . Speciﬁcally, Z gbl, i ≡ [ z gbl, ( x i ) · · · z gbl, K gbl ( x i ) ] and Z grp, i = [ z grp,1 ( x i ) · · · z grp, K grp ( x i ) ] for ≤ i ≤ m . The corresponding ﬁxed and random effects vectors are u gbl ∼ N ( , σ gbl I K gbl ) and (cid:34) u lin, i u grp, i (cid:35) ind. ∼ N (cid:32)(cid:34) (cid:35) , (cid:34) Σ OO σ grp I K grp (cid:35)(cid:33) , ≤ i ≤ m. Hence, the full random effects covariance matrix is G = Cov ( u ) =  σ gbl I K gbl OO I m ⊗ (cid:34) Σ OO σ grp I K grp (cid:35)  . (4)Next deﬁne the matrices C ≡ [ X Z ] , D BLUP ≡ (cid:34) O OO G − (cid:35) and R BLUP ≡ σ ε I . (5)4he best linear unbiased predictor of [ β u ] T and corresponding covariance matrix are (cid:20) (cid:98) β (cid:98) u (cid:21) = ( C T R − BLUP C + D BLUP ) − C T R − BLUP y and Cov (cid:18)(cid:20) (cid:98) β (cid:98) u − u (cid:21)(cid:19) = ( C T R − BLUP C + D BLUP ) − . (6)This covariance matrix grows quadratically in m , so its storage becomes infeasible forlarge numbers of groups. However, only the following sub-blocks are required for addingpointwise conﬁdence intervals to curve estimates:Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) = top left-hand (2 + K gbl ) × (2 + K gbl ) sub-block of ( C T R − BLUP C + D BLUP ) − , Cov (cid:32)(cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35)(cid:33) = subsequent (2 + K grp ) × (2 + K grp ) diagonalsub-blocks of ( C T R − BLUP C + D BLUP ) − below Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) , ≤ i ≤ m , and E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35) T  = subsequent (2 + K gbl ) × (2 + K grp ) sub-blocksof ( C T R − BLUP C + D BLUP ) − to the right ofCov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) , ≤ i ≤ m . (7)As in Nolan, Menictas & Wand (2019), we deﬁne the generic two-level sparse matrixto be determination of the vector x which minimizes the least squares criterion (cid:107) b − Bx (cid:107) where (cid:107) v (cid:107) ≡ v T v for any column vector v , (8)with B having the two-level sparse form B ≡  B • B O · · · OB O • B · · · O ... ... ... . . . ... B m O O · · · • B m  and b partitioned according to b ≡  b b ... b m  . (9)In (9), for any ≤ i ≤ m , the matrices B i , • B i and b i each have the same number of rows.The numbers of columns in B i and • B i are arbitrary whereas the b i are column vectors.In addition to solving for x , the sub-blocks of ( B T B ) − corresponding to the non-sparseregions of B T B are included in our deﬁnition of a two-level sparse matrix least squaresproblem. Algorithm 2 of Nolan et al. (2018) provides a stable and efﬁcient solution tothis problem and labels it the S OLVE T WO L EVEL S PARSE L EAST S QUARES algorithm. Section S.11 ofthe web-supplement contains details regarding this algorithm. In Nolan et al. (2018) weused S OLVE T WO L EVEL S PARSE L EAST S QUARES for ﬁtting two-level linear mixed models. How-ever, precisely the same algorithm can be used for ﬁtting two-level group-speciﬁc curvemodels because of: 5 esult 1.

Computation of [ (cid:98) β T (cid:98) u T ] T and each of the sub-blocks of Cov ([ (cid:98) β T ( (cid:98) u − u ) T ] T ) listedin (7) are expressible as solutions to the two-level sparse matrix least squares problem: (cid:13)(cid:13)(cid:13)(cid:13) b − B (cid:20) βu (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) where the non-zero sub-blocks B and b , according to the notation in (9), are for ≤ i ≤ m : b i ≡  σ − ε y i  , B i ≡  σ − ε X i σ − ε Z gbl, i O m − / σ − gbl I K gbl O OO O  and • B i ≡  σ − ε X i σ − ε Z grp, i O O Σ − / OO σ − grp I K grp  with each of these matrices having ˜ n i = n i + K gbl + 2 + K grp rows and with B i having p = 2 + K gbl columns and • B i having q = 2 + K grp columns. The solutions are (cid:20) (cid:98) β (cid:98) u gbl (cid:21) = x , Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) = A and (cid:34) (cid:98) u lin, i (cid:98) u grp, i (cid:35) = x ,i , E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35) T  = A ,i , Cov (cid:32)(cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35)(cid:33) = A ,i , ≤ i ≤ m. A derivation of Result 1 is given in Section S.1 of the web-supplement. Algorithm 1encapsulates streamlined best linear unbiased prediction computation together with coef-ﬁcient covariance matrix sub-blocks of interest.

We now consider the following Bayesian extension of (2) and (4): y | β , u , σ ε ∼ N ( Xβ + Z u , σ ε I ) , u | σ gbl , σ grp , Σ ∼ N ( , G ) , G as deﬁned in (4), β ∼ N ( µ β , Σ β ) , σ ε | a ε ∼ Inverse- χ ( ν ε , /a ε ) , a ε ∼ Inverse- χ (1 , / ( ν ε s ε )) ,σ gbl | a gbl ∼ Inverse- χ ( ν gbl , /a gbl ) , a gbl ∼ Inverse- χ (1 , / ( ν gbl s gbl )) ,σ grp | a grp ∼ Inverse- χ ( ν grp , /a grp ) , a grp ∼ Inverse- χ (1 , / ( ν grp s grp )) , Σ | A Σ ∼ Inverse-G-Wishart (cid:0) G full , ν Σ + 2 , A − Σ (cid:1) , A Σ ∼ Inverse-G-Wishart ( G diag , , Λ A Σ ) , Λ A Σ ≡ { ν Σ diag ( s Σ , , s Σ , ) } − . (10)Here the × vector µ β and × symmetric positive deﬁnite matrix Σ β are hyperparam-eters corresponding to the prior distribution on β and ν ε , s ε , ν gbl , s gbl , ν grp , s grp , ν Σ , s Σ , , s Σ , > are hyperparameters for the variance and covariance matrix parameters. Details on theInverse G-Wishart distribution, and the Inverse- χ special case, are given in Section S.3 of6 lgorithm 1 Streamlined algorithm for obtaining best linear unbiased predictions and correspond-ing covariance matrix components for the two-level group speciﬁc curves model.

Inputs: y i ( n i × , X i ( n i × , Z gbl, i ( n i × K gbl ) , Z grp, i ( n i × K grp ) , ≤ i ≤ m ; σ ε , σ gbl , σ grp > , Σ ( q × q ) , symmetric and positive deﬁnite.For i = 1 , . . . , m : b i ←−  σ − ε y i  , B i ←−  σ − ε X i σ − ε Z gbl, i O m − / σ − gbl I K gbl O OO O  , • B i ←−  σ − ε X i σ − ε Z grp, i O O Σ − / OO σ − grp I K grp  S ←− S OLVE T WO L EVEL S PARSE L EAST S QUARES (cid:16)(cid:8) ( b i , B i , • B i ) : 1 ≤ i ≤ m (cid:9)(cid:17)(cid:20) (cid:98) β (cid:98) u gbl (cid:21) ←− x component of S ; Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) ←− A component of S For i = 1 , . . . , m : (cid:34) (cid:98) u lin, i (cid:98) u grp, i (cid:35) ←− x ,i component of S Cov (cid:32)(cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35)(cid:33) ←− A ,i component of S E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35) T  ←− A ,i component of S Output: (cid:32) (cid:20) (cid:98) β (cid:98) u gbl (cid:21) , Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) , (cid:40)(cid:32) (cid:34) (cid:98) u lin, i (cid:98) u grp, i (cid:35) , Cov (cid:32)(cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35)(cid:33) ,E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35) T  (cid:33) : 1 ≤ i ≤ m (cid:41)(cid:33) the web-supplement. The auxiliary variable a ε is deﬁned so that σ ε has a Half- t distribu-tion with degrees of freedom parameter ν ε and scale parameter s ε , with larger values of s ε corresponding to greater noninformativity. Analogous comments apply to the other stan-dard deviation parameters. Setting ν Σ = 2 leads to the correlation parameter in Σ havinga Uniform distribution on ( − , (Huang & Wand, 2013).Throughout this article we use p generically to denote a density function correspond-ing to random quantities in Bayesian models such as (10). For example, p ( β ) denotes theprior density function of β and p ( u | σ gbl , σ grp , Σ ) denotes the density function of u con-ditional on ( σ gbl , σ grp , Σ ) . Now consider the following mean ﬁeld restriction on the jointposterior density function of all parameters in (10): p ( β , u , a ε , a gbl , a grp , A Σ , σ ε , σ gbl , σ grp , Σ | y ) ≈ q ( β , u , a ε , a gbl , a grp , A Σ ) q ( σ ε , σ gbl , σ grp , Σ ) . (11)7ere, generically, each q denotes an approximate posterior density function of the randomvector indicated by its argument according to the mean ﬁeld restriction (11). Then ap-plication of the minimum Kullback-Leibler divergence equations (e.g. equation (10.9) ofBishop, 2006) leads to the optimal q -density functions for the parameters of interest beingas follows: q ∗ ( β , u ) has a N (cid:0) µ q ( β , u ) , Σ q ( β , u ) (cid:1) distribution, q ∗ ( σ ε ) has an Inverse- χ (cid:0) ξ q ( σ ε ) , λ q ( σ ε ) (cid:1) distribution, q ∗ ( σ gbl ) has an Inverse- χ (cid:0) ξ q ( σ gbl ) , λ q ( σ gbl ) (cid:1) distribution, q ∗ ( σ grp ) has an Inverse- χ (cid:0) ξ q ( σ grp ) , λ q ( σ grp ) (cid:1) distributionand q ∗ ( Σ ) has an Inverse-G-Wishart ( G full , ξ q ( Σ ) , Λ q ( Σ ) ) distribution.The optimal q -density parameters are determined via an iterative coordinate ascent algo-rithm, with details given in Section S.5 of this article’s web-supplement. The stoppingcriterion is based on the variational lower bound on the marginal likelihood (e.g. Bishop,2006; Section 10.2.2) and denoted p ( y ; q ) . Its logarithmic form and derivation are given inSection S.6 of the web-supplement.Note that updates for µ q ( β , u ) and Σ q ( β , u ) may be written µ q ( β , u ) ← ( C T R − MFVB C + D MFVB ) − ( C T R − MFVB y + o MFVB ) and Σ q ( β , u ) ← ( C T R − MFVB C + D MFVB ) − (12)where R MFVB ≡ µ − q (1 /σ ε ) I , D MFVB ≡  Σ − β O OO µ q (1 /σ gbl ) I OO O blockdiag ≤ i ≤ m (cid:34) M q ( Σ − ) OO µ q (1 /σ grp ) I (cid:35)  and o MFVB ≡ (cid:34) Σ − β µ β (cid:35) . (13)For increasingly large numbers of groups the matrix Σ q ( β , u ) approaches a size that isuntenable for random access memory storage on standard 2020s workplace computers.However, only the following relatively small sub-blocks of Σ q ( β , u ) are required for varia-tional inference concerning the variance and covariance matrix parameters: Σ q ( β , u gbl ) = top left-hand (2 + K gbl ) × (2 + K gbl ) sub-block of ( C T R − MFVB C + D MFVB ) − , Σ q ( u lin, i , u grp, i ) = subsequent (2 + K grp ) × (2 + K grp ) diagonal sub-blocks of ( C T R − MFVB C + D MFVB ) − below Σ q ( β , u gbl ) , ≤ i ≤ m , and E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u lin, i u grp, i (cid:21) − µ q ( u lin, i , u grp, i )) (cid:19) T (cid:41) = subsequent (2 + K gbl ) × (2 + K grp ) sub-blocks of ( C T R − MFVB C + D MFVB ) − to the right of Σ q ( β , u gbl ) , ≤ i ≤ m . (14)For a streamlined mean ﬁeld variational Bayes algorithm, we appeal to: Result 2.

The mean ﬁeld variational Bayes updates of µ q ( β , u ) and each of the sub-blocks of Σ q ( β , u ) in (14) are expressible as a two-level sparse matrix least squares problem of the form: (cid:13)(cid:13)(cid:13) b − Bµ q ( β , u ) (cid:13)(cid:13)(cid:13) here the non-zero sub-blocks B and b , according to the notation in (9), are, for ≤ i ≤ m , b i ≡  µ / q (1 /σ ε ) y i m − / Σ − / β µ β  , B i ≡  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z gbl, i m − / Σ − / β OO m − / µ / q (1 /σ gbl ) I K gbl O OO O  and • B i ≡  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z grp, i O OO OM / q ( Σ − ) OO µ / q (1 /σ grp ) I K grp  with each of these matrices having ˜ n i = n i + 2 + K gbl + 2 + K grp rows and with B i having p = 2 + K gbl columns and • B i having q = 2 + K grp columns. The solutions are µ q ( β , u gbl ) = x , Σ q ( β , u gbl ) = A , µ q ( u lin, i , u grp, i ) = x ,i , Σ q ( u lin, i , u grp, i ) = A ,i , and E q (cid:34) β − µ q ( β ) u gbl − µ q ( u gbl ) (cid:35) (cid:34) u lin, i − µ q ( u lin, i ) u grp, i − µ q ( u grp, i ) (cid:35) T  = A ,i , ≤ i ≤ m. Algorithm 2 utilizes Result 2 to facilitate streamlined computation of the variationalparameters.Lastly, we note that Algorithm 2 is loosely related to Algorithm 2 of Lee & Wand (2016).One difference is that we are treating the Gaussian, rather than Bernoulli, response situa-tion here. In addition, we are using the recent sparse multilevel matrix results of Nolan &Wand (2018) which are amenable to higher level extensions, such as the three-level groupspeciﬁc curve model treated in Section 3.

In many curve-type data applications the data can be categorized as being from two ormore types. Of particular interest in such circumstances are contrast function estimatesand accompanying standard errors. The streamlined approaches used in Algorithms 1and 2 still apply for the contrast function extension regardless of the number of categories.The two category situation, where there is a single contrast function, is described here.The extension to higher numbers of categories is straightforward.Suppose that the ( x ij , y ij ) pairs are from one of two categories, labeled A and B , andintroduce the indicator variable data: ι Aij ≡ (cid:26) if ( x ij , y ij ) is from category A, if ( x ij , y ij ) is from category B. lgorithm 2 QR-decomposition-based streamlined algorithm for obtaining mean ﬁeld variationalBayes approximate posterior density functions for the parameters in the Bayesian two-level group-speciﬁc curves model (10) with product density restriction (11).

Data Inputs: y i ( n i × , X i ( n i × , Z gbl, i ( n i × K gbl ) , Z grp, i ( n i × K grp ) , ≤ i ≤ m ;Hyperparameter Inputs: µ β (2 × , Σ β (2 × symmetric and positive deﬁnite, s ε , ν ε , s gbl , ν gbl , s Σ , , s Σ , , ν Σ , s grp , ν grp > .For i = 1 , . . . , m : C gbl, i ←− [ X i Z gbl, i ] ; C grp, i ←− [ X i Z grp, i ] Initialize: µ q (1 /σ ε ) , µ q (1 /σ gbl ) , µ q (1 /σ grp ) , µ q (1 /a ε ) , µ q (1 /a gbl ) , µ q (1 /a grp ) > , M q ( Σ − ) (2 × , M q ( A − Σ ) (2 × both symmetric and positive deﬁnite. ξ q ( σ ε ) ←− ν ε + (cid:80) mi =1 n i ; ξ q ( σ gbl ) ←− ν gbl + K gbl ; ξ q ( Σ ) ←− ν Σ + 2 + mξ q ( σ grp ) ←− ν grp + mK grp ; ξ q ( a ε ) ←− ν ε + 1 ; ξ q ( a gbl ) ←− ν gbl + 1 ; ξ q ( a grp ) ←− ν grp + 1 ξ q ( A Σ ) ←− ν Σ + 2 Cycle:For i = 1 , . . . , m : b i ←−  µ / q (1 /σ ε ) y i m − / Σ − / β µ β  , B i ←−  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z gbl, i m − / Σ − / β OO m − / µ / q (1 /σ gbl ) I K gbl O OO O  , • B i ←−  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z grp, i O OO OM / q ( Σ − ) OO µ / q (1 /σ grp ) I K grp  S ←− S OLVE T WO L EVEL S PARSE L EAST S QUARES (cid:16)(cid:8) ( b i , B i , • B i ) : 1 ≤ i ≤ m (cid:9)(cid:17) µ q ( β , u gbl ) ←− x component of S ; Σ q ( β , u gbl ) ←− A component of S µ q ( u gbl ) ←− last K gbl rows of µ q ( β , u gbl ) Σ q ( u gbl ) ←− bottom-right K gbl × K gbl sub-block of Σ q ( β , u gbl ) λ q ( σ ε ) ←− µ q (1 /a ε ) ; Λ q ( Σ ) ←− M q ( A − Σ ) ; λ q ( σ grp ) ←− µ q (1 /a grp ) For i = 1 , . . . , m : µ q ( u lin, i , u grp, i ) ←− x ,i component of S Σ q ( u lin, i , u grp, i ) ←− A ,i component of S µ q ( u lin, i ) ←− ﬁrst rows of µ q ( u lin, i , u grp, i ) Σ q ( u lin, i ) ←− top left × sub-block of Σ q ( u lin, i , u grp, i ) µ q ( u grp, i ) ←− last K grp rows of µ q ( u lin, i , u grp, i ) Σ q ( u grp, i ) ←− bottom right K grp × K grp sub-block of Σ q ( u lin, i , u grp, i ) continued on a subsequent page . . . Then penalized spline models for the global mean and deviation functions for each cate-10 lgorithm 2 continued.

This is a continuation of the description of this algorithm that com-mences on a preceding page. E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u lin, i u grp, i (cid:21) − µ q ( u lin, i , u grp, i )) (cid:19) T (cid:41) ←− A ,i component of S λ q ( σ ε ) ←− λ q ( σ ε ) + (cid:13)(cid:13) y i − C gbl, i µ q ( β , u gbl ) − C grp, i µ q ( u lin, i , u grp, i ) (cid:13)(cid:13) λ q ( σ ε ) ←− λ q ( σ ε ) + tr ( C T gbl, i C gbl, i Σ q ( β , u gbl ) ) + tr ( C T grp, i C grp, i Σ q ( u lin, i , u grp, i ) ) λ q ( σ ε ) ←− λ q ( σ ε ) +2 tr (cid:34) C T grp, i C gbl, i E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u lin, i u grp, i (cid:21) − µ q ( u lin, i , u grp, i )) (cid:19) T (cid:41)(cid:35) Λ q ( Σ ) ←− Λ q ( Σ ) + µ q ( u lin, i ) µ T q ( u lin, i ) + Σ q ( u lin, i ) λ q ( σ grp ) ←− λ q ( σ grp ) + (cid:107) µ q ( u grp,i ) (cid:107) + tr (cid:0) Σ q ( u grp,i ) (cid:1) λ q ( σ gbl ) ←− µ q (1 /a gbl ) + (cid:107) µ q ( u gbl ) (cid:107) + tr (cid:0) Σ q ( u gbl ) (cid:1) µ q (1 /σ ε ) ← ξ q ( σ ε ) /λ q ( σ ε ) ; µ q (1 /σ gbl ) ← ξ q ( σ gbl ) /λ q ( σ gbl ) M q ( Σ − ) ← ( ξ q ( Σ ) − Λ − q ( Σ ) ; µ q (1 /σ grp ) ← ξ q ( σ grp ) /λ q ( σ grp ) λ q ( a ε ) ←− µ q (1 /σ ε ) + 1 / ( ν ε s ε ) ; µ q (1 /a ε ) ←− ξ q ( a ε ) /λ q ( a ε ) Λ q ( A Σ ) ←− diag (cid:8) diagonal (cid:0) M q ( Σ − ) (cid:1)(cid:9) + { ν Σ diag ( s Σ , , s Σ , ) } − M q ( A − Σ ) ←− ξ q ( A Σ ) Λ − q ( A Σ ) λ q ( a gbl ) ←− µ q (1 /σ gbl ) + 1 / ( ν gbl s gbl ) ; µ q (1 /a gbl ) ←− ξ q ( a gbl ) /λ q ( a gbl ) λ q ( a grp ) ←− µ q (1 /σ grp ) + 1 / ( ν grp s grp ) ; µ q (1 /a grp ) ←− ξ q ( a grp ) /λ q ( a grp ) until the increase in p ( y ; q ) is negligible.Outputs: µ q ( β , u gbl ) , Σ q ( β , u gbl ) , (cid:110) µ q ( u lin, i , u grp, i ) , Σ q ( u lin, i , u grp, i ) ,E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u lin, i u grp, i (cid:21) − µ q ( u lin, i , u grp, i )) (cid:19) T (cid:41) : 1 ≤ i ≤ m (cid:111) ,ξ q ( σ ε ) , λ q ( σ ε ) , ξ q ( σ gbl ) , λ q ( σ gbl ) , ξ q ( Σ ) , Λ − q ( Σ ) , ξ q ( σ grp ) , λ q ( σ grp ) . gory are f A ( x ) = β A + β A x + K gbl (cid:88) k =1 u A gbl, k z gbl, k ( x ) g Ai ( x ) = u A lin, i + u A lin, i x + K grp (cid:88) k =1 u A grp, ik z grp,k ( x )  for category Aand f B ( x ) = β A + β BvsA + ( β A + β BvsA ) x + K gbl (cid:88) k =1 u B gbl, k z gbl, k ( x ) g Bi ( x ) = u B lin, i + u B lin, i x + K grp (cid:88) k =1 u B grp, ik z grp,k ( x )  for category B.11his allows us to estimate the global contrast function c ( x ) ≡ f B ( x ) − f A ( x ) = β BvsA + β BvsA x + K gbl (cid:88) k =1 ( u B gbl, k − u A gbl, k ) z gbl, k ( x ) . (15)The distributions on the random coefﬁcients are [ u A lin, i u A lin, i u B lin, i u B lin, i ] T ind. ∼ N ( , Σ ) and u A gbl, k ind. ∼ N (cid:0) , ( σ A gbl ) (cid:1) , u B gbl, k ind. ∼ N (cid:0) , ( σ B gbl ) (cid:1) , u A grp, ik ind. ∼ N (cid:0) , σ grp (cid:1) and u B grp, ik ind. ∼ N (cid:0) , σ grp (cid:1) independently of each other. In this two-category extension, the matrix Σ is an unstruc-tured × covariance matrix.Algorithms 1 and 2 can be used to achieve streamlined ﬁtting and inference for thecontrast curve extension, but with key matrices having new deﬁnitions. Firstly, the X i , Z gbl, i and Z grp, i matrices need to become: X i = [ x i − ι Ai ( − ι Ai ) (cid:12) x i ] , Z gbl, i = [ ι Ai (cid:12) z gbl, ( x i ) · · · ι Ai (cid:12) z gbl, K gbl ( x i ) ( − ι Ai ) (cid:12) z gbl, ( x i ) · · · ( − ι Ai ) (cid:12) z gbl, K gbl ( x i ) ] and Z grp, i = [ ι Ai (cid:12) z grp,1 ( x i ) · · · ι Ai (cid:12) z grp, K grp ( x i ) ( − ι Ai ) (cid:12) z grp,1 ( x i ) · · · ( − ι Ai ) (cid:12) z grp, K grp ( x i ) ] where ι Ai is the n i × vector of ι Aij values. In the case of best linear unbiased prediction theupdates for the B i and • B i matrices in Algorithm 1 need to be replaced by: B i ←−  σ − ε X i σ − ε Z gbl, i O m − / (cid:20) ( σ A gbl ) − I K gbl ( σ B gbl ) − I K gbl (cid:21) O OO O  and • B i ←−  σ − ε X i σ − ε Z grp, i O O Σ − / OO σ − grp I K grp  and the output coefﬁcient vectors change to  (cid:98) β (cid:98) u A gbl (cid:98) u B gbl  and  (cid:98) u A lin, i (cid:98) u B lin, i (cid:98) u A grp, i (cid:98) u B grp, i  . In the case of mean ﬁeld variational Bayes the updates of the B i and • B i matrices inAlgorithm 2 need to be replaced by: B i ←−  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z gbl, i m − / Σ − / β OO m − /  µ / q (1 / ( σ A gbl ) ) I K gbl µ / q (1 / ( σ B gbl ) ) I K gbl  O OO O  , • B i ←−  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z grp, i O OO OM / q ( Σ − ) OO µ / q (1 /σ grp ) I K grp  . A contrast curves adjustment to the mean ﬁeld variational Bayes updates is also re-quired for some of the covariance matrix parameters. However, these calculations arecomparatively simple and analogous to those given in Section S.5.We demonstrate the use of Algorithm 2 in this setting for data from a longitudinalstudy on adolescent somatic growth. More detail on this data can be found in Pratt et al. (1989). The variables of interest are y ij = j th height measurement (centimetres) of subject i , and x ij = age (years) of subject i when y ij is recorded,for ≤ i ≤ m and ≤ j ≤ n i . The subjects are categorized into black ethnicity andwhite ethnicity and comparison of mean height between the two populations is of interest.Algorithm 2 is seen to have good agreement with the data in each sub-panel of the top twoplots in Figure 2. The bottom panels of Figure 2 show the estimated height gap betweenblack and white adolescents as a function of age. For the females, there is a signiﬁcantheight difference only at 16-17 years old. Between 5 and 15 years, there is no obviousheight difference. For the males, it is highest and (marginally) statistically signiﬁcant upto about 14 years of age, peaking at 13 years of age. Between 17 and 20 years old there isno discernible height difference between the two populations. The three-level version of group-speciﬁc curve models corresponds to curve-type datahaving two nested groupings. For example, the data in each panel of Figure 1 are ﬁrstgrouped according to slice, which is the level 2 group, and the slices are grouped accordingto tumor which is the level 3 group. We denote predictor/response pairs as ( x ijk , y ijk ) where x ijk is the k th value of the predictor variable in the i th level 3 group and ( i, j ) th level2 group and y ijk is the corresponding value of the response variable. We let m denote thenumber of level 3 groups, n i denote that number of level 2 groups in the i th level 3 groupand o ij denote the the number of units within the ( i, j ) th level 2 group. The Figure 1 data,which happen to be balanced, are such that m = number of tumors = 10 ,n i = number of slices for the i th tumor = 5 and o ij = number of predictor/response pairs for the i th tumor and j th slice = 128 . The Gaussian response three-level group speciﬁc curve model for such data is y ijk = f ( x ijk ) + g i ( x ijk ) + h ij ( x ijk ) + ε ijk , ε ijk ind. ∼ N (0 , σ ε ) , ≤ i ≤ m, ≤ j ≤ n i , ≤ k ≤ o ij , (16)where the smooth function f is the global mean function, the g i functions, ≤ i ≤ m ,allow for group-speciﬁc deviations according to membership of the i th level 3 group and13 ge (years) he i gh t ( c en t i m e t r e s ) lllllllllllllllll llllllllllllllllllllll llllllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllll lllllllllllllllll llllllllllll lllllllllllll llllllllllllllllllll lllllllllllllllllllllllllllllllllll llllllllllllllllllllllll lllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllll lllllllllllllllllllllllllll lllllllllllllllllllllll llllllllllllllllll lllllllllllllllll llllllllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllll llllllllllllllllll llllllllllllllllllllll llllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllllll llllllllllllllllll lllllllllllllllllll llllllllllllllllllllllllllllllll llllllllllllllll llllllllllllllllll lllllllllllllllll lllllllllllllllll llllllllllllllll l llllllllllllllll lllllllllllllllll lllllllllllllllll lllllllllll l lllllllllllllllllllll llllllllllllllllll lllllllllllll llllllllllllllllllll llllllllllllllllllllll llllllllllllllllllllll lllllllllllll lllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllllllllllllll llllllllllllllllllllllll lllllllllllllllll lllllllllllllllllllll lllllllllllllllllllll lllllllllllllll llllllllllllllllllllll lllllllllllllllllll llllllllllllllllllll l llllllllllllllllll lllllllllllllllllll lllllllllllllllllllllll llllllllllllllllllllll llllllllllllllllllllll lllllllllllll lllllllllllllllllllllllll llllllllllllllllllll llllllllllllllll lllllllllllllllll lllllllllllll llllllllllllllllllllllllll llllllllllllllll lllllllllllllllllllll llllllllllllllllll lllllll ll lllllllllllllllll lllllllllllllllllllllll lllllllllllll l llllllllllllllll llllllllllllllllllll lllllllllllllllll lllllllllllllllllllllll lllllllllllllllllll llllllllllllllll lllllllllllllllllllll lllllllllllll llll llllllllllllllllllllll llllllllllllllll l llllllllllllllllllll lllllllllllllllll lllllllllllllllllll lllllllllllllllllll lllllllll l llllllllllllllllll lllllllllllllllllllllll llllllll l lllllllllllllllll lllllllllllllll llllllllllllllllll llllllllllllllllllllll l l black females white females age (years) he i gh t ( c en t i m e t r e s ) lllllllllllllllllllllll

510 20 lllllllllllllllllllll llllllllllllllllllllllllll

510 20 llllllllllllllllllllllll lllllllllllllllllllllll

510 20 lllllllllllll llllllllllllllllllllllll

510 20 lllllllllllllllll llllllllllllllllllllllllllllllllllll llllllllllllllllllllll lllllllllllllll llllllllllllllllll llllllllllllllllllllll lllllllllllll llll lllllllllllllll l lllllllllllllllllllll lllllllllllllllll llllllllllllll l lllllllllllllllllll lllllllllllllllll llllllllllllllllllllll llllllllllllllllllll llllllllllllllllllllllll lllllllllllllllll llllllllll lllllllllllllllllllllllllllllllllllll lllllll lll lllllllllllllllllll llllllllllllllllllllllll lllllllllllllllll llllll llllll lllllllllllllllll lllllllllllllllllll llllllllllllllllllllllll llllllllllllllllllll llllllllllllllllllll lllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllll llllllllllllllll lllllllllllllllllllll llllllllllllllll l llllllllllllll lllllllllllllllllll lllllllllllllllllll lllllllllllllllllllll llll lllllllllllllllll llllllllllllll llllllllllllllllllll lllllllllllllllllll llllllllllllllll llllllllllll llllllllllllllllll llllllllllllllllllll lllllllllllllllllll lllllllllllllllllllllll llllllllllllllllll llllllllllllllll llllllllllllllll lllllllllllllllllll llllllllllllllllllllllllllllllllllllllll lllllllllllll lllll lllllllllllllllllllllll llllllllllllllllllllllll llllllllllllllllllllll lllllllllllllllll lllllllllllllllllll lllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllllll llllllllllllllllllllllllll llllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllllll llllllllllllllll lllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllllllllllllllllllllll lllllllllllllllllll llllllllllllllllllllll llllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllllllllll lllllllllllllllllll lllllllllllllllllllll llllllllllll llllll llllllllllllll llllllllllllllll llllllllllllllllllllll llllllllll ll lllllllllllllllllllll llllllllllllllllllll lllllllllllllllllll lllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllll llllllllll lllllllllll llllllllllllllllllllllll lllllllllllllllllllllll lllllllllllllllllllll llllllllllllllllllllllll llllllllllllllllllllll llllllllllllllllllllll lllllllllllllllllllll lllllllllllllllllll

510 20 llllllllllllllllllllll lllllllllllllllllllllll

510 20 lllllllllllllllllllll llllllllllllll

510 20 llllllllllllllllllllll lllllllllllllllllll l l black males white males − − − age (years) m ean he i gh t d i ff e r en c e ( c en t i m e t r e s ) − age (years) m ean he i gh t d i ff e r en c e ( c en t i m e t r e s ) Figure 2:

Top panels: ﬁtted group-speciﬁc curves for 100 female subjects (left) and 116 malesubjects (right) from the data on adolescent somatic growth (Pratt et al. 1989). The shading corre-sponds to approximate pointwise 99% credible intervals. Bottom panels: similar to the top panelsbut for the estimated contrast curve. The shaded regions correspond to approximate pointwise 95%credible intervals. h ij , ≤ i ≤ m and ≤ j ≤ n i allow for an additional level of group-speciﬁc deviationsaccording to membership of the j th level 2 group within the i th level 3 group. The mixedmodel-based penalized spline models for these functions are f ( x ) = β + β x + K gbl (cid:88) k =1 u gbl, k z gbl, k ( x ) , u gbl, k ind. ∼ N (0 , σ gbl ) ,g i ( x ) = u g lin, i + u g lin, i x + K g grp (cid:88) k =1 u g grp, ik z g grp,k ( x ) , (cid:34) u g lin, i u g lin, i (cid:35) ind. ∼ N ( , Σ g ) , u g grp, ik ind. ∼ N (cid:0) , σ grp, g (cid:1) and h ij ( x ) = u h lin, ij + u h lin, ij x + K h grp (cid:88) k =1 u h grp, ijk z h grp,k ( x ) , (cid:34) u h lin, ij u h lin, ij (cid:35) ind. ∼ N ( , Σ h ) , u h grp, ijk ind. ∼ N (cid:0) , σ grp, h (cid:1) , with all random effect distributions independent of each other. For this three-level casewe have three bases: { z gbl, k ( · ) : 1 ≤ k ≤ K gbl } , { z g grp,k ( · ) : 1 ≤ k ≤ K g grp } and { z h grp,k ( · ) : 1 ≤ k ≤ K h grp } . The variance and covariance matrix parameters are analogous to the two-level model. Forexample, Σ g and Σ h are both unstructured × matrices corresponding to the linearcomponents of the g i and h ij respectively.The following notation is useful for setting up the required design matrices: if M , . . . , M d is a set of matrices each having the same number of columns thenstack ≤ i ≤ d ( M i ) ≡  M ... M d  . We then deﬁne, for ≤ i ≤ m and ≤ j ≤ n i , x i ≡ stack ≤ j ≤ n i (cid:0) x ij (cid:1) and x ij ≡ stack ≤ k ≤ o ij (cid:0) x ijk (cid:1) . Model (16) is expressible as a Gaussian response linear mixed model as follows: y | u ∼ N ( Xβ + Z u , σ ε I ) , u ∼ N ( , G ) , (17)where the design matrices are X = stack ≤ i ≤ m (cid:0) X i (cid:1) with X i = stack ≤ j ≤ n i (cid:0) X ij (cid:1) and X ij ≡ [ x ij ] and Z ≡ (cid:104) Z gbl blockdiag ≤ i ≤ m (cid:104) stack ≤ j ≤ n i ([ X ij Z g grp, ij ]) blockdiag ≤ j ≤ n i ([ X ij Z h grp, ij ]) (cid:105)(cid:105) . where Z gbl ≡ stack ≤ i ≤ m (cid:0) stack ≤ j ≤ n i ( Z gbl, ij ) (cid:1) and the matrices Z gbl, ij , Z g grp, ij and Z h grp, ij , ≤ i ≤ m , ≤ j ≤ n i , contain, respectively,spline basis functions for the global mean function f , the i th level one group deviationfunctions g i and ( i, j ) th level two group deviation functions h ij . Speciﬁcally, Z gbl, ij ≡ [ z gbl, ( x ij ) · · · z gbl, K gbl ( x ij )] , Z g grp, ij = [ z g grp, ( x ij ) · · · z g grp, Kg grp ( x ij )] and Z h grp, ij ≡ [ z h grp, ( x ij ) · · · z h grp, Kh grp ( x ij )] for ≤ i ≤ m and ≤ j ≤ n i .15he ﬁxed and random effects vectors are β ≡ (cid:34) β β (cid:35) and u ≡  u gbl stack ≤ i ≤ m  (cid:20) u g lin, i u g grp, i (cid:21)(cid:20) u h lin, i u h grp, i (cid:21) ... (cid:20) u h lin, in i u h grp, in i (cid:21)   where u g lin, i ≡ (cid:34) u g lin, i u g lin, i (cid:35) with u g grp, i , u h lin, ij and u h grp, ij deﬁned similarly and the covariance matrix of u is G = Cov ( u ) =  σ gbl I OO blockdiag ≤ i ≤ m  Σ g O OO σ grp, g I OO O I n i ⊗ (cid:20) Σ h OO σ grp, h I (cid:21)   . (18)We deﬁne matrices in a similar way to what is given in (5). The best linear unbiasedpredictor of [ β u ] and corresponding covariance matrix are as shown in (6), but, withentries as described in this section. This covariance matrix grows quadratically in both m and the n i s, and so, storage becomes impractical for large numbers of level 2 and level3 groups. However, only certain sub-blocks are required for the addition of pointwiseconﬁdence intervals to curve estimates. In particular, we only require the non-zero sub-blocks of the general three-level sparse matrix given in Section 3 of Nolan & Wand (2018)that correspond to ( C T R − BLUP C + D BLUP ) − . In the case of the three-level Gaussian responselinear model, Nolan & Wand’s A sub-block corresponds to a (2 + K gbl ) × (2 + K gbl ) matrix Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) ; A ,i sub-block corresponds to a (2 + K g grp ) × (2 + K g grp ) matrix Cov (cid:32)(cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35)(cid:33) ; A ,i sub-block corresponds to a (2 + K gbl ) × (2 + K g grp ) matrix E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35) T  , ≤ i ≤ m ; A ,ij sub-block corresponds to a (2 + K h grp ) × (2 + K h grp ) matrix Cov (cid:32)(cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35)(cid:33) ; A ,ij sub-block corresponds to a (2 + K gbl ) × (2 + K h grp ) matrix E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  ; A , i, j sub-block corresponds to a (2 + K g grp ) × (2 + K h grp ) matrix E (cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  , ≤ i ≤ m, ≤ j ≤ n i . As described in Nolan, Menictas & Wand (2019), the S OLVE T HREE L EVEL S PARSE L EAST S QUARES algorithm arises in the special case where x is the minimizer of the least squares problem16iven in equation (8), where B has the three-level sparse form and b is partitioned accord-ing to that shown in equation (7) of Nolan & Wand (2018). This algorithm can be used forﬁtting three-level group-speciﬁc curve models by making use of Result 3. Result 3.

Computation of [ (cid:98) β T (cid:98) u T ] T and each of the sub-blocks of Cov ([ (cid:98) β T ( (cid:98) u − u ) T ] T ) listedin (7) are expressible as the three-level sparse matrix least squares form: (cid:13)(cid:13)(cid:13)(cid:13) b − B (cid:20) βu (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) where the non-zero sub-blocks B and b , according to the notation in Section 3.1 of Nolan & Wand(2018), are for ≤ i ≤ m and ≤ j ≤ n i : b ij ≡  σ − ε y ij  , B ij ≡  σ − ε X ij σ − ε Z gbl, ij O ( (cid:80) mi =1 n i ) − / σ − gbl I K gbl O OO OO OO O  , • B ij ≡  σ − ε X ij σ − ε Z g grp, ij O O n − / i Σ − / g OO n − / i σ − grp, g I K g grp O OO O  and •• B ij ≡  σ − ε X ij σ − ε Z h grp, ij O OO OO O Σ − / h OO σ − grp, h I K h grp  with each of these matrices having ˜ o ij = o ij + K gbl + 2 + K g grp + 2 + K h grp rows and with B i having p = 2 + K gbl columns, • B i having q = 2 + K g grp columns and •• B ij having q = 2 + K h grp columns.The solutions are (cid:20) (cid:98) β (cid:98) u gbl (cid:21) = x , Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) = A , (cid:34) (cid:98) u g lin, i (cid:98) u g grp, i (cid:35) = x ,i , E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35) T  = A ,i , Cov (cid:32)(cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35)(cid:33) = A ,i , ≤ i ≤ m, (cid:34) (cid:98) u h lin, ij (cid:98) u h grp, ij (cid:35) = x ,ij , E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  = A ,ij ,E (cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  = A , i, j and Cov (cid:32)(cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35)(cid:33) = A ,ij , ≤ i ≤ m, ≤ j ≤ n i . lgorithm 3 Streamlined algorithm for obtaining best linear unbiased predictions and correspond-ing covariance matrix components for the two-level group speciﬁc curves model.

Inputs: y ij ( o ij × , X ij ( o ij × , Z gbl, ij ( o ij × K gbl ) , Z g grp, ij ( o ij × K g grp ) , Z h grp, ij ( o ij × K h grp ) , ≤ i ≤ m, ≤ j ≤ n i ; σ ε , σ gbl , σ grp, g , σ grp, h > , Σ g (2 × , Σ h (2 × , symmetric and positive deﬁnite.For i = 1 , . . . , m :For j = 1 , . . . , n i : b ij ←−  σ − ε y ij  , B ij ←−  σ − ε X ij σ − ε Z gbl, ij O ( (cid:80) mi =1 n i ) − / σ − gbl I K gbl O OO OO OO O  • B ij ←−  σ − ε X ij σ − ε Z g grp, ij O O n − / i Σ − / g OO n − / i σ − grp, g I K g grp O OO O  , •• B ij ←−  σ − ε X ij σ − ε Z h grp, ij O OO OO O Σ − / h OO σ − grp, h I K h grp  S ←− S OLVE T HREE L EVEL S PARSE L EAST S QUARES (cid:16)(cid:8) ( b ij , B ij , • B ij , •• B ij ) : 1 ≤ i ≤ m, ≤ j ≤ n i (cid:9)(cid:17)(cid:20) (cid:98) β (cid:98) u gbl (cid:21) ←− x component of S ; Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) ←− A component of S For i = 1 , . . . , m : (cid:34) (cid:98) u g lin, i (cid:98) u g grp, i (cid:35) ←− x ,i component of S Cov (cid:32)(cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35)(cid:33) ←− A ,i component of S E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35) T  ←− A ,i component of S continued on a subsequent page . . . A derivation of Result 3 is given in Section S.7 of the web-supplement. Result 3 com-bined with Theorem 4 of Nolan & Wand (2018) leads to Algorithm 3. The S OLVE T HREE -L EVEL S PARSE L EAST S QUARES algorithm is given in Section S.12.18 lgorithm 3 continued.

This is a continuation of the description of this algorithm that com-mences on a preceding page.

For j = 1 , . . . , n i : (cid:34) (cid:98) u h lin, ij (cid:98) u h grp, ij (cid:35) ←− x ,ij component of S E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  ←− A ,ij component of S E (cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  ←− A , i, j component of S Cov (cid:32)(cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35)(cid:33) ←− A ,ij component of S Output: (cid:32) (cid:20) (cid:98) β (cid:98) u gbl (cid:21) , Cov (cid:18)(cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21)(cid:19) , (cid:40)(cid:32) (cid:34) (cid:98) u lin, i (cid:98) u grp, i (cid:35) , Cov (cid:32)(cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35)(cid:33) ,E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u lin, i − u lin, i (cid:98) u grp, i − u grp, i (cid:35) T  (cid:33) : 1 ≤ i ≤ m, (cid:32) (cid:34) (cid:98) u h lin, ij (cid:98) u h grp, ij (cid:35) , E (cid:20) (cid:98) β (cid:98) u gbl − u gbl (cid:21) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  ,E (cid:34) (cid:98) u g lin, i − u g lin, i (cid:98) u g grp, i − u g grp, i (cid:35) (cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35) T  , Cov (cid:32)(cid:34) (cid:98) u h lin, ij − u h lin, ij (cid:98) u h grp, ij − u h grp, ij (cid:35)(cid:33) (cid:33) :1 ≤ i ≤ m, ≤ j ≤ n i (cid:111)(cid:17) A Bayesian extension of (17) and (18) is: y | β , u , σ ε ∼ N ( Xβ + Z u , σ ε I ) , u | σ gbl , σ grp, g , Σ g , σ grp, h , Σ h ∼ N ( , G ) , G as deﬁned in (18), β ∼ N ( µ β , Σ β ) , σ ε | a ε ∼ Inverse- χ ( ν ε , /a ε ) , a ε ∼ Inverse- χ (1 , / ( ν ε s ε )) ,σ gbl | a gbl ∼ Inverse- χ ( ν gbl , /a gbl ) , a gbl ∼ Inverse- χ (1 , / ( ν gbl s gbl )) ,σ grp, g | a grp, g ∼ Inverse- χ ( ν grp, g , /a grp, g ) , a grp, g ∼ Inverse- χ (1 , / ( ν grp, g s grp, g )) ,σ grp, h | a grp, h ∼ Inverse- χ ( ν grp, h , /a grp, h ) , a grp, h ∼ Inverse- χ (1 , / ( ν grp, h s grp, h )) , Σ g | A Σ g ∼ Inverse-G-Wishart (cid:0) G full , ν Σ g + 2 , A − Σ g (cid:1) , A Σ g ∼ Inverse-G-Wishart ( G diag , , Λ A Σ g ) , Λ A Σ g ≡ { ν Σ g diag ( s Σ g, , s Σ g, ) } − , Σ h | A Σ h ∼ Inverse-G-Wishart (cid:0) G full , ν Σ h + 2 , A − Σ h (cid:1) , A Σ h ∼ Inverse-G-Wishart ( G diag , , Λ A Σ h ) , Λ A Σ h ≡ { ν Σ h diag ( s Σ h, , s Σ h, ) } − . (19)19he following mean ﬁeld restriction is imposed on the joint posterior density function ofall parameters in (19): p ( β , u , a ε , a gbl , a grp, g , A Σ g , a grp, h , A Σ h , σ ε , σ gbl , σ grp, g , Σ g , σ grp, h , Σ h | y ) ≈ q ( β , u , a ε , a gbl , a grp, g , A Σ g , a grp, h , A Σ h ) q ( σ ε , σ gbl , σ grp, g , Σ g , σ grp, h , Σ h ) . (20)The optimal q -density functions for the parameters of interest are q ∗ ( β , u ) has a N (cid:0) µ q ( β , u ) , Σ q ( β , u ) (cid:1) distribution, q ∗ ( σ ε ) has an Inverse- χ (cid:0) ξ q ( σ ε ) , λ q ( σ ε ) (cid:1) distribution, q ∗ ( σ gbl ) has an Inverse- χ (cid:0) ξ q ( σ gbl ) , λ q ( σ gbl ) (cid:1) distribution, q ∗ ( σ grp, g ) has an Inverse- χ (cid:0) ξ q ( σ grp, g ) , λ q ( σ grp, g ) (cid:1) distribution q ∗ ( σ grp, h ) has an Inverse- χ (cid:0) ξ q ( σ grp, h ) , λ q ( σ grp, h ) (cid:1) distribution q ∗ ( Σ g ) has an Inverse-G-Wishart ( G full , ξ q ( Σ g ) , Λ q ( Σ g ) ) distributionand q ∗ ( Σ h ) has an Inverse-G-Wishart ( G full , ξ q ( Σ h ) , Λ q ( Σ h ) ) distribution.The optimal q -density parameters are determined through an iterative coordinate ascentalgorithm, details of which are given in Section S.10 of the web-supplement. As in thetwo-level case, the updates for µ q ( β , u ) and Σ q ( β , u ) may be written in the same form as (12)but with a three-level version of the C matrix and D MFVB ≡  Σ − β O OO µ q (1 /σ gbl ) I OO O I m ⊗  M q ( Σ − g ) O OO µ q (1 /σ grp, g ) I OO O I n i ⊗ (cid:34) M q ( Σ − h ) OO µ q (1 /σ grp, h ) I (cid:35) . (21)For large numbers of level 2 and level 3 groups, Σ q ( β , u ) ’s size becomes infeasible to dealwith. However, only relatively small sub-blocks of Σ q ( β , u ) are needed for variational in-ference regarding the variance and covariance parameters. These sub-block positions cor-respond to the non-zero sub-block positions of a general three-level sparse matrix deﬁned20n Section 3 of Nolan & Wand (2018). Here, Nolan & Wand’s A sub-block corresponds to a (2 + K gbl ) × (2 + K gbl ) matrix Σ q ( β , u gbl ) ; A ,i sub-block corresponds to a (2 + K g grp ) × (2 + K g grp ) matrix Σ q ( u g lin, i , u g grp, i ) ; A ,i sub-block corresponds to a (2 + K gbl ) × (2 + K g grp ) matrix E (cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:32)(cid:34) u g lin, i u g grp, i (cid:35) − µ q ( u g lin, i , u g grp, i ) (cid:33) T  , ≤ i ≤ m ; A ,ij sub-block corresponds to a (2 + K h grp ) × (2 + K h grp ) matrix Σ q ( u h lin, ij , u h grp, ij ) ; A ,ij sub-block corresponds to a (2 + K gbl ) × (2 + K h grp ) matrix E (cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:32)(cid:34) u h lin, ij u h grp, ij (cid:35) − µ q ( u h lin, ij , u h grp, ij ) (cid:33) T  ; A , i, j sub-block corresponds to a (2 + K g grp ) × (2 + K h grp ) matrix E (cid:32)(cid:34) u g lin, i u g grp, i (cid:35) − µ q ( u g lin, i , u g grp, i ) (cid:33) (cid:32)(cid:34) u h lin, ij u h grp, ij (cid:35) − µ q ( u h lin, ij , u h grp, ij ) (cid:33) T  , ≤ i ≤ m, ≤ j ≤ n i . (22)We appeal to Result 4 for a streamlined mean ﬁeld variational Bayes algorithm. Result 4.

The mean ﬁeld variational Bayes updates of µ q ( β , u ) and each of the sub-blocks of Σ q ( β , u ) in (22) are expressible as a three-level sparse matrix least squares problem of the form: (cid:13)(cid:13)(cid:13)(cid:13) b − B (cid:20) βu (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) where the non-zero sub-blocks B and b , according to the notation in Section 3.1 of Nolan & Wand(2018), are for ≤ i ≤ m and ≤ j ≤ n i . b ij ≡  µ / q (1 /σ ε ) y ij ( (cid:80) mi =1 n i ) − / Σ − / β µ β  , B ij ≡  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z gbl, ij ( (cid:80) mi =1 n i ) − / Σ − / β OO ( (cid:80) mi =1 n i ) − / µ / q (1 /σ gbl ) I K gbl O OO OO OO O  , • B ij ≡  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z g grp, ij O OO O n − / i M / q ( Σ − g ) OO n − / i µ / q (1 /σ grp, g ) I K g grp O OO O  and •• B ij ≡  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z h grp, ij O OO OO OO OM / q ( Σ − h ) OO µ / q (1 /σ grp, h ) I K h grp  ith each of these matrices having ˜ o ij = o ij + 2 + K gbl + 2 + K g grp + 2 + K h grp rows and with B i having p = 2 + K gbl columns, • B i having q = 2 + K g grp columns and •• B ij having q = 2 + K h grp columns. The solutions are µ q ( β , u gbl ) = x , Σ q ( β , u gbl ) = A , µ q ( u g lin, i , u g grp, i ) = x ,i , E q (cid:34) β − µ q ( β ) u gbl − µ q ( u gbl ) (cid:35) (cid:34) u g lin, i − µ q ( u g lin, i ) u g grp, i − µ q ( u g grp, i ) (cid:35) T  = A ,i , Σ q ( u g lin, i , u g grp, i ) = A ,i , ≤ i ≤ m, µ q ( u h lin, ij , u h grp, ij ) = x ,ij , E q (cid:34) β − µ q ( β ) u gbl − µ q ( u gbl ) (cid:35) (cid:34) u h lin, ij − µ q ( u h lin, ij ) u h grp, ij − µ q ( u h grp, ij ) (cid:35) T  = A ,ij ,E q (cid:34) u g lin, i − µ q ( u g lin, i ) u g grp, i − µ q ( u g grp, i ) (cid:35) (cid:34) u h lin, ij − µ q ( u h lin, ij ) u h grp, ij − µ q ( u h grp, ij ) (cid:35) T  = A , i, j and Σ q ( u h lin, ij , u h grp, ij ) = A ,ij , ≤ i ≤ m, ≤ j ≤ n i . Algorithm 4 makes use of Result 4 to facilitate streamlined computation of all varia-tional parameters in the three-level group speciﬁc curves model.Figure 3 provides illustration of Algorithm 4 by showing the ﬁts to the Figure 1 ul-trasound data. Posterior mean curves and (narrow) 99% pointwise credible intervals areshown. As discussed in the next section, such ﬁts can be obtained rapidly and accuratelyand Algorithm 4 is scalable to much larger data sets of the type illustrated by Figures 1and 3. 22 lgorithm 4

QR-decomposition-based streamlined algorithm for obtaining mean ﬁeld variationalBayes approximate posterior density functions for the parameters in the Bayesian three-level group-speciﬁc curves model (19) with product density restriction (20)

Data Inputs: y ij ( o ij × , X ij ( o ij × , Z gbl, ij ( o ij × K gbl ) , Z g grp, ij ( o ij × K g grp ) , Z h grp, ij ( o ij × K h grp ) 1 ≤ i ≤ m, ≤ j ≤ n i . Hyperparameter Inputs: µ β (2 × , Σ β (2 × symmetric and positive deﬁnite, s ε , ν ε , s gbl , ν gbl , s Σ g, , s Σ g, , ν Σ g , s grp, g , ν grp, g , s Σ h, , s Σ h, , ν Σ h , s grp, h , ν grp, h > .For i = 1 , . . . , m : For j = 1 , . . . , n i : C gbl, ij ←− [ X ij Z gbl, ij ] ; C g grp, ij ←− [ X ij Z g grp, ij ] ; C h grp, ij ←− [ X ij Z h grp, ij ] Initialize: µ q (1 /σ ε ) , µ q (1 /σ gbl ) , µ q (1 /σ grp, g ) , µ q (1 /σ grp, h ) , µ q (1 /a ε ) , µ q (1 /a gbl ) , µ q (1 /a grp, g ) , µ q (1 /a grp, h ) > , M q ( Σ − g ) (2 × , M q ( Σ − h ) (2 × , M q ( A − g ) (2 × , M q ( A − h ) (2 × symmetric and positive deﬁnite. ξ q ( σ ε ) ←− ν ε + (cid:80) mi =1 (cid:80) n i j =1 o ij ; ξ q ( σ gbl ) ←− ν gbl + K gbl ; ξ q ( Σ g ) ←− ν Σ g + 2 + mξ q ( Σ h ) ←− ν Σ h + 2 + (cid:80) mi =1 n i ; ξ q ( σ grp, g ) ←− ν grp, g + mK g grp ξ q ( σ grp, h ) ←− ν grp, h + K h grp (cid:80) mi =1 n i ; ξ q ( a ε ) ←− ν ε + 1 ; ξ q ( a gbl ) ←− ν gbl + 1 ξ q ( a grp, g ) ←− ν grp, g + 1 ; ξ q ( a grp, h ) ←− ν grp, h + 1 ; ξ q ( A Σ g ) ←− ν Σ g + 2 ; ξ q ( A Σ h ) ←− ν Σ h + 2 Cycle:For i = 1 , . . . , m :For j = 1 , . . . , n i : b ij ←−  µ / q (1 /σ ε ) y ij n − / · Σ − / β µ β  , B ij ←−  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z gbl, ij ( (cid:80) mi =1 n i ) − / Σ − / β OO ( (cid:80) mi =1 n i ) − / µ / q (1 /σ gbl ) I K gbl O OO OO OO O  , • B ij ←−  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z g grp, ij O OO O n − / i M / q ( Σ − g ) OO n − / i µ / q (1 /σ grp, g ) I K g grp O OO O  continued on a subsequent page . . . lgorithm 4 continued. This is a continuation of the description of this algorithm that com-mences on a preceding page. •• B ij ←−  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z h grp, ij O OO OO OO OM / q ( Σ − h ) OO µ / q (1 /σ grp, h ) I K h grp  S ←− S OLVE T HREE L EVEL S PARSE L EAST S QUARES (cid:16)(cid:8) ( b ij , B ij , • B ij , •• B ij ) : 1 ≤ i ≤ m, ≤ j ≤ n i (cid:9)(cid:17) µ q ( β , u gbl ) ←− x component of S ; Σ q ( β , u gbl ) ←− A component of S µ q ( u gbl ) ←− last K gbl rows of µ q ( β , u gbl ) Σ q ( u gbl ) ←− bottom-right K gbl × K gbl sub-block of Σ q ( β , u gbl ) λ q ( σ ε ) ←− µ q (1 /a ε ) ; Λ q ( Σ g ) ←− M q ( A − Σ g ) ; Λ q ( Σ h ) ←− M q ( A − Σ h ) λ q ( σ grp, g ) ←− µ q (1 /a grp, g ) ; λ q ( σ grp, g ) ←− µ q (1 /a grp, g ) For i = 1 , . . . , m : µ q ( u g lin, i , u g grp, i ) ←− x ,i component of S Σ q ( u g lin, i , u g grp, i ) ←− A ,i component of S µ q ( u g lin, i ) ←− ﬁrst rows of µ q ( u g lin, i , u g grp, i ) Σ q ( u g lin, i ) ←− top left × sub-block of Σ q ( u g lin, i , u g grp, i ) µ q ( u g grp, i ) ←− last K g grp rows of µ q ( u g lin, i , u g grp, i ) Σ q ( u g grp, i ) ←− bottom right K g grp × K g grp sub-block of Σ q ( u g lin, i , u g grp, i ) E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i )) (cid:19) T (cid:41) ←− A ,i component of S For j = 1 , . . . , n i : µ q ( u h lin, ij , u h grp, ij ) ←− x ,ij component of S Σ q ( u h lin, ij , u h grp, ij ) ←− A ,ij component of S µ q ( u h lin, ij ) ←− ﬁrst rows of µ q ( u h lin, ij , u h grp, ij ) Σ q ( u h lin, ij ) ←− top left × sub-block of Σ q ( u h lin, ij , u h grp, ij ) µ q ( u h grp, ij ) ←− last K h grp rows of µ q ( u h lin, ij , u h grp, ij ) Σ q ( u h grp, ij ) ←− bottom right K h grp × K h grp sub-block of Σ q ( u h lin, ij , u h grp, ij ) E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u g lin, i , u g grp, i )) (cid:19) T (cid:41) ←− A ,ij component of S continued on a subsequent page . . . lgorithm 4 continued. This is a continuation of the description of this algorithm that com-mences on a preceding page. E q (cid:40)(cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i )) (cid:19) (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u h lin, ij , u h grp, ij )) (cid:19) T (cid:41) ←− A ,i,j component of S λ q ( σ ε ) ←− λ q ( σ ε ) + (cid:13)(cid:13) y ij − C gbl, ij µ q ( β , u gbl ) − C g grp, ij µ q ( u g lin, i , u g grp, i ) − C h grp, ij µ q ( u h lin, ij , u h grp, ij ) (cid:13)(cid:13) λ q ( σ ε ) ←− λ q ( σ ε ) + tr ( C T gbl, ij C gbl, ij Σ q ( β , u gbl ) ) + tr (( C g grp, ij ) T C g grp, ij Σ q ( u g lin, i , u g grp, i ) ) λ q ( σ ε ) ←− λ q ( σ ε ) + tr (( C h grp, ij ) T C h grp, ij Σ q ( u h lin, ij , u h grp, ij ) ) λ q ( σ ε ) ←− λ q ( σ ε ) + 2 tr (cid:20) C T grp, i C gbl, i E q (cid:26)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) × (cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i )) (cid:19) T (cid:41)(cid:35) λ q ( σ ε ) ←− λ q ( σ ε ) + 2 tr (cid:20) ( C g grp, ij ) T C gbl, ij E q (cid:26)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) × (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u h lin, ij , u h grp, ij )) (cid:19) T (cid:41)(cid:35) λ q ( σ ε ) ←− λ q ( σ ε ) + 2 tr (cid:20) ( C g grp, ij ) T C h grp, ij E q (cid:26)(cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i ) (cid:19) × (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u h lin, ij , u h grp, ij ) (cid:19) T (cid:41)(cid:35) Λ q ( Σ h ) ←− Λ q ( Σ h ) + µ q ( u h lin, ij ) µ T q ( u h lin, ij ) + Σ q ( u h lin, ij ) λ q ( σ grp, h ) ←− λ q ( σ grp, h ) + (cid:107) µ q ( u h grp, ij ) (cid:107) + tr (cid:18) Σ q ( u h grp, ij ) (cid:19) Λ q ( Σ g ) ←− Λ q ( Σ g ) + µ q ( u g lin, i ) µ T q ( u g lin, i ) + Σ q ( u g lin, i ) λ q ( σ grp, g ) ←− λ q ( σ grp, g ) + (cid:107) µ q ( u g grp, i ) (cid:107) + tr (cid:16) Σ q ( u g grp, i ) (cid:17) λ q ( σ gbl ) ←− µ q (1 /a gbl ) + (cid:107) µ q ( u gbl ) (cid:107) + tr (cid:16) Σ q ( u gbl ) (cid:17) µ q (1 /σ ε ) ← ξ q ( σ ε ) /λ q ( σ ε ) ; µ q (1 /σ gbl ) ← ξ q ( σ gbl ) /λ q ( σ gbl ) M q ( ( Σ g ) − ) ← ( ξ q ( Σ g ) − Λ − q ( Σ g ) ; M q (( Σ h ) − ) ← ( ξ q ( Σ h ) − Λ − q ( Σ h ) µ q (1 /σ grp, g ) ← ξ q ( σ grp, g ) /λ q ( σ grp, g ) ; µ q (1 /σ grp, h ) ← ξ q ( σ grp, h ) /λ q ( σ grp, h ) λ q ( a ε ) ←− µ q (1 /σ ε ) + 1 / ( ν ε s ε ) ; µ q (1 /a ε ) ←− ξ q ( a ε ) /λ q ( a ε ) M q ( A − Σ g ) ←− ξ q ( A Σ g ) Λ − q ( A Σ g ) ; M q ( A − Σ h ) ←− ξ q ( A Σ h ) Λ − q ( A Σ h ) Λ q ( A Σ g ) ←− diag (cid:8) diagonal (cid:0) M q ( Σ − g ) (cid:1)(cid:9) + { ν Σ g diag ( s Σ g, , s Σ g, ) } − Λ q ( A Σ h ) ←− diag (cid:8) diagonal (cid:0) M q ( Σ − h ) (cid:1)(cid:9) + { ν Σ h diag ( s Σ h, , s Σ h, ) } − λ q ( a gbl ) ←− µ q (1 /σ gbl ) + 1 / ( ν gbl s gbl ) ; µ q (1 /a gbl ) ←− ξ q ( a gbl ) /λ q ( a gbl ) λ q ( a grp, g ) ←− µ q (1 /σ grp, g ) + 1 / ( ν grp, g s grp, g ) ; µ q (1 /a grp, g ) ←− ξ q ( a grp, g ) /λ q ( a grp, g ) λ q ( a grp, h ) ←− µ q (1 /σ grp, h ) + 1 / ( ν grp, h s grp, h ) ; µ q (1 /a grp, h ) ←− ξ q ( a grp, h ) /λ q ( a grp, h ) until the increase in p ( y ; q ) is negligible. continued on a subsequent page . . . lgorithm 4 continued. This is a continuation of the description of this algorithm that com-mences on a preceding page.

Outputs: µ q ( β , u gbl ) , Σ q ( β , u gbl ) , (cid:110) µ q ( u g lin, i , u g grp, i ) , Σ q ( u g lin, i , u g grp, i ) ,E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i )) (cid:19) T (cid:41) : 1 ≤ i ≤ m,E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u h lin, ij , u h grp, ij )) (cid:19) T (cid:41) ,E q (cid:40)(cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i ) (cid:19) (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u h lin, ij , u h grp, ij ) (cid:19) T (cid:41) , µ q ( u h lin, ij , u h grp, ij ) , Σ q ( u h lin, ij , u h grp, ij ) : 1 ≤ i ≤ m, ≤ j ≤ n i (cid:111) , ξ q ( σ ε ) , λ q ( σ ε ) , ξ q ( σ gbl ) ,λ q ( σ gbl ) , ξ q ( Σ g ) , Λ − q ( Σ g ) , ξ q ( Σ h ) , Λ − q ( Σ h ) , ξ q ( σ grp, g ) , λ q ( σ grp, g ) , ξ q ( σ grp, h ) , λ q ( σ grp, h ) . requency l og ( ba cksc a tt e r) −35−30−25−20−15−10−5 slice 1tumor 1

10 15 20 slice 2tumor 1 slice 3tumor 1

10 15 20 slice 4tumor 1 slice 5tumor 1

10 15 20 slice 1tumor 2 slice 2tumor 2

10 15 20 slice 3tumor 2 slice 4tumor 2

10 15 20 slice 3tumor 9 slice 4tumor 9

10 15 20 slice 5tumor 9 slice 1tumor 10

10 15 20 slice 2tumor 10 slice 3tumor 10

10 15 20 slice 4tumor 10 slice 5tumor 10

Figure 3:

Illustrative three-level curve-type data with approximate ﬁtted group speciﬁc curves andcorresponding 99% credible sets based on mean ﬁeld variational Bayes via Algorithm 4. The re-sponse variable is

Mean ﬁeld restrictions such as (11) and (20) imply that there is some loss of accuracy ininference produced by Algorithms 2 and 4. However, at least for the Gaussian responsecase treated here, approximate parameter orthogonality between the coefﬁcient parame-ters and covariance parameters from likelihood theory implies that such restrictions aremild and mean ﬁeld accuracy is high. Figure 4 corroborates this claim by assessing ac-curacy of the mean function estimates and 95% credible intervals at the median values offrequency for each panel in Figure 3. As a benchmark we use Markov chain Monte Carlo-based inference via the rstan package (Guo et al. , 2018). After a warmup of size , we retained , Markov chain Monte Carlo samples from the mean function and me-dian frequency posterior distributions and used kernel density estimation to approximatethe corresponding posterior density function. For a generic univariate parameter θ , the27ccuracy of an approximation q ( θ ) to p ( θ | y ) is deﬁned to beaccuracy ≡ (cid:26) − (cid:90) ∞−∞ (cid:12)(cid:12) q ( θ ) − p ( θ | y ) (cid:12)(cid:12) dθ (cid:27) % . (23)The percentages in the top right-hand panel of Figure 4 correspond to (23) with replace-ment of p ( θ | y ) by the above-mentioned kernel density estimate. In this case accuracy isseen to be excellent, with accuracy percentages between 97% and 99% for all 40 curves.28

26 −25 −24 −23 . . . tumor 1, slice 1 accuracy −36 −35 −34 −33 . . . tumor 1, slice 2 app r o x . po s t e r i o r accuracy −30 −29 −28 −27 . . . tumor 1, slice 3 app r o x . po s t e r i o r accuracy −33 −32 −31 −30 −29 . . . tumor 1, slice 4 app r o x . po s t e r i o r accuracy −36 −35 −34 −33 . . . tumor 1, slice 5 app r o x . po s t e r i o r accuracy −65 −64 −63 −62 . . . tumor 2, slice 1 accuracy −44 −43 −42 −41 . . . tumor 2, slice 2 app r o x . po s t e r i o r accuracy −42 −41 −40 −39 . . . tumor 2, slice 3 app r o x . po s t e r i o r accuracy −46 −45 −44 −43 . . . tumor 2, slice 4 app r o x . po s t e r i o r accuracy −35 −34 −33 −32 . . . tumor 2, slice 5 app r o x . po s t e r i o r accuracy . . . tumor 3, slice 1 accuracy . . . tumor 3, slice 2 app r o x . po s t e r i o r accuracy . . . tumor 3, slice 3 app r o x . po s t e r i o r accuracy . . . tumor 3, slice 4 app r o x . po s t e r i o r accuracy

12 13 14 15 . . . tumor 3, slice 5 app r o x . po s t e r i o r accuracy −57 −56 −55 −54 −53 . . . tumor 4, slice 1 accuracy −45 −44 −43 −42 −41 . . . tumor 4, slice 2 app r o x . po s t e r i o r accuracy −55 −54 −53 −52 . . . tumor 4, slice 3 app r o x . po s t e r i o r accuracy −51 −50 −49 −48 . . . tumor 4, slice 4 app r o x . po s t e r i o r accuracy −54 −53 −52 −51 . . . tumor 4, slice 5 app r o x . po s t e r i o r accuracy −62 −61 −60 −59 −58 . . . tumor 5, slice 1 accuracy −50 −49 −48 −47 . . . tumor 5, slice 2 app r o x . po s t e r i o r accuracy −57 −56 −55 −54 −53 . . . tumor 5, slice 3 app r o x . po s t e r i o r accuracy −61 −60 −59 −58 −57 . . . tumor 5, slice 4 app r o x . po s t e r i o r accuracy −64 −63 −62 −61 . . . tumor 5, slice 5 app r o x . po s t e r i o r accuracy −14 −13 −12 −11 −10 . . . tumor 6, slice 1 accuracy . . . tumor 6, slice 2 app r o x . po s t e r i o r accuracy

14 15 16 17 . . . tumor 6, slice 3 app r o x . po s t e r i o r accuracy −22 −21 −20 −19 −18 . . . tumor 6, slice 4 app r o x . po s t e r i o r accuracy −8 −7 −6 −5 . . . tumor 6, slice 5 app r o x . po s t e r i o r accuracy −71 −70 −69 −68 . . . tumor 7, slice 1 accuracy −81 −80 −79 −78 . . . tumor 7, slice 2 app r o x . po s t e r i o r accuracy −86 −85 −84 −83 −82 . . . tumor 7, slice 3 app r o x . po s t e r i o r accuracy −91 −90 −89 −88 . . . tumor 7, slice 4 app r o x . po s t e r i o r accuracy −96 −95 −94 −93 . . . tumor 7, slice 5 app r o x . po s t e r i o r accuracy −41 −40 −39 −38 . . . tumor 8, slice 1 accuracy −34 −33 −32 −31 . . . tumor 8, slice 2 app r o x . po s t e r i o r accuracy −35 −34 −33 −32 . . . tumor 8, slice 3 app r o x . po s t e r i o r accuracy −38 −37 −36 −35 −34 . . . tumor 8, slice 4 app r o x . po s t e r i o r accuracy −39 −38 −37 −36 −35 . . . tumor 8, slice 5 app r o x . po s t e r i o r accuracy −50 −49 −48 −47 . . . tumor 9, slice 1 accuracy −51 −50 −49 −48 −47 . . . tumor 9, slice 2 app r o x . po s t e r i o r accuracy −55 −54 −53 −52 . . . tumor 9, slice 3 app r o x . po s t e r i o r accuracy −52 −51 −50 −49 . . . tumor 9, slice 4 app r o x . po s t e r i o r accuracy −53 −52 −51 −50 . . . tumor 9, slice 5 app r o x . po s t e r i o r accuracy −32 −31 −30 −29 . . . tumor 10, slice 1 accuracy −46 −45 −44 −43 −42 . . . tumor 10, slice 2 app r o x . po s t e r i o r accuracy −48 −47 −46 −45 . . . tumor 10, slice 3 app r o x . po s t e r i o r accuracy −44 −43 −42 −41 . . . tumor 10, slice 4 app r o x . po s t e r i o r accuracy −53 −52 −51 −50 . . . tumor 10, slice 5 app r o x . po s t e r i o r accuracy Figure 4:

Accuracy assessment of Algorithm 4. Each panel displays approximate posterior densityfunctions corresponding to mean function estimates according to the three-level group speciﬁc curvemodel (16). In each case the estimate is at the median frequency value. The orange density functionsare based on Markov chain Monte Carlo and the blue density functions are based on mean ﬁeldvariational Bayes. The accuracy percentage scores are deﬁned by (23). .2 Speed Assessment We also conducted some simulation studies to assess the speed of streamlined variationalhigher level group-speciﬁc curve models, in terms of both comparative advantage overna¨ıve implementation and absolute performance. The focus of these studies was varia-tional inference in the two-level case and to probe maximal speed potential Algorithm 2was implemented in the low-level computer language

Fortran 77 . An implementationof the na¨ıve counterpart of Algorithm 2, involving storage and direct calculations con-cerning the full Σ q ( β , u ) matrix, was also carried out. We then simulated data according tomodel (1) with σ ε = 0 . , f ( x ) = 3 (cid:112) x (1 . − x ) Φ(6 x − and g i ( x ) = α α sin(2 πx α ) where, for each i , α , α and α are, respectively, random draws from the N ( , ) dis-tribution and the sets {− , } and { , , } . The level-2 sample sizes n i generated ran-domly from the set { , , . . . , } and the level-1 sample sizes m ranging over the set { , , , , } . All x ij data were generated from a Uniform distribution over theunit interval. Table 1 summarizes the timings based on 100 replications with the numberof mean ﬁeld variational Bayes iterations ﬁxed at 50. The study was run on a MacBookAir laptop with a 2.2 gigahertz processor and 8 gigabytes of random access memory. m na¨ıve streamlined na¨ıve/streamlined100 75 (1.21) 0.748 (0.0334) 100200 660 (7.72) 1.490 (0.0491) 442300 2210 (22.00) 2.260 (0.0567) 974400 5180 (92.20) 3.040 (0.0718) 1700500 NA 3.780 (0.0593) NATable 1: Average (standard deviation) of elapsed computing times in seconds for ﬁtting model (1)na¨ıvely versus with streamlining via Algorithm 2. The NA entries indicates non applicability dueto the na¨ıve computations not being feasible.

For m ranging from to we see that the na¨ıve to streamlined ratios increasefrom about to , . When m = 500 the na¨ıve implementation fails to run due toits excessive storage demands. In contrast, the streamlined ﬁts are produced in about seconds. It is clear that streamlined variational inference is to be preferred and is the onlyoption for large numbers of groups.We then obtained timings for the streamlined algorithm for m becoming much larger,taking on values , , , and , . The iterations in Algorithm 2 were stoppedwhen the relative increase in the marginal log-likelihood fell below − . The average andstandard deviation times in seconds over 100 replications are shown in Table 2. We seethat the computational times are approximately linear in m . Even with twelve and a halfthousand groups, Algorithm 2 is able to deliver ﬁtting and inference on a contemporarylaptop computer in about one and a half minutes. m = 100 m = 500 m = 2 , m = 12 , Average (standard deviation) of elapsed computing times in seconds for ﬁtting model (1)with streamlining via Algorithm 2. cknowledgments This research was partially supported by the Australian Research Council Discovery ProjectDP140100441. The ultrasound data was provided by the Bioacoustics Research Labora-tory, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Illinois, U.S.A.

References

Atay-Kayis, A. & Massam, H. (2005). A Monte Carlo method for computing marginallikelihood in nondecomposable Gaussian graphical models.

Biometrika , , 317–335.Bates, D., M¨achler, M., Bolker, B. and Walker, S. (2015). Fitting linear mixed-effects modelsusing lme4 . Journal of Statistical Software , , 1–48.Bishop, C.M. (2006). Pattern Recognition and Machine Learning.

New York: Springer.Brumback, B.A. and Rice, J.A. (1998). Smoothing spline models for the analysis of nestedand crossed samples of curves (with discussion).

Journal of the American StatisticalAssociation , , 961–994.Donnelly, C.A., Laird, N.M. and Ware, J.H. (1995). Prediction and creation of smoothcurves for temporally correlated longitudinal data. Journal of the American Statisti-cal Association , , 984–989.Durban, M., Harezlak, J., Wand, M.P. and Carroll, R.J. (2005). Simple ﬁtting of subject-speciﬁc curves for longitudinal data. Statistics in Medicine , , 1153–1167.Goldsmith, J., Zipunnikov, V. and Schrack, J. (2015). Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics , , 344–353.Guo, J., Gabry, J. and Goodrich, B. (2018). rstan : R interface to Stan . R package version2.18.2. http://mc-stan.org .Huang, A. and Wand, M.P. (2013). Simple marginally noninformative prior distributionsfor covariance matrices. Bayesian Analysis , , 439–452.Lee, C.Y.Y. and Wand, M.P. (2016). Variational inference for ﬁtting complex Bayesianmixed effects models to health data. Statistics in Medicine , , 165–188.Nolan, T.H., Menictas, M. and Wand, M.P. (2019). Streamlined computing for variationalinference with higher level random effects. Unpublished manuscript submitted tothe arXiv.org e-Print archive; on hold as of 11th March 2019. Soon to be posted alsoon http://matt-wand.utsacademics.info/statsPapers.html Nolan, T.H. and Wand, M.P. (2018). Solutions to sparse multilevel matrix problems. Un-published manuscript available at https://arxiv.org/abs/1903.03089 .Pinheiro, J.C. and Bates, D.M. (2000).

Mixed-Effects Models in S and S-PLUS . New York:Springer.Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D. and R Core Team. (2018). nlme : Linear and31onlinear mixed effects models. R package version 3.1. http://cran.r-project.org/package=nlme .Pratt, J.H., Jones, J.J., Miller, J.Z., Wagner, M.A. and Fineberg, N.S. (1989). Racial differ-ences in aldosterone excretion and plasma aldosterone concentrations in children. New England Journal of Medicine , , 1152–1157.Robinson, G.K. (1991). That BLUP is a good thing: the estimation of random effects. Sta-tistical Science , , 15–51.Trail, J.B., Collins, L.M., Rivera, D.E., Li, R., Piper, M.E. and Baker, T.B. (2014). Functionaldata analysis for dynamical system identiﬁcation of behavioral processes. Psycholog-ical Methods , , 175–187.Verbyla, A.P., Cullis, B.R., Kenward, M.G. and Welham, S.J. (1999). The analysis of de-signed experiments and longitudinal data by using smoothing splines (with discus-sion). Applied Statistics , , 269–312.Wahba, G. (1990). Spline Models for Observational Data.

Philadelphia: Society for Industrialand Applied Mathematics.Wand, M.P. and Ormerod, J.T. (2008). On semiparametric regression with O’Sullivan pe-nalized splines.

Australian and New Zealand Journal of Statistics , , 179–198.Wand, M.P. and Ormerod, J.T. (2011). Penalized wavelets: embedding wavelets into semi-parametric regression. Electronic Journal of Statistics , , 1654–1717.Wang, Y. (1998). Mixed effects smoothing spline analysis of variance. Journal of the RoyalStatistical Society, Series B , , 159–174.Wirtzfeld, L.A., Ghoshal, G., Rosado-Mendez, I.M., Nam, K., Park, Y., Pawlicki, A.D.,Miller, R.J., Simpson, D.G., Zagzebski, J.A., Oelze, M.I., Hall, T.J. and O’Brien, W.D.(2015). Quantitative ultrasound comparison of MAT and 4T1 mammary tumors inmice and rates across multiple imaging systems. Journal of Ultrasound Medicine , ,1373–1383.Zhang, D., Lin, X., Raz, J. and Sowers, M. (1998). Semi-parametric stochastic mixed modelsfor longitudinal data. Journal of the American Statistical Association , , 710–719.32 eb-Supplement for: Streamlined Variational Inference forHigher Level Group-Speciﬁc Curve Models B Y M. M

ENICTAS , T.H. N OLAN , D.G. S IMPSON AND

M.P. W

AND University of Technology Sydney and University of Illinois S.1 Derivation of Result 1

Straightforward algebra can be used to verify that C T R − BLUP C + D BLUP = B T B and C T R − BLUP y = B T b where B and b have sparse forms (9) with non-zero sub-blocks equal to b i ≡  σ − ε y i  , B i ≡  σ − ε X i σ − ε Z gbl, i O m − / σ − gbl I K gbl O OO O  and • B i ≡  σ − ε X i σ − ε Z grp, i O O Σ − / OO σ − grp I K grp  . Therefore, in view of (6) and (7), (cid:20) (cid:98) β (cid:98) u (cid:21) = ( B T B ) − B T b and Cov (cid:18)(cid:20) (cid:98) β (cid:98) u − u (cid:21)(cid:19) = ( B T B ) − . S.2 Derivation of Algorithm 1

Algorithm 1 is simply a proceduralization of Result 1.

S.3 The Inverse G-Wishart and Inverse χ Distributions

The Inverse G-Wishart corresponds to the matrix inverses of random matrices that have a

G-Wishart distribution (e.g. Atay-Kayis & Massam, 2005). For any positive integer d , let G be an undirected graph with d nodes labeled , . . . , d and set E consisting of sets of pairsof nodes that are connected by an edge. We say that the symmetric d × d matrix M respects G if M ij = 0 for all { i, j } / ∈ E. A d × d random matrix X has an Inverse G-Wishart distribution with graph G and param-eters ξ > and symmetric d × d matrix Λ , written X ∼ Inverse-G-Wishart ( G, ξ, Λ ) if and only if the density function of X satisﬁes p ( X ) ∝ | X | − ( ξ +2) / exp {− tr ( Λ X − ) } over arguments X such that X is symmetric and positive deﬁnite and X − respects G .Two important special cases are G = G full ≡ totally connected d -node graph , G = G diag ≡ totally disconnected d -node graph , for which the Inverse G-Wishart distribution coincides with a product of independentInverse Chi-Squared random variables. The subscripts of G full and G diag reﬂect the fact that X − is a full matrix and X − is a diagonal matrix in each special case.The G = G full case corresponds to the ordinary Inverse Wishart distribution. However,with message passing in mind, we will work with the more general Inverse G-Wishartfamily throughout this article.In the d = 1 special case the graph G = G full = G diag and the Inverse G-Wishart distri-bution reduces to the Inverse Chi-Squared distributions. We write x ∼ Inverse- χ ( ξ, λ ) for this Inverse-G-Wishart ( G diag , ξ, λ ) special case with d = 1 and λ > scalar. S.4 Derivation of Result 2

It is straightforward to verify that the µ q ( β , u ) and Σ q ( β , u ) updates, given at (12), may bewritten as µ q ( β , u ) ←− ( B T B ) − B T b and Σ q ( β , u ) ←− ( B T B ) − where B and b have the forms (9) with b i ≡  µ / q (1 /σ ε ) y i m − / Σ − / β µ β  , B i ≡  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z gbl, i m − / Σ − / β OO m − / µ / q (1 /σ gbl ) I K gbl O OO O  and • B i ≡  µ / q (1 /σ ε ) X i µ / q (1 /σ ε ) Z grp, i O OO OM / q ( Σ − ) OO µ / q (1 /σ grp ) I K grp  . Result 2 immediately follows from Theorem 2 of Nolan & Wand (2018).

S.5 Derivation of Algorithm 2

We provide expressions for the q -densities for mean ﬁeld variational Bayesian inferencefor the parameters in (10), with product density restriction (11). Arguments analogous tothose given in, for example, Appendix C of Wand & Ormerod (2011) lead to: q ( β , u ) is a N ( µ q ( β , u ) , Σ q ( β , u ) ) density function2here Σ q ( β , u ) = ( C T R − MFVB C + D MFVB ) − and µ q ( β , u ) = Σ q ( β , u ) ( C T R − MFVB y + o MFVB ) with R MFVB , D MFVB and o MFVB deﬁned via (13), q ( σ ε ) is an Inverse- χ (cid:0) ξ q ( σ ε ) , λ q ( σ ε ) (cid:1) density functionwhere ξ q ( σ ε ) = ν ε + (cid:80) mi =1 n i and λ q ( σ ε ) = µ q (1 /a ε ) + m (cid:88) i =1 E q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) y i − C gbl, i (cid:34) βu gbl (cid:35) − C grp, i (cid:34) u lin, i u grp, i (cid:35) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = µ q (1 /a ε ) + m (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E q (cid:32) y i − C gbl, i (cid:34) βu gbl (cid:35) − C grp, i (cid:34) u lin, i u grp, i (cid:35)(cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + tr (cid:40) Cov q (cid:32) C gbl, i (cid:34) βu gbl, i (cid:35) + C grp, i (cid:34) u lin, i u grp, i (cid:35)(cid:33)(cid:41)(cid:35) = µ q (1 /a ε ) + m (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E q (cid:32) y i − C gbl, i (cid:34) βu gbl (cid:35) − C grp, i (cid:34) u lin, i u grp, i (cid:35)(cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + tr ( C T gbl, i C gbl, i Σ q ( β , u gbl ) ) + tr ( C T grp, i C grp, i Σ q ( u lin, i , u grp, i ) )+2 tr (cid:20) C T grp, i C gbl, i E q (cid:26)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) × (cid:18)(cid:20) u lin, i u grp, i (cid:21) − µ q ( u lin, i , u grp, i ) (cid:19) T (cid:41)(cid:35)(cid:41) where C gbl, i ≡ [ X i Z gbl, i ] , C grp, i ≡ [ X i Z grp, i ] , and with reciprocal moment µ q (1 /σ ε ) = ξ q ( σ ε ) /λ q ( σ ε ) , q ( σ gbl ) is an Inverse- χ (cid:16) ξ q ( σ gbl ) , λ q ( σ gbl ) (cid:17) density functionwhere ξ q ( σ gbl ) = ν gbl + K gbl and λ q ( σ gbl ) = µ q (1 /a gbl ) + (cid:107) µ q ( u gbl ) (cid:107) + tr (cid:16) Σ q ( u gbl ) (cid:17) , with reciprocal moment µ q (1 /σ gbl ) = ξ q ( σ gbl ) /λ q ( σ gbl ) , q ( σ grp ) is an Inverse- χ (cid:16) ξ q ( σ grp ) , λ q ( σ grp ) (cid:17) density functionwhere ξ q ( σ grp ) = ν grp + mK grp and λ q ( σ grp ) = µ q (1 /a grp ) + m (cid:88) i =1 (cid:110) (cid:107) µ q ( u grp, i ) (cid:107) + tr (cid:16) Σ q ( u grp, i ) (cid:17)(cid:111) , with reciprocal moment µ q (1 /σ grp ) = ξ q ( σ grp ) /λ q ( σ grp ) , q ( Σ ) is an Inverse-G-Wishart (cid:0) G full , ξ q ( Σ ) , Λ q ( Σ ) (cid:1) density function3here ξ q ( Σ ) = ν Σ + 2 + m Λ q ( Σ ) = M q ( A − Σ ) + m (cid:88) i =1 (cid:16) µ q ( u lin, i ) µ T q ( u lin, i ) + Σ q ( u lin, i ) (cid:17) , with inverse moment M q ( Σ − ) = ( ξ q ( Σ ) − Λ − q ( Σ ) , q ( a ε ) is an Inverse- χ ( ξ q ( a ε ) , λ q ( a ε ) ) density functionwhere ξ q ( a ε ) = ν ε + 1 , λ q ( a ε ) = µ q (1 /σ ε ) + 1 / ( ν ε s ε ) with reciprocal moment µ q (1 /a ε ) = ξ q ( a ε ) /λ q ( a ε ) , q ( a gbl ) is an Inverse- χ ( ξ q ( a gbl ) , λ q ( a gbl ) ) density functionwhere ξ q ( a gbl ) = ν gbl + 1 , λ q ( a gbl ) = µ q (1 /σ gbl ) + 1 / ( ν gbl s gbl ) with reciprocal moment µ q (1 /a gbl ) = ξ q ( a gbl ) /λ q ( a gbl ) , q ( a grp ) is an Inverse- χ ( ξ q ( a grp ) , λ q ( a grp ) ) density functionwhere ξ q ( a grp ) = ν grp + 1 , λ q ( a grp ) = µ q (1 /σ grp ) + 1 / ( ν grp s grp ) with reciprocal moment µ q (1 /a grp ) = ξ q ( a grp ) /λ q ( a grp ) and q ( A Σ ) is an Inverse-G-Wishart (cid:0) G diag , ξ q ( A Σ ) , Λ q ( A Σ ) (cid:1) density functionwhere ξ q ( A Σ ) = ν Σ + 2 , Λ q ( A Σ ) = diag (cid:8) diagonal (cid:0) M q ( Σ − ) (cid:1)(cid:9) + Λ A Σ with inverse moment M q ( A − Σ ) = ξ q ( A Σ ) Λ − q ( A Σ ) .4 .6 Marginal Log-Likelihood Lower Bound and Derivation The expression for the lower bound on the marginal log-likelihood for Algorithm 2 is log p ( y ; q ) = − log( π ) m (cid:88) i =1 n i − log | Σ β | − tr (cid:18) Σ − β (cid:26)(cid:16) µ q ( β ) − µ β (cid:17) (cid:16) µ q ( β ) − µ β (cid:17) T + Σ q ( β ) (cid:27)(cid:19) − tr (cid:32) M q ( Σ − ) (cid:40) m (cid:88) i =1 (cid:16) µ q ( u lin, i µ T q ( u lin, i ) + Σ q ( u lin, i ) (cid:17)(cid:41)(cid:33) + { K gbl + m (2 + K grp ) }− µ q (1 /σ gbl ) (cid:110) (cid:107) µ q ( u gbl ) (cid:107) + tr ( Σ q ( u gbl ) ) (cid:111) − µ q (1 /σ grp ) m (cid:88) i =1 (cid:110) (cid:107) µ q ( u grp, i )) (cid:107) + tr ( Σ q ( u grp, i ) ) (cid:111) + log | Σ β | + { ν Σ + m + 1 + ( ν ε + ν gbl + K gbl + ν grp + mK grp ) } log(2) − log Γ( ν ε ) − µ q (1 /a ε ) µ q (1 /σ ε ) − ξ q ( σ ε ) log( λ q ( σ ε ) ) + log { Γ( ξ q ( σ ε ) ) } + λ q ( σ ε ) µ q (1 /σ ε ) − log( ν ε s ε ) − { Γ( ) } − ν ε s ε µ q (1 /a ε ) − ξ q ( a ε ) log( λ q ( a ε ) ) + log { Γ( ξ q ( a ε ) ) } + λ q ( a ε ) µ q (1 /a ε ) − log Γ( ν gbl ) − µ q (1 /a gbl ) µ q (1 /σ gbl ) − ξ q ( σ gbl ) log( λ q ( σ gbl ) ) + log { Γ( ξ q ( σ gbl ) ) } − log( ν gbl s gbl )+ λ q ( σ gbl ) µ q (1 /σ gbl ) − { / (2 ν gbl s gbl ) } µ q (1 /a gbl ) − ξ q ( a gbl ) log( λ q ( a gbl ) ) − µ q (1 /a grp ) µ q (1 /σ grp ) + log { Γ( ξ q ( a gbl ) ) } + λ q ( a gbl ) µ q (1 /a gbl ) − log Γ( ν grp ) + log { Γ( ξ q ( σ grp ) ) } − log( ν grp s grp ) − ξ q ( σ grp ) log( λ q ( σ grp ) ) + λ q ( σ grp ) µ q (1 /σ grp ) − { / (2 ν grp s grp ) } µ q (1 /a grp ) − ξ q ( a grp ) log( λ q ( a grp ) )+ log { Γ( ξ q ( a grp ) ) } + λ q ( a grp ) µ q (1 /a grp ) − tr ( M q ( A − Σ ) M q ( Σ − ) ) + tr ( Λ q ( Σ ) M q ( Σ − ) )+ (cid:88) j =1 log Γ( ( ξ q ( A Σ ) + 2 − j )) − (cid:88) j =1 log Γ( ( ν Σ + 4 − j )) − ( ξ q ( Σ ) −

1) log | Λ q ( Σ ) | + (cid:88) j =1 log Γ( ( ξ q ( Σ ) + 2 − j )) −

12 2 (cid:88) j =1 / ( ν Σ s Σ , j ) (cid:16) M q ( A − Σ ) (cid:17) jj − (cid:88) j =1 log Γ( (3 − j )) − ( ξ q ( A Σ ) −

1) log | Λ q ( A Σ ) | + tr ( Λ q ( A Σ ) M q ( A − Σ ) ) − µ q (1 /σ ε ) m (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E q (cid:32) y i − C gbl, i (cid:34) βu gbl, i (cid:35) − C grp, i (cid:34) u lin, i u grp, i (cid:35)(cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + tr ( C T gbl, i C gbl, i Σ q ( β , u gbl ) ) + tr ( C T grp, i C grp, i Σ q ( u lin, i , u grp, i ) )+2 tr (cid:20) C T grp, i C gbl, i E q (cid:26)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) × (cid:18)(cid:20) u lin, i u grp, i (cid:21) − µ q ( u lin, i , u grp, i )) (cid:19) T (cid:41)(cid:35)(cid:41) . (S.1) Derivation:

The lower-bound on the marginal log-likelihood is achieved through the fol-lowing expression: log p ( y ; q ) = E q { log p ( y , β , u , σ ε , a ε , σ gbl , a gbl , σ grp , a grp , Σ , A Σ ) − log q ∗ ( β , u , σ ε , a ε , σ gbl , a gbl , σ grp , a grp , Σ , A Σ ) } E q { log p ( y | β , u , σ ε ) } + E q { log p ( β , u | σ gbl , σ grp , Σ ) } − E q { log q ∗ ( β , u ) } + E q { log p ( σ ε | a ε ) } − E q { log q ∗ ( σ ε ) } + E q { log p ( a ε ) } − E q { log q ∗ ( a ε ) } + E q { log p ( σ gbl | a gbl ) } − E q { log q ∗ ( σ gbl ) } + E q { log p ( a gbl ) } − E q { log q ∗ ( a gbl ) } + E q { log p ( σ grp | a grp ) } − E q { log q ∗ ( σ grp ) } + E q { log p ( a grp ) } − E q { log q ∗ ( a grp ) } + E q { log p ( Σ | A Σ ) } − E q { log q ∗ ( Σ ) } + E q { log p ( A Σ ) } − E q { log q ∗ ( A Σ ) } . (S.2)First we note that log p ( y | β , u , σ ε ) = −

12 log(2 π ) m (cid:88) i =1 n i −

12 log( σ ε ) m (cid:88) i =1 n i − σ ε m (cid:88) i =1 (cid:107) y − Xβ − Zu (cid:107) where (cid:107) y − Xβ − Zu (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  y ... y m  −  X ... X m  β −  Z gbl, ... Z gbl, m  u gbl − blockdiag ≤ i ≤ m ([ X i Z grp, i ]) (cid:20) u lin, i u grp, i (cid:21) ≤ i ≤ m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = m (cid:88) i =1 (cid:107) y i − X i β − Z gbl, i u gbl, i − X i u lin, i − Z grp, i u grp, i (cid:107) = m (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) y i − C gbl, i (cid:34) βu gbl, i (cid:35) − C grp, i (cid:34) u lin, i u grp, i (cid:35) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) and C gbl, i ≡ [ X i Z gbl, i ] , C grp, i ≡ [ X i Z grp, i ] . Therefore, E q { log p ( y | β , u , σ ε ) } = − log(2 π ) m (cid:88) i =1 n i − E q { log( σ ε ) } m (cid:88) i =1 n i − µ q (1 /σ ε ) m (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E q (cid:32) y i − C gbl, i (cid:34) βu gbl, i (cid:35) − C grp, i (cid:34) u lin, i u grp, i (cid:35)(cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + tr ( C T gbl, i C gbl, i Σ q ( β , u gbl ) ) + tr ( C T grp, i C grp, i Σ q ( u lin, i , u grp, i ) )+2 tr (cid:34) C T grp, i C gbl, i E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u lin, i u grp, i (cid:21) − µ q ( u lin, i , u grp, i )) (cid:19) T (cid:41)(cid:35)(cid:41) The remainder of the expectations in (S.2) are expressed as: E q { log p ( β , u | σ gbl , σ grp , Σ ) } = − { K gbl + m (2 + K grp ) } log(2 π ) − log | Σ β |− K gbl E q { log( σ gbl ) } − m E q { log | Σ |} − mK grp E q { log( σ grp ) }− tr (cid:18) Σ − β (cid:26)(cid:16) µ q ( β ) − µ β (cid:17) (cid:16) µ q ( β ) − µ β (cid:17) T + Σ q ( β ) (cid:27)(cid:19) − µ q (1 /σ gbl ) (cid:110) (cid:107) µ q ( u gbl ) (cid:107) + tr ( Σ q ( u gbl ) ) (cid:111) − tr (cid:32) M q ( Σ − ) (cid:40) m (cid:88) i =1 (cid:16) µ q ( u lin, i µ T q ( u lin, i ) + Σ q ( u lin, i ) (cid:17)(cid:41)(cid:33) − µ q (1 /σ grp ) m (cid:88) i =1 (cid:110) (cid:107) µ q ( u grp, i )) (cid:107) + tr ( Σ q ( u grp, i ) ) (cid:111) q { log q ∗ ( β , u ) } = − { K gbl + m (2 + K grp ) } − { K gbl + m (2 + K grp ) } log(2 π ) − log | Σ β | E q { log p ( σ ε | a ε ) } = − ν ε E q { log(2 a ε ) } − log Γ( ν ε / − ( ν ε + 1) E q { log( σ ε ) }− µ q (1 /a ε ) µ q (1 /σ ε ) E q { log q ∗ ( σ ε ) } = ξ q ( σ ε ) log( λ q ( σ ε ) / − log { Γ( ξ q ( σ ε ) ) } − ( ξ q ( σ ε ) + 1) E q { log( σ ε ) }− λ q ( σ ε ) µ q (1 /σ ε ) E q { log p ( a ε ) } = − log(2 ν ε s ε ) − log { Γ( ) } − ( + 1) E q { log( a ε ) }−{ / (2 ν ε s ε ) } µ q (1 /a ε ) E q { log q ∗ ( a ε ) } = ξ q ( a ε ) log( λ q ( a ε ) / − log { Γ( ξ q ( a ε ) ) } − ( ξ q ( a ε ) + 1) E q { log( a ε ) }− λ q ( a ε ) µ q (1 /a ε ) E q { log p ( σ gbl | a gbl ) } = − ν gbl E q { log(2 a gbl ) } − log Γ( ν gbl / − ( ν gbl + 1) E q { log( σ gbl ) }− µ q (1 /a gbl ) µ q (1 /σ gbl ) E q { log q ∗ ( σ gbl ) } = ξ q ( σ gbl ) log( λ q ( σ gbl ) / − log { Γ( ξ q ( σ gbl ) ) } − ( ξ q ( σ gbl ) + 1) E q { log( σ gbl ) }− λ q ( σ gbl ) µ q (1 /σ gbl ) E q { log p ( a gbl ) } = − log(2 ν gbl s gbl ) − log { Γ( ) } − ( + 1) E q { log( a gbl ) }−{ / (2 ν gbl s gbl ) } µ q (1 /a gbl ) E q { log q ∗ ( a gbl ) } = ξ q ( a gbl ) log( λ q ( a gbl ) / − log { Γ( ξ q ( a gbl ) ) } − ( ξ q ( a gbl ) + 1) E q { log( a gbl ) }− λ q ( a gbl ) µ q (1 /a gbl ) E q { log p ( σ grp | a grp ) } = − ν grp E q { log(2 a grp ) } − log Γ( ν grp / − ( ν grp + 1) E q { log( σ grp ) }− µ q (1 /a grp ) µ q (1 /σ grp ) E q { log q ∗ ( σ grp ) } = ξ q ( σ grp ) log( λ q ( σ grp ) / − log { Γ( ξ q ( σ grp ) ) } − ( ξ q ( σ grp ) + 1) E q { log( σ grp ) }− λ q ( σ grp ) µ q (1 /σ grp ) E q { log p ( a grp ) } = − log(2 ν grp s grp ) − log { Γ( ) } − ( + 1) E q { log( a grp ) }−{ / (2 ν grp s grp ) } µ q (1 /a grp ) E q { log q ∗ ( a grp ) } = ξ q ( a grp ) log( λ q ( a grp ) / − log { Γ( ξ q ( a grp ) ) } − ( ξ q ( a grp ) + 1) E q { log( a grp ) }− λ q ( a grp ) µ q (1 /a grp ) E q [log { p ( Σ | A Σ ) } ] = − ( ν Σ + 1) E q { log | A Σ |} − ( ν Σ + 4) E q { log | Σ |} − log( π ) − tr ( M q ( A − Σ ) M q ( Σ − ) ) − ( ν Σ + 3) log(2) − (cid:80) j =1 log Γ( ( ν Σ + 4 − j )) E q [log { q ( Σ ) } ] = ( ξ q ( Σ ) −

1) log | Λ q ( Σ ) | − ( ξ q ( Σ ) + 2) E q { log | Σ |} − tr ( Λ q ( Σ ) M q ( Σ − ) ) − ( ξ q ( Σ ) + 1) log(2) − log( π ) − (cid:80) j =1 log Γ( ( ξ q ( Σ ) + 2 − j )) E q [log { p ( A Σ ) } ] = − E q { log | A Σ |} − (cid:80) j =1 / ( ν Σ s Σ , j ) (cid:16) M q ( A − Σ ) (cid:17) jj − − log( π ) − (cid:80) j =1 log Γ( (3 − j )) E q [log { q ( A Σ ) } ] = ( ξ q ( A Σ ) −

1) log | Λ q ( A Σ ) | − ( ξ q ( A Σ ) + 2) E q { log | A Σ |} − tr ( Λ q ( A Σ ) M q ( A − Σ ) ) − ( ξ q ( A Σ ) + 1) log(2) − log( π ) − (cid:80) j =1 log Γ( ( ξ q ( A Σ ) + 2 − j ))

7n the summation of each of these log p ( y ; q ) terms, note that the coefﬁcient of E q { log( σ ε ) } is − m (cid:88) i =1 n i − ν ε − ξ q ( σ ε ) + 1 = − m (cid:88) i =1 n i − ν ε − ( ν ε + m (cid:88) i =1 n i ) + 1 = 0 . The coefﬁcient of E q { log( σ gbl ) } is − K gbl − ν gbl − ξ q ( σ gbl ) + 1 = − K gbl − ν gbl − ( ν gbl + K gbl ) + 1 = 0 . The coefﬁcient of E q { log( σ grp ) } is − mK grp − ν grp − ξ q ( σ grp ) + 1 = − mK grp − ν grp − ( ν grp + mK grp ) + 1 = 0 . The coefﬁcient of E q { log | Σ |} is − m − ( ν Σ + 4) + ( ξ q ( Σ ) + 2) = − ( m + ν Σ + 4) + ( m + ν Σ + 4) = 0 . The coefﬁcient of E q { log( a ε ) } is − ν ε − − ξ q ( a ε ) + 1 = − ν ε − − ( ν ε + 1) + 1 = 0 . The coefﬁcient of E q { log( a gbl ) } is − ν gbl − − ξ q ( a gbl ) + 1 = − ν gbl − − ( ν gbl + 1) + 1 = 0 . The coefﬁcient of E q { log( a grp ) } is − ν grp − − ξ q ( a grp ) + 1 = − ν grp − − ( ν grp + 1) + 1 = 0 . The coefﬁcient of E q { log | A Σ |} is − ( ν Σ + 1) −

32 + ( ξ q ( A Σ ) + 2) = − ( ν Σ + 2) + ( ν Σ + 2) = 0 . Therefore these terms can be dropped and then the cancellations led by the above expec-tations leads to the lower bound expression in (S.1).

S.7 Derivation of Result 3 If B and b have the same forms given by equation (7) in Nolan & Wand (2018) with b ij ≡  σ − ε y ij  , B ij ≡  σ − ε X ij σ − ε Z gbl, ij O ( (cid:80) mi =1 n i ) − / σ − gbl I K gbl O OO OO OO O  , • B ij ≡  σ − ε X ij σ − ε Z g grp, ij O O n − / i Σ − / g OO n − / i σ − grp, g I K g grp O OO O  and •• B ij ≡  σ − ε X ij σ − ε Z h grp, ij O OO OO O Σ − / h OO σ − grp, h I K h grp  , B T B = C T R − BLUP C + D BLUP and B T b = C T R − BLUP y where C ≡ [ X Z ] , D BLUP ≡ (cid:34) O OO G − (cid:35) and R BLUP ≡ σ ε I , (S.3)and G as deﬁned in (18). The remainder of the derivation of Result 3 is analogous to thatof Result 1. S.8 Derivation of Algorithm 3

Algorithm 3 is simply a proceduralization of Result 3.

S.9 Derivation of Result 4

It is straightforward to verify that the µ q ( β , u ) and Σ q ( β , u ) updates, given at (12) but with D MFVB as given in (21), may be written as µ q ( β , u ) ←− ( B T B ) − B T b and Σ q ( β , u ) ←− ( B T B ) − where B and b have the forms given by equation (7) in Nolan & Wand (2018) with b ij ≡  µ / q (1 /σ ε ) y ij ( (cid:80) mi =1 n i ) − / Σ − / β µ β  , B ij ≡  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z gbl, ij ( (cid:80) mi =1 n i ) − / Σ − / β OO ( (cid:80) mi =1 n i ) − / µ / q (1 /σ gbl ) I K gbl O OO OO OO O  , • B ij ≡  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z g grp, ij O OO O n − / i M / q ( Σ − g ) OO n − / i µ / q (1 /σ grp, g ) I K g grp O OO O  and •• B ij ≡  µ / q (1 /σ ε ) X ij µ / q (1 /σ ε ) Z h grp, ij O OO OO OO OM / q ( Σ − h ) OO µ / q (1 /σ grp, h ) I K h grp  . Result 4 immediately follows from Theorem 4 of Nolan & Wand (2018).

S.10 Derivation of Algorithm 4

We provide expressions for the q -densities for mean ﬁeld variational Bayesian inferencefor the parameters in (19) with product density restriction (20). q ( β , u ) is a N ( µ q ( β , u ) , Σ q ( β , u ) ) density function9here Σ q ( β , u ) = ( C T R − MFVB C + D MFVB ) − and µ q ( β , u ) = Σ q ( β , u ) ( C T R − MFVB y + o MFVB ) with R MFVB ≡ µ − q (1 /σ ε ) I , o MFVB ≡ (cid:34) Σ − β µ β (cid:35) and D MFVB as given in (21). q ( σ ε ) is an Inverse- χ (cid:0) ξ q ( σ ε ) , λ q ( σ ε ) (cid:1) density functionwhere ξ q ( σ ε ) = ν ε + (cid:80) mi =1 (cid:80) n i j =1 o ij and λ q ( σ ε ) = µ q (1 /a ε ) + m (cid:88) i =1 n i (cid:88) j =1 E q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) y ij − C gbl, ij (cid:34) βu gbl (cid:35) − C g grp, ij (cid:34) u g lin, i u g grp, i (cid:35) − C h grp, ij (cid:34) u h lin, ij u h grp, ij (cid:35) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = µ q (1 /a ε ) + m (cid:88) i =1 n i (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E q (cid:32) y ij − C gbl, ij (cid:34) βu gbl (cid:35) − C g grp, ij (cid:34) u g lin, i u g grp, i (cid:35) − C h grp, ij (cid:34) u h lin, ij u h grp, ij (cid:35)(cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + tr (cid:40) Cov q (cid:32) C gbl, ij (cid:34) βu gbl (cid:35) + C g grp, ij (cid:34) u g lin, i u g grp, i (cid:35) + C h grp, ij (cid:34) u h lin, ij u h grp, ij (cid:35)(cid:33)(cid:41)(cid:35) = µ q (1 /a ε ) + m (cid:88) i =1 n i (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E q (cid:32) y ij − C gbl, ij (cid:34) βu gbl (cid:35) − C g grp, ij (cid:34) u g lin, i u g grp, i (cid:35) − C h grp, ij (cid:34) u h lin, ij u h grp, ij (cid:35)(cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + tr ( C T gbl, ij C gbl, ij Σ q ( β , u gbl ) ) + tr (( C g grp, ij ) T C g grp, ij Σ q ( u g lin, i , u g grp, i ) ) + tr (( C h grp, ij ) T C h grp, ij Σ q ( u h lin, ij , u h grp, ij ) )+2 tr (cid:34) ( C g grp, ij ) T C gbl, ij E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i ) (cid:19) T (cid:41)(cid:35) +2 tr (cid:34) ( C h grp, ij ) T C gbl, ij E q (cid:40)(cid:18)(cid:20) βu gbl (cid:21) − µ q ( β , u gbl ) (cid:19) (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u h lin, ij , u h grp, ij ) (cid:19) T (cid:41)(cid:35) +2 tr (cid:34) ( C g grp, ij ) T C h grp, ij E q (cid:40)(cid:18)(cid:20) u g lin, i u g grp, i (cid:21) − µ q ( u g lin, i , u g grp, i ) (cid:19) (cid:18)(cid:20) u h lin, ij u h grp, ij (cid:21) − µ q ( u h lin, ij , u h grp, ij ) (cid:19) T (cid:41)(cid:35)(cid:41) where C gbl, ij ≡ [ X ij Z gbl, ij ] , C g grp, ij ≡ [ X ij Z g grp, ij ] , C h grp, ij ≡ [ X ij Z h grp, ij ] and with reciprocalmoment µ q (1 /σ ε ) = ξ q ( σ ε ) /λ q ( σ ε ) , q ( σ gbl ) is an Inverse- χ (cid:16) ξ q ( σ gbl ) , λ q ( σ gbl ) (cid:17) density functionwhere ξ q ( σ gbl ) = ν gbl + K gbl and λ q ( σ gbl ) = µ q (1 /a gbl ) + (cid:107) µ q ( u gbl ) (cid:107) + tr (cid:16) Σ q ( u gbl ) (cid:17) , with reciprocal moment µ q (1 /σ gbl ) = ξ q ( σ gbl ) /λ q ( σ gbl ) , q ( σ grp, g ) is an Inverse- χ (cid:16) ξ q ( σ grp, g ) , λ q ( σ grp, g ) (cid:17) density function10here ξ q ( σ grp, g ) = ν grp, g + mK g grp and λ q ( σ grp, g ) = µ q (1 /a grp, g ) + m (cid:88) i =1 (cid:110) (cid:107) µ q ( u g grp, i ) (cid:107) + tr (cid:16) Σ q ( u g grp, i ) (cid:17)(cid:111) , with reciprocal moment µ q (1 /σ grp, g ) = ξ q ( σ grp, g ) /λ q ( σ grp, g ) , q ( Σ g ) is an Inverse-G-Wishart (cid:0) G full , ξ q ( Σ g ) , Λ q ( Σ g ) (cid:1) density functionwhere ξ q ( Σ g ) = ν Σ g + 2 + m and Λ q ( Σ g ) = M q ( A − Σ g ) + m (cid:88) i =1 (cid:16) µ q ( u g lin, i ) µ T q ( u g lin, i ) + Σ q ( u g lin, i ) (cid:17) , with inverse moment M q ( Σ − g ) = ( ξ q ( Σ g ) − Λ − q ( Σ g ) , q ( σ grp, h ) is an Inverse- χ (cid:16) ξ q ( σ grp, h ) , λ q ( σ grp, h ) (cid:17) density functionwhere ξ q ( σ grp, h ) = ν grp, h + K h grp (cid:80) mi =1 n i and λ q ( σ grp, h ) = µ q (1 /a grp, h ) + m (cid:88) i =1 n i (cid:88) j =1 (cid:110) (cid:107) µ q ( u h grp, ij ) (cid:107) + tr (cid:16) Σ q ( u h grp, ij ) (cid:17)(cid:111) , with reciprocal moment µ q (1 /σ grp, h ) = ξ q ( σ grp, h ) /λ q ( σ grp, h ) , q ( Σ h ) is an Inverse-G-Wishart (cid:0) G full , ξ q ( Σ h ) , Λ q ( Σ h ) (cid:1) density functionwhere ξ q ( Σ h ) = ν Σ h + 2 + (cid:80) mi =1 n i and Λ q ( Σ h ) = M q ( A − Σ h ) + m (cid:88) i =1 n i (cid:88) j =1 (cid:18) µ q ( u h lin, ij ) µ T q ( u h lin, ij ) + Σ q ( u h lin, ij ) (cid:19) , with inverse moment M q ( Σ − h ) = ( ξ q ( Σ h ) − Λ − q ( Σ h ) , q ( a ε ) is an Inverse- χ ( ξ q ( a ε ) , λ q ( a ε ) ) density functionwhere ξ q ( a ε ) = ν ε + 1 , λ q ( a ε ) = µ q (1 /σ ε ) + 1 / ( ν ε s ε ) with reciprocal moment µ q (1 /a ε ) = ξ q ( a ε ) /λ q ( a ε ) , q ( a gbl ) is an Inverse- χ ( ξ q ( a gbl ) , λ q ( a gbl ) ) density functionwhere ξ q ( a gbl ) = ν gbl + 1 , λ q ( a gbl ) = µ q (1 /σ gbl ) + 1 / ( ν gbl s gbl ) with reciprocal moment µ q (1 /a gbl ) = ξ q ( a gbl ) /λ q ( a gbl ) , q ( a grp, g ) is an Inverse- χ ( ξ q ( a grp, g ) , λ q ( a grp, g ) ) density functionwhere ξ q ( a grp, g ) = ν grp, g + 1 , λ q ( a grp, g ) = µ q (1 /σ grp, g ) + 1 / ( ν grp, g s grp, g ) µ q (1 /a grp, g ) = ξ q ( a grp, g ) /λ q ( a grp, g ) and q ( A Σ g ) is an Inverse-G-Wishart (cid:16) G diag , ξ q ( A Σ g ) , Λ q ( A Σ g ) (cid:17) density functionwhere ξ q ( A Σ g ) = ν Σ g + 2 , Λ q ( A Σ g ) = diag (cid:8) diagonal (cid:0) M q ( Σ − g ) (cid:1)(cid:9) + Λ A Σ g with inverse moment M q ( A − Σ g ) = ξ q ( A Σ g ) Λ − q ( A Σ g ) , q ( a grp, h ) is an Inverse- χ ( ξ q ( a grp, h ) , λ q ( a grp, h ) ) density functionwhere ξ q ( a grp, h ) = ν grp, h + 1 , λ q ( a grp, h ) = µ q (1 /σ grp, h ) + 1 / ( ν grp, h s grp, h ) with reciprocal moment µ q (1 /a grp, h ) = ξ q ( a grp, h ) /λ q ( a grp, h ) and q ( A Σ h ) is an Inverse-G-Wishart (cid:16) G diag , ξ q ( A Σ h ) , Λ q ( A Σ h ) (cid:17) density functionwhere ξ q ( A Σ h ) = ν Σ h + 2 Λ q ( A Σ h ) = diag (cid:8) diagonal (cid:0) M q ( Σ − h ) (cid:1)(cid:9) + Λ A Σ h with inverse moment M q ( A − Σ h ) = ξ q ( A Σ h ) Λ − q ( A Σ h ) . S.11 The S OLVE T WO L EVEL S PARSE L EAST S QUARES

Algorithm

The S OLVE T WO L EVEL S PARSE L EAST S QUARES is listed in Nolan et al. (2018) and based on Theo-rem 2 of Nolan & Wand (2018). Given its centrality to Algorithms 1 and 2 we list it againhere. The algorithm solves a sparse version of the the least squares problem: min x (cid:107) b − Bx (cid:107) which has solution x = A − B T b where A = B T B where B and b have the followingstructure: B ≡  B • B O · · · OB O • B · · · O ... ... ... . . . ... B m O O · · · • B m  and b =  b b ... b m  . (S.4)The sub-matrices corresponding to the non-zero blocks of A are labelled according to: A − =  A A , A , · · · A ,m A , T A , × · · · × A , T × A , · · · × ... ... ... . . . ... A ,m T × × · · · A ,m  . (S.5)with × denoting sub-blocks that are not of interest. The S OLVE T WO L EVEL S PARSE L EAST S QUARES algorithm is given in Algorithm S.1. 12 lgorithm S.1 S OLVE T WO L EVEL S PARSE L EAST S QUARES for solving the two-level sparse matrix leastsquares problem: minimise (cid:107) b − B x (cid:107) in x and sub-blocks of A − corresponding to the non-zerosub-blocks of A = B T B . The sub-block notation is given by (S.4) and (S.5). Inputs: (cid:8)(cid:0) b i (˜ n i × , B i (˜ n i × p ) , • B i (˜ n i × q ) (cid:1) : 1 ≤ i ≤ m (cid:9) ω ←− NULL ; Ω ←− NULLFor i = 1 , . . . , m :Decompose • B i = Q i (cid:20) R i (cid:21) such that Q − i = Q Ti and R i is upper-triangular. c i ←− Q Ti b i ; C i ←− Q Ti B i c i ←− ﬁrst q rows of c i ; c i ←− remaining rows of c i ; ω ←− (cid:20) ω c i (cid:21) C i ←− ﬁrst q rows of C i ; C i ←− remaining rows of C i ; Ω ←− (cid:20) Ω C i (cid:21) Decompose Ω = Q (cid:20) R (cid:21) such that Q − = Q T and R is upper-triangular. c ←− ﬁrst p rows of Q T ω ; x ←− R − c ; A ←− R − R − T For i = 1 , . . . , m : x ,i ←− R − i ( c i − C i x ) ; A ,i ←− − A ( R − i C i ) T A ,i ←− R − i ( R − Ti − C i A ,i ) Output: (cid:16) x , A , (cid:8)(cid:0) x ,i , A ,i , A ,i ) : 1 ≤ i ≤ m (cid:9)(cid:17) .12 The S OLVE T HREE L EVEL S PARSE L EAST S QUARES

Algorithm

The S OLVE T HREE L EVEL S PARSE L EAST S QUARES , listed in Nolan et al. (2018) is a proceduralizationof Theorem 4 of Nolan & Wand (2018). Since it is central to Algorithms 3 and 4 we listit here. The S OLVE T HREE L EVEL S PARSE L EAST S QUARES algorithm is concerned with solving thesparse three-level version of min x (cid:107) b − Bx (cid:107) with the solution x = A − B T b where A = B T B where B and b have the followingstructure: B ≡ (cid:104) stack ≤ i ≤ m (cid:110) stack ≤ j ≤ n i ( B ij ) (cid:111) (cid:12)(cid:12)(cid:12) blockdiag ≤ i ≤ m (cid:110)(cid:2) stack ≤ j ≤ n i ( • BBB ij ) (cid:12)(cid:12) blockdiag ≤ j ≤ n i ( •• BBB ij ) (cid:3)(cid:111)(cid:105) (S.6)and b ≡ stack ≤ i ≤ m (cid:110) stack ≤ j ≤ n i ( b ij ) (cid:111) . (S.7)The three-level sparse matrix inverse problem involves determination of the sub-blocks of A − corresponding to the non-zero sub-blocks of A . Our notation for these sub-blocks isillustrated by A − =  A A , A , A , A , A , A , A , A , T A , A , , A , , × × × × A , T A , , T A , × × × × × A , T A , , T × A , × × × × A , T × × × A , A , , A , , A , , A , T × × × A , , T A , × × A , T × × × A , , T × A , × A , T × × × A , , T × × A ,  (S.8)for the m = 2 , n = 2 and n = 3 case. The × symbol denotes sub-blocks that are not ofinterest. The S OLVE T HREE L EVEL S PARSE L EAST S QUARES algorithm is given in Algorithm S.2.14 lgorithm S.2 S OLVE T HREE L EVEL S PARSE L EAST S QUARES for solving the three-level sparse matrixleast squares problem: minimise (cid:107) b − B x (cid:107) in x and sub-blocks of A − corresponding to thenon-zero sub-blocks of A = B T B . The sub-block notation is given by (S.6), (S.7) and (S.8). Inputs: (cid:8)(cid:0) b ij (˜ o ij × , B ij (˜ o ij × p ) , • B ij (˜ o ij × q ) , •• B ij (˜ o ij × q ) (cid:1) : 1 ≤ i ≤ m, ≤ j ≤ n i (cid:9) ω ←− NULL ; Ω ←− NULLFor i = 1 , . . . , m : ω ←− NULL ; Ω ←− NULL ; Ω ←− NULLFor j = 1 , . . . , n i :Decompose •• B ij = Q ij (cid:20) R ij (cid:21) such that Q − ij = Q Tij and R ij is upper-triangular. d ij ←− Q Tij b ij ; D ij ←− Q Tij B ij ; • D ij ←− Q Tij • B ij d ij ←− q rows of d ij ; d ij ←− remaining rows of d ij ; ω ←− (cid:20) ω d ij (cid:21) D ij ←− q rows of D ij ; D ij ←− remaining rows of D ij ; Ω ←− (cid:20) Ω D ij (cid:21) • D ij ←− q rows of • D ij ; • D ij ←− remaining rows of • D ij ; Ω ←− (cid:34) Ω • D ij (cid:35) Decompose Ω = Q i (cid:20) R i (cid:21) such that Q − i = Q Ti and R i is upper-triangular. c i ←− Q Ti ω ; C i ←− Q Ti Ω c i ←− q rows of c i ; c i ←− remaining rows of c i ; ω ←− (cid:20) ω c i (cid:21) C i ←− q rows of C i ; C i ←− remaining rows of C i ; Ω ←− (cid:20) Ω C i (cid:21) Decompose Ω = Q (cid:20) R (cid:21) so that Q − = Q T and R is upper-triangular. c ←− ﬁrst p rows of Q T ω ; x ←− R − c ; A ←− R − R − T For i = 1 , . . . , m : x ,i ←− R − i ( c i − C i x ) ; A ,i ←− − A ( R − i C i ) T A ,i ←− R − i ( R − Ti − C i A ,i ) For j = 1 , . . . , n i : x ,ij ← R − ij ( d ij − D ij x − • D ij x ,i ) A ,ij ← − (cid:110) R − ij ( D ij A + • D ij A ,i T ) (cid:111) T A , i, j ← − (cid:110) R − ij ( D ij A ,i + • D ij A ,i ) (cid:111) T A ,ij ← R − ij (cid:0) R − Tij − D ij A ,ij − • D ij A , i, j (cid:1) Output: (cid:16) x , A , (cid:8)(cid:0) x ,i , A ,i , A ,i ) : 1 ≤ i ≤ m (cid:9)(cid:17)(cid:8)(cid:0) x ,ij , A ,ij , A ,ij , A , i, j (cid:1) : 1 ≤ i ≤ m, ≤ j ≤ n i (cid:9)(cid:17)(cid:9)(cid:17)