# Incompletely observed nonparametric factorial designs with repeated measurements: A wild bootstrap approach

IIncompletely observed nonparametric factorial designs with repeatedmeasurements: A wild bootstrap approach

Lubna Amro ∗ , Frank Konietschke , and Markus Pauly ARTICLE HISTORY

Compiled February 8, 2021

ABSTRACT

In many life science experiments or medical studies, subjects are repeatedly ob-served and measurements are collected in factorial designs with multivariate data.The analysis of such multivariate data is typically based on multivariate analysisof variance (MANOVA) or mixed models, requiring complete data, and certain as-sumption on the underlying parametric distribution such as continuity or a speciﬁccovariance structure, e.g., compound symmetry. However, these methods are usuallynot applicable when discrete data or even ordered categorical data are present. Insuch cases, nonparametric rank-based methods that do not require stringent distri-butional assumptions are the preferred choice. However, in the multivariate case,most rank-based approaches have only been developed for complete observations.It is the aim of this work is to develop asymptotic correct procedures that are ca-pable of handling missing values, allowing for singular covariance matrices and areapplicable for ordinal or ordered categorical data. This is achieved by applying awild bootstrap procedure in combination with quadratic form-type test statistics.Beyond proving their asymptotic correctness, extensive simulation studies validatetheir applicability for small samples. Finally, two real data examples are analyzed.

KEYWORDS

Missing Values; Wild bootstrap; Rank tests; Repeated measures; Orderedcategorical data; Nonparametric hypotheses.

1. Introduction

Factorial designs have a long history in several scientiﬁc ﬁelds such as biology, ecologyor medicine [Traux and Carkhuﬀ, 1965, Nesnow et al., 1998, Wildsmith et al., 2001,Biederman et al., 2008, Lekberg et al., 2018]. The reasons are simple: They are aneﬃcient way to study main and interaction eﬀects between diﬀerent factors. Typically,such designs are inferred by parametric mean-based procedures such as linear mixedmodels, or multivariate analysis of variance (MANOVA). These procedures, however, Mathematical Statistics and Applications in Industry, Faculty of Statistics, Technical University of Dortmund,Germany Institute of Biometry and Clinical Epidemiology, Charit´e—Universit¨atsmedizin Berlin, Berlin, Germany Berlin Institute of Health (BIH), Berlin, Germany

Corresponding author :Lubna Amro, Mathematical Statistics and Applications in Industry, Faculty of Statistics, Technical Universityof Dortmund, Germany.Email: [email protected] a r X i v : . [ s t a t . M E ] F e b ely on restrictive distributional assumptions, such as multivariate normality, or spe-cial dependencies [Lawley, 1938, Bartlett, 1939, Davis, 2002, Johnson and Wichern,2007, Fitzmaurice et al., 2012].However, assessing multivariate normality or a speciﬁc type of covariance matrix isdiﬃcult in practice [Micceri, 1989, Qian and Huang, 2005, Bourlier et al., 2008, Xuand Cui, 2008], especially when the sample sizes are small. In particular, Erceg-Hurnand Mirosevich [2008] pointed out that ”Researchers relying on statistical tests (e.g.,Levene’s test) to identify assumption violations may frequently fail to detect deviationsfrom normality and homoscedasticity that are large enough to seriously aﬀect the type-Ierror rate and power of classic parametric tests ”(p. 600). In fact, classical MANOVAprocedures are usually non robust to deviations and may result in inaccurate decisionscaused by possibly conservative or inﬂated type-I error rates [Arnau et al., 2012,Pesarin and Salmaso, 2012, Brombin et al., 2013, Konietschke et al., 2015, Paulyet al., 2015b, Friedrich et al., 2017a, Friedrich and Pauly, 2018]. Moreover, if ordinal,or ordered categorical data are present, mean-based approaches are not applicable. Atempting alternative are nonparametric rank-based methods, which are applicable fornon-normal data, in particular discrete data or even ordered categorical data. Theirkey features are their robustness and their invariance under monotone transformationsof the data. Consequently, Akritas and Arnold [1994], Akritas and Brunner [1997],Munzel and Brunner [2000], Brunner and Puri [2001], Akritas [2011], Brunner et al.[2017], and Friedrich et al. [2017b] proposed nonparametric ranking methods for allkinds of factorial designs. Thereby, hypotheses are no longer formulated in terms ofmeans (which do not exist for ordinal data), distribution functions are used instead.However, all these methods are only applicable for completely observed factorialdesigns data and cannot be used to analyze multivariate data with missing values.In contrast, there are only a few methods which are applicable in case of missingvalues, require no parametric assumptions and also lead to valid inferences in caseof arbitrary covariance structures, skewed distributions or unbalanced experimentaldesigns. Examples in the nonparametric framework for the matched pairs designsinclude the ranking methods proposed by Akritas et al. [2002], Konietschke et al.[2012], Fong et al. [2018]. A promising approach for factorial designs with repeatedmeasurements is given by the proposals of Brunner et al. [1999] and Domhof et al.[2002] who recommended two types of quadratic forms for testing nonparametrichypotheses in terms of distribution functions: rank based Wald and ANOVA-typestatistics. The Wald-type test is an asymptotically valid test which usually needslarge sample sizes to obtain accurate test decisions, see Brunner [2001], Friedrich et al.[2017b] for the case of complete observations and our simulation study below. Apartfrom that, the ANOVA-type test is based upon an approximation of its distributionwith scaled χ -distributions. As the latter does not coincide with the ANOVA-typetest limiting distribution under H , the test is in general not asymptotically correctand also exhibits a more or less conservative behaviour under small sample sizes, seeBrunner [2001] for the case of complete observations. Another applicable techniqueis the nonparametric imputation method proposed by Gao [2007], which possesses agood type-I error control but a lower power behavior than the methods by Brunneret al. [1999] and Domhof et al. [2002], see the simulation study of Gao [2007].Thus, the aims of the present paper are (i) to provide statistical testsfor hypotheses formulated in distribution functions that are capable of treatingmissing values in factorial designs, (ii) work without parametric assumptions such2s continuity of the distribution functions, or nonsingular covariance matrices, (iii)are asymptotically correct while (iv) showing a satisfactorily type-I error control andgood power properties. To accomplish this, we propose three diﬀerent quadratic-formtype test statistics and equip them with a nonparametric Wild bootstrap procedurefor calculating critical values to enhance their small sample performance.The paper is organized as follows: First we introduce the statistical model and thehypotheses of interest in the next section. In Section 3, we introduce the test statisticsand analyze their asymptotic behaviour. In Section 4, the proposed wild bootstraptechnique is explained. Section 5 displays the results from our extensive simulationstudy and two real data examples from a ﬂuvoxamine study and a skin disorderclinical trial are analyzed in Section 6. All proofs as well as additional simulationresults can be found in the supplement.To facilitate the presentation, we introduce the following notations: Let I d denotethe d-dimensional unit matrix, J d denote the d × d matrix of ’s i.e. J d = d Td , where d = (1 , ..., Td × denotes the d-dimensional column vector and P d = I d − d J d is theso-called d-dimensional centering matrix. Finally, by A ⊗ B we denote the Kroneckerproduct of the matrices A and B .

2. Statistical model and nonparametric hypotheses

We consider a nonparametric repeated measures model with a independent and poten-tially unbalanced treatment groups and d diﬀerent time points given by independentrandom vectors X ik = ( X i k , ..., X idk ) T , i = 1 , ..., a, k = 1 , ..., n i , (2.1)where X ijk ∼ F ij ( x ) = [ F + ij ( x ) + F − ij ( x )] , i = 1 , ..., a, j = 1 , ..., d, k = 1 , ..., n i . Here, F + ij ( x ) = P ( X ijk ≤ x ) is the right continuous version and F − ij ( x ) = P ( X ijk < x ) isthe left continuous version of the distribution function. Using the normalized version F ij ( x ) is useful for handling ties and including continuous as well as discontinuousdistribution functions [Brunner et al., 2018]. To include the case of missing values, wefollow the notation of Brunner et al. [1999] and let λ ijk = (cid:40) , if X ijk is observed0 , if X ijk is non-observed i = 1 , ..., a, j = 1 , ..., d, k = 1 , ..., n i . (2.2)Moreover, let n = (cid:80) ai =1 n i denote the total number of subjects and let N = (cid:80) ai =1 (cid:80) dj =1 (cid:80) n i k =1 λ ijk denote the total number of observations.To formulate the null hypothesis in this nonparametric setup, let F = ( F , ..., F ad ) T denote the vector of the distribution functions F ij , i = 1 , ..., a, j = 1 , ..., d. and C denote a contrast matrix , i.e., C = where = (1 , ..., T and = (0 , ..., T .Then, the null hypotheses are formulated by H : { CF = } . This framework coversdiﬀerent factorial repeated measures designs. For example, the hypothesis of notreatment group eﬀect, i.e. H G : { ¯ F . = ... = ¯ F a. } , is equivalently written in matrixnotation as H G : { ( P a ⊗ d Td ) F = } . Similarly, the hypothesis of no time eﬀect,i.e. H T : { ¯ F . = ... = ¯ F .d } is equivalently written as H T : { ( a Ta ⊗ P d ) F = } ,3nd the hypothesis of no interaction eﬀect between treatment and time is written as H GT : { ( P a ⊗ P d ) F = } .To entail a parameter for describing diﬀerences between distributions, Brunner et al.[1999] and Domhof et al. [2002] considered the relative marginal eﬀects p ij = (cid:90) H ( x ) dF ij ( x ) , (2.3)where H ( x ) = N − (cid:80) ai =1 (cid:80) dj =1 (cid:80) n i k =1 λ ijk F ij ( x ) is the weighted average of all distri-bution functions in the experiment. Estimators thereof are given by plugging-in theempirical versions of F ij ( x ) and H ( x )ˆ F ij ( x ) = 1 λ ij. n i (cid:88) k =1 λ ijk ˆ F ijk ( x ) (2.4)= 1 λ ij. n i (cid:88) k =1 λ ijk c ( x − X ijk ) , λ ij. = n i (cid:88) k =1 λ ijk , (2.5)ˆ H ( x ) = 1 N a (cid:88) i =1 d (cid:88) j =1 n i (cid:88) k =1 c ( x − X ijk ) , (2.6)where, c ( u ) is the normalized version of the counting function, i.e. c ( u ) = 0 , / u < , u = 0 or u > c ( x − X ijk ) is assumed to equal 0 if theobservation X ijk is missing. Note that ˆ F ij ( x ) = 0 in case of λ ij. = 0. Thus, the relativemarginal eﬀects p ij are estimated byˆ p ij = (cid:90) ˆ Hd ˆ F ij = 1 λ ij. n i (cid:88) k =1 λ ijk ˆ H ( X ijk ) (2.7)= 1 λ ij. n i (cid:88) k =1 λ ijk N ( R ijk −

12 ) , (2.8)where R ijk is the mid-rank of X ijk among all N dependent and independentobservations. For convenience, the relative marginal eﬀect estimators are collected inthe vector ˆ p = (ˆ p , ..., ˆ p ad ) T .In the sequel, we derive asymptotic theory under the following sample size assump-tion and missing values: Assumption 1. λ ij. n → κ i ∈ (0 , i = 1 , ..., a, j = 1 , ..., d , as n ≡ min { λ ij. } →∞ . This ensures that there are ’enough’ subjects without missing values in each group.

Asymptotic distribution.

Brunner et al. [1999] derived the asymptotic distribu-tion of √ n C ˆ p under the null hypothesis H : CF = . Setting ˆ Y ijk = ˆ H ( X ijk ) and Y ijk = H ( X ijk ), they proved that under H , √ n C ˆ p and √ n C ¯ Y . are asymptoticallyequivalent. Accordingly, they showed that √ n C ˆ p follows under H , asymptotically, a4ultivariate normal distribution with expectation and covariance matrix CV n C T .Here V n = d (cid:77) i =1 nn i V i = d (cid:77) i =1 nn i n i (cid:88) k =1 Λ ik V ik Λ ik , (2.9)where Λ ik = n i diag { λ i k λ i . , ..., λ idk λ id. } , and V ik = Cov ( Y ik ) denotes the covariancematrix of the random vectors Y ik = ( H ( X i k ) , ..., H ( X idk )) T .Brunner et al. [1999] propose to estimate the covariance matrix V n by ˆ V n = r (cid:77) i =1 nn i ˆ V i , (2.10)where ˆ V i = [ˆ v i ( j, j (cid:48) )] with diagonal and oﬀ-diagonal elements ˆ v i ( j, j ) and ˆ v i ( j, j (cid:48) ),respectively, deﬁned asˆ v i ( j, j ) = n i (cid:80) n i k =1 λ ijk [ R ijk − ¯ R ij. ] ( N ) λ ij. ( λ ij. − , (2.11)ˆ v i ( j, j (cid:48) ) = n i (cid:80) n i k =1 λ ijk λ ij (cid:48) k [( R ijk − ¯ R ij. )( R ij (cid:48) k − ¯ R ij (cid:48) . )]( N )(( λ ij. − λ ij (cid:48) . −

1) + ∆ i,jj (cid:48) − . (2.12)Here, ¯ R ij. = λ ij. (cid:80) n i k =1 λ ijk ( R ijk ), and ∆ i,jj (cid:48) = (cid:80) n i k =1 λ ijk λ ij (cid:48) k . Under Assumption1, Brunner et al. [1999] have shown that ˆ V i is a consistent estimator of V i .

3. Statistics and Asymptotics

In this section, we propose three diﬀerent quadratic forms for testing the null hy-pothesis: a Wald-type statistic (WTS), an ANOVA-type statistic (ATS) as suggestedin Brunner et al. [1999] and Domhof et al. [2002] as well as a modiﬁed ANOVA-typestatistic (MATS). For mean-based analyses, the latter was proposed by Friedrich andPauly [2018].The rank version of the WTS is deﬁned as T W = n ˆ p T C T [ C ˆ V n C T ] + C ˆ p , (3.1)where [ B ] + denotes the Moore-penrose inverse of a matrix B . Its asymptotic nulldistribution is summarized below. Theorem 3.1.

Under Assumption (1) and V n → V > as n → ∞ , the statistic T W has under the null hypothesis H : CF = , asymptotically, as n → ∞ , a central χ f -distribution with f = rank ( C ) degrees of freedom. χ -distribution isusually slow and large sample sizes are required to obtain adequate results [Vallejoet al., 2010, Konietschke et al., 2015, Pauly et al., 2015a, Smaga, 2017]. Thus, Brunneret al. [1999] and Domhof et al. [2002] proposed an alternative quadratic form bydeleting the variance ˆ V n involved in the computation of the WTS, resulting in theANOVA-type statistic (ATS) deﬁned as T A = 1 tr ( T ˆ V n ) n ˆ p T T ˆ p , (3.2)where T = C T [ CC T ] + C is a projection matrix. Note that H : T F = ⇔ CF = because C T [ CC T ] + is a generalized inverse of C .It is also worth to note, that diﬀerent to T W , the asymptotic distribution of T A canalso be derived if V is singular. Theorem 3.2.

Under Assumption (1) and under the null hypothesis H : CF = ,the test statistic T A has asymptotically, as n → ∞ , the same distribution as therandom variable A = a (cid:88) i =1 d (cid:88) j =1 ζ ij B ij /tr ( T V n ) , (3.3)where B ij i.i.d ∼ χ and the weights ζ ij are the eigenvalues of T V n .The limiting distribution is approximated by a scaled gχ f distribution. Brunneret al. [1999] proposed a Box [1954]-type approximation such that the ﬁrst two momentsof T A and gχ f approximately coincide where f can be estimated by ˆ f = ( tr ( T ˆ V n )) tr ( T ˆ V n T ˆ V n ) .Thus, the distribution of T A is approximated by a central F ( ˆ f , ∞ )-distribution. Thecorresponding ANOVA-type test φ A = { T A > F α ( ˆ f , ∞ ) } , where F α ( ˆ f , ∞ ) denotesthe (1 − α )-quantile of the F ( ˆ f , ∞ )-distribution.Despite its advantage of being applicable in case of singular covariance matrices, theATS has the drawback of being an approximative test and thus, even its asymptoticexactness cannot be guaranteed.Another possible test statistic is the modiﬁed version of the ANOVA type-statistic(MATS) that was developed by Friedrich and Pauly [2018] for mean-based MANOVAmodels. Here, we transfer it to the ranked-based set-up, where it is given by T M = n ˆ p T C T [ C ˆ D n C T ] + C ˆ p , (3.4)where ˆ D n = diag ( ˆ V n ) ii . In this way it is a compromise between the WTS and theATS as it only uses the diagonal entries of ˆ V n for the multivariate studentization. Infact, the nonsingularity assumption of V is not needed to derive its asymptotics. It is6eplaced by the weaker requirement that D = diag ( V ) ii >

0, which is fulﬁlled whenall it’s diagonal elements v i ( j ) are positive. This is the same as V ar ( H ( X ij )) > i ∈ { , ..., a } , j ∈ { , ..., d } . It is supposed to be met in a wide range of applicablesettings, except only cases where, for example, at least one time point of any of thetreatment groups is a discrete variable with very few distinct values. Theorem 3.3.

Under Assumption (1) and assuming that v i ( j, j ) > for all i ∈{ , ..., a } , j ∈ { , ..., d } , the test statistic T M , has under the null hypothesis H : CF = , asymptotically, as n → ∞ , the same distribution as the random variable M = a (cid:88) i =1 d (cid:88) j =1 ˜ ζ ij ˜ B ij , where ˜ B ij i.i.d ∼ χ and the weights ˜ ζ ij are the eigenvalues of V / n C T [ CDC T ] + C V / n and D = diag ( V ) ii . Since, the limit distributions of both, the ATS and the MATS depend on un-known weights ζ i and ˜ ζ i , we cannot directly calculate critical values. In addition,the χ f − approximation to T W is rather slow. To this end, we develop asymptoticallycorrect testing procedures based on bootstrap versions of T W , T A , and T M in the sub-sequent section.

4. Wild bootstrap approach

We consider a wild bootstrap approach to derive new asymptotically valid testingprocedures with good ﬁnite sample properties. To this end, let Z ik = ( R ik − ¯ R i. )denote the centered rank vectors, where R ik = ( R i k , ..., R idk ) T , ¯ R i. = ( ¯ R i . , ..., ¯ R id. ) T and ¯ R ij. = λ ij. (cid:80) n i k =1 λ ijk ( R ijk ). Moreover, let W ik denote independent and identicallydistributed random weights with E ( W ik ) = 0 and V ar ( W ik ) = 1. Although, thereare diﬀerent possible choices for these random weights [Mammen, 1992, Davidsonand Flachaire, 2008], some particular choices have become popular. Following theinvestigations in Friedrich et al. [2017b] for the corresponding complete case scenario,we use Rademacher random variables, which are deﬁned by P ( W ik = −

1) = P ( W ik =1) = 1 /

2. Then, a wild bootstrap sample is deﬁned as Z ∗ ik = W ik . Z ik , i = 1 , ..., a, k = 1 , ..., n i . (4.1)In other words, Z ∗ ik is a symmetrization of the rank vector Z ik . Now, we can deﬁnethe bootstrap version of the relative eﬀect estimator ˆ p ij ˆ p ∗ ij = 1 λ ij. n i (cid:88) k =1 λ ijk N ( Z ∗ ijk ) = 1 λ ij. n i (cid:88) k =1 λ ijk N ( W ik ( R ijk − ¯ R ij. )) . (4.2)We pool them into the vector ˆ p ∗ = (ˆ p ∗ , ..., ˆ p ∗ ad ) T to obtain the bootstrap counter-part of ˆ p .In the same way, the bootstrap covariance matrix estimator ˆ V ∗ i = [ˆ v ∗ i ( j, j (cid:48) )] withdiagonal and oﬀ-diagonal elements ˆ v ∗ i ( j, j ) and ˆ v ∗ i ( j, j (cid:48) ), respectively, is given by7 v ∗ i ( j, j ) = n i (cid:80) n i k =1 λ ijk [ Z ∗ ijk − ¯ Z ∗ ij. ] ( N ) λ ij. ( λ ij. − , (4.3)ˆ v ∗ i ( j, j (cid:48) ) = n i (cid:80) n i k =1 λ ijk λ ij (cid:48) k [( Z ∗ ijk − ¯ Z ∗ ij. )( Z ∗ ij (cid:48) k − ¯ Z ∗ ij (cid:48) . )]( N )(( λ ij. − λ ij (cid:48) . −

1) + ∆ i,jj (cid:48) − . (4.4)From this, the bootstrapped versions of the quadratic forms, i.e. the Wald-typestatistic T ∗ W , the ANOVA-type statistic T ∗ A and the modiﬁed ANOVA-type statistic T ∗ M are computed as T ∗ W = n ˆ p ∗ T C T [ C ˆ V ∗ n C T ] + C ˆ p ∗ , (4.5) T ∗ A = 1 tr ( T ˆ V ∗ n ) n ˆ p ∗ T T ˆ p ∗ , (4.6) T ∗ M = n ˆ p ∗ T C T [ C ˆ D ∗ n C T ] + C ˆ p ∗ , (4.7)where ˆ V ∗ n = a (cid:76) i =1 nn i ˆ V ∗ i and ˆ D ∗ n = diag ( ˆ V ∗ n ) ii . To get an asymptotically valid boot-strap test, we have to assure that the conditional distribution of the Wald, ANOVA-type, and MATS-type bootstrap statistics T ∗ W , T ∗ A , and T ∗ M approximate the nulldistribution of T W , T A , and T M , respectively. Theorem 4.1.

Under Assumption (1) , the following results hold:(1) For i = 1 , ..., a , the conditional distribution of √ n ˆ p ∗ i , given the data X , convergesweakly to the multivariate N (0 , κ − i V i ) -distribution in probability.(2) The conditional distribution of √ n ˆ p ∗ , given the data X , converges weakly to themultivariate N (0 , a (cid:76) i =1 κ − i V i ) -distribution in probability. Thus, the distributions of √ n C ˆ p ∗ and √ n C ( ˆ p − p ) asymptotically coincide underthe null hypothesis H . Theorem 4.2.

Under Assumption (1) , for any choice [ − ] ∈ { A, M, W } , the condi-tional distribution of T ∗ [ − ] converges weakly to the null distribution of T [ − ] in probabilityfor any choice of p ∈ R ad . In particular we have that sup x ∈ R | P p ( T ∗ [ − ] ≤ x | X ) − P H ( T [ − ] ≤ x ) | p −→ holds, provided that V > or v i ( j, j ) > for all i ∈ { , ..., a } , j ∈ { , ..., d } is fulﬁlledin case of T W or T M , respectively. Therefore, the corresponding wild bootstrap tests are given by φ ∗ W = { T W > c ∗ w } , φ ∗ A = { T A > c ∗ A } , and φ ∗ M = { T M > c ∗ M } , where c ∗ w , c ∗ A , and c ∗ M denoting the(1 − α ) quantiles of the conditional bootstrap distributions of T ∗ W , T ∗ A , and T ∗ M giventhe data, respectively. 8heorem 4.2 implies that the wild bootstrap tests are of asymptotic level α underthe null hypothesis and consistent for any ﬁxed alternative H : CF (cid:54) = , i.e., theyhave asymptotically power 1. In addition, it follows from Jansen et al. [2003] thatthey have the same local power under contiguous alternatives as their original tests. H H H T A T W* T A* T M* T A T W* T A* T M* T A T W* T A* T M* T y pe − I e rr o r d Figure 1.: Type-I error of the asymptotic ANOVA-type test T A and the bootstrappedtests T ∗ W , T ∗ A , and T ∗ M based on several hypotheses of interest for varying time points d ∈ { , } . Each boxplot summarizes the type-I error results from 96 diﬀerent simula-tion scenarios for this hypothesis. For all individual simulations, see the tables in thesupplement.

5. Monte-Carlo simulations

The above results are valid for large sample sizes. To analyze the ﬁnite sample behaviorof the asymptotic quadratic tests and their wild bootstrap counterparts described inSections 3 and 4, we conduct extensive simulations. As an assessment criteria, allprocedures were studied with respect to their(i) type-I error rate control at level α = 5% and their(ii) power to detect deviations from the null hypothesis.All simulations were operated by means of the R computing environment, version3.2.3 R Core Team [2013] and each setting was based on 10 ,

000 simulation runs and B = 999 bootstrap runs. The algorithm for the computation of the p -value of the wildbootstrap tests in any of the test statistics T ∈ { T A , T W , T M } is as follows:(1) For the given incomplete multi-factor data, calculate the observed test statistic,say T .(2) Compute the rank values Z ik . 9 T A T W* T A* T M* T A T W* T A* T M* T y pe − I e rr o r d Figure 2.: Type-I error of the asymptotic ANOVA-type test T A and the bootstrappedtests T ∗ W , T ∗ A , and T ∗ M based on missing rates r = { , } for varying time points d ∈{ , } . Each boxplot summarizes the type-I error results from 144 diﬀerent simulationscenarios for this missing rate. For all individual simulations, see the tables in thesupplement.(3) Using i.i.d Rademacher weights W i , ..., W in i , generate bootstrapped rank values Z ∗ ik = W ik . Z ik .(4) Calculate the value of the test statistic for the bootstrapped sample T ∗ .(5) Repeat the Steps 3 and 4 independently B = 999 times and collect the observedtest statistic values in T ∗ b , b = 1 , ....., B .(6) Finally, estimate the wild bootstrap p -value as P − value = (cid:80) Bb =1 I ( T ∗ b > = T ) B .We considered a two-way layout design with a = 2 independent groups andtwo diﬀerent time points d ∈ { , } underlying discrete and continuous distri-butions. As in Konietschke et al. [2015] and Bathke et al. [2018] for the case ofcomplete observations, we investigated balanced situations with sample size vectors( n , n ) ∈ { (5 , , (10 , , (20 , } , and unbalanced situation with sample size vector( n , n ) = (10 , H G : ( P a ⊗ d Td ), no time eﬀect H T : ( a Ta ⊗ P d ), aswell as no interaction eﬀect H GT : ( P a ⊗ P d ).For each scenario, we generated missingness within MCAR as well as MARframeworks as described below: For the MCAR mechanism, we simulated X ik = ( δ ik X i k , ..., δ idk X idk ) , k = 1 , ..., n i , for independent Bernoulli distributed δ i k ∼ B ( r ) and a zero entry was interpreted as a missing observation. The missingprobability was chosen from r ∈ { , } .For the MAR mechanism , we considered several pairs of features { X obs , X miss } .For each pair, there is a determining feature X obs that determines the missingpattern of its corresponding X miss [Santos et al., 2019]. Thus, for d = 4, we10 P AR CS10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 600.040.080.120.16 r H G T TP AR CS10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 600.040.080.120.16 r H T TP AR CS10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 600.040.080.120.16 r H G (a) Normal TP AR CS10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 600.040.080.120.16 r H G T TP AR CS10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 600.040.080.120.16 r H T TP AR CS10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 600.040.080.120.16 r H G (b) Chisquare Figure 3.: Type-I error simulation results ( α = 0 .

05) of the tests T W (–––), T A (–– ––) , T ∗ W ( − · − ), T ∗ A ( · · · ), and T ∗ M ( – - – ) under diﬀerent covariance structureswith sample sizes ( n , n ) = (15 ,

15) and d = 4 for varying percentages of MCAR data r ∈ { , , , , , } with observations generated from (a) a Normaland (b) a χ distribution, respectively.deﬁne the pairs { X i k , X i k } and { X i k , X i k } . Whereas, for d = 8, we deﬁnethe following pairs { X i k , X i k } , { X i k , X i k } , { X i k , X i k } , and { X i k , X i k } . Twodiﬀerent MAR scenarios are investigated - MAR1 and MAR2-. In MAR1, for eachpair, we divided X into three groups based on their X obs values corresponding toa 2 σ − rule: the ﬁrst group is given by { X ik : X obs ∈ ( −∞ , − σ ) , k = 1 , .., n i } ,the second by { X ik : X obs ∈ ( − σ , σ ) , k = 1 , .., n i } and the last group by { X ik : X obs ∈ (2 σ , ∞ ) , k = 1 , .., n i } , where σ is the variance of X obs . Then, werandomly inserted missing values on X miss based on the following missing percentages:15% for group one and three and 30% for the second group.For generating MAR2 scenario, we considered the median of each X obs to deﬁne themissing pattern of X miss [Zhu et al., 2012, Pan et al., 2015]. For each pair from above,two groups were deﬁned: the ﬁrst one is { X ik : X obs ∈ ( −∞ , median ( X obs )] , k =1 , .., n i } , and the second is { X ik : X obs ∈ ( median ( X obs ) , ∞ ) , k = 1 , .., n i } . Then,missing values were inserted on X miss as follows: 10% for group one and 30% for thesecond group. Continuous data

To investigate the type-I error control of the suggested methods, data samples weregenerated from the model X ik ∼ F i ( µ , Σ i ) , i = 1 , , k = 1 , ..., n i , F i ( µ , Σ i ) represents a multivariate distribution with expectation vector µ and covariance matrix Σ i . Marginal data distributions were generated from diﬀerentsymmetric distributions (normal, double exponential) as well as skewed distributions(lognormal, χ ) (similar to Pauly et al. [2015a]). We used normal copulas to generatethe dependency structures of the repeated measurements using the R package copula [Yan et al., 2007]. For the covariance matrix Σ i , we investigated the three followingcovariance structures (AR) Setting 1: Σ i = ( ρ | l − j | ) l,j ≤ d , ρ = 0 . i = 1 , (CS) Setting 2: Σ i = I t , for i = 1 , , (TP) Setting 3: Σ i = ( d − | l − j | ) l,j ≤ d , for i = 1 , Type-I Error Results.

The type-I error simulation results of the studiedprocedures for testing the hypotheses of no time eﬀect H T and no group × timeinteraction H GT under the MCAR and MAR frameworks are shown in Tables S.1 -S.8 (MCAR framework) and Tables S.9 - S.16 (MAR framework) in the supplement.It can be readily seen that the suggested bootstrap approaches based upon T ∗ W , T ∗ A and T ∗ M tend to result in quite accurate type-I error rate control for most hypothesesunder symmetric as well as skewed distributions and under MCAR and MARmechanisms. The type-I error control is surprisingly not aﬀected by less stringentmissing mechanisms and the bootstrap tests are robust under fairly large amountsof missing observations. In addition, the data dependency structures do not aﬀectthe quality of the approximations. Moreover, there was no signiﬁcant impact of thesample sizes balanced or unbalancedness in the control of type-I error. On the otherhand, the asymptotic ANOVA-type test T A also shows a quite accurate type-I errorcontrol for large sample sizes. However, under small sample sizes, T A tends to besensitive to the missing rates. In particular, it exhibits a liberal behavior for largermissing rates. In contrast, the asymptotic Wald test T W shows an extremely liberalbehavior in all considered situations and under all investigated missing mechanisms. Acloser look at the type-I error simulation results for all considered settings under theMCAR framework is provided. Compact boxplots factorized in terms of hypothesesof interest or missing rates are given in Figure 1 and 2, respectively. They are basedupon diﬀerent sample size settings (4) x diﬀerent distributions (4) x diﬀerent missingrates (2) x diﬀerent covariance structures (3) x diﬀerent time points (2) x diﬀerenthypotheses (3). Due to the extreme liberal behavior of the asymptotic Wald test, ithas been removed from those plots in order to get a better view of the remainingtests behavior. It can be clearly seen that the asymptotic ANOVA-type test has theless accurate control among all other considered tests. The hypothesis of interestand number of time points obviously impact the quality of the approximations ofthe asymptotic ANOVA-type test. Furthermore, the type-I error control behavior ofthe bootstrapped MATS depends on the hypothesis of interest. Consequently, thebootstrapped Wald and ANOVA-type tests are recommended over all consideredmethods.In order to cover the eﬀect of increasing missing rates , we additionally studiedtype-I error control for a = 2 groups, d = 4 time points, ( n = 15 , n = 15) sample12izes with r ∈ { , , , , , } covering missingness in observationsranging from 10% to 60%. Figure 3 and Figures S.1 - S.2 in the supplement summarizetype-I error rate control for these settings under symmetric and asymmetric distribu-tions. The results indicate that the asymptotic Wald test T W tends to be liberal in allconsidered situations. In particular, It is extremely liberal when testing the hypothesesof no time eﬀect ( H T ) and no interaction eﬀect ( H GT ). In contrast, the asymptoticANOVA-type test T A tends to be sensitive to missing rates and hypothesis type. Inparticular, it exhibits an accurate or liberal behavior for small or large missing rates,respectively. Moreover, it shows a quite constant liberal behavior when testing thehypothesis of no group eﬀect ( H G ). Its behaviour appears to be independent of thecovariance pattern. In contrast, the suggested bootstrap approaches tend to controltype-I error rate more accurate over the range of missing rates r for almost all settings.

10 30(10,10) (10,20) (20,10) (20,20)(10,10) (10,20) (20,10) (20,20)0.040.060.080.100.120.14 n H G T

10 30(10,10) (10,20) (20,10) (20,20)(10,10) (10,20) (20,10) (20,20)0.040.060.080.100.120.14 n H T

10 30(10,10) (10,20) (20,10) (20,20)(10,10) (10,20) (20,10) (20,20)0.040.060.080.100.120.14 n H G Figure 4.: Type-I error simulation results ( α = 0 .

05) of the tests T W (–––), T A (–– ––) , T ∗ W ( − · − ), T ∗ A ( · · · ), and T ∗ M ( – - – ) for ordinal data under MCAR frame-work with sample sizes ( n , n ) = { (10 , , (10 , , (20 , , (20 , } and d = 4.13 .2. Ordinal data

In order to address all the goals outlined above, we simulated ordinal data. The ob-servations were simulated similar to Brunner and Langer [2000] as follows: X ijk = int (cid:0) cZ ik + Y ijk c + 1 (cid:1) + 1 , i = 1 , ..., a, k = 1 , ..., n i , j = 1 , ..., d, where Z ik and Y ijk are independently uniformly distributed in the interval [0 , c > int ( x ) indicates the integer part of x . The elements of X ijk take values between 1 and 4. The correlation between X ij ‘ k and X ijk is determinedby the choice of the constant c . In our simulation study, we considered c = 1 assuringa compound symmetric covariance structure. We considered a = 2 groups and d ∈ { , } dimensions, and the same sample sizes as in the continuous data settings.The type-I error results of the considered methods under MCAR and MAR frame-works are summarized in Figure 4 and Figures S.3 - S.5 in the supplement. It can beseen that the asymptotic Wald test is too liberal in most situations and under all con-sidered hypotheses. The ANOVA-type test of Brunner et al. [1999] shows much betterbehavior. In contrast, the type-I error control for the bootstrap-based procedures isthe best, particularly for the bootstrapped Wald and ANOVA-type tests which arealso less eﬀected by an increased missing rate or strict MAR assumptions. TP AR CS0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.00.250.500.751.00

Delta (Normal) po w e r TP AR CS0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.00.250.500.751.00

Delta (Double exponential) po w e r Figure 5.: Power simulation results of the tests T W ( ––– ), T A ( –– –– ) , T ∗ W ( – · – ), T ∗ A ( · · · ), and T ∗ M ( – - – ) under diﬀerent covariance structures with sample size n = 15and d = 4 under alternative 1, under the MCAR framework with missing rate r = 30%with observations generated from a Normal (upper row) and a Double exponential(bottom row) distribution, respectively. 14 P AR CS0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.00.250.500.75

Delta (Chi−square) po w e r TP AR CS0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.00.250.500.75

Delta (Lognormal) po w e r Figure 6.: Power simulation results of the tests T W ( ––– ), T A ( –– –– ) , T ∗ W ( – · – ), T ∗ A ( · · · ), and T ∗ M ( – - – ) under diﬀerent covariance structures with sample size n = 15and d = 4 under alternative 1, under the MCAR framework with missing rate r = 30%with observations generated from a χ (upper row) and a Lognormal (bottom row)distribution, respectively. Power

In order to assess the empirical power of all studied methods, we considered a one-sample repeated measures design with d = 4 repeated measures, sample size n = 15,and covariance structures as given in Settings 1 − X k ∼ F ( µ , Σ ) + µ , k = 1 , ..., , where, we were interested in detecting two speciﬁc alternatives • Alternative 1: µ = (0 , , ζ, ζ ), • Alternative 2: µ = (0 , , , ζ ),for varying shift parameter ζ ∈ { , . , , . , , . , } .The power analysis results of the considered methods under the MCAR frameworkfor several distributions, involving various covariance settings for detecting Alternative1 are summarized in Figures 5 - 6 (missing rate r = 30%) and Figures S.6 - S.7in the supplement ( r = 10%). The simulation results for investigating Alterna-tive 2 are displayed in Figures S.8 - S.11 in the supplement. The power analysisresults of the considered methods under the respective MAR framework are summa-rized in Figures S.12 - S.15 (MAR1 scenario) and Figures S.16 - S.19 (MAR2 scenario).15s the asymptotic Wald test is too liberal compared to the other studiedmethods, its power function is larger. Moreover, the bootstrapped Wald test ex-hibits the lowest power behavior, while the bootstrapped MATS and ANOVA-typehave a quite similar power behavior as the Brunner et al. [1999] ANOVA-type test.The diﬀerences between the procedures is less pronounced under the MAR framework.To sum up, we recommend the bootstrap ANOVA-type statistic. It exhibits theoverall best type-I-error control combined with a good power behaviour and needs theless stringent assumptions for application. Session 1 Session 2 Session 3 (a)

Side eﬀect

Session 1 Session 2 Session 3 (b)

Therapeutic eﬀect

Figure 7.: Frequencies of the side eﬀect and therapeutic eﬀect scores observed inthe ﬂuvoxamine trial.

6. Application to empirical data6.1.

The ﬂuvoxamine trial

In this section, we re-examine a clinical ﬂuvoxamine study. It has already beenanalyzed by Molenberghs and Lesaﬀre [1994], Molenberghs et al. [1997], Van Steenet al. [2001], Jansen et al. [2003], Molenberghs and Verbeke [2004], and Molenberghsand Kenward [2007]. The study has been performed to establish the proﬁle ofﬂuvoxamine in ambulatory clinical psychiatric operations. Hereby, 315 patients whosuﬀer from depression, panic disorder, and/or obsessive-compulsive disorder werescored every two weeks over six weeks of treatment (d=3). At each clinical visit,scores for both side eﬀect and therapeutic eﬀect scales which are based upon about20 psychiatric symptoms were recorded. The side eﬀect scale ranges from 1 to 4. Thelower the score, the better the clinical record. For example, score 1 stands for ”no sideeﬀect” while score 4 indicates that ”the side eﬀect surpasses the therapeutic eﬀect”.Similarly, the therapy eﬀect is a 4 category ordinal scale: (1) ”no improvement orworsening”; (2) ”minimal improvement, not changing functionality”; (3) ”moderateimprovement, partial disappearance of symptoms”; and (4) ”important improvement,almost disappearance of symptoms”. The higher the score, the better the patientshealth.Several patients missed the recording of their measurements in some sessions whichleaded to a large amount of missing values. A closer look to our data shows that,from the total of 315 initially recruited patients, 14 patients dropped oﬀ, 31 patientswere scored on the ﬁrst session only, and 44 patients on Session 1 and Session 3 only.Thus, only 224 patients have complete observations. Two patients were excluded fromthe analysis due to their non-monotone missing pattern. This leaves us with 299 pa-16ients. Waﬄe plots representing the distributions of the side eﬀect and therapeuticeﬀect among the three sessions are shown in Figure 7. We aim to test the hypotheseswhether side eﬀect or therapeutic eﬀect scores are signiﬁcantly diﬀerent between thethree sessions for patients with psychiatric disorder. To this end, we applied all con-sidered testing methods; asymptotic Wald and ANOVA-type tests ( T W , T A ), and thebootstrap procedures ( T ∗ W , T ∗ A , T ∗ M ) to detect the null hypothesis H T : { CF = } .The results are summarized in Table 1. It can be seen that all tests indicate a signiﬁ-cant diﬀerence between the therapeutic eﬀect scores as well as the side eﬀect scores ofthe three sessions (two-sided p-value < . Eﬀect T W T A T ∗ W T ∗ A T ∗ M Therapeutic eﬀect . e − . e − Side eﬀect . e − . e − The skin disorder trial

Here, we study data from a randomized, multi-center, parallel group study for treatinga skin condition. Treating skin conditions can be tough, thus the goal of the study wasto assess the severe rate of the skin condition over time and to compare the eﬃciencyand safety of two continuous therapy treatments drug and placebo. Patients wererandomly assigned to drug or placebo therapy treatment. Prior to treatment, patientswere assessed to determine the initial severity of the skin condition (moderate orsevere). At three follow-up visits, the treatment outcome was measured accordingto a ﬁve-point ordinal response scale that assess the extent of improvement (1 =rapidly improving, 2 = slowly improving, 3 = stable, 4 = slowly worsening, 5 =rapidly worsening). The study consists of 88 and 84 subjects allocated to the activetreatment group and the placebo group, respectively. And, the proportion of missingobservations is around 30%. The distribution of patients improvement across thetreatment groups and the follow-up visits is displayed in Figure 8. The study isdescribed in full detail in Landis et al. [1988] and is published in Davis [2002].Similar to the ﬂuvoxamine study above, we applied all asymptotic and bootstrapprocedures to infer the following null hypotheses: ”no group eﬀect”, ”no time eﬀect ”and ”no group × time eﬀect”. The results are summarized in Table 2: All approachesreject the null hypothesis of no group eﬀect, the null hypothesis of no time eﬀectand the null hypothesis of no treatment group × time interaction. This impliesthat the clinical outcome of the patients signiﬁcantly improves with time and thisprogression is signiﬁcantly diﬀerent between the two treatment groups drug andplacebo. Therefore, the data is further analyzed and split by the factor initial severityand the analysis is replicated separately for each baseline severity level (moderate orsevere). The results are provided in Table 3.It can be seen from Table 3 that all approaches detect a signiﬁcant group eﬀect aswell as a signiﬁcant time eﬀect in both moderate and severe groups. In contrast, a sig-niﬁcant group × time interaction eﬀect arises only in the moderate severity group.This17 T r ea t m en t rapidly improving slowly improving stable slowly worsening rapidly worsening NA Follow−up visits

Figure 8.: Frequencies of observed patients treatment outcome in the skin disordertrial.indicates that the change in clinical outcomes of patients of moderate severity variesover time depending on treatment group membership. All ﬁve approaches share theprevious ﬁndings. Table 2.: P-values of the skin disorder trial.

Hypothesis T W T A T ∗ W T ∗ A T ∗ M Group

Time

Group × Time .

032 0 .

016 0 .

026 0 .

01 0 . Hypothesis

Initial severity = Moderate Initial severity = Severe T W T A T ∗ W T ∗ A T ∗ M T W T A T ∗ W T ∗ A T ∗ M Group

Time .

003 0 .

004 0 .

003 0 .

002 0 .

002 0 0 0 0 0

Group × Time .

041 0 .

032 0 .

043 0 .

027 0 .

029 0 .

423 0 .

318 0 .

447 0 .

338 0 . Simulations for data with similar attributes to the trials data

It was also interesting to discover the type-I error rate control of the tests undersimilar attributes to the data sets of the ﬂuvoxamine and skin disorder clinicaltrials. The data sets reﬂect large sample sizes and a small or moderate amount ofmissing values from either a one sample repeated measurements design or a two waylayout design. Simulation results for the type-I error rate of the studied proceduresfor ( n = 299 , d = 3) and ( n = 88 , n = 84 , d = 3) sample sizes are presented18n Table S.21 and Table S.22 in the supplement, respectively. It can be seen thatmost tests are robust under both settings and control type-I error rate accurately.Only the Wald type tests exhibit a liberal behaviour for some of the lognormal settings.

7. Summary

Multi-group repeated measures design with ordinal or skewed observations are quitecommon. If the observations are additionally subject to missing values existing meth-ods for testing null hypotheses in terms of distribution function may be either liberal(Wald-type statistics) or run (asymptotically) on a wrong type-I-error level (ANOVA-type statistic, [Domhof et al., 2002]). To this end we investigated three alternativesbased upon resampling. We proved their asymptotic validity and analyzed their smallsample behaviour regarding type-I-error control and power in extensive simulations.Under all of the 5 considered methods, an ANOVA-type statistic with critical valuescalculated by means of a Wild bootstrap approach exhibits the best behaviour and isrecommended.In the future, we will include the present methodology into the R package Miss-Pair [Amro et al., 2019a]. Moreover, we plan to extend our investigations to generalMANOVA settings, e.g., extending the results of Dobler et al. [2019a] to the situationwith missing values.

Acknowledgement

Lubna Amro and Markus Pauly acknowledge the support of the German ResearchFoundation (DFG) (No. PA 2409/3-2). Frank Konietschke’s work was also supportedby the German Research Foundation (DFG) (No. KO 4680/3-2).

References

Michael G. Akritas.

Nonparametric Models for ANOVA and ANCOVA Designs , pages964–968. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. ISBN 978-3-642-04898-2. .Michael G Akritas and Steven F Arnold. Fully nonparametric hypotheses for fac-torial designs i: Multivariate repeated measures designs.

Journal of the AmericanStatistical Association , 89(425):336–343, 1994.Michael G Akritas and Edgar Brunner. A uniﬁed approach to rank tests for mixedmodels.

Journal of Statistical Planning and Inference , 61(2):249–277, 1997.Michael G Akritas, Jouni Kuha, and D Wayne Osgood. A nonparametric approach tomatched pairs with missing data.

Sociological methods & research , 30(3):425–454,2002.Lubna Amro, Frank Konietschke, and Markus Pauly. R package misspair. https://github.com/lubnaamro/MissPair , 2019a.Lubna Amro, Markus Pauly, and Burim Ramosaj. Asymptotic based bootstrapapproach for matched pairs with missingness in a single-arm. arXiv preprintarXiv:1912.04902 , 2019b. 19aume Arnau, Roser Bono, Mar´ıa J Blanca, and Rebecca Bendayan. Using the lin-ear mixed model to analyze nonnormal data distributions in longitudinal designs.

Behavior research methods , 44(4):1224–1238, 2012.Maurice S Bartlett. A note on tests of signiﬁcance in multivariate analysis. In

Mathe-matical Proceedings of the Cambridge Philosophical Society , volume 35, pages 180–185. Cambridge University Press, 1939.Arne C Bathke, Sarah Friedrich, Markus Pauly, Frank Konietschke, Wolfgang Staﬀen,Nicolas Strobl, and Yvonne H¨oller. Testing mean diﬀerences among groups: mul-tivariate and repeated measures analysis with minimal assumptions.

Multivariatebehavioral research , 53(3):348–359, 2018.Lori A Biederman, Thomas W Boutton, and Steven G Whisenant. Nematode commu-nity development early in ecological restoration: The role of organic amendments.

Soil Biology and Biochemistry , 40(9):2366–2374, 2008.Virginie Bourlier, Alexia Zakaroﬀ-Girard, Alexandra Miranville, Sandra De Barros,Marie Maumus, Coralie Sengen`es, Jean Galitzky, Max Lafontan, Fredrik Karpe,KN Frayn, et al. Remodeling phenotype of human subcutaneous adipose tissuemacrophages.

Circulation , 117(6):806, 2008.George EP Box. Some theorems on quadratic forms applied in the study of analysisof variance problems, i. eﬀect of inequality of variance in the one-way classiﬁcation.

The annals of mathematical statistics , 25(2):290–302, 1954.Chiara Brombin, Edoardo Midena, and Luigi Salmaso. Robust non-parametric testsfor complex-repeated measures problems in ophthalmology.

Statistical methods inmedical research , 22(6):643–660, 2013.Edgar Brunner. Asymptotic and approximate analysis of repeated measures designsunder heteroscedasticity.

Mathematical Statistics with Applications in Biometry ,page 2, 2001.Edgar Brunner and Frank Langer. Nonparametric analysis of ordered categoricaldata in designs with longitudinal observations and small sample sizes.

BiometricalJournal: Journal of Mathematical Methods in Biosciences , 42(6):663–675, 2000.Edgar Brunner and Madan L Puri. Nonparametric methods in factorial designs.

Sta-tistical papers , 42(1):1–52, 2001.Edgar Brunner, Ulrich Munzel, Madan L Puri, et al. Rank-score tests in factorialdesigns with repeated measures.

Journal of Multivariate Analysis , 70(2):286–317,1999.Edgar Brunner, Frank Konietschke, Markus Pauly, and Madan L Puri. Rank-basedprocedures in factorial designs: Hypotheses about non-parametric treatment eﬀects.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 79(5):1463–1485, 2017.Edgar Brunner, Arne C Bathke, and Frank Konietschke.

Rank and Pseudo-RankProcedures for Independent Observations in Factorial Designs . Springer, 2018.Russell Davidson and Emmanuel Flachaire. The wild bootstrap, tamed at last.

Journalof Econometrics , 146(1):162–169, 2008.Charles S Davis.

Statistical methods for the analysis of repeated measurements .Springer Science & Business Media, 2002.Dennis Dobler, Sarah Friedrich, and Markus Pauly. Nonparametric manova in mean-ingful eﬀects.

Annals of the Institute of Statistical Mathematics , pages 1–26, 2019a.Dennis Dobler, Markus Pauly, and ThomasH Scheike. Conﬁdence bands for multiplica-tive hazards models: Flexible resampling approaches.

Biometrics , 75(3):906–916,2019b.Sebastian Domhof, Edgar Brunner, and D Wayne Osgood. Rank procedures for re-20eated measures with missing values.

Sociological methods & research , 30(3):367–393, 2002.David M Erceg-Hurn and Vikki M Mirosevich. Modern robust statistical methods:an easy way to maximize the accuracy and power of your research.

AmericanPsychologist , 63(7):591, 2008.Garrett M Fitzmaurice, Nan M Laird, and James H Ware.

Applied longitudinal anal-ysis , volume 998. John Wiley & Sons, 2012.Youyi Fong, Ying Huang, Maria P Lemos, and M Juliana Mcelrath. Rank-basedtwo-sample tests for paired data with missing values.

Biostatistics , 19(3):281–294,2018.Sarah Friedrich and Markus Pauly. Mats: Inference for potentially singular and het-eroscedastic manova.

Journal of Multivariate Analysis , 165:166–179, 2018.Sarah Friedrich, Edgar Brunner, and Markus Pauly. Permuting longitudinal data inspite of the dependencies.

Journal of Multivariate Analysis , 153:255–265, 2017a.Sarah Friedrich, Frank Konietschke, and Markus Pauly. A wild bootstrap approach fornonparametric repeated measurements.

Computational Statistics & Data Analysis ,113:38–52, 2017b.Xin Gao. A nonparametric procedure for the two-factor mixed model with missingdata.

Biometrical journal , 49(5):774–788, 2007.Ivy Jansen, Geert Molenberghs, Marc Aerts, Herbert Thijs, and Kristel Van Steen. Alocal inﬂuence approach applied to binary data from a psychiatric study.

Biometrics ,59(2):410–419, 2003.Richard Arnold Johnson and Dean W Wichern.

Applied multivariate statistical anal-ysis , volume 6. Prentice hall Upper Saddle River, NJ, 2007.Frank Konietschke, Solomon W Harrar, Katharina Lange, and Edgar Brunner. Rank-ing procedures for matched pairs with missing data—asymptotic theory and a smallsample approximation.

Computational Statistics & Data Analysis , 56(5):1090–1102,2012.Frank Konietschke, Arne C Bathke, Solomon W Harrar, and Markus Pauly. Parametricand nonparametric bootstrap methods for general manova.

Journal of MultivariateAnalysis , 140:291–301, 2015.K Krishnamoorthy and Fei Lu. A parametric bootstrap solution to the manova underheteroscedasticity.

Journal of Statistical Computation and Simulation , 80(8):873–887, 2010.J Richard Landis, Michael E Miller, Charles S Davis, and Gary G Koch. Some generalmethods for the analysis of categorical data in longitudinal studies.

Statistics inmedicine , 7(1-2):109–137, 1988.DN Lawley. A generalization of ﬁsher’s z test.

Biometrika , 30(1/2):180–187, 1938.Ylva Lekberg, James D Bever, Rebecca A Bunn, Ragan M Callaway, Miranda MHart, Stephanie N Kivlin, John Klironomos, Beau G Larkin, John L Maron, Kurt OReinhart, et al. Relative importance of competition and plant–soil feedback, theirsynergy, context dependency and implications for coexistence.

Ecology letters , 21(8):1268–1281, 2018.Enno Mammen. Bootstrap, wild bootstrap, and asymptotic normality.

ProbabilityTheory and Related Fields , 93(4):439–455, 1992.Torben Martinussen and Thomas H Scheike.

Dynamic regression models for survivaldata . Springer Science & Business Media, 2007.Theodore Micceri. The unicorn, the normal curve, and other improbable creatures.

Psychological bulletin , 105(1):156, 1989.Geert Molenberghs and Michael Kenward.

Missing data in clinical studies , volume 61.21ohn Wiley & Sons, 2007.Geert Molenberghs and Emmanuel Lesaﬀre. Marginal modeling of correlated ordinaldata using a multivariate plackett distribution.

Journal of the American StatisticalAssociation , 89(426):633–644, 1994.Geert Molenberghs and Geert Verbeke. Meaningful statistical model formulations forrepeated measures.

Statistica Sinica , pages 989–1020, 2004.Geert Molenberghs, Michael G Kenward, and Emmanuel Lesaﬀre. The analysis oflongitudinal ordinal data with nonrandom drop-out.

Biometrika , 84(1):33–44, 1997.Ullrich Munzel and Edgar Brunner. Nonparametric methods in multivariate factorialdesigns.

Journal of Statistical Planning and Inference , 88(1):117–132, 2000.Stephen Nesnow, Marc J Mass, Jeﬀrey A Ross, Anthony J Galati, Guy R Lambert,Chris Gennings, Walter H Carter Jr, and Gary D Stoner. Lung tumorigenic inter-actions in strain a/j mice of ﬁve environmental polycyclic aromatic hydrocarbons.

Environmental Health Perspectives , 106(suppl 6):1337–1346, 1998.Ruilin Pan, Tingsheng Yang, Jianhua Cao, Ke Lu, and Zhanchao Zhang. Missing dataimputation by k nearest neighbours based on grey relational structure and mutualinformation.

Applied Intelligence , 43(3):614–632, 2015.Markus Pauly, Edgar Brunner, and Frank Konietschke. Asymptotic permutation testsin general factorial designs.

Journal of the Royal Statistical Society: Series B (Sta-tistical Methodology) , 77(2):461–473, 2015a.Markus Pauly, David Ellenberger, and Edgar Brunner. Analysis of high-dimensionalone group repeated measures designs.

Statistics , 49(6):1243–1261, 2015b.Fortunato Pesarin and Luigi Salmaso. A review and some new results on permutationtesting for multivariate problems.

Statistics and Computing , 22(2):639–646, 2012.Hui-Rong Qian and Shuguang Huang. Comparison of false discovery rate methods inidentifying genes with diﬀerential expression.

Genomics , 86(4):495–503, 2005.R Core Team.

R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria, 2013. URL .Miriam Seoane Santos, Ricardo Cardoso Pereira, Adriana Fonseca Costa, Jastin Pom-peu Soares, Jo˜ao Santos, and Pedro Henriques Abreu. Generating synthetic missingdata: A review by missing mechanism.

IEEE Access , 7:11651–11667, 2019.(cid:32)Lukasz Smaga. Bootstrap methods for multivariate hypothesis testing.

Communica-tions in Statistics-Simulation and Computation , 46(10):7654–7667, 2017.CB Traux and Robert R Carkhuﬀ. Personality change in hospitalized mental pa-tients during group psychotherapy as a function of the use of alternate sessions andvicarious therapy pretraining.

Journal of Clinical Psychology , 1965.Maria Umlauft, Marius Placzek, Frank Konietschke, and Markus Pauly. Wild boot-strapping rank-based procedures: Multiple testing in nonparametric split-plot de-signs. arXiv preprint arXiv:1710.04532 , 2017.G Vallejo, MP Fern´andez, and Pablo Esteban Livacic-Rojas. Analysis of unbalancedfactorial designs with heteroscedastic data.

Journal of Statistical Computation andSimulation , 80(1):75–88, 2010.Kristel Van Steen, Geert Molenberghs, Geert Verbeke, and Herbert Thijs. A localinﬂuence approach to sensitivity analysis of incomplete longitudinal ordinal data.

Statistical Modelling , 1(2):125–142, 2001.SE Wildsmith, GE Archer, AJ Winkley, PW Lane, and PJ Bugelski. Maximization ofsignal derived from cdna microarrays.

Biotechniques , 30(1):202–208, 2001.Jin Xu and Xinping Cui. Robustiﬁed manova with applications in detecting diﬀeren-tially expressed genes from oligonucleotide arrays.

Bioinformatics , 24(8):1056–1062,22008.Li-Wen Xu, Fang-Qin Yang, Shuang Qin, et al. A parametric bootstrap approach fortwo-way anova in presence of possible interactions with unequal variances.

Journalof Multivariate Analysis , 115:172–180, 2013.Jun Yan et al. Enjoy the joy of copulas: with a package copula.

Journal of StatisticalSoftware , 21(4):1–21, 2007.Bing Zhu, Changzheng He, and Panos Liatsis. A robust missing value imputationmethod for noisy data.