[PDF] Studentized Permutation Method for Comparing Restricted Mean Survival Times with Small Sample from Randomized Trials

Abstract

Recent observations, especially in cancer immunotherapy clinical trials with time-to-event outcomes, show that the commonly used proportial hazard assumption is often not justifiable, hampering an appropriate analyse of the data by hazard ratios. An attractive alternative advocated is given by the restricted mean survival time (RMST), which does not rely on any model assumption and can always be interpreted intuitively. As pointed out recently by Horiguchi and Uno (2020), methods for the RMST based on asymptotic theory suffer from inflated type-I error under small sample sizes. To overcome this problem, they suggested a permutation strategy leading to more convincing results in simulations. However, their proposal requires an exchangeable data set-up between comparison groups which may be limiting in practice. In addition, it is not possible to invert their testing procedure to obtain valid confidence intervals, which can provide more in-depth information. In this paper, we address these limitations by proposing a studentized permutation test as well as the corresponding permutation-based confidence intervals. In our extensive simulation study, we demonstrate the advantage of our new method, especially in situations with relative small sample sizes and unbalanced groups. Finally we illustrate the application of the proposed method by re-analysing data from a recent lung cancer clinical trial.

Full PDF

aa r X i v : . [ s t a t . M E ] F e b Studentized Permutation Method for Comparing Restricted MeanSurvival Times with Small Sample from Randomized Trials

Marc Ditzhaus ∗ , Menggang Yu , and Jin Xu Department of Statistics, TU Dortmund University, Germany. Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison WI,USA Key Laboratory of Advanced Theory and Application in Statistics and Data Science - MOEand School of Statistics, East China Normal University, ChinaFebruary 23, 2021

Abstract

Recent observations, especially in cancer immunotherapy clinical trials with time-to-event out-comes, show that the commonly used proportional hazard assumption is often not justiﬁable, ham-pering an appropriate analyse of the data by hazard ratios. An attractive alternative advocated isgiven by the restricted mean survival time (RMST), which does not rely on any model assumptionand can always be interpreted intuitively. As pointed out recently by Horiguchi and Uno (2020a),methods for the RMST based on asymptotic theory suﬀer from inﬂated type-I error under smallsample sizes. To overcome this problem, they suggested a permutation strategy leading to moreconvincing results in simulations. However, their proposal requires an exchangeable data set-upbetween comparison groups which may be limiting in practice. In addition, it is not possible toinvert their testing procedure to obtain valid conﬁdence intervals, which can provide more in-depthinformation. In this paper, we address these limitations by proposing a studentized permutationtest as well as the corresponding permutation-based conﬁdence intervals. In our extensive simula-tion study, we demonstrate the advantage of our new method, especially in situations with relativesmall sample sizes and unbalanced groups. Finally we illustrate the application of the proposedmethod by re-analysing data from a recent lung cancer clinical trial.

Keywords: hazard ratio, permutation methods, restricted mean survival time, survival analysis,time-to-event outcomes.

While the log-rank test and hazard ratios were the gold standard in time-to-event analysis for a longtime, there is a recent trend towards alternative methods not relying on the proportional hazardassumption. The reason for this change are recently observed violations of the proportional hazardassumption in real data. For example, Trinquart et al. (2016) analysed 54 phase III oncology clinicaltrials from ﬁve leading journals and in 13 (24%) of them the proportional hazard assumption couldbe rejected signiﬁcantly. Especially in immunotherapy trials, a delayed treatment eﬀect often leadto a violation of the proportional hazard assumption (Mick and Chen, 2015; Alexander et al., 2018)and suchlike could also be observed when comparing bone marrow transplant and chemotherapy forhematologic malignancies (Zittoun et al., 1995; Scott et al., 2017). More classical and known eﬀectsizes as landmark survival (Taori et al., 2009) and the median survival time (Brookmeyer and Crowley,1982; Chen and Zhang, 2016; Ditzhaus et al., 2020a) provide rather a snapshot for a time point thaninformation about the complete Kaplan–Meier curves. ∗ e-mail: [email protected] We consider the two-sample survival set-up given by mutually independent survival and censoringtimes T ij ∼ S i , C ij ∼ G i , i = 1 , j = 1 , . . . , n i , respectively. Here, S i and G i denote the survival functions for the survival and censoring times ofthe i th group, respectively. Both are not necessarily continuous and ties in the data are explicitlyallowed, e.g. survival times rounded to days, months etc. Based on the right-censored event times X ij = min( T ij , C ij ) and the censoring statuses δ ij = { X ij = T ij } , we would like to infer diﬀerences2etween the two groups in terms of their RMSTs µ i = Z τ S i ( t ) d t ( i = 1 , , τ ], which is practically relevant (e.g. τ = 2 years). Thereby,it needs to be guaranteed that the event times X ij larger than τ are observable with a positiveprobability P ( X ij ≥ τ ) >

0. In practice, a typical choice for τ is the end-of-study time. While τ is usually be chosen as a pre-speciﬁed constant allowing a straight-forward interpretation of µ i ,Tian et al. (2020) discuss an empirical choice of τ , e.g. the largest observed time, under appropriateregularity assumptions on the censoring distribution.The RMST can be naturally estimated by plugging-in the Kaplan-Meier estimator b S i : b µ i = Z τ b S i ( t ) d t ( i = 1 , . Asymptotic inference for this estimator relies on a normal approximation, which can be justiﬁed bymartingale arguments (Andersen et al., 1993) combined with the continuous mapping theorem. Infact, under the assumption of non-vanishing groups, i.e. n i /n → κ i ∈ (0 ,

1) as n → ∞ , which issupposed throughout the paper, we obtain √ n { ( b µ − b µ ) − ( µ − µ ) } d → Z ∼ N (0 , σ ) , σ = σ + σ . (1)Here, σ i denotes the asymptotic variance of √ n ( b µ i − µ i ) and is given by σ i = κ − i Z τ (cid:26)Z τx S i ( t ) d t (cid:27) { − ∆ A i ( x ) } G i − ( x ) S i − ( x ) d A i ( x ) ( i = 1 , , where A i = − log( S i ) is the cumulative hazard rate function and ∆ A i ( x ) = A i ( x ) − A i − ( x ) is itsincrement in t . Moreover, G i − , S i − and A i − denote the left-continuous versions of G i , S i and A i ,respectively, e.g., G i − ( t ) = P ( C i ≥ t ) (c.f. G i ( t ) = P ( C i > t )).While the convergence in (1) is well established (see e.g. Zhao et al., 2016) for continuously dis-tributed survival and censoring times, it even remains true when ties are allowed. See the supplementfor a detailed proof. The variance can be estimated straightforwardly by replacing S i , G i and A i bytheir respective Kaplan–Meier ( b S i , b G i ) and Nelson–Aalen ( b A i ) estimators. In detail, b σ = b σ + b σ and b σ i = nn i Z τ (cid:26)Z τx b S i ( t ) d t (cid:27) { − ∆ b A i ( x ) } b S i − ( x ) b G i − ( x ) d b A i ( x ) . (2)Combining (1) and (2), we obtain an asymptotically valid test ϕ = {√ n | b µ − b µ | / b σ > z − α/ } forthe null hypothesis of equal RMSTs: H : µ = µ . Here, z − α/ denotes the (1 − α/ Following the idea of exact permutation tests (Lehmann and Romano, 2006; Hemerik and Goeman,2018), Horiguchi and Uno (2020a) recently proposed a permutation test for H : µ = µ , which wecall the unstudentized test hereafter.In detail, given the observed data ( X , δ ) ≡ (cid:8) ( X ij , δ ij ) : i = 1 , j = 1 , . . . , n i (cid:9) , let ( X π , δ π ) ≡ (cid:8) ( X πij , δ πij ) : i = 1 , j = 1 , . . . , n i (cid:9) be its permutated version corresponding to a scramble of the3 .000.250.500.751.00 0.0 2.5 5.0 7.5 10.0time S u r v i v a l f un c t i on Group 1 Group 2

S3: Exponential vs piece−wise Exponential (crossing curves) S u r v i v a l f un c t i on Group 1 Group 2

S5: Weibull (shape Alternatives) S u r v i v a l f un c t i on Group 1 Group 2

S6: Weibull (scale Alternatives) S u r v i v a l f un c t i on Group 1 Group 2

S7: Weibull and piece−wise Exponential

Figure 1: Four examples, for which the groups’ survival curves are diﬀerent but their restricted meansurvival time over [0 ,

10] coincides. The examples correspond to Scenarios S3, S5, S6 and S7 from thesimulation study, see Section 3treatment indicator. Note that the permutation is at the subject level and ( X ij , δ ij ) are permutated inpairs. Horiguchi and Uno (2020a) suggested using the permutation test ϕ π HU = {| b µ − b µ | > q π − α,HU } in case of small sample sizes, where q π − α,HU is the (1 − α )-quantile of the permutation distribution t P {| b µ π − b µ π | ≤ t | ( X , δ ) } given the observed data ( X , δ ). Here, b µ πi , b S πi denote the permutationcounterparts of the original estimators by replacing the data ( X , δ ) with a permuted sample ( X π , δ π ).Such permutation tests are known to be ﬁnitely exact, i.e. the type-I error is controlled not onlyasymptotically but for every ﬁxed sample size, under exchangeable data. In the context of right-censored survival data, exchangeability implies equal survival and censoring distributions between thegroups, respectively, i.e. S = S and G = G . This is obviously a much stronger assumption onboth the interested time-to-event outcome and the censoring distribubtions. In our context of RMSTcomparison, having potentially crossing survival curves in mind, it may occur that the null hypothesis H : µ = µ is true despite S = S holds, as shown by the four examples in Figure 1. In addition,the assumption of equal censoring distributions alone is also too restrictive, since side eﬀects relatedto the treatment may lead to diﬀerent drop-out rates for example. An additional disadvantage is thatthis unstudentized permutation strategy cannot be used to obtain valid conﬁdence intervals becausethe fact µ = µ clearly violates the exchangeability assumption.To address all these issues, we propose a studentized permutation test. To explain our idea, we needto understand ﬁrst the asymptotic behavior of the permutated, unstudentized statistic, here b µ π − b µ π ,under non-exchangeable settings. For that purpose, we introduce the pooled Kaplan–Meier estimator b S and the pooled Nelson–Aalen-estimator b A . In detail, let N ( t ) = P i,j δ ij { X ij ≤ t } be the numberof events up until t and Y ( t ) = P i,j { X i,j ≥ t } be the number of individuals under risk at time t .Moreover, let t , . . . , t d , d ∈ N , be the distinctive time points within X . Then b S ( t ) = Q k : t k ≤ t [1 − ∆ N ( t k ) /Y ( t k )] and b A ( t ) = P k : t k ≤ t ∆ N ( t k ) /Y i ( t k ). Now, deﬁne y ( t ) = P i =1 κ i S i − ( t ) G i − ( t ) and ν ( t ) = P i =1 κ i R t G i − ( s ) d F i ( s ), where F i = 1 − S i . Combining the Glivenko-Cantelli Theorem andthe continuous mapping theorem we obtain almost surely that b S ( t ) and b A ( t ) converge uniformly on[0 , τ ] to S ( t ) = exp[ − A ( t )] and A ( t ) = R t /y ( s )d ν ( s ), respectively, see the supplement for moredetails. Having these additional notations at hand, we are now able to derive the asymptotic limit ofthe permuted, unstudentized statistic b µ π − b µ π : 4 heorem 1. Under H : µ = µ as well as under H : µ = µ , we have the following conditionalconvergence in distribution √ n ( b µ π − b µ π ) d → Z perm ∼ N (0 , σ perm ) , as n → ∞ , given the data in probability, where the limiting variance is given by σ perm = 1 κ κ Z τ (cid:26)Z τx S ( t ) d t (cid:27) { − ∆ A ( x ) } y ( t ) d A ( x ) . In the special case S = S and G = G , the variances σ in (1) and σ coincide. But, ingeneral, they are diﬀerent. Thus, applying the unstudentized permutation test for a non-exchangeablesetting may lead to a systematic error, which is caused by a diﬀerent variance of the permuted statistic.However, this can be solved by studentization, i.e. by including an appropriate variance estimator inthe original test statistic as well as in its permutation counterpart. In fact, it can be shown that thepermutation counterpart b σ π of the variance estimator b σ converges, given the observed data, to thevariance σ from Theorem 1. See the supplement for a detailed proof. In other words, inclusion ofthe variance estimator in the permutation step corrects the wrong variance. Consequently, we obtain Theorem 2.

Under H : µ = µ as well as under H : µ = µ we have the following conditionalconvergence in distribution √ n ( b µ π − b µ π ) / b σ π d → Z perm ∼ N (0 , as n → ∞ , given the observed data in probability. From Theorem 2 we obtain that √ n | b µ − b µ − ( µ − µ ) | / b σ and √ n | b µ π − b µ π | / b σ π have the sameasymptotic distribution, namely | Z | for Z ∼ N (0 , q π − α denote the (1 − α )-quantile of the conditional distribution t P {√ n | b µ π − b µ π | / b σ π ≤ t | ( X , δ ) } . Then the studentized permutation test ϕ π and the permutation-basedconﬁdence interval I π for µ − µ are given by ϕ π = n √ n | b µ − b µ | b σ > q π − α o , I π = hb µ − b µ ± n − / b σ q π − α i . Combining (1), Theorem 2, as well as Lemma 1 and Theorem 7 of Janssen and Pauls (2003), wecan deduce that the conditional quantile q π − α tends to z − α/ and we obtain: Corollary 1. (i) The permutation test ϕ π has asymptotic level α for H : µ = µ and is consistentfor general alternatives H : µ = µ , i.e. E H ( ϕ π ) → α and E H ( ϕ π ) → as n → ∞ . (ii) Thepermutation-based conﬁdence interval I π has asymptotic conﬁdence level − α , i.e., P ( µ − µ ∈ I π ) → − α as n → ∞ . In this subsection, we brieﬂy explain how the permutation strategy can also be adopted to obtainconﬁdence intervals for the ratio µ /µ . While the studentization idea directly applied to the ratiowould lead to inappropriate conﬁdence intervals for a ratio, i.e. b µ / b µ ± D n , we consider the log-transformation log( b µ ) − log( b µ ) instead. Analogous to (1), it can be shown that √ n [ { log( b µ ) − log( b µ ) } − { log( µ ) − log( µ ) } ] → Z ∼ N (0 , σ ) , σ = σ µ + σ µ . (3)The asymptotic variance can be estimated by b σ = ( b σ / b µ ) + ( b σ / b µ ). Consequently, an asymp-totically valid conﬁdence interval for µ /µ and its studentized permutation counterpart are givenrespectively by I rat = h exp n log( b µ ) − log( b µ ) ± n − / b σ rat z − α/ oi ,I π rat = h exp n log( b µ ) − log( b µ ) ± n − / b σ rat q π − α, rat oi , q π − α denotes the (1 − α )-quantile of the conditional distribution t P {√ n | log( b µ π ) − log( b µ π ) | / b σ π rat ≤ t | ( X , δ ) } . Similarly to Corollary 1, we can prove that the permutation-based conﬁdence interval isasymptotically valid: Corollary 2.

The permutation-based conﬁdence interval I π rat for the ratio µ /µ has asymptotic con-ﬁdence level − α , i.e., P ( µ /µ ∈ I π rat ) → − α as n → ∞ . To complement our theoretical discussion from the previous section, we conducted an extensive sim-ulation study to examine the performance of the permutation test as well as the permutation-basedconﬁdence intervals. For ease of presentation, we restricted ourselves to the diﬀerence of the RMSTs.Additional results for the ratio are deferred to the supplement.

We considered seven diﬀerent choices for the survival times distribution:S1

Exponential distributions (proportional hazards) : T ∼ Exp(0 .

2) and T ∼ Exp( λ δ, ).S2 Exponential distribution vs piece-wise Exponential (late departures) : T ∼ Exp(0 .

2) and T haspiece-wise constant hazard function α ( t ) = 0 . · { t ≤ } + λ δ, { t > } .S3 Exponential distribution vs piece-wise Exponential (crossing curves) : T ∼ Exp(0 .

2) and T has piece-wise constant hazard function α ( t ) = 0 . · { t ≤ c δ, } + 0 . · { t > c δ, } .S4 Lognormal scale alternatives : T ∼ logN(2 , .

25) and T ∼ logN( µ δ , . Weibull shape alternatives (crossing curves) : T ∼ Weib(3 ,

8) and T ∼ Weib(shape δ , Weibull scale alternatives (crossing curves) : T ∼ Weib(3 ,

8) and T ∼ Weib(1 . , scale δ ).S7 Weibull vs piece-wise Exponential (crossing curves) : T ∼ Weib(2 ,

7) and T has piece-wiseconstant hazard function α ( t ) = 0 . · { t ≤ c δ, } + 0 . · { t > c δ, } .The parameters λ δ,k , c δ,k , µ δ , shape δ and scale δ depend on the diﬀerence δ = µ − µ of the RMSTs. Forour simulations, we considered δ = 0 for the settings under the null hypotheses and δ ∈ { . , , . , } for the diﬀerent alternative scenarios. See Figure 1 for an illustration of the Scenarios S3, S5, S6and S7 with crossing curves under the null hypotheses ( δ = 0). Under the null hypothesis ( δ = 0)Scenarios S1 and S2 coincide. That is why just one of the respective two scenarios was included inthe simulation study whenever δ = 0 was considered. For the censoring, we chose the following threecensoring conﬁgurations, see also Figure 2 for the respective survival functions:C1 unequally Weibull distributed censoring (Weib, uneq) : C ∼ Weib(3 ,

18) and C ∼ Weib(0 . , equally uniformly distributed censoring (Unif, eq) : C ∼ Unif[0 ,

25] and C ∼ Unif[0 , equally Weibull distributed censoring (Weib, eq) : C ∼ Weib(3 ,

15) and C ∼ Weib(3 , n bal = (20 ,

20) and two unbalanced, n incr = (16 ,

24) and n decr = (24 , K n bal , K n incr , K n decr with K = 2 , survRM2perm (Horiguchi and Uno, 2020b). Horiguchi and Uno (2020a) discussed ex-tensively diﬀerent strategies on tackling the problem of possibly inestimable Kaplan–Meier-estimatorsfor permuted data sets. Their numerical ﬁndings do not reveal a clear favorable method and all six6 .000.250.500.751.00 0.0 2.5 5.0 7.5 10.0time S u r v i v a l f un c t i on Group 1 Group 2

C1: unequally Weibull distributed censoring

Group 1 and 2

C2: equally uniformly distributed censoring

Group 1 and 2

C3: equally Weibull distributed censoring

Figure 2: The survival curves of the three diﬀerent censoring scenarios.studied strategies lead to comparable results. That is why we restricted ourselves here to the simplehorizontal extension of the Kaplan–Meier curves, which corresponds to Method 2 in their paper andR-package. In detail, we set b S πi ( u ) = b S πi ( t ) for all u ∈ [ t, τ ] when S πi was just estimable up to t < τ .The unstudentized permutation method, which relies on the assumption of exchangeable data,cannot be used to derive conﬁdence intervals. Consequently, just the asymptotic and studentizedpermutation methods were included in the respective comparisons.The simulations were conducted by means of the computing environment R (R Core Team, 2021),version 3.6.1, generating N sim = 5 ,

000 simulation runs and N res = 2 ,

000 resampling iterations forthe two permutation procedures. Analogous to Horiguchi and Uno (2020a), we regenerated the datawhenever the Kaplan–Meier-estimator were not estimable, i.e. when at least for one group the largestobserved time was censored and lied within [0 , τ ]. The nominal signiﬁcance level was set to α = 5%and the end point of the time window was set to τ = 10. The simulation results for the type-I error control are presented in Table 1. Since the results for thetwo equally distributed censoring settings lead to the same conclusions, only Scenario C2 is included inthe table and the results for C3 can be found in the supplement. To judge the tests’ performance, werecall that the 95%-conﬁdence interval for the estimated sizes based on N sim = 5 ,

000 simulation runsequals [4 . , . α = 5%. Havingthis at hand, it can readily be seen that the the asymptotic approach leads to rather liberal decisions.In 70 out of the 108 settings, the empirical type-I error rate was above the upper bound 5 .

6% ofthe conﬁdence interval [4 . , . . K = 4) the empiricalsize was inside the conﬁdence interval [4 . , . n = K (20 , n = K (16 , n = K · (24 ,

16) with empirical sizes around 6%. Theempirical sizes under the remaining Scenarios S3, S5, S6 and S7 with crossing survival curves becomeeven more unstable. For the equally distributed censoring setting, the unstudentized permutationtest lead to rather liberal decision for n = K (24 ,

16) with values up to 7 .

7% and quite conservativedecisions for n = K (16 ,

24) with values reaching down to 3 . .

2% under the balanced sample size settings. The liberality is evenmore pronounced for n = K (24 ,

16) with values even up to 9 . n = K (16 , n + n < − K = 1 ,

2) and can be explained by the liberal behavior of the asymptotic test, which we observedunder the null hypotheses. Comparing the power results of the two permutation approaches, thepower values are almost indistinguishable in most of the cases. However, partially the unstudentizedpermutation test lead to higher power values with a diﬀerence up to even 6 percentage points and eventhe reverse, i.e. the studentized permutation has higher power values, can be observed. These diverseﬁndings can be explained by the unstable type-I error control of the unstudentized permutation testwith too liberal and too conservative decisions. Overall, the results need to be taken with a pinchof salt, because only the studentized permutation test exhibited a generally convincing performanceunder the null hypotheses.We ﬁnally turn to the performance of the conﬁdence intervals. We summarized the results forall seven distributional choices S1–S7, the three censoring distributions C1–C3 and the ﬁve diﬀerentchoices for δ ∈ { , . , , . , } in Figure 3, for each of the nine diﬀerent sample sizes. In total,each boxplot summarizes the results of 102 diﬀerent settings; recall that S1 and S2 coincide under δ = 0 and, thus, only S1 is considered in this case. It is apparent that the empirical coverage of theasymptotic test is liberal, similar to our ﬁndings regarding the type-I error control. The liberalityor undercoverage is most pronounced for the small sample size cases ( K = 1) and becomes lesspronounced when the sample sizes increase. But even for the largest sample size settings ( K = 4),the median empirical coverage is just slightly above the lower border 94 .

4% of the binomial 95%-conﬁdence interval [94 . , . . , . n = (24 , µ − µ for small sample sizes, as it leads tothe most accurate type-I error and coverage control, respectively. Moreover, it can compete in termsof power with the other strategies whenever a comparison is fair and not inﬂuenced by liberal decisionsunder the null hypothesis. To illustrate the presented permutation-based methods, we re-consider the data analysis of Hellmann et al.(2018), who compared a combination treatment of nivolumab plus ipilimumab with chemotherapyamong 299 patients with non-small-cell lung cancer. Their study focused on patients with a hightumor mutational burden, i.e. at least ten mutations per megabase. And the study endpoint wasprogression-free survival. Since the present methods are designed for small sample sizes, we conducta relevant subgroup analysis, which was also done by Hellmann et al. (2018). In detail, we restrictto the patients having PD-L1 (tumor programmed death ligand 1) expression of at least 1%. On thebasis of the published Kaplan–Meier curves in Hellmann et al. (2018) and some additional informa-8 =(24,16) n=(48,32) n=(96,64)n=(16,24) n=(32,48) n=(64,98)n=(20,20) n=(40,40) n=(80,80)Asym stud P Asym stud P Asym stud P929496929496929496

Methods C o v e r age i n % Figure 3: Coverage in % (nominal level α = 5%) of the conﬁdence intervals based on the asymptoticapproximation (Asym) and the studentized permutation approach (stud P). The dashed, horizontallines represent the binomial 95%-conﬁdence interval [94 . , . .000.250.500.751.00 0 3 6 9 12 15 18 21 24 Time in Months S u r v i v a l p r obab ili t y Treatment

Chemotherapy Nivolumab+ipilimumab

48 30 16 4 1 1 1 0 038 20 16 15 10 8 4 1 1

Nivolumab+ipilimumabChemotherapy 0 3 6 9 12 15 18 21 24

Time in Months T r ea t m en t Number at risk

Figure 4: Kaplan–Meier curves of the reconstructed datation therein, e.g. the risk table, we reconstructed the individual patient data following the procedureof Guyot et al. (2012). The respective Kaplan–Meier curves of the two treatment groups are dis-played in Figure 4. Therein, we can observe a delayed treatment eﬀect of nivolumab plus ipilimumab.Thus, the assumption of proportional hazards is questionable and can even by formally rejected bythe well established test of Grambsch and Therneau (1994) or the recent permutation-based proposalof Ditzhaus and Janssen (2020) (with 10,000 permutations). Both tests lead to a p -value less than0 . . . , . p -values of the asymptotic, studentized andunstudentized permutation tests (both based on 5,000 permutations), for inferring H : µ = µ arepresented in Table 2. The conﬁdence intervals for the diﬀerence µ − µ as well as for the ratio µ /µ are shown in Table 3. In both tables, the diﬀerent end points τ ∈ { , , } were considered. Inpractice, the end point needs to be chosen jointly with the physician regarding clinical relevance.For τ = 15 and τ = 18, the results conﬁrm the ﬁndings of Hellmann et al. (2018) that the combi-nation nivolumab plus ipilimumab improves the progression-free time compared to the chemotherapy.The point estimates and conﬁdence intervals in Table 3 help to quantify the improvement and can beinterpreted easily. For example, the combination treatment leads in average to a longer progression-free time of 4 . ± .

05 months (95% conﬁdence based on 5,000 permutations) compared to thechemotherapy over the ﬁrst 1.5 years.In general, it is observed that the asymptotic approach leads to smaller p -values and narrowerconﬁdence intervals than its permutation counterpart. Moreover, the unstudentized permutation testlead to comparable p -values than the asymptotic approach. As pointed out in Section 3, the resultsof the asymptotic and unstudentized permutation test need to be considered carefully, especially forsmall and unbalanced sample sizes as having here. Thus, we would rather trust the results of thestudentized permutation test than those of the other two, especially for τ = 12 months, where thedecisions are diverse. 10 Discussion and remarks

In the last years, the RMST became an important part of the statistical toolbox for survival data. Var-ious researchers (Stensrud and Hern´an, 2020; Trinquart et al., 2016; A’Hern) advise to use it, at least,as a complementary summary statistic, especially when the assumption of proportional hazards is indoubt. As raised by Horiguchi and Uno (2020a), the type-I error rate of related asymptotic methodsis inﬂated for small sample sizes. The permutation procedure of Horiguchi and Uno (2020a) as wellas their detailed discussion of how to deal with inestimable Kaplan–Meier curves of the permutateddata was an important step to solve that problem. However, their test’s application is limited to ex-changeable data settings and, in particular, to equal survival and censoring distributions, respectively.In this paper, we explained how studentization can tackle these limitations. For the present sur-vival two-sample comparison, it allows us to apply permutation tests even in non-exchangeable datasituation, i.e. for diﬀerent survival and/or censoring distributions, as well as to formulate correspond-ing conﬁdence intervals for the quantity µ − µ and µ /µ of interest. Moreover, the control of thetype-I error, which was the initial motivation for permutation tests, is not aﬀected by the studenti-zation strategy. Compared to their asymptotic counterparts, studentized permutation tests usuallyshow a satisfactory type-I error control even for small sample sizes, as seen in Section 3.The theoretical justiﬁcation of studentized permutation tests and respective conﬁdence intervals iscomplemented by an extensive simulation study. The corresponding results support the usage of thedeveloped methods for small sample sizes.Our framework can be extended in various directions, e.g. to competing risks (Zhao et al., 2018;Lyu et al., 2020). More general study designs may be part of future research. For that purpose, wecan follow Dobler and Pauly (2020) and Ditzhaus et al. (2020a), who recently discussed permutation-based inference for the concordance measure and median survival times, respectively, in the generalcontext of factorial designs. Sample size determination can also be developed, in parallel to theasymptotic test based results (Ye and Yu, 2018). Acknowledgement

Marc Ditzhaus was funded by the

Deutsche Forschungsgemeinschaft (grant no. PA-2409 5-1). More-over, the authors gratefully acknowledge the computing time provided on the Linux HPC cluster atTU Dortmund (LiDO3), partially funded in the course of the Large-Scale Equipment Initiative by the

Deutsche Forschungsgemeinschaft as project 271512359.

References

R.P. A’Hern. Restricted mean survival time: an obligatory end point for time-to-event analysis incancer trials?

Journal of Clinical Oncology , 34(28):3474–3476.B.M. Alexander, J.D. Schoenfeld, and L. Trippa. Hazards of hazard ratios-deviations from modelassumptions in immunotherapy.

The New England Journal of Medicine , 378(12):1158–1159, 2018.P.K. Andersen, Ø. Borgan, R.D. Gill, and N. Keiding.

Statistical Models Based on Counting Processes .Springer, New York, 1993.T.B Berrett, Y. Wang, R. F. Barber, and R.J. Samworth. The conditional permutation test forindependence while controlling for confounders.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 82(1):175–197, 2020.R. Brookmeyer and J. Crowley. A conﬁdence interval for the median survival time.

Biometrics , 38:29–41, 1982.Z. Chen and G. Zhang. Comparing survival curves based on medians.

BMC Medical Research Method-ology , 16(1):1–7, 2016. 11.Y. Chung and J.P. Romano. Exact and asymptotically robust permutation tests.

The Annals ofStatistics , 41:484–507, 2013.M. Ditzhaus and S. Friedrich. More powerful logrank permutation tests for two-sample survival data.

Journal of Statistical Computation and Simulation , 90(12):2209–2227, 2020.M. Ditzhaus and A. Janssen. Bootstrap and permutation rank tests for proportional hazards underright censoring.

Lifetime Data Analysis , 26(3):493–517, 2020.M. Ditzhaus, D. Dobler, and M. Pauly. Inferring median survival diﬀerences in general factorial designsvia permutation tests.

Statistical Methods in Medical Research , page 0962280220980784, 2020a.M. Ditzhaus, A. Janssen, and M. Pauly. Permutation inference in factorial survival designs with theCASANOVA. arXiv preprint arXiv:2004.10818 , 2020b.M. Ditzhaus, R. Fried, and M. Pauly. QANOVA: Quantile-based permutation methods for generalfactorial designs.

TEST (to appear) , 2021.D Dobler and M Pauly. Bootstrap-and permutation-based inference for the Mann–Whitney eﬀect forright-censored and tied data.

TEST , 27(3):639–658, 2018.D. Dobler and M. Pauly. Factorial analyses of treatment eﬀects under independent right-censoring.

Statistical Methods in Medical Research , 29(2):325–343, 2020.P.M. Grambsch and T.M. Therneau. Proportional hazards tests and diagnostics based on weightedresiduals.

Biometrika , 81(3):515–526, 1994.P. Guyot, A.E. Ades, M. Ouwens, and N.J. Welton. Enhanced secondary analysis of survival data: re-constructing the data from published Kaplan-Meier survival curves.

BMC Medical Research Method-ology , 12(1):1–13, 2012.M.D. Hellmann, T.-E. Ciuleanu, A. Pluzanski, J. S. Lee, G.A. Otterson, C. Audigier-Valette, E. Mi-nenza, H. Linardou, S. Burgers, P. Salman, et al. Nivolumab plus ipilimumab in lung cancer witha high tumor mutational burden.

New England Journal of Medicine , 378(22):2093–2104, 2018.J. Hemerik and J. Goeman. Exact testing with random permutations.

TEST , 27(4):811–825, 2018.M. Horiguchi and H. Uno. On permutation tests for comparing restricted mean survival time withsmall sample from randomized trials.

Statistics in Medicine , 39(20):2655–2670, 2020a.M. Horiguchi and H. Uno. survRM2perm: Permutation Test for Comparing Restricted Mean SurvivalTime , 2020b. URL https://CRAN.R-project.org/package=survRM2perm . R package version 0.1.0.A. Janssen. Studentized permutation tests for non-iid hypotheses and the generalized Behrens-Fisherproblem.

Statistics & Probability Letters , 36:9–21, 1997.A. Janssen and T. Pauls. How do bootstrap and permutation tests work?

Annals of Statistics , 31(3):768–806, 2003.D.H. Kim, H. Uno, and L.-J. Wei. Restricted mean survival time as a measure to interpret clinicaltrial results.

JAMA Cardiology , 2(11):1179–1180, 2017.E.L. Lehmann and J.P. Romano.

Testing statistical hypotheses . Springer, New York, 2006.J. Lyu, Y. Hou, and Z. Chen. The use of restricted mean time lost under competing risks data.

BMCMedical Research Methodology , 20(1):1–11, 2020.T. Mick and T.-T. Chen. Statistical challenges in the design of late-stage cancer immunotherapystudies.

Cancer Immunology Research , 3(12):1292–1298, 2015.12. Neuhaus. Conditional rank tests for the two-sample problem under random censorship.

The Annalsof Statistics , 21:1760–1779, 1993.M. Pauly and L. Smaga. Asymptotic permutation tests for coeﬃcients of variation and standardisedmeans in general one-way anova models.

Statistical Methods in Medical Research , 29(9):2733–2748,2020.M. Pauly, E. Brunner, and F. Konietschke. Asymptotic permutation tests in general factorial designs.

Journal of the Royal Statistical Society: Series B , 77:461–473, 2015.R Core Team.

R: A language and environment for statistical computing . R Foundation for StatisticalComputing, Vienna, Austria, 2021. URL .P. Royston and M.K.B. Parmar. The use of restricted mean survival time to estimate the treatmenteﬀect in randomized clinical trials when the proportional hazards assumption is in doubt.

Statisticsin Medicine , 30(19):2409–2421, 2011.P. Royston and M.K.B. Parmar. Restricted mean survival time: an alternative to the hazard ratio forthe design and analysis of randomized trials with a time-to-event outcome.

BMC Medical ResearchMethodology , 13(1):1–15, 2013.B.L. Scott, M.C. Pasquini, B.R. Logan, J. Wu, S.M. Devine, D.L. Porter, R.T. Maziarz, E.D. Warlick,H.F. Fernandez, E.P. Alyea, et al. Myeloablative versus reduced-intensity hematopoietic cell trans-plantation for acute myeloid leukemia and myelodysplastic syndromes.

Journal of Clinical Oncology ,35(11):1154, 2017. L. Smaga. Diagonal and unscaled Wald-type tests in general factorial designs.

Electronic Journal ofStatistics , 11(1):2613–2646, 2017.M.J. Stensrud and M.A. Hern´an. Why test for proportional hazards?

JAMA , 323(14):1401–1402,2020.G. Taori, K.M. Ho, C. George, R. Bellomo, S.A.R. Webb, G.K. Hart, and M.J. Bailey. Landmarksurvival as an end-point for trials in critically ill patients–comparison of alternative durations offollow-up: an exploratory analysis.

Critical Care , 13(4):1–8, 2009.L. Tian, H. Jin, H. Uno, Y. Lu, B. Huang, K. M. Anderson, and L.J. Wei. On the empirical choice ofthe time window for restricted mean survival time.

Biometrics , 2020.L. Trinquart, J. Jacot, S.C. Conner, and R. Porcher. Comparison of treatment eﬀects measured by thehazard ratio and by the ratio of restricted mean survival times in oncology randomized controlledtrials.

Journal of Clinical Oncology , 34(15):1813–1819, 2016.H. Uno, B. Claggett, L. Tian, E. Inoue, P. Gallo, T. Miyata, D. Schrag, M. Takeuchi, Y. Uyama,L. Zhao, et al. Moving beyond the hazard ratio in quantifying the between-group diﬀerence insurvival analysis.

Journal of Clinical Oncology , 32(22):2380, 2014.A.W. van der Vaart and J.A. Wellner.

Weak convergence and empirical processes . Springer Series inStatistics. Springer-Verlag, New York, 1996. With applications to statistics.T. Ye and M. Yu. A robust approach to sample size calculation in cancer immunotherapy trials withdelayed treatment eﬀect.

Biometrics , 74(4):1292–1300, 2018.L. Zhao, B. Claggett, L. Tian, H. Uno, M.A. Pfeﬀer, S.D. Solomon, L. Trippa, and L.J. Wei. On therestricted mean survival time curve in survival analysis.

Biometrics , 72(1):215–221, 2016.L. Zhao, L. Tian, B. Claggett, M. Pfeﬀer, D. H. Kim, S. Solomon, and L.-J. Wei. Estimating treatmenteﬀect with clinical interpretation from a comparative clinical trial with an end point subject tocompeting risks.

JAMA Cardiology , 3(4):357–358, 2018.13.A. Zittoun, F. Mandelli, R. Willemze, T. De Witte, B. Labar, L. Resegotti, F. Leoni, E. Damasio,G. Visani, G. Papa, et al. Autologous or allogeneic bone marrow transplantation compared withintensive chemotherapy in acute myelogenous leukemia.

New England Journal of Medicine , 332(4):217–223, 1995. 14 =(24,16) n=(48,32) n=(96,64)n=(16,24) n=(32,48) n=(64,98)n=(20,20) n=(40,40) n=(80,80)Asym stud P Asym stud P Asym stud P919293949596919293949596919293949596

Methods C o v e r age i n % Figure 5: Coverage in % (nominal level α = 5%) of the conﬁdence intervals for the ratio µ /µ basedon the asymptotic approximation (Asym) and the studentized permutation approach (stud P). Thedashed, horizontal lines represent the binomial 95%-conﬁdence interval [94 . , . A Additional simulation results

First, we present the results for the power comparison, see Tables 5–7, and for the type-I error compar-ison in Table 4 under the remaining censoring setting C3, i.e. equally Weibull distributed censoring.The results were already brieﬂy discussed in the main paper, including all relevant main conclusions.In addition to that, we present here the simulation results of the asymptotic and permutation-basedconﬁdence intervals for the ratio µ /µ from Section 2.2. For the respective simulation study, we usedthe same set-up as in the simulation study for the diﬀerence-based methods from Section 3. In partic-ular, each boxplot in Figure 5 summarizes the results for 102 diﬀerent settings. It is apparent that theperformance of the asymptotic conﬁdence interval is less extreme than the one for the diﬀerences. Butfor small sample sizes, here n = (20 , , (16 , , (24 , K n bal , K n incr , K n decr with K = 1 , n + n < α = 5%) for the asymptotic (Asym), the studentizedpermutation (st P) and the unstudentized permutation (un P) tests in Scenarios S1–S4. The valuesinside the binomial conﬁdence interval [4.4%, 5.6%] are printed bold n = K · (24 , n = K · (20 , n = K · (16 , K Asym st P un P Asym st P un P Asym st P un P

S1 and S2: Exponential

Weib (uneq) (7%, 26%) 1 7.2

Unif (eq) (20%, 20%) 1 6.8

S3: Exponential vs. piece-wise Exponential (crossing curves)

Weib (uneq) (7%, 28%) 1 7.1

S4: Lognormal

Weib (uneq) (14%,35%) 1 7.4 Unif (eq) (33%,33%) 1 7.1 S5: Weibull (diﬀerent shape)

Weib (uneq) (8%,38%) 1 8.0 6.0 9.5 7.9 6.0 7.2 6.5

S6: Weibull (diﬀerent scale)

Weib (uneq) (8%,35%) 1 7.9 6.0 8.8 7.3

Unif (eq) (29%,35%) 1 7.7

S7: Weibull vs. piece-wise Exponential

Weib (uneq) (7%, 41%) 1 7.1

Table 2: Testing RMST diﬀerence based on the asymptotic (Asym), the studentized (st P) andunstudentized (un P) tests for the reconstructed data τ = 12 months τ = 15 months τ = 18 monthsAsym un P st P Asym un P st P Asym un P st P p -values 0.045 0.045 0.067 0.01 0.011 0.02 0.004 0.005 0.01116able 3: Point estimates and 95%-conﬁdence intervals of the diﬀerence µ − µ and the ratio µ /µ ,respectively, based on the asymptotic approximation (Asym) and the studentized permutation method.The ﬁrst group is the chemotherapy group and the second the nivolumab plus ipilimumab group τ = 12 months τ = 15 months τ = 18 monthsAsym st P Asym st P Asym st P b µ − b µ -1.85 -2.99 -4.0295%-CI [-3 . , -0 .

04] [-3 . , .

13] [-5 . , -0 .

70] [-5 . , -0 .

43] [-6 . , -1 .

26] [-7 . , -0 . b µ / b µ α = 5%) for the asymptotic (Asym), the studentizedpermutation (st P) and the unstudentized permutation (un P) tests in Scenarios S1–S7 under equallyWeibull distributed censoring, i.e. censoring setting C3. The values inside the binomial conﬁdenceinterval [4.4%, 5.6%] are printed bold. n = K · (24 , n = K · (20 , n = K · (16 , K Asym st P un P Asym st P un P Asym st P un P

S1 and S2: Exponential (11%, 11%) 1 6.1 S3: Exponential vs. piece-wise Exponential (crossing curves) (11%, 27%) 1 7.7 5.9 7.2 6.3

S4: Lognormal (21%,21%) 1 6.3 S5: Weibull (diﬀerent shape) (13%,40%) 1 7.5 5.7 7.7 7.2 5.7

S6: Weibull (diﬀerent scale) (13%,26%) 1 6.6

S7: Weibull vs. piece-wise Exponential (11%, 43%) 1 6.5 α = 5%) under the alternative µ − µ = δ ∈ { , } forthe asymptotic (Asym), the studentized permutation (st P) and the unstudentized permutation (unP) tests in Scenarios S1, S2 and S3. n = K · (24 , n = K · (16 , n = K · (20 , δ Cens. rates K Asym st P un P Asym st P un P Asym st P un P

S1: Exponential (proportional hazards)

Weib (uneq) δ = 1 (7%,30%) 1 15.3 12.2 15.1 15.3 12.5 13.6 16.1 12.6 12.22 23.3 21.4 24.6 23.8 21.8 22.9 23.8 21.6 20.64 38.9 37.4 41.7 41.2 40.0 41.4 41.1 39.4 38.1 δ = 2 (7%,34%) 1 40.6 35.2 40.2 42.3 37.7 39.0 42.8 37.0 36.22 64.0 61.4 65.9 69.2 67.1 68.3 67.3 65.0 63.94 90.1 89.4 91.4 92.5 92.2 92.5 92.4 91.9 91.2Unif (eq) δ = 1 (20%,27%) 1 16.0 12.3 13.7 16.3 13.8 13.8 18.1 14.2 13.32 24.1 22.0 23.3 25.7 23.6 23.5 25.3 23.1 22.34 41.3 40.3 41.4 41.5 40.4 40.4 42.4 41.4 40.0 δ = 2 (20%,37%) 1 41.7 35.4 38.0 43.6 38.5 38.7 42.2 36.9 36.12 65.8 63.5 64.9 69.9 68.0 67.9 67.8 65.2 64.74 91.2 90.6 91.1 92.8 92.4 92.3 91.6 91.2 91.1Weib (eq) δ = 1 (11%,19%) 1 16.8 13.5 15.0 16.1 13.7 13.7 17.5 14.5 13.72 26.0 24.1 25.3 26.6 24.8 24.9 25.3 23.3 22.84 41.6 40.8 41.7 46.1 44.8 44.9 44.6 43.3 42.4 δ = 2 (11%,29%) 1 44.9 40.3 42.1 47.0 42.0 42.1 44.9 40.1 39.22 71.6 69.4 70.9 72.4 70.6 70.3 72.2 69.6 69.04 93.4 93.0 93.5 95.1 95.0 95.0 94.2 93.8 93.4 S2: Exponential (late departures)

Weib (uneq) δ = 1 (7%,30%) 1 13.8 10.9 14.2 14.8 12.1 13.3 16.2 12.8 11.52 20.9 19.1 23.8 21.7 20.0 21.4 23.2 21.2 19.54 36.1 34.7 40.2 39.6 38.5 40.3 38.9 37.9 35.5 δ = 2 (7%,39%) 1 34.8 30.2 36.6 37.8 32.8 34.7 38.1 32.7 30.12 56.1 53.2 60.3 61.7 59.4 61.2 62.7 59.6 57.24 84.8 84.0 87.9 88.3 87.7 88.6 87.7 87.1 85.6Unif (eq) δ = 1 (20%,31%) 1 14.8 11.2 13.1 15.1 12.3 12.3 15.8 12.9 11.12 22.0 20.3 22.2 23.8 21.8 22.2 24.2 22.0 19.94 36.5 35.3 37.3 39.4 38.2 38.1 37.8 36.6 34.1 δ = 2 (20%,49%) 1 36.8 31.3 34.5 38.3 33.9 33.4 38.6 33.5 30.72 59.5 56.9 59.8 62.5 60.1 59.8 61.8 59.4 56.24 85.7 84.8 86.4 88.6 87.9 88.0 89.4 88.6 86.6Weib (eq) δ = 1 (11%,24%) 1 15.4 12.5 14.6 16.2 13.4 13.3 15.7 12.8 11.52 22.6 20.6 23.0 24.3 22.3 22.6 24.8 23.0 20.64 38.8 37.5 40.0 42.1 41.1 41.0 42.2 40.9 38.7 δ = 2 (11%,46%) 1 38.4 33.8 37.2 41.0 36.8 36.8 40.8 36.4 33.52 62.4 60.1 63.3 65.8 63.9 63.8 66.4 64.1 61.14 88.3 87.8 89.4 90.7 90.0 90.1 91.2 90.7 89.0 S3: Exponential vs. piece-wise Exponential (crossing curves)

Weib (uneq) δ = 1 (7%,32%) 1 14.3 11.5 15.1 14.3 11.8 12.6 15.1 12.0 10.02 18.7 16.9 21.9 21.2 19.4 21.1 21.6 19.5 17.34 31.0 29.8 37.1 32.6 31.7 33.2 35.3 34.2 30.9 δ = 2 (7%,36%) 1 36.3 31.4 37.3 38.6 33.6 35.1 38.9 33.4 30.92 57.7 54.9 61.9 61.1 58.6 60.0 61.7 59.2 55.84 85.1 84.7 87.5 87.5 86.8 87.5 88.6 88.0 86.3Unif (eq) δ = 1 (20%,38%) 1 13.7 10.8 12.5 14.3 11.4 11.2 15.8 13.1 10.82 20.7 18.8 21.5 20.5 18.6 18.4 21.6 19.7 16.44 34.1 32.8 36.5 34.6 33.6 33.2 36.2 34.9 30.9 δ = 2 (20%,46%) 1 36.8 31.4 34.9 38.6 33.9 33.7 38.2 33.4 30.62 59.8 56.8 60.1 62.2 59.7 59.6 62.5 59.7 56.74 86.5 85.7 87.5 89.3 88.8 88.5 88.5 87.7 86.0Weib (eq) δ = 1 (11%,35%) 1 14.3 11.9 14.0 14.9 12.6 12.6 14.7 11.9 9.82 20.9 19.1 22.0 22.3 20.8 20.9 23.4 21.8 18.34 33.5 32.7 36.7 38.5 37.7 37.7 37.3 36.2 32.4 δ = 2 (11%,42%) 1 38.0 33.0 36.7 39.5 35.3 35.4 41.5 36.5 33.82 61.6 58.9 62.6 64.2 62.0 62.3 65.5 63.0 59.64 88.5 87.9 89.9 91.3 90.9 90.8 90.4 89.9 88.2 α = 5%) under the alternative µ − µ = δ ∈ { , } forthe asymptotic (Asym), the studentized permutation (st P) and the unstudentized permutation (unP) tests in Scenarios S4 and S5. n = K · (24 , n = K · (16 , n = K · (20 , δ Cens. rates K Asym st P un P Asym st P un P Asym st P un P

S4: Lognormal (scale alternatives)

Weib (uneq) δ = 1 (14%,39%) 1 30.2 25.5 24.2 28.4 24.1 23.1 27.3 22.5 22.52 45.7 42.9 43.5 45.3 42.8 42.2 45.1 41.5 40.94 70.6 69.5 70.6 72.0 71.1 71.3 71.8 70.6 69.8 δ = 2 (12%,44%) 1 83.0 79.1 76.7 84.5 80.8 79.2 82.5 76.9 77.92 98.2 97.7 97.6 98.5 98.2 98.0 98.1 97.6 97.64 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0Unif (eq) δ = 1 (33%,42%) 1 27.3 22.1 20.6 26.5 22.2 22.1 25.5 20.4 22.92 42.7 39.9 38.1 43.2 40.6 40.3 42.5 38.9 41.24 69.9 68.7 67.6 71.6 70.5 70.3 68.6 67.3 68.8 δ = 2 (33%,56%) 1 81.7 77.2 74.2 81.4 77.3 77.5 76.6 70.2 76.72 97.7 97.1 96.5 97.8 97.4 97.5 96.4 95.5 96.84 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0Weib (eq) δ = 1 (21%,32%) 1 32.8 28.5 26.2 30.8 27.2 27.2 29.3 24.5 27.42 50.4 47.6 45.5 50.4 48.0 48.3 45.8 43.5 45.84 78.2 77.0 75.9 78.2 77.3 77.5 75.5 74.8 75.9 δ = 2 (21%,52%) 1 87.8 84.6 82.2 87.5 84.4 84.2 84.5 80.7 85.12 99.2 99.1 98.9 99.3 99.2 99.1 98.8 98.5 98.94 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 S5: Weibull (diﬀerent shape)

Weib (uneq) δ = 1 (8%,40%) 1 25.8 22.4 25.7 25.0 22.0 22.7 24.5 20.3 19.42 35.7 33.4 39.7 38.5 36.3 38.1 37.1 35.0 32.84 55.7 54.7 62.0 60.1 59.1 61.8 62.5 60.8 57.9 δ = 2 (8%,42%) 1 75.4 71.5 73.3 77.2 73.2 73.1 77.8 72.7 72.32 94.9 94.1 95.4 96.0 95.6 95.8 96.1 95.3 95.24 99.9 99.8 99.9 99.9 99.9 99.9 100.0 99.9 99.9Unif (eq) δ = 1 (29%,48%) 1 25.0 20.5 21.6 25.2 20.6 20.6 23.9 19.4 18.12 35.8 33.3 35.2 37.9 35.3 34.8 38.4 35.5 33.64 59.0 57.7 60.4 60.9 59.8 59.2 61.2 59.9 56.3 δ = 2 (29%,50%) 1 75.5 70.6 70.0 77.2 72.9 72.8 74.5 67.7 70.32 94.3 93.5 93.7 96.3 95.6 95.8 95.6 94.6 94.74 99.9 99.9 99.9 99.9 99.9 99.9 100.0 100.0 100.0Weib (eq) δ = 1 (13%,42%) 1 26.4 22.9 23.9 27.5 24.2 24.0 26.1 22.4 21.32 40.2 37.9 40.8 40.3 38.2 38.1 43.0 40.9 37.94 63.0 61.6 65.0 68.3 67.4 67.4 67.9 66.9 63.4 δ = 2 (13%,44%) 1 81.0 77.3 77.4 81.6 78.4 78.3 82.2 78.5 79.32 97.5 96.9 97.2 98.0 97.6 97.7 97.9 97.7 97.64 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 α = 5%) under the alternative µ − µ = δ ∈ { , } forthe asymptotic (Asym), our studentized permutation (st P) and the unstudentized permutation (unP) tests in Scenarios S6 and S7. n = K · (24 , n = K · (16 , n = K · (20 , δ Cens. rates K Asym st P un P Asym st P un P Asym st P un P

S6: Weibull (diﬀerent scale)

Weib (uneq) δ = 1 (8%,40%) 1 25.3 21.7 25.5 24.8 21.2 22.0 23.5 19.2 18.22 35.5 33.2 39.7 38.9 36.8 38.7 38.1 35.4 33.34 57.3 55.9 63.7 61.2 60.0 62.1 62.7 60.8 57.8 δ = 2 (8%,48%) 1 70.5 66.1 70.0 72.3 68.5 69.0 73.5 68.2 66.72 90.4 89.2 92.0 92.3 91.5 92.0 93.7 92.6 91.64 99.7 99.5 99.8 99.8 99.7 99.8 99.9 99.9 99.9Unif (eq) δ = 1 (29%,48%) 1 24.5 20.3 21.4 24.5 20.5 20.2 23.8 19.0 18.22 36.1 33.7 35.4 37.6 35.2 34.9 37.1 34.3 31.84 57.9 56.5 59.4 61.2 60.0 59.7 61.4 60.1 56.7 δ = 2 (29%,67%) 1 69.6 64.5 65.2 73.0 68.3 67.8 69.4 63.6 64.12 91.6 90.7 91.0 92.8 91.9 91.8 92.5 91.5 90.84 99.6 99.5 99.7 99.9 99.9 99.8 99.8 99.8 99.8Weib (eq) δ = 1 (13%,42%) 1 26.5 22.9 24.3 25.8 22.3 22.2 26.5 22.5 21.42 40.4 38.0 40.5 42.4 40.6 40.2 42.1 39.4 36.84 63.7 62.6 65.7 67.9 66.8 66.5 68.2 67.1 63.4 δ = 2 (13%,65%) 1 74.9 71.2 72.9 77.7 74.1 73.6 75.8 71.7 71.72 94.3 93.4 94.5 96.0 95.6 95.6 95.3 94.7 94.14 99.8 99.8 99.8 100.0 100.0 100.0 99.9 99.9 99.8 S7: Weibull vs. piece-wise Exponential

Weib (uneq) δ = 1 (7%,46%) 1 16.9 14.3 18.0 17.4 14.5 15.6 17.5 14.6 12.42 24.2 22.3 28.9 26.3 24.6 26.0 24.9 22.5 19.14 37.0 36.1 44.4 42.3 41.0 43.5 43.4 42.0 38.0 δ = 1 (7%,52%) 1 46.1 41.7 47.3 49.8 45.4 45.9 49.3 43.4 39.82 70.0 67.4 74.3 74.0 72.0 72.9 75.1 72.9 68.64 92.2 91.9 94.5 95.4 95.0 95.5 95.9 95.6 94.4Unif (eq) δ = 1 (25%,59%) 1 17.4 14.3 16.1 17.2 14.4 13.9 17.5 14.0 11.82 25.1 23.2 25.9 26.0 24.0 23.4 25.8 23.5 19.54 38.2 37.1 40.9 40.5 39.6 38.7 44.3 43.0 37.5 δ = 2 (25%,69%) 1 48.6 43.6 46.0 49.1 44.7 43.5 48.6 42.9 40.32 71.9 69.4 72.6 73.3 71.2 70.1 76.0 73.8 69.64 93.2 92.9 94.1 95.5 95.1 95.0 96.0 95.5 94.2Weib (eq) δ = 1 (11%,56%) 1 18.0 14.9 17.8 18.9 15.9 15.8 17.8 14.5 12.32 26.3 24.7 28.5 27.5 25.1 25.1 28.0 25.8 21.94 41.5 40.3 45.7 44.1 43.2 42.9 46.3 45.1 39.5 δ = 2 (11%,67%) 1 50.7 45.9 50.1 51.7 47.5 47.3 53.4 48.1 44.72 71.9 70.1 74.1 78.3 76.7 76.6 79.1 76.8 73.24 95.0 94.8 95.8 96.1 95.8 95.9 96.7 96.5 95.6 Counting process notation

For the proofs, we adopt the counting process notation of Andersen et al. (1993). Let N i ( t ) = P n i j =1 δ ij { X ij ≤ t } be the number of observed events up until t in group i = 1 , Y i ( t ) = P n i j =1 { X ij ≥ t } denotes the number of individuals under risk just before t in group i = 1 ,

2. More-over, let N = N + N and Y = Y + Y be the respective versions for the pooled sample. It is easyto check that b S i − ( t ) b G i − ( t ) = 1 n i n i X j =1 { X ij ≥ t } = 1 n i Y i ( t ) . (4)Given the counting process notation, we can write the Kaplan–Meier and Nelson–Aalen estimators asfollows b S i ( t ) = Y k : t ik ≤ t (cid:16) − ∆ N i ( t ik ) Y i ( t ik ) (cid:17) , b A i ( t ) = X k : t ik ≤ t ∆ N i ( t ik ) Y i ( t ik ) ( i = 1 , t ≥ , where ∆ N i ( t ) = N i ( t ) − N i − ( t ) is the increment of N i in t and t i , t i , . . . are the distinctive time pointswithin the observed times ( X ij ) j of group i . Moreover, we introduce their pooled counterparts: b S ( t ) = Y k : t k ≤ t (cid:16) − ∆ N ( t k ) Y ( t k ) (cid:17) , b A ( t ) = X k : t k ≤ t ∆ N ( t k ) Y ( t k ) ( t ≥ , where t , . . . , t d are the distinctive time points within the pooled observation times X . C Proof of (1) and (3)

The convergence in (1) and (3) directly follow from the continuous mapping theorem, the δ -methodand the following Proposition. Proposition 1. As n → ∞ , √ n ( b µ i − µ i ) d −→ Z i ∼ N (0 , σ i ) with variance σ i = κ − i Z τ (cid:16) Z τx S i ( t ) d t (cid:17) − ∆ A i ( x )) G i − ( x ) S i − ( x ) d A i ( x ) . (5) Proof of Proposition 1.

Let D be the Skorohod space consisting of all c`adl`ag functions on [0 , τ ]. ByExample 3.9.31 of van der Vaart and Wellner (1996) √ n i ( b S i − S i ) d → G i on D (6)for a centered Gaussian process G i with covariance structure( s, t ) S i ( t ) S i ( s ) Z min( s,t )0 − ∆ A i ( x )) G i − ( x ) S i − ( x ) d A i ( x ) . Thus, we can deduce from (6) and the continuous mapping theorem √ n ( b µ i − µ i ) = r nn i Z τ √ n i ( b S i ( t ) − S i ( t )) d t d −→ κ − / i Z τ G i ( t ) d t = Z i . By Fubini’s Theorem (van der Vaart and Wellner, 1996, Sec. 3.9.2), Z i is indeed centered normallydistributed with variance given by σ i = κ − i Z τ Z τ E ( G i ( t ) G i ( s )) d t d s = κ − i Z τ Z τ S i ( t ) S i ( s ) Z min( s,t )0 − ∆ A i ( x )) G i − ( x ) S i − ( x ) d A i ( x ) d t d s = κ − i Z τ (cid:16)Z τx S i ( t ) d t (cid:17) − ∆ A i ( x )) G i − ( x ) S i − ( x ) d A i ( x ) . Proof of the variance estimator’s consistency

Deﬁne y i = S i − G i − and ν i by ν i ( t ) = R t G i − ( s ) d F i ( s ) ( t ≥ t ∈ [0 ,τ ] | n − i Y i ( t ) − y i ( t ) | + sup t ∈ [0 ,τ ] | n − i N i ( t ) − ν i ( t ) | → n → ∞ (7)almost surely. It is well known that this combined with the continuous mapping theorem implies theuniform consistency of the Kaplan–Meier and Nelson–Aalen estimators:sup t ∈ [0 ,τ ] | b S i ( t ) − S i ( t ) | + sup t ∈ [0 ,τ ] | b A i ( t ) − A i ( t ) | → n → ∞ (8)almost surely. Obviously, it follows thatsup x ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)Z τx b S i ( t ) d t − Z τx S i ( t ) d t (cid:12)(cid:12)(cid:12) → n → ∞ (9)almost surely. In particular, b µ i = Z τ b S i ( t ) d t → Z τ S i ( t ) d t = µ i as n → ∞ (10)with probability one. Moreover, we can deduce from (4), (7) and (9) that we have almost surely as n → ∞ b σ i = nn i Z τ (cid:16) Z τx b S i ( t ) d t (cid:17) − ∆ b A i ( x )) n − i Y i ( x ) d b A i ( x ) → κ − i Z τ (cid:16) Z τx S i ( t ) d t (cid:17) − ∆ A i ( x )) S i − ( x ) G i − ( x ) d A i ( x ) = σ i . (11)Clearly, the consistency of b σ = b σ + b σ follows. In the same way, we can deduce the consistency of b σ which was deﬁned after Equation (3). E Proof of Theorems 1 and 2

We ﬁrst introduce the limits of

Y /n, N/n , b S and b A : y ( t ) = κ y ( t ) + κ y ( t ) , ν ( t ) = κ ν ( t ) + κ ν ( t ) ,S ( t ) = exp n − Z t y ( s ) d ν ( s ) o , A ( t ) = Z t y ( s ) d ν ( s ) , where y i ( t ) = S i − ( t ) G i − ( t ) and ν i ( t ) = R t G i − ( s ) d F i ( s ) were already deﬁned in the proof of Propo-sition 1. In fact, from the Glivenko-Cantelli Theorem (and the continuous mapping theorem for theconvergence of b S ) we obtain immediatelysup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) N ( t ) n − ν ( t ) (cid:12)(cid:12)(cid:12) + sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) Y ( t ) n − y ( t ) (cid:12)(cid:12)(cid:12) + sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) b S ( t ) − S ( t ) (cid:12)(cid:12)(cid:12) + + sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) b A ( t ) − A ( t ) (cid:12)(cid:12)(cid:12) → n → ∞ . In particular, b µ = Z τ b S ( t ) d t → Z τ S ( t ) d t = µ almost surely as n → ∞ .For the ﬁrst step of the proof, we follow the argumentation of the previous proof of (1). Asexplained by Dobler and Pauly (2018) (see Theorem 5 in their supplement), the following conditional22onvergence is a straightforward consequence of Theorems 3.7.1 and 3.7.2 in van der Vaart and Wellner(1996): √ n (cid:16) b S π − b S, b S π − b S (cid:17) d −→ G π on D as n → ∞ given the data in probability, where G π = ( G π , G π ) is a centered Gaussian process on D with covari-ance structure given by E ( G πi ( s ) G πi ′ ( t )) = (cid:16) κ i { i = i ′ } − (cid:17) S ( t ) S ( s ) Z min( s,t )0 − ∆ A ( x )) y ( x ) d A ( x ) . Consequently, we obtain from the continuous mapping theorem that given the data in probability √ n ( b µ π − b µ, b µ π − b µ ) d → (cid:16)Z τ G π ( s ) d s, Z τ G π ( s ) d s (cid:17) = ( Z π , Z π ) , (13)where ( Z π , Z π ) is 2 − dimensional, centered normally distributed with covariance structure E ( Z πi Z πi ′ ) = (cid:16) κ i { i = i ′ } − (cid:17) σ π , σ π = Z τ (cid:16) Z τx S ( t ) d t (cid:17) − ∆ A ( x )) y ( x ) d A ( x ) . Applying again the continuous mapping theorem yields that given the data in probability √ n ( b µ π − b µ π ) d → Z π − Z π ∼ N (0 , σ π ) with σ = σ π κ κ as n → ∞ . (14)This proves Theorem 1.To verify Theorem 2, it remains to discuss the consistency of the variance estimator. Therefor, weﬁx the original observations ( X , δ ). Note that N , Y and S does not change when permuting the data.Thus, we can treat them all as ﬁxed functions. Moreover, we can assume without a loss of generalitythat (12) holds for them. Following (Neuhaus, 1993, equation 6.1), we can deducesup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) Y πi ( t ) Y ( t ) − κ i (cid:12)(cid:12)(cid:12) p → n → ∞ . Using similar arguments, the statement remains true for N πi /N . Combining both, (12) and thecontinuous mapping theorem yieldssup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) Y πi ( t ) n − κ i y ( t ) (cid:12)(cid:12)(cid:12) + sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) N πi ( t ) n − κ i ν ( t ) (cid:12)(cid:12)(cid:12) + sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) b S πi ( t ) − S ( t ) (cid:12)(cid:12)(cid:12) + sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12) b A πi ( t ) − A ( t ) (cid:12)(cid:12)(cid:12) p → . In particular, we obtain (cid:12)(cid:12)(cid:12)b µ πi − µ (cid:12)(cid:12)(cid:12) + sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)Z τt b S πi ( s ) d s − Z τt S ( s ) d s (cid:12)(cid:12)(cid:12) p → . Combining all previous statements yields that as n → ∞ b σ π i = nn i Z τ (cid:16) Z τx b S πi ( t ) d t (cid:17) n − i Y πi ( x ) d b A πi ( x ) p → κ − i Z τ (cid:16) Z τx S ( t ) d t (cid:17) y ( x ) d A ( x ) = κ − i σ π . Finally, the desired convergence of the variance estimator follows, i.e. as n → ∞ b σ π = b σ π + b σ π p → κ − σ π + κ − σ π = σ . F Proof of Corollary 2

It is suﬃcient to show that given the data in probability √ n (1 / b σ π rat )[log( b µ π ) − log( b µ π )] d → Z π rat ∼ N (0 ,

1) as n → ∞ . (15)The corresponding proof can again be separated into two steps: (1) veriﬁcation of the asymptoticnormality of log( b µ π ) − log( b µ π ) and (2) showing the consistency of the variance estimator b σ π . It iseasy to see that (1) follows immediately from (13) and the δδ