[PDF] Claiming trend in toxicological and pharmacological dose-response studies: an overview on statistical methods and related R-Software

Abstract

There are very different statistical methods for demonstrating a trend in pharmacological experiments. Here, the focus is on sparse models with only one parameter to be estimated and interpreted: the increase in the regression model and the difference to control in the contrast model. This provides both p-values and confidence intervals for an appropriate effect size. A combined test consisting of the Tukey regression approach and the multiple contrast test according to Williams is recommended, which can be generalized to the generalized linear (mixed effect) model. Thus numerous variable types occurring in pharmacology/toxicology can be adequately evaluated. Software is available through CRAN packages. The most significant limitation of this approach is for designs with very small sample sizes, often in pharmacology/toxicology.

Full PDF

CClaiming trend in toxicological and pharmacologicaldose-response studies:an overview on statistical methods and related R-Software

Ludwig A. Hothorn,Im Grund 12, D-31867 Lauenau, Germany(e-mail: ludwig()hothorn.de. Retired from Leibniz University Hannover)July 21, 2020

Abstract

There are very different statistical methods for demonstrating a trend in pharmacological experi-ments. Here, the focus is on sparse models with only one parameter to be estimated and interpreted:the increase in the regression model and the difference to control in the contrast model. This providesboth p-values and conﬁdence intervals for an appropriate effect size. A combined test consisting of theTukey regression approach and the multiple contrast test according to Williams is recommended, whichcan be generalized to the generalized linear (mixed effect) model. Thus numerous variable types oc-curring in pharmacology/toxicology can be adequately evaluated. Software is available through CRANpackages. The most signiﬁcant limitation of this approach is for designs with very small sample sizes,often in pharmacology/toxicology.

The problem

One of the main statistical principles in pharmacology/toxicology is: ’Characterization of Dose-responseRelationships Inferred by Statistically Signiﬁcant Trend Tests’ [30].But the vast majority of articles on trend tests deal with time trends. Exactly this is not discussed here,as the speciﬁc dependency between the times, e.g. 2014, 2015 and 2016, is modeled. Trends in dose-response dependencies, recommended in guidelines, e.g. for in-vivo micronucleus assay [3], assumeindependent doses in a randomized design (classiﬁcation I). The dose is thus a grouped variable, e.g. 0,10, 50 mg/kg, whereby the dose levels can be considered either as qualitative levels of a factor (analysisof variance model) or quantitative values of a covariate (regression model) (or both; see below) (classi-ﬁcation II). The essential classiﬁcation depends on the objective, and this can be very different: simplyto show an increasing global trend, to estimate a speciﬁc dose, such as MED, NOAEL (in the sense ofParacelsus) or using a model-based approach to estimate a lower conﬁdence interval for a regulatorydeﬁned dose, such as ED or the benchmark dose (BMD) (classiﬁcation III). A related overview onmodel-based approaches is available [36]. Some authors [29] argue to avoid trend test-based approaches(NOAEL estimation) altogether because of their design dependency and to use the generally superiormodel-based approaches, e.g. the BMDL (where L stands for lower conﬁdence limit). Because of theusually few doses, the small sample sizes, the non-interpolation, the critical model-assumption depen-dence and the uncertainty of setting limits in BMD model, I would not prefer a BMDL of a NOAEL onprinciple.A misunderstanding between toxicologists and statisticians is the simple deﬁnition of what is actually a1 a r X i v : . [ s t a t . A P ] J u l rend. Often trend tests are used which were formulated for linear trends, such as the Armitage trendtest [5] and the Jonckheere trend test [28]. But a speciﬁc shape of dose-response dependence is rarelyan a-priori assumption, but rather a result of the evaluation. Therefore, trend tests are needed which aresensitive to as many shapes as possible, as well as supra- and sub-linear shapes, e.g. the common inpharmacology/toxicology plateau-like shapes, and also allow a statement to be made as to which shapeof the alternative is most likely. We describe trend tests for near-to-linear trend, any shape trend, strictlymonotone trend and non-monotone trend, especially with down-turn effects at high doses (classiﬁcationVI).Nowadays, the existence of a trend and its magnitude is usually represented by a p-value (or a derivedstar). Its brevity and simplicity are bought at the expense of numerous serious disadvantages. Therefore,the biologically relevant choice of an effect measure and its conﬁdence interval is the better alternative.For this purpose, the slope in 2-parametric regression models or the difference/ratio to control is particu-larly suitable (classiﬁcation V).Numerous statistical tests are based on normally distributed homoscedastic error terms in a randomizedone-way layout. But in pharmacology/toxicology, many other endpoints such as proportions, counts,time-to-events, etc. occur in more complex systems, including random terms such as within-litter depen-dencies. One needs trend tests in the generalized linear model (GLM) and generalized linear mixed effectmodel (GLMM). Also robust variants for any distribution and even informative variance heterogeneity,e.g. increasing with effect increase (classiﬁcation VI).A basic idea is a model that is as simple (’spares’) as possible, which is realized by exactly one, inter-pretable parameter, e.g. the slope in regression models or the difference to the control in contrast tests.Of course you can ﬁnd much better ﬁtting models, e.g. polynomials, splines or 5-parametric nonlin-ear models. But both their comparison across endpoints, sex, species, studies, relevant in pharmacol-ogy/toxicology, and their biological interpretability are rather difﬁcult (classiﬁcation VII).Limiting factors are the very small sample size per dose group n i common in pharmacology/toxicology,up to the triplicates in the Ames assay, and the usually few dose levels (k=3+1): in terms of the false+/false-decision ratio, model discrimination, robustness, etc. In practice, asymptotic tests are used (which assume n → inf ) and sometimes the bias is massive. There is no universal solution to this problem, but in someplaces the mitigation of this limitation is pointed out. Classiﬁcation: (bold ... discussed here; italic ... not here)I: independent dose groups in a randomized design vs. dependent time points

II: a) dose modeled qualitatively, b) dose modeled quantitatively, c) jointly

III: a) just claiming global trend , b) estimate NOAEL (MED), c) model-based estimation of ED50,BMDL IV: tests for a) near-to-linear trend, b) any shape trend, c) strictly monotone trend and d) non-monotone trend (especially with down-turn effects at high doses)V: avoid p-values.

Use an appropriate effect measure and its conﬁdence interval for interpreta-tion

VI: a) trend tests in GLM, GLMM, b) robust trend tests

VII: using sparse models with preferably only one interpretable parameter

Motivating examples

Three data sets were selected. One from regulatory toxicology (repeated administration of sodium dichro-mate dihydrate in a chronic study on rat), a second from environmental toxicology (aquatic assay usingceriodaphnia dubia) and a third from veterinary pharmacology using beagle dogs. In the left panel of2igure 1 the number of total brood in ceriodaphnia dubia after treatment with increasing tank concentra-tions of nitrofen [6] are show, in the middle panel the blood urea nitrogen (BUN) endpoint of the 13-weekstudy with sodium dichromate dihydrate [1] where a downturn effect at high dose occur, and in the rightpanel are the 24 hours diuresis values for 0, 1, 2, and 4 mg/kg i.m. furosemide to treat congestive heartfailure in dogs [7]. The occurring data diversity is thus somewhat depicted. Increasing and decreasingtrends, monotone trends and those with downturn effects at high doses (and those with plateau effect),counts and continuous endpoints, variance homogeneity and heterogeneity, designs with ( k = 3 + 1) , ( k = 4 + 1) and ( k = 5 + 1) :Figure 1: Boxplots for example data: nitrofen (left), blood urea nitrogen (middle), furosemide (right) Trend tests

Trend tests where dose is modeled qualitatively: multiple contrast tests

To consider the dose as a factor with qualitative levels seems absurd at ﬁrst, since the dose levels areknown. However, the difference between the applied dose and the concentration at the target and theassumptions-sparing modeling can be exactly that.Due to the sparse model strategy (class. VII), only linear contrast tests are considered here. A contrastis a suitable linear combination of means ¯ x i (or other effect sizes, such as ratio-to-control, odds ratio,hazard ratio): (cid:80) ki =0 c i ¯ x i , where the contrast coefﬁcient c i are appropriate chosen weights. Speciﬁcfor all experiments in pharmacology/toxicology is the comparison to the control i = 0 , ..., k . A singlecontrast test is its standardized version t Single contrast = (cid:80) ki =0 c i ¯ x i /S (cid:113)(cid:80) ki c i /n i and (cid:80) ki =0 c i = 0 ensures a t df, − α distributed level- α -test. A multiple contrast test (MCT) is deﬁned as maximum testis: t MCT = max ( t , ..., t q ) which follows jointly ( t , . . . , t q ) (cid:48) a q -variate t - distribution with degree offreedom df and the correlation matrix R ⇒ depending on c i , n i but also s i , ρ i , ... and may be complex.The choice of the weights c i and hence the particular contrast matrix deﬁnes the special version, e.g. fora simple balanced design with k=2: Dunnett one-sided test [11] c i C T T c a -1 0 1 c b -1 -1 0Williams one-sided procedure (formulated as multiple contrast [8]) The simple Williams contrast c i C D D c a -1 0 1 c b -1 1/2 1/2 3llustrates the idea and interoperability of an MCT: either the comparison of D vs. C is signiﬁcant(i.e. strict monotone), or to the pooled ( D + D ) / , i.e. a plateau (or both, or neither ( H ). Eithermultiplicity-adjusted p-values or simultaneous conﬁdence limits (here one-side lower one) are available: [ (cid:80) ki =0 c i ¯ x i − S ∗ t q,df,R, − sided, − α (cid:113)(cid:80) ki c i /n i ] This makes the Williams test the recommended trendtest in pharmacology/toxicology: sensitive to some monotonic and partially non-monotonic forms, a com-parison to the control, easily interpretable conﬁdence intervals for the required effect sizes (difference ofproportions [20], risk ratio or odds ratio of proportions [21], ratio-to-control estimates [18], relative effectsizes [31], hazard rates [15], multiple endpoints [14], heteroscedastic error terms [16], poly-k-adjustedtumor rates [42]).Sometimes the question arises whether to use the Dunnett-test (unrestricted H ) or the Williams-test(restricted H )? The simple answer is use both [26]: The slight increase in conservatism (correlated con- c i C D D c a -1 0 1 c b -1 1 0 c c -1 1/2 1/2trasts!) is usually overcompensated by the simultaneous interpretability of disordered or ordered effects. Considering the doses as qualitative factor levels, the William test can be recommendedusing the R-package multcomp

Trend tests where dose is modeled quantitatively: Tukey-type regression models

There are numerous approaches to the relationship between response and dose formulated as quantitativecovariates, but because of the sparse-model strategy quasilinear regression models are considered here(Notice, The MCP-Mod approach, see below, can be seen as a linear regression with a single predictorvariable [48]). Tukey’s trend test [49] is close to pharmacological principles and quite simple: the si-multaneous considering of three regression models for the metameters D i , (untransformed) rank ( D i ) (ordinal-transformed) and log ( D i ) (lin-logarithmic transformed dose). This joint test represents alsomax-T-test, where the joint distribution over these linear (generalized) models (instead of just contrasts)is achieved by the multiple marginal model approach (mmm) [35] where a related CRAN package isavailable [40]. Trend tests where dose is modeled jointly qualitatively and quantitatively

One can combine the advantages of both Williams (qualitative) and Tukey (quantitative) test by formu-lating a related joint max-T test [24], since both models belong to the class of linear models and thusthe common adjustment by means of mmm is available. This test has the average best balanced power,i.e. shows high power for either linear or plateau shapes. Of course it shows a reduced power, if oneknew a-priori that there is a linear relationship and one would only use a linear regression model. Butthis is far from reality in pharmacology/toxicology, because the shape is the result of the bioassay, not ana-priori assumption. Because of the rather high correlated model, this decline is acceptable compared tothe power gain and more precise interpretation of the likely shape alternative/model.4 onsidering the doses as quantitative covariate levels, the Tukey test can be recom-mended, even its joint test together with Williams-type contrasts using the R-package tukeytrend . Analyzing the BUN example shows the appropriateness of the Tukey-Williams joint test: althoughthe quantitative regression part does not show any monotonous increase, while at least a Williams contrast(most in the alternative the contrast (1000 + 500) / − ) shows a trend up to the reversal point at 500mg/kg. The effect size is the means value difference-to-control (and the regression slope in the regressionpart- not in the alternative) where the most in the alternative effect size is (¯ y BUN + ¯ y BUN ) / − ¯ y BUN ) =2 . mg/dl with a lower limit of 1.8 mg/dL.Table 1: Tukey-Williams Trend Test: BUN example Model Test statistics p -valueTukey: arithmetic -0.275 0.905Tukey: ordinal 1.436 0.231Tukey: ari-logarithmic 1.436 0.231Williams: − (1000 + 500) / − < . Williams: (1000 + 500 + 250) / − < . Williams: (1000 + 500 + 250 + 125) / − < . Williams: (1000 + 500 + 250 + 125 + 62 . / − < . The R-Code is quite simple using the package tukeytrend : ﬁrst a simple linear model is ﬁtted intothe object mod1 . Its estimators are used in the function tukeytrendﬁt to perform the MaxT test using the3 regression models (ari, ord, arilog) and the William contrasts ( cytpe=Williams ), where ddf uses theempirical degree of freedom and vcov the sandwich variance estimator to be robust to heteroscedasticity.In the object ex1 the function glht() (from the package multcomp ) estimates the adjusted p-values, or evenbetter the simultaneous conﬁdence limits for a potential monotonous increase of the BUN values withdose (see the data and the executable R code in the Appendix).A further underrated issue is the trend test when heteroscedasticity occur- this can lead to a bias. As longas n i are not too small, the use of a sandwich variance estimator is recommended [16] (see the BUNexample above). mod1 <- lm(BUN ~ dose, data=mm)tw1 <- tukeytrendfit(mod1, dose="dose", ddf="residual",scaling=c("ari", "ord", "arilog", "treat"),ctype="Williams")\\ex1<-summary(glht(tw1 mmm, tw1 mlf,vcov = sandwich, alternative="greater")) An underrated fact: shape speciﬁcity

The shape of the dose-response dependence is the result of the experiment, rarely an a-priori assumption.In addition to the existence of a global trend, the information on what shape of the dose-response depen-dence exists is relevant once again. This is quite challenging.Nevertheless, trend tests that are designed for a linear alternative, such as the CA test for proportions [5]or the nonparametric Jonckheere-test [28], are widely used. This is disturbing, because one rarely wantsto test for linear trend exactly, and a loss of power occurs, e.g. if there is a plateau at high doses whichis not uncommon in pharmacology/toxicology. In the sense of an average high power for any mono-tone shape, trend tests which are sensitive to several alternatives are to be preferred. These are e.g. theWilliams and Tukey tests demonstrated above whereby their joint test is best in pharmacology/toxicology5rom this point of view, sensitive to a wide range of alternatives, with still acceptable conservatism [41].Notice, for some issues in toxicology it is unacceptable to infer global trend with the usual trend tests,although there is a signiﬁcant downturn at high doses. One connects trend with monotonous, and a down-turn effect at high dose(s) is just not monotonous. Here you can use a combination of trend test and testbetween control and high dose: as an intersection-union tests [33], an union-intersection test with thecomplete power approach [24] or surprisingly just the simple pairwise (or contrast) test control vs. highdose.On the other hand, there are situations, like the muta-tox issue in in-vitro mutagenicity assay, which, de-spite a possible down-turn at high doses, would still like to decide on global trend. The toxicologist wouldsimply follow a trend test up to the optical maximum effect (and ignore the higher doses). Statisticallycorrect this can be achieved by means of maxT-test over all possible maximum doses (ignoring the higherones) [9].

Trend tests in generalized linear models (GLM)

Another contradiction exists: most trend tests assume normally distributed, homoscedastic errors, butmost endpoints in pharmacology/toxicology are skewed distributed or are counts, proportions, time-to-event variables [46]. One solution is the use of the GLM, where suitable link-functions are used totransform the different endpoint types appropriately. This is a general but asymptotic approach, i.e. itrequires a very large sample size and can cause problems with the common small n i . Proportions are quite common in pharmacology/toxicology, such as microscopic incidences or crudemortality/ tumor rates. For both the Williams-trend test [20] and the Tukey test [24] related GLM-modiﬁcations exist, whereas the add-1 correction for common-used small n i can be recommended tocontrol the α -level approximately [43]. Note two very different models. On the one hand, the occurrenceof a condition such as a tumor is recorded at the level of the experimental unit, e.g. the animal. On theother hand, proportions are repeatedly recorded within an experimental unit, such as (e.g., the number ofmicronucleated erythocytes per scored polychromatic erythrocytes within a single mouse in the in vivomicronucleus assay [19]. In the latter case, a link function for overdispersion can be used, since poolingthe individual proportions within the experimental unit usually underestimates the variance.Similarly, counts can be evaluated, such as number of micronuclei; also with or without modeling ofoverdispersion [21]. Count data are particularly sensitive to the underlying assumption, e.g. for ties andvariance heterogeneity structure. The CRAN package cotram [45] offers ﬂexible count transformationmodels where count responses may arise from various and complex data-generating processes. Becausethe endpoint are counts with some tied data, the Dunnett test assuming approximate normal distributionis not appropriate here. Instead the cotram package [45] offers count transformation models, providing asimple but ﬂexible approach for the regression analysis of count responses arising from various, and pos-sibly complex, data-generating processes. (Notice, also for continuous variables there are robust methodswith the most likely transformation model behind [22, 25]).The above example with the total brood count in an aquatic assay on ceriodaphnia dubia with nitrofen [6]is used to demonstrate this approach: data("nitrofen", package="boot")nitrofen$Conc<-as.factor(nitrofen$conc)library(multcomp)library(cotram)NSH<-cotram(total~Conc, data=nitrofen,order=5)WISH<-glht(NSH, linfct = mcp(Conc = "Williams"), alternative="less") All Williams contrasts are seriously in the alternative, most is the contrast (310 + 235) / − withthe effect size of an odds ratio, i.e. a signiﬁcant decreasing trend exists. Notice, the count transformationmodel belongs not directly to a GLM (here a negative binomial link function can be used). Attention6 ontrast Estimate SE Test statistics p-value − -12.9 2.32 -5.56 < . / − -10.0 1.75 -5.75 < . / − -7.2 1.27 -5.68 < . / − -5.4 1.03 -5.28 < . Table 2: Williams-type p-values for count transformation modelshould be paid to the speciﬁcity of data, where near-to-zero counts occur frequently, especially in control.Simple transformations may be appropriate here [27].

Multinomial vectors occur also, such as the differential blood count. There analysis is more complexand related multiple contrast modiﬁcations are available [39].

Time-to-event data occur commonly, such as time-to-death or time-to-tumor occurrence in long-termcarcinogenicity assays. These survival functions can be compared by the Williams-trend test assuming aCox proportional hazard model [15].Because of their high toxicological relevance, the evaluation of graded histopathological ﬁndings (withseverity scores) is particularly interesting. Very different models can be helpful, especially if near-to-zerocategories occur in the control: collapsing categories [12] (or a closed permutation test [21], Chapter2.1.8.1), cumulative link model for ordinal endpoints citeHothorn2016, Freeman-Tukey transformationmodels or nonparametric contrast tests using the relative effect size estimator (see the section below).Due to the complex interaction between tumor development and mortality, the evaluation of crude tumorrates can be biased. Therefore, a mortality-adjusted analysis is indicated, relatively simple on the basisof poly-k estimators , which do not need the cause of death. Individual weights w ij = ( t ij /t max ) k considering the individual mortality pattern are used where t ij is time of death of animal j in dose i andthe tuning parameter k is data-dependent. The weights result in adjusted sample sizes n ∗ i = (cid:80) n i j =1 w ij tobe used instead of the randomized number of animals n i . Therefore adjusted proportions p ∗ i = y i /n ∗ i areused in the GLM instead of the crude tumor proportions p i = y i /n i . This procedure is available for theWilliams test [43] and the Tukey test [24].As a somewhat extreme example of simultaneous inference, the tumor and mortality data of the US-NTPstudy on the carcinogenic potential of methyleugenol for the incidence of skin ﬁbromas used [2] (availablein the CRAN package MCPAN ). Since dose either quantitative (Tukey) or qualitative (Williams) ﬁts best,or the tuning parameter k=3 or k=6, we use a max(maxT)-test on these ((3 + 3) ∗ marginal models inweighted GLM’s for an identity link for difference of weighted proportions to control: me$weightpoly3[wt0] <- (me$death[wt0]/max(me$death))^3me$weightpoly6[wt0] <- (me$death[wt0]/max(me$death))^6fitpoly3 <- glm(tumour ~ dose, data=me, family=binomial(link="identity"),weight=weightpoly3)fitpoly6 <- glm(tumour ~ dose, data=me, family=binomial(link="identity"),weight=weightpoly6)library(tukeytrend)ttpoly3 <- tukeytrendfit(fitpoly3, dose="dose",scaling=c("ari", "ord", "arilog", "treat"), ctype="Williams")ttpoly6 <- tukeytrendfit(fitpoly6, dose="dose",scaling=c("ari", "ord", "arilog", "treat"), ctype="Williams")compttpoly6 <- glht(model=ttpoly6$mmm, linfct=ttpoly6$mlf, alternative="greater")tt36 <- combtt(ttpoly3,ttpoly6)TT36 <- summary(asglht(tt36, alternative="greater")) Most in the alternative is the plateau-type Williams contrast for poly-6 adjustment, where as the differ-ence to poly-3 is tiny in this particular example. 7able 3: Tukey-Williams-type trend test models for poly k=3 and k=6 in the skin ﬁbroma tumor data

Poly-k Model Test statistic p-value3 Tukey: arithmetic 1.77 0.0877Tukey: ordinal 2.50 0.0181Tukey: ari-logarithmic 2.48 0.0190Williams − (150 + 75) / − (150 + 75 + 37) / − − (150 + 75) / − (150 + 75 + 37) / − Trend tests when adjusting against covariate and secondary factors is required

Sometimes the consideration of a covariate, such as initial body weight, or an additional factor, such assex, is important. Name-giving tests, such as the CA test, are usually difﬁcult to extend - in GLM this isvery easy [21]. A typical application is the evaluation of organ weights, where the body weight can bemodeled as covariates.

Trend tests in generalized linear mixed effect models (GLMM)

The heterogeneity between repeated times, between litter mates, between cages, between tanks in aquatox-assays, the cells within the slides within the organ-speciﬁc samples of the Comet assay, or independentrepeated studies on a particular test substance can be modeled as a random factor and thus the hierarchyof the different sources of variance can be reproduced correctly by GLMM. Such models are alreadycomplex and data-dependent, sometimes numerically unstable, especially when only a few levels can beconsidered.Repeated bioassays are common for relevant chemicals by several sponsors, e.g. the glyphosate bioas-says. Here, simplifying the crude proportions of four studies (abbreviated by A-D) for malign lymphomain male mice are considered [47]. Figure 2 shows considerable heterogeneity between the studies, notFigure 2: Crude malign lymphoma tumor rates of 4 carcinogenicity assays on glyphosate in male miceonly with regard to the shape of the dose-response relationship, but also with regard to the spontaneous8ates, especially with regard to the very different doses chosen - across orders of magnitude. This alonemakes the qualitative modeling of the dose (Williams test) less appropriate. Therefore, the regression ap-proach according to Tukey is the method of choice here, whereby the heterogeneity between studies witha GLMM is estimated. The function glmmPQL() provides a robust penalized quasi-likelihood approachin GLMM (with the canonical link function binomial() ) and the internal function lmer2lm() transformsGLMM-objects to GLM objects within the multiple marginal model approach [35]- the basis for thedecisive function glht(mmm()) with in the package multcomp . library(tukeytrend)Lmic <- dosescalett(Lmice, dose="dose", scaling=c("ari", "ord", "arilog")) dataglmmari <- glmmPQL(fixed=cbind(tumor,mice-tumor) ~ doseari, random = ~ 1 +doseari |study,family = binomial, data=Lmic)glmmord <- glmmPQL(fixed=cbind(tumor,mice-tumor) ~ doseord, random = ~ 1 +doseari |study,family = binomial, data=Lmic, niter = 100)glmmarilog <- glmmPQL(cbind(tumor,mice-tumor) ~ dosearilog, random = ~ 1 +doseari |study,family = binomial, data=Lmic)lmari <- tukeytrend:::lmer2lm(glmmari)lmord <- tukeytrend:::lmer2lm(glmmord)lmarilog <- tukeytrend:::lmer2lm(glmmarilog)linf <- matrix(c(0,1), ncol=2)ttglmm <- glht(mmm("mari"=lmari, "mord"=lmord, "marilog"=lmarilog),mlf("mari"=linf, "mord"=linf, "marilog"=linf), alternative="greater") Table 4: Odds ratios and their lower conﬁdence limits using the Tukey trend test on crude malign lym-phoma tumor rates of 4 carcinogenicity assays

Model Odds ratio Lower conﬁdence limitTukey: arithmetic 1.000 0.999Tukey: ordinal 1.071 0.976Tukey: ari-logarithmic 1.214 0.960

For none of the 3 regression models is the lower limit greater than 1, i.e. there is no increasing trend,which is not surprising given these pronounced heterogeneities.

Use the modiﬁcations of the Tukey-Williams joint test for non-normal endpoints and/orcomplex designs in the generalized linear (mixed effects) model.

Nonparametric trend tests

Since the normal distribution assumption cannot be tested because of the small n i , nor can it be derivedfrom historical controls, nonparametric trend tests are often used, especially the Jonckheere-test and theglobal-ranking version of the Williams test [44]. For tied values (see above the graded ﬁndings), and forthe usual heterogeneous variances, these tests are problematic. As an alternative, non-parametric testsfor relative effect size can be used, which are also available for multiple contrast tests and thus as amodiﬁcation of the Williams test [31]. A corresponding CRAN package nparcomp [32] is easily used forthis purpose. Effect measures and their conﬁdence interval

If one follows the recommendation to avoid p-values for the interpretation of trend tests, and instead to useeffect size estimates adapted to the toxicological problem and its intervals (conﬁdence intervals or rathercompatibility intervals [23]), then the choice of a suitable effect size estimate is of primary importance.9his contradicts the dominant choice of the difference-to-control as an effect size estimator. But thereare much more effect sizes. One of them has to do with the scale of the response variable, e.g. the riskratio for proportions or hazard rate for time-to-event variables. On the other hand, the effect size cansigniﬁcantly deﬁne the modeling objective, such as ratio-to-control or odds ratio for continuous outcomelogistic regression [22].

Using ratio-to-control instead of the common-used difference-to-control

Difference-to-control is the mostly used effect size, mostly hidden, because their conﬁdence intervals arenot explicitly interpreted, but simply the p-values. Sometimes one wants to interpret ratio-to-control, suchas the k-fold rule of the Ames assay. The usual procedure of log transformation of the response variablecan be problematic [38]. Instead, a direct modiﬁcation of the Williams test is available for normallydistributed, even heteroscedastic errors [18] (numerically easily available in the package mratios , [10]).In most cases the interpretation of a percentage change in pharmacology/toxicology is appropriate, butnot for very small near-to-zero values in the control.

Select effect sizes that are biologically appropriate and interpretable.

Further trend test issues

Phase IIb randomized clinical dose-ﬁnding trials (RCT) are similar to the above assays, with the ratherfewer doses (k=2, 3) arguing for low-parameter models [37]. The common-used MCP-Mod approachrepresents a hybrid method consisting of contrast tests and related nonlinear models [34].Sometimes, after detecting a trend, one wants to determine a speciﬁc experimental dose, such as minimaleffective dose (MED) in RCT or no-observed adverse event level (NOEAL) (in toxicology). In doing so,one should pay attention to a possible bias due to the pooling properties of the trend tests.We had only assumed randomized dose groups, but sometimes a trend was observed in continuous expo-sure, e.g. transforming into groups via post-hoc categorization which may be misleading [13].Sometimes you want to show that no trend exists. This cannot be done simply by a non-signiﬁcant trendtest, since ’absence of evidence is no evidence of absence’ [4]. Instead, one tests the slope parameter fornon-inferiority, whereby the determination of the threshold limit is a delicate problem.

Conclusions

Trend tests sensitive to many possible shapes, considering dose qualitatively or quantitatively (or bestjointly), formulating for a single, interpretable parameter, in the same (or similar) framework for as manyeffect sizes (and its conﬁdence limits) and thus variable types as possible are available nowadays- a me-thodical step forward. Not only the statistical methods are available, but also the corresponding softwarein form of CRAN packages. A structure like a construction kit was described above, whereby real dataassays were evaluated exemplarily (further examples see [21, 23, 17, 24]).The biggest limitation are the widely used small to very small n i , especially with often associatedunclear data conditions. Sometimes in these situations the use of simple transformations, such as arcsinor FT, followed by the standard trend tests even makes sense - as least bad test.10 eferences [1] National Toxicology Program. 13 Weeks gavage study on female F344 rats administered with Sodium dichromate dihydrate(VI) (CASRN: 7789-12-0, Study Number: C20114,TDMS Number:2011402 .[2]

National Toxicology Program. Toxicology and carcinogenesis studies of methyleugenol in F344/N rats and B6C3F1 mice. ,2000.[3]

OECD Guideline for testing of chemicals No. 474. In vivo micronucleus test , 2006.[4] D G Altman and J M Bland. Statistics notes – absence of evidence is not evidence of absence.

Br. Med. J. , 311(7003):485–485,1995.[5] P. Armitage. Tests for linear trends in proportions and frequencies.

Biometrics , 11(3):375–386, 1955.[6] A.J. Bailer and J.T. Oris.

Assessing toxicity of pollutants in aquatic systems.

John Wiley, 1994.[7] B. Bieth, B. Bornkamp, C. Toutain, R. Garcia, and J. P. Mochel. Multiple comparison procedure and modeling: a versatile toolfor evaluating dose-response relationships in veterinary pharmacology - a case study with furosemide.

Journal of VeterinaryPharmacology and Therapeutics , 39(6):539–546, December 2016.[8] F. Bretz. An extension of the williams trend test to general unbalanced linear models.

Computational Statistics and DataAnalysis , 50(7):1735–1748, April 2006.[9] F Bretz and LA Hothorn. Statistical analysis of monotone or non-monotone dose-response data from in vitro toxicologicalassays.

ATLA-Altern Lab Anim , 31(Suppl. 1):81–96, JUN 2003.[10] D. Dilba, F. Schaarschmidt, and LA. Hothorn. Inferences for ratios of normal means.

RNews 7 (2007) 1, 20 -23. , 2007.[11] C. W. Dunnett. A multiple comparison procedure for comparing several treatments with a control.

J Am Stat Assoc ,50(272):1096–1121, 1955.[12] J. W. Green, T. A. Springer, A. N. Saulnier, and J. Swintek. Statistical analysis of histopathological endpoints.

EnvironmentalToxicology and Chemistry , 33(5):1108–1116, May 2014.[13] S. Greenland. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis.

Epidemiology , 6(4):450–454, July 1995.[14] M. Hasler and L. A. Hothorn. A multivariate williams-type trend procedure.

Statistics in Biopharmaceutical Research ,4(1):57–65, 2012.[15] E. Herberich and L.A. Hothorn. Statistical evaluation of mortality in long-term carcinogenicity bioassays using a Williams-type procedure.

Regul Toxicol Pharm , 64:26–34, 2012.[16] E. Herberich, J. Sikorski, and T. Hothorn. A robust procedure for comparing multiple means under heteroscedasticity inunbalanced designs.

PLOS One , 5(3):e9788, March 2010.[17] Hothorn. Trend test jointly with a control against high dose pairwisetest to analyse toxicological bioassays.[18] L. A. Hothorn and G. D. Djira. A ratio-to-control williams-type test for trend.

Pharmaceutical Statistics , 10(4):289–292, July2011.[19] L. A. Hothorn and D. Gerhard. Statistical evaluation of the in vivo micronucleus assay.

Arch Toxicol , 83(6):625–634, 2009.[20] L. A. Hothorn, M. Sill, and F. Schaarschmidt. Evaluation of incidence rates in pre-clinical studies using a williams-typeprocedure.

International Journal of Biostatistics , 6(1):15, 2010.[21] L.A. Hothorn.

Statistics in Toxicoloyg- using R . Chapman Hall, 2016.[22] L.A. Hothorn and F.M. Kluxen. Robust multiple comparisons against a controlgroup with application in toxicology. arXiv:1905.01838v1 [stat.AP] 6 May 2019 , 2019.[23] L.A. Hothorn and Pirow R. Use compatibility intervals in regulatory toxicology.

Arch. Tox. , 2020.[24] L.A. Hothorn and F. Schaarschmidt. A modiﬁed armitage test for more than a linear trend on proportions. arXiv:2006.14880 ,2020.[25] T. Hothorn, L. Most, and P. Buhlmann. Most likely transformations.

Scandinavian Journal of Statistics , 45(1):110–134,March 2018.[26] T. Jaki and L. A. Hothorn. Statistical evaluation of toxicological assays: Dunnett or williams test-take both.

Archives ofToxicology , 87(11):1901–1910, November 2013.[27] T. Jaki, A Kitsche, and L. A Hothorn. Statistical evaluation of toxicological assays with zero or near-to-zero proportions orcounts in the concurrent negative control group: A tutorial.

JP J Biostatistics , 0:0, 2014.[28] A. R. Jonckheere. A distribution-free kappa-sample test against ordered alternatives.

Biometrika , 41(1-2):133–145, 1954.

29] R. L. Kodell. Replace the noael and loael with the bmdl01 and bmdl10.

Environmental and Ecological Statistics , 16(1):3–12,March 2009.[30] R. L. Kodell and J. J. Chen. Characterization of dose-response relationships inferred by statistically signiﬁcant trend tests.

Biometrics , 47(1):139–146, March 1991.[31] F. Konietschke and L. A. Hothorn. Evaluation of toxicological studies using a nonparametric shirley-type trend test forcomparing several dose levels with a control group.

Statistics in Biopharmaceutical Research , 4(1):14–27, 2012.[32] F. Konietschke, M. Placzek, F. Schaarschmidt, and L. A. Hothorn. nparcomp: An r software package for nonparametricmultiple comparisons and simultaneous conﬁdence intervals.

Journal of Statistical Software , 64(9), March 2015.[33] K. K. Lin and M. A. Rahman. Comparisons of false negative rates from a trend test alone and from a trend test jointly witha control-high groups pairwise test in the determination of the carcinogenicity of new drugs.

Journal of BiopharmaceuticalStatistics , 29(1):128–142, January 2019.[34] J. Pinheiro, B. Bornkamp, E. Glimm, and F. Bretz. Model-based dose ﬁnding under model uncertainty using general paramet-ric models.

Statistics in Medicine , 33(10):1646–1661, May 2014.[35] C. B. Pipper, C. Ritz, and H. Bisgaard. A versatile method for conﬁrmatory evaluation of the effects of a covariate in multiplemodels.

Journal of the Royal Statistical Society Series C-Applied Statistics , 61:315–326, 2012.[36] C. Ritz, F. Baty, J. C. Streibig, and D. Gerhard. Dose-response analysis using r.

Plos One , 10(12):e0146021, December 2015.[37] S. Saha and W. Brannath. Comparison of different approaches for dose response analysis.

Biometrical Journal , 61(1):83–100,January 2019.[38] F. Schaarschmidt. Simultaneous conﬁdence intervals for multiple comparisons among expected values of log-normal vari-ables.

Computational Statistics and Data Analysis , 58:265–275, February 2013.[39] F. Schaarschmidt, D. Gerhard, and C. Vogel. Simultaneous conﬁdence intervals for comparisons of several multinomialsamples.

Computational Statistics and Data Analysis , 106:65–76, February 2017.[40] F. Schaarschmidt and L.A. Hothorn. library(tukeytrend). 2018.[41] F. Schaarschmidt, C. Ritz, and L.A. Hothorn. The tukey trend test: Multiplicity adjustment using multiple marginal models.

Report, LUH , 2020.[42] F. Schaarschmidt, M. Sill, and L. A. Hothorn. Poly-k-trend tests for survival adjusted analysis of tumor rates formulated asapproximate multiple contrast test.

Journal of Biopharmaceutical Statistics , 18(5):934–948, 2008.[43] F. Schaarschmidt, M. Sill, and L.A. Hothorn. Approximate simultaneous conﬁdence intervals for multiple contrasts of bino-mial proportions.

Biometrical Journal , 50(5, SI):782–792, OCT 2008.[44] E Shirley. A nonparametric equivalent of williams’ test for contrasting increasing dose levels of a treatment.

Biometrics ,33(2):386–389, 1977.[45] S. Siegfried and T. Hothorn. Count transformation models.

Methods in Ecology and Evolution , 11(7):818–827, July 2020.[46] E. Szocs and R. B. Schafer. Ecotoxicology is not normal.

Environmental Science and Pollution Research , 22(18):13990–13999, September 2015.[47] J. V. Tarazona, D. Court-Marques, M. Tiramani, H. Reich, R. Pfeil, F. Istace, and F. Crivellente. Glyphosate toxicity andcarcinogenicity: a review of the scientiﬁc basis of the european union assessment and its differences with iarc.

Archives ofToxicology , 91(8):2723–2743, August 2017.[48] N. Thomas. Understanding mcp-mod dose ﬁnding as a method based on linear regression.

Statistics in Medicine , 36(27):4401–4413, November 2017.[49] J. W. Tukey, J. L. Ciminera, and J. F. Heyse. Testing the statistical certainty of a response to increasing doses of a drug.

Biometrics , 41(1):295–301, 1985.

Appendix: Data and R-Code L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L),.Label = c("0", "62.5", "125", "250", "500", "1000"),class = "factor")), row.names = c(NA,-60L), class = "data.frame")library("toxbox")boxclust(data=mm, outcome="BUN", treatment="Dose", cluster=NULL,ylabel="BUN", xlabel="Dose", option="uni", psize=1.1,nsize=2.5,hjitter=0.3, legpos="none", printN="FALSE", white=TRUE, titlesize=8, labelsize=6) caling=c("ari", "ord", "arilog", "treat"), ctype="Williams")compttpoly6 <- glht(model=ttpoly6$mmm, linfct=ttpoly6$mlf, alternative="greater")tt36 <- combtt(ttpoly3,ttpoly6)TT36 <- summary(asglht(tt36, alternative="greater"))library("ggplot2")library(xtable)polyk36<-fortify(TT36) [, c(1,5,6)]colnames(polyk36)<-c("Comparisons","Test stats", "p-value")print(xtable(polyk36, digits=c(0,0,2,4), caption="Tukey-Williams-type trend test models for poly k=3 and k=6", label="tab:exa14"),caption.placement = "top", include.rownames=FALSE)caling=c("ari", "ord", "arilog", "treat"), ctype="Williams")compttpoly6 <- glht(model=ttpoly6$mmm, linfct=ttpoly6$mlf, alternative="greater")tt36 <- combtt(ttpoly3,ttpoly6)TT36 <- summary(asglht(tt36, alternative="greater"))library("ggplot2")library(xtable)polyk36<-fortify(TT36) [, c(1,5,6)]colnames(polyk36)<-c("Comparisons","Test stats", "p-value")print(xtable(polyk36, digits=c(0,0,2,4), caption="Tukey-Williams-type trend test models for poly k=3 and k=6", label="tab:exa14"),caption.placement = "top", include.rownames=FALSE)