Psychological medicine | 2019

Treatment of anorexia nervosa: is it lacking power?

 

Abstract


Two recent studies in PSM (Brockmeyer et al., 2018; Murray et al., 2018) have independently synthesised recent evidence on treatments for anorexia nervosa (AN) in adults and have reached similar, sobering, conclusions. In essence, despite the large amount of time, effort, and money that has been invested in evaluating psychological treatments (one, for example, reviewed studies comprising 2092 patients in 19 trials over 5 years), ‘no single psychotherapy has emerged as clearly superior to others in the treatment of adults with AN’ (Brockmeyer et al., 2018, p. 1250). Whilst this particular conclusion is largely supported by the evidence, results have often been interpreted as meaning that those psychological therapies that were evaluated are, therefore, equivalent in effectiveness. Although it has been argued that this supports a ‘common factors’ approach to treatment (e.g. Lose et al., 2014), one shortcoming is often given insufficient attention: low statistical power. When comparing two treatments of an illness, the researcher’s aim is critical. Consider a novel treatment, treatment B, and an existing treatment, treatment A. Is the aim to determine that treatment B is: (1) different from treatment A (either better or worse); (2) at least as effective as treatment A, or (3) equivalent to treatment A (Tamayo-Sarver et al., 2005)? The apparent confusion around equivalence and non-inferiority trials has been raised by other authors (e.g. see Leichsenring et al., 2018), but suffice it to say here that the hypotheses (null and alternative) differ for each of these aims such that use of a traditional comparative test – as might be used in scenario (1) – is likely to be inappropriate in tests of non-inferiority and equivalence (Walker and Nowacki, 2011). Sticking with perhaps the simplest example: a researcher wishes to find out that treatment B is superior to treatment A. A power calculation is vital in determining the minimum sample size needed to detect an effect. Looking at the two review articles, effect sizes of the difference between two treatments on body mass index (BMI) outcomes rarely exceeded d = 0.30, a small-to-medium-sized effect (and were typically smaller than this, particularly for psychological outcomes). In the absence of an agreed non-inferiority margin (or, ‘clinically acceptable difference’), this would seem to be a reasonable assumption of magnitude and is less than half of the anticipated effect of the comparator treatment (for BMI, for example, effect sizes are usually around d = 0.6–1.00 for established treatments; e.g. Zipfel et al., 2014). Given conventions around acceptable error rates (i.e. α = 0.05, 1− β = 0.80), this would suggest a minimum sample size of 139 in each arm to demonstrate that treatment B is superior to treatment A. None of the studies of outpatient trials in adults reached this level. This issue is emphasised when looking at confidence intervals of treatment comparisons, which are often large and encompass what might be proffered as a clinically acceptable difference. Even if other study biases are limited, low statistical power will engender a tendency towards showing more false negatives. By illustration, in their review, Brockmeyer et al. (2018) considered only studies ‘with a minimal sample size of n = 100’ (p. 1229) and included having a ‘sample size n > 30 in each condition’ (p. 1229) in their quality appraisal. Although this may seem encouraging, it remains possible that any lack of differences found between treatments rests on low statistical power; a sample size of 30 seems arbitrary and, if based on the hypothesised effectiveness (effect size) of one treatment, is unlikely to be sufficient to detect an effect between two treatments. Studies with low statistical power also risk reporting findings that are true when in fact they are not and over-estimating the magnitude of those effects (Button et al., 2013). This can have impacts on later research, whereby sample sizes determined on ‘historical precedent rather than through formal power calculation’ may hamper attempts at replication (Button et al., 2013, p. 367). This brief summary echoes the conclusions of Rief and Hofmann (2018) that the scientific community needs to consider issues around study design more seriously. Of concern, a number of authors have repeatedly argued for the issue of low statistical power to be addressed, with studies suggesting that the problem is, in fact, endemic (e.g. see Le Hananff et al., 2006; Button et al., 2013; Vankov et al., 2014). Power to detect an effect ought to be an essential element of research design within null-hypothesis significance testing; failing to meet a minimum sample size is likely to render subsequent conclusions questionable at best. Guidance has been published in the reporting non-inferiority and equivalence trials (Piaggio et al., 2012).

Volume 49 6
Pages \n 1055-1056\n
DOI 10.1017/S0033291718003434
Language English
Journal Psychological medicine

Full Text