JAMA psychiatry | 2019
Small Sample Sizes and a False Economy for Psychiatric Clinical Trials.
Abstract
In 2013, JAMA Psychiatry published an exciting new finding by Hallak et al1: patients with schizophrenia who were treated with a single infusion of the antihypertensive agent sodium nitroprusside showed a dramatic, instantaneous, and sustained improvement in psychotic and negative symptoms. The study by Hallak et al1 was not only double-blind, randomized, and placebocontrolled, but the authors had also presented a prospective power analysis, ensured good interrater reliability, and reported all the relevant aspects of the trial. Understandably, this study generated tremendous excitement and several attempts were made to replicate the finding. In this issue of JAMA Psychiatry, Brown et al2 report an equally systematic study that fails to replicate the original finding. In fact, the findings by Brown et al2 join those of 2 other prior studies3,4 that also failed to replicate the original finding. A careful analysis of the competing studies fails to reveal an obvious explanation for this discrepancy—except that the original finding was based on a rather small sample of just 10 patients per treatment arm. Failure to reproduce such dramatic findings is neither new, nor limited to psychiatry. As provocatively claimed by Ioannidis5 in 2005, perhaps most published studies are indeed false. Alert to this possibility, the Academy of Medical Sciences in the United Kingdom published a report on the reproducibility and reliability of biomedical research, and identified the many causes for the low replicability of published studies, from deficiencies in the conduct and reporting of studies to incentive structures that promote publication over rigour.6,7 However, in the matter of the sodium nitroprusside studies the methods of the studies seem to be strong and the reporting seems to be thorough. But rigor of conduct and reporting cannot overcome the challenge of starting with a small number of patients. A review of 83 psychiatric intervention studies indicated that only half had been subject to any attempt at replication.8 Of those that had been subject to an attempt at replication, only 37% replicated the original effect size, while the others either contradicted the original finding or observed a much smaller effect. The review clearly showed that, the larger the original study, the more likely it was to be replicated. It is a statistical inevitability that (all other things being equal) smaller studies will provide more imprecise estimates of any true effect. Therefore, if a number of small studies are performed, there will be a wider variation of findings (sometimes described as vibration of effects). If many small studies are conducted, only those that generate a large effect size (and therefore reach a P value below the conventional 5% threshold) will be published and these findings from studies with a small sample size are likely to represent inflated effect size estimates, and at worst, false positives. Based on this finding it would be easy to recommend larger sample sizes in general, given evidence that studies in the biomedical sciences are usually too small to detect credible or likely effect sizes.9 But any such general commandment would be too simplistic. Ultimately, studies should be large enough to give a sufficiently precise estimate, but not so large as to be wasteful. Funders and journals increasingly require a justification of sample size. However, calculation of sample size is not a perfect science and requires an estimation of the expected outcomes in the drug and placebo groups and an estimation of the expected variance. Getting these estimates right and realistic is at the heart of the matter. Hallak et al1 predicted an effect size of d = 1.5. How does this size compare with effect sizes in this field? We provide not 2, but 3 comparators. A study of clinical interventions in medicine and psychiatry across a range of domains indicates a median observed an effect size of 0.37 for general medicine and 0.41 for psychiatry.10 A review of more than 7000 patients treated with antipsychotics in randomized clinical trials showed that the effect size of the newer atypical antipsychotics, compared with placebo, was 0.48.11 Finally, studies of schizophrenia clinicians show that they consider an improvement from baseline (which includes placebo response plus drug response) equivalent to an approximate effect size of d = 1.0 to be clinically significant.12 Thus, seen from these 3 perspectives, an expectation of d = 1.5 vs placebo seems rather optimistic. The temptation is understandable. One would like to discover the big findings, and such an assumption comes with the payoff that it decreases the number of patients required by sample size calculators. However, such over-optimism comes with 2 wasteful consequences: a higher than appropriate chance of a false negative and, in the event of a positive finding, a higher chance that the positive finding might be a false positive with an inflated effect. So, what is the right effect size to aim for? There is not, and cannot be, a single answer. What seems prudent is that trials of any new treatment should assume the median observed in the field (which usually in the range of 0.3-0.5), and those who hope for a much larger effect size should be required to provide a strong justification for such optimism. Because, overoptimism that leads to too small a sample may just be a false economy. Related article Opinion