R J. | 2021

Towards a Grammar for Processing Clinical Trial Data

 

Abstract


The goal of this paper is help define a path toward a grammar for processing clinical trials by a) defining a format in which we would like to represent data from standardized clinical trial data b) describing a standard set of operations to transform clinical trial data into this format, and c) to identify a set of verbs and other functionality to facilitate data processing and encourage reproducibility in the processing of these data. It provides a background on standard clinical trial data and goes through a simple preprocessing example illustrating the value of the proposed approach through the use of the forceps package, which is currently being used for data of this kind. Introduction: On the use of historical clinical data There are few areas of data science research that provide more promise to improve human quality-oflife and treat disease than the development of methods and analysis in clinical trials. While adjacent, data focused areas of biomedicine and health related research have recently seen increased attention, especially the analysis of real-world evidence (RWE) and electronic health records (EHR) in particular, clinical trial data maintains several distinct quality advantages, enumerated here. 1. Features and measurements are selected for their relevance unlike EHR’s or other similar data, variables collected for a clinical trial are included because they are potentially relevant to the disease under consideration or the treatment whose efficacy is being analyzed. This makes the variable selection process considerably easier than that where data collection has not been designed for a targeted analysis of this type. 2. Data collection procedures are carefully prescribed clinical trial data is uniform in both which variables are collected and how they are collected. This ensures data quality across trial sites ensuring that variables are relatively complete as well as consistent. 3. Inclusion/Exclusion criteria define the population since RWE studies are observational, the populations they consider are not always well-understood due to bias in the collection process. On the other hand, clinical trial data sets are generally controlled and randomized, with well documented inclusion and exclusion criteria. Along with maintaining higher quality clinical trial data is more available and more easily accessible when compared to real world data sources, which often require affiliations with appropriate research institutions as well as infrastructure and appropriate staff including data managers to extract data. By contrast, modern clinical trial data organizations allow users to quickly search and download thousands of trials including anonymized patient-level information. These data sets tend to include control-arm data, which can be used to understand prognostic disease populations construct historical controls for existing trials. However, some also include treatment data which can be used to characterize predictive patient subtypes for a given treament, understand safety profiles for classes of drugs, and aid in the design of new trials. We note that, for oncology, Project Data Sphere (PDS, 2020) and, outside of oncology, Immport (Imm, 2020) have been invaluable in our own experience by facilitating these types of analyses. Clinical trial analysis data sets During a clinical trial, patient-level data is collected in case report forms (CRFs). The format and data collected in these forms are prescribed in the trial design. These forms are the basis for the construction of analysis data sets and other documents that will be submitted to governing bodies including the Food and Drug Association (FDA) and European Medicines Agency (EMA) for approval if the sponsor (party funding the trial) decides it is appropriate. The Clinical Data Interchange Standards Consortium (CDISC) (CDI, 2020) develops standards dealing with medical research data including the submission of trial results. Adhering to these standards is necessary for a successful trial submission. There are several data sets included with a submission that tend to be useful for analysis. This paper focuses on the Analysis Data Model (ADaM) data, which provides patient level data, which has been validated and used for data derivation and analysis. An ADaM data set is itself composed of several data sets including a Subject-Level Analysis Data Set (ADSL) holding analysis and treatment information. Other information including baseline characteristics, demographic data, visit information, The R Journal Vol. XX/YY, AAAA ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 etc. are held in the and Basic Data Structure (BDS) formatted data sets. Finally, adverse events are held in the Analysis Data Sets for Adverse Events (ADAE). Challenges to analyzing these data sets ADaM data for a clinical trial is generally made available as a set of SAS7BDAT (Shotwell et al., 2013) files. While neither the FDA nor the EMA require this format for submission nor do they require the use of SAS (Institute, 2020) for analysis, there is a heavy bias toward the data format and computing platform. This is partially because they are validated and approved by governing bodies and because a large effort has gone into their use in submissions. Packages like sas7bdat (Shotwell, 2014) and, more recently, haven (Wickham and Miller, 2020) have gone a long way to make these data sets easily accessible to R (R Core Team, 2012) users working with clinical trial data. Despite the effort that has gone into defining a structure for the data as well as the tools implemented to aid in their analysis the data sets themselves are not particularly easy to analyze for two reasons. First, the standard is not “tidy” as defined by Wickham et al. (2014). In particular, it is not required that each variable forms a column. In fact, multiple variables may be stored in one column with another column acting as a key as to which variable’s value is given. This case is often seen in the ADSL data set where a single column may primary and secondary endpoints. For data sets like these the value variable are held in the Analysis Value (AVAL) if corresponding variable is numeric, Analysis Value Character (AVALC) if the variable is a string, the Parameter Character Description (PARAMCD) column giving a shorted variable name, and the Paramater column providing a text description of the variable. As an example, consider the adakiep.xpt data set, which is provided as an example on the CDISC website and whose data is included in the supplementary material.

Volume 13
Pages 563
DOI 10.32614/rj-2021-052
Language English
Journal R J.

Full Text