[PDF] Practical Considerations for Data Collection and Management in Mobile Health Micro-randomized Trials

Abstract

There is a growing interest in leveraging the prevalence of mobile technology to improve health by delivering momentary, contextualized interventions to individuals' smartphones. A just-in-time adaptive intervention (JITAI) adjusts to an individual's changing state and/or context to provide the right treatment, at the right time, in the right place. Micro-randomized trials (MRTs) allow for the collection of data which aid in the construction of an optimized JITAI by sequentially randomizing participants to different treatment options at each of many decision points throughout the study. Often, this data is collected passively using a mobile phone. To assess the causal effect of treatment on a near-term outcome, care must be taken when designing the data collection system to ensure it is of appropriately high quality. Here, we make several recommendations for collecting and managing data from an MRT. We provide advice on selecting which features to collect and when, choosing between "agents" to implement randomization, identifying sources of missing data, and overcoming other novel challenges. The recommendations are informed by our experience with HeartSteps, an MRT designed to test the effects of an intervention aimed at increasing physical activity in sedentary adults. We also provide a checklist which can be used in designing a data collection system so that scientists can focus more on their questions of interest, and less on cleaning data.

Full PDF

PPractical Considerations for Data Collection andManagement in Mobile Health Micro-randomized Trials

Nicholas J. Seewald , Shawna N. Smith , Andy Jinseok Lee , Predrag Klasnja ,and Susan A. Murphy University of Michigan, Department of Statistics University of Michigan, Departments of Psychiatry and General Medicine University of Michigan, School of Information University of Michigan, School of Information Harvard University, Departments of Statistics and Computer ScienceDecember 31, 2018

Abstract

There is a growing interest in leveraging the prevalence of mobile technology to improvehealth by delivering momentary, contextualized interventions to individuals’ smartphones. Ajust-in-time adaptive intervention (JITAI) adjusts to an individual’s changing state and/or con-text to provide the right treatment, at the right time, in the right place. Micro-randomized trials(MRTs) allow for the collection of data which aid in the construction of an optimized JITAI bysequentially randomizing participants to different treatment options at each of many decisionpoints throughout the study. Often, this data is collected passively using a mobile phone. Toassess the causal effect of treatment on a near-term outcome, care must be taken when de-signing the data collection system to ensure it is of appropriately high quality. Here, we makeseveral recommendations for collecting and managing data from an MRT. We provide adviceon selecting which features to collect and when, choosing between “agents” to implement ran-domization, identifying sources of missing data, and overcoming other novel challenges. Therecommendations are informed by our experience with HeartSteps, an MRT designed to testthe effects of an intervention aimed at increasing physical activity in sedentary adults. We alsoprovide a checklist which can be used in designing a data collection system so that scientistscan focus more on their questions of interest, and less on cleaning data.

The increasing prevalence of mobile phones and wearable sensors has lead to a great deal of in-terest in using these technologies to improve health. In particular, ubiquitous computing holdsthe promise of delivering behavioral interventions that can be tailored to an individual’s current1 a r X i v : . [ s t a t . O T ] D ec ontext at precisely the right time. A just-in-time adaptive intervention (JITAI) is an emergingmobile health intervention which is intended to provide treatment “at the right time and in the rightplace” [1, 2]. At each of many decision points, a JITAI uses decision rules to determine whether,and if so, which intervention option to deliver to an individual. These decision rules are functionswhich map data about the individual collected up until the decision point onto an interventionoption [1].The micro-randomized trial (MRT) is an experimental design that provides data for the con-struction of JITAIs [3, 4]. Participants in an MRT are sequentially randomized to different treat-ment options (including no treatment) at each of many decision points at which treatment deliverymight be effective. Often, treatments in this setting are designed to have a nearly real-time impact;as a result, primary analyses for an MRT often focus on the effects of treatment on a “proximal”outcome: a near-term, measurable effect of an intervention component through which that compo-nent is hypothesized to affect desired distal health endpoints [4, 1]. Repeated randomization allowsresearchers to assess the average causal effects of treatment on proximal outcomes, as well as howthese effects change over time and are moderated by participants’ context, such as their location,social setting, or mood [3]. The randomizations in an MRT can inform the construction of an ef-fective JITAI by testing intervention components for which there is either insufﬁcient evidence foran effect on a proximal outcome, or for which the dynamics of this effect over time are not wellunderstood [3].MRTs differ from standard clinical trials in several important ways. First, participants in MRTsmay be randomized hundreds or thousands of times at relevant decision points. Second, MRTs areakin to factorial trial designs in that MRTs are designed to provide data that is useful in optimiz-ing/constructing a multifactorial intervention, namely a JITAI. In an MRT the data analyses focuson examining if the different treatments that might be included in a JITAI have their intendedproximal effects. This is very different from a standard clinical trial used to contrast distal healtheffects of a completely formed intervention (e.g., a JITAI) versus a control. Third, a key goal ofMRTs is to inform “just-in-time” decision rules—rules for when, where, and for whom a particularintervention component should be delivered—which requires consideration of the speciﬁc aspectsof context that may affect treatment effectiveness. In comparison to standard clinical trials, then,MRTs require investigators to collect data at many more time points and, to the extent that con-textual moderation effects are of interest, potentially across many more dimensions. Because ofthe volume of data they need to collect, MRTs frequently make use of automated, passive datacollection in order to minimize participant burden. However, MRTs that choose to integrate thispassive data collection into the same systems used to deliver the intervention face distinct chal-lenges related to ensuring the collection of high-quality data necessary for evaluating interventioneffectiveness and moderation.Here, we offer practical guidance on data collection and management in an MRT. Speciﬁcally,we consider issues related to determining which features to collect, the “agents” used to collectthose features, differentiating amongst causes of missing data, and novel challenges in preparingMRT data for analysis. Throughout, we provide examples from our experiences with HeartSteps,an MRT designed to optimize a JITAI that used tailored activity suggestions delivered to a partic-ipant’s smartphone to encourage bouts of physical activity throughout the day. In the Appendix,we present a general checklist which can be used to guide the design of data collection systems inan MRT. 2 HeartSteps

The goal of the HeartSteps project is to develop an effective JITAI for supporting physical ac-tivity [5]. The system deployed in the ﬁrst HeartSteps MRT consisted of an Android applicationwe developed and the Jawbone Up Move activity tracker that collected minute-level step countdata [3, 6]. The study involved sedentary adults, and one focus of the study was to examine the ef-fect of two different intervention components included in the HeartSteps application: contextually-tailored activity suggestions and activity planning.Activity suggestions prompted participants to walk or to break sedentary behavior, and theywere tailored to the user’s current context in order to make them immediately actionable. Sugges-tions could be provided at ﬁve times (decision points) each day and were delivered as notiﬁcationsto the lock screen of participants’ phones. Suggestions remained on the lock screen either untilparticipants interacted with them or until they timed out after 30 minutes. Participants could ac-knowledge an activity suggestion by giving it a “thumbs up” or “thumbs down” rating. The ﬁvedaily decision points were spaced evenly throughout the day, roughly corresponding to morningcommute, lunch, mid-afternoon, evening commute, and after dinner. The proximal outcome of in-terest for the activity suggestions—the outcome that they were intended to directly inﬂuence—wasthe participant’s step count in the 30 minutes following the decision point.The second treatment assessed in the HeartSteps MRT was activity planning. The planningintervention was designed to help participants specify when, where, and how they would be activeon the following day. Planning could be provided each evening as part of the end-of-day ques-tionnaire, which assessed contextual data that could not be captured automatically, such as howstressful and/or hectic the participant’s day was or if s/he experienced any illness or travel. Theproximal outcome for the evening planning intervention was total step count on the following day.The two intervention components above were randomized for each participant during the studyas follows: Activity suggestions were randomized to be delivered with probability 0.6 at each of theﬁve decision points each day of the study. Evening planning was randomized to be delivered withprobability 0.5 in the evening of each day of the study. Note that the randomization of the planningcomponent was independent of the randomization of activity suggestions. The two componentswere randomized on different time-scales and had different proximal outcomes. At each decisionpoint for each intervention, we collected data on a number of variables related to treatment deliveryand context, resulting in a large, dense dataset. This required careful planning to ensure quality.The HeartSteps MRT lasted 42 days and was completed by 37 participants. Each participantexperienced at least 210 decision points (5 per day for 42 days) for activity suggestions, and 42decision points for evening planning. Thus, for 37 participants, there were 7770 possible decisionpoints at which suggestions could be randomized, and a further 1554 possible decision pointsat which planning could be randomized. The sample size was chosen to achieve 80% power todetect, for the activity suggestions, an effect size of 0.1, assuming a two-sided type-I error rateof 0.05 and 70% “availability” (see Section 5) throughout the study [5]. Under these conditions,the minimum-required sample size is 32 participants [4, 7]. Note that the sample size may appearto be quite small; this is because the primary analyses for MRTs concern main effects of theintervention components in which the primary hypothesis test statistic uses both within- as well asbetween-person contrasts in proximal outcomes; these types of test statistics are made possible bythe within-person randomizations [3].Note that at each decision point in the HeartSteps MRT treatment assignments were made3igure 1: Screenshots from the smartphone app used in HeartSteps. At left, an activity suggestionis delivered to the participant’s phone lock screen. The participant can rate the suggestion thumbs-up or -down, or turn off the intervention for up to 12 hours. At right, a screen from the app showsthe participant her step counts for the current day. [6]4ndependently of the individual’s past proximal outcomes, past treatment assignment, and context(conditional on them being “available” at that decision point; see Section 5). This, however, doesnot impede the study’s ability to provide data which can aid in the development of a JITAI. Theprimary aim of the initial HeartSteps MRT was to examine the average causal effect of sending anactivity suggestion (vs. not sending) on the participant’s step count in the 30 minutes followingthe decision point. Secondary aims were to examine how this effect changed over time and/or wasmoderated by context, as well as to evaluate the time-varying causal effect of evening planningon the next day’s step count [5]. Addressing these aims allowed us to determine (1) whether toinclude each micro-randomized intervention component in the JITAI being developed, and (2)the circumstances under which the intervention component is most effective. For instance, wediscovered that suggestions encouraging the participant to walk signiﬁcantly increased 30-minutepost-suggestion step count, whereas the effect for suggestions designed to break sedentary behaviorhad a smaller, positive, non-signiﬁcant effect. Furthermore, the effect of the walking suggestionscould not be detected by day 29. Subsequent versions of the intervention could be modiﬁed toreduce habituation to activity suggestions, which may extend their usefulness [5].

Passive sensors and mobile phones offer investigators a wide variety of features that can be usedto develop a high-quality JITAI. However, restraint must be exercised in choosing which data tocollect. This choice is an important aspect of designing an MRT and, as with any clinical trial, itshould be primarily driven by the scientiﬁc question(s) of interest. However, the science motivat-ing MRTs is typically concerned with proximal effects of treatments which are often delivered byan application (“app”) on a participant’s smartphone in order to achieve timeliness and contextu-alization. The phone typically also collects the relevant data of interest, both from the treatmentand from other sources of (often passively-collected) data. Thus, with MRTs, the choice of whichfeatures to collect is inherently linked to both the science and the intervention (app) developmentprocess. Scientists should work closely with app developers to ensure that appropriate data iscollected and that the system functions reliably in order to ensure the data is high-quality [8, 6].

Particular attention should be given to collecting features that are instrumental in constructing orassessing the primary proximal outcome. As in any clinical trial, the primary outcomes—here,the primary proximal outcomes—are speciﬁed a priori and care should be taken to ensure theirproper collection. For example, when this outcome is measured passively, such as with a sensor orwearable ﬁtness tracker, the scientist must be aware of limitations of both the device itself and ofthe application programming interfaces (APIs) provided by the manufacturer to access participantdata.Scientists may also consider incorporating redundancy into collection of the primary proximaloutcome to ensure robust results. Recall from Section 2 that HeartSteps measured its proximaloutcome of step count primarily using a Jawbone Up Move ﬁtness tracker. Step count data wasaccessible via a Jawbone-provided API at several time-scales, including minute- and daily-level.We also opted to measure participant step count through the Google Fit app installed on partic-5

10 20 30 40200040006000800010000 Day on study A v e r a g e d a il y s t e p c oun t Jawbone Google Fit

Figure 2: Average daily step counts of HeartSteps participants as measured by Jawbone (dashedblack) and Google Fit (solid blue). Google Fit recorded consistently lower step counts than did theJawbone wearable, but the two data streams follow approximately similar patterns.ipants’ phones. However, Google Fit step count data was less granular than Jawbone’s, as theapp aggregated step counts across full periods of activity (e.g., from when an individual startedwalking to when the individual stopped), rather than every minute. Furthermore, Google Fit usesthe phone’s accelerometer to track step count, which may be less reliable than a wearable sensor(e.g., some individuals may carry their phones in a bag, versus in a pants pocket, which could yieldsystematically different step counts). Figure 2 illustrates the difference in step counts as measuredby different sensors in HeartSteps.Nonetheless, having a second measure of step count allowed for greater conﬁdence in the out-come measure and results by enabling certain sensitivity analyses. For one, Google Fit provided abackup source of data when Jawbone data was missing (e.g., when participants forgot to wear theirdevice). In analysis of data from HeartSteps, we performed sensitivity analyses in which miss-ing Jawbone step count data was singly imputed using data from Google Fit, when available [5].Additionally, collecting another measure of step count provided conﬁdence that our analyses ad-dressed the effect of treatment on physical activity more broadly, rather than simply step count asmeasured by a particular device. Indeed, there may be notable differences in data provided by dif-ferent sensors. Step count is particularly susceptible to this and Jawbone accuracy could have beenaffected by such things as where the tracker is worn (wrist vs. waist) or participant age [9, 10].Data analyses, then, might ﬁt models to both the primary outcome and its proxy(ies) to ensure theeffects of the intervention are not artifacts of sensor choice. For more on the use of Google Fit datain HeartSteps analyses, see Section 6. 6 .2 Treatment Delivery

In order to address questions about the effect of a treatment on its associated proximal outcome,data must be collected on whether, when, and how the intervention was randomized and delivered.The exact information required will vary with the analysis method used, but at minimum, at eachdecision point the system should store the probability with which the treatment was randomized,the result and time of the randomization, and when and whether the intervention was delivered.Storing randomization probabilities is particularly important: the weighted and centered least-squares approach developed by Boruvka et al. to analyze MRT data centers the treatment indicatorwith the randomization probability [11]. While randomization probabilities remained constant inHeartSteps, many MRTs may allow them to change over the study: for example, when interven-tions are triggered by dynamic behavior, such as sedentary periods that are highly variable bothwithin and between people (see, e.g., Dempsey et al. [12]). Finally, to enable robust causal infer-ence, the same information needs to be recorded at each decision point regardless of whether thetreatment is provided. Further discussion of record-keeping in regard to treatment delivery can befound in Section 4.To the extent that adherence to treatment is of interest, collection of data related to whetheror when the treatment was accessed may also be important. In HeartSteps, for example, whilewe were able to determine if a participant was randomized to receive an activity suggestion, andwhether they rated it, we were not able to directly determine whether they saw or read the activitysuggestion (without rating it), nor were we able to sense whether the participant followed thespeciﬁc suggestion s/he received (although we did observe subsequent step count, as discussedabove).

A key motivation for building and optimizing JITAIs is the ability to tailor both the content and thedelivery of treatment to individuals and their immediate contexts [13, 1]. Tailoring interventioncontent has been shown to be effective at improving health behavior change, so JITAIs often seekto provide highly context-speciﬁc treatments [14, 15]. For example, the HeartSteps activity sug-gestions were intended to be immediately actionable, and thus the content of the suggestions wastailored to reﬂect the participant’s current location, time of day, day of the week, and/or weather ateach decision point.Scientists must ensure that apps used in MRTs are able to collect all relevant contextual datain a timely manner. Further, scientists must have protocols in place for dealing with potential lagtimes in this collection to ensure that the tailored intervention is not using “stale” or inappropriatecontextual information. For example, in HeartSteps, we did not want to send an activity suggestionthat mentioned poor weather when the sun was shining. Anecdotally, this can be jarring to par-ticipants, who expect the intervention to be appropriately contextualized [6]. Risks of mistailoredcontent can be mitigated by collecting contextual data as close to each decision point as possible,and by having protocols for dealing with stale or absent contextual information (e.g., delivering ageneric form of the intervention).Contextual data is also useful for discovering the best times to provide treatment by inves-tigating possible moderation effects—potentially time-varying contextual factors that strengthenor weaken intervention effects [11, 3]. For example, with HeartSteps we might hypothesize that7roviding an evening planning intervention is more effective on weekdays than weekends, or thatthe effect of activity suggestions declines with time-on-study as participants habituate to the sug-gestions. Other potential moderators might include participant burden (which could be assessedby the number of treatments provided in some recent window), stress, weather, time of day, orlocation. These moderating effects can easily be assessed using the method developed by Boruvkaet al. by including in the “treatment effects model” (their equation (5)) an interaction between themoderator of interest and a centered treatment indicator [11].Determining a priori which moderator effects are of interest is crucial, so that appropriate datacan be collected over the course of the study. Examining time-varying treatment effect modera-tion requires that data on any and all potential moderators is available at all decision points. Forexample, to investigate whether stress moderates the impact of an intervention, analysts must haveaccess to a relevant, time-varying measure of stress for every randomization. While this does notnecessarily mean that stress should be assessed at every decision point (it could be measured, forinstance, every day), the measure used should be scientiﬁcally sensible. Note that this also intro-duces questions regarding the time scales on which variables of interest are available; these areaddressed in Section 7. Investigators interested in a large number of potential moderators shouldtry to pre-specify the relative “priority” of each such variable. Moderators analyses may requireﬁtting many models; having an a priori hypothesized level of importance for each variable canimprove conﬁdence in the results.

Choosing which features to collect in an MRT often involves a trade-off between the investigator’sscientiﬁc interests and the participants’ privacy. Data collection through passive sensing can hidepotential privacy risks from participants [16]. Proper anonymization of MRT data is also impor-tant, as certain features collected via a smartphone could be used to identify participants. GPScoordinate data, for instance, could be used to infer a participant’s home address, even with someform of masking [17]. While the need to exercise restraint in collecting participant data is a chal-lenge for clinical research in general [18] and not speciﬁc to MRTs, privacy risks in mobile healthare often greater due to the diversity and temporal density of the collected information. Theserisks have been much discussed [19, 20, 16], and a number of approaches have been developed forincreasing participant privacy, including methods for anonymizing geographic data [21, 17]. Here,we emphasize mitigation of privacy risks through thoughtful collection of contextual features in anMRT.Over-collection or improper storage of contextual variables may exacerbate privacy concernsin participants. As discussed above, location data can enable JITAIs to deliver treatments thatare highly tailored to a participant’s current physical context. However, highly-speciﬁc locationinformation is likely of less utility in time-varying moderators analyses. In HeartSteps, we chose tocategorize participant location as “home”, “work”, or “other” to assess effect moderation, ensuringthat no speciﬁc location information was stored in the long-term. If other contextual moderatorsthat depend on precise location, like weather, are collected accurately and consistently at eachdecision point, it may be unnecessary to store GPS coordinates or other ﬁne-grained data. Werecommend that study designers carefully consider both scientiﬁc justiﬁcations and participants’privacy concerns when designing MRT data collection systems.Privacy should also be considered when storing data. Mobile health data might include pro-8ected health information (PHI), which may be subject to storage and safety requirements in theUnited States under the Health Insurance Portability and Accountability Act (HIPAA). If partici-pants’ data is sent from their phones to a central server for later analysis, those servers may needto be HIPAA-compliant, and access to the data and PHI should be tightly controlled. However,information that is collected and used solely for contextualizing the intervention may not need tobe sent to a server, and could instead stay on the participant’s phone, minimizing risk.

Randomization is a key part of MRTs; therefore, choosing how to randomize is critical. The tech-nical aspects of this issue are discussed by Smith and colleagues [6]; here, we focus on the choicebetween randomizing treatment assignments on the participant’s phone, or on a central serverwhich is also used for data collection. This decision has consequences for both data managementand the participant’s experience with the intervention. In particular, phone-side randomization canreact faster to changing context and provide more precise timing of treatment, but it is also suscep-tible to a wide variety of technical issues which may interfere with data integrity and/or structure.Conversely, server-side decisions can facilitate a cleaner data structure, but at the expense of speedand precision in timing.Consider the HeartSteps MRT: in the version of the app we tested in our ﬁrst study, the de-cision to push an activity suggestion (and, subsequently, suggestion delivery) was made on thephone. Because the decisions were made “locally”, an internet connection was not required at thetime of the decision for the user to receive the intervention. The suggestions could be tailored tothe participant’s context within 90 seconds of a decision point, and messages could be deliveredconsistently and at appropriate times. Further, anticipating the possibility of a lost internet connec-tion at the next decision point, our system collected contextual data and “pre-fetched” an activitysuggestion from the server 30 minutes prior to randomization. This suggestion would be deliveredat the decision point if the participant was randomized to receive treatment and did not have aninternet connection. This maximized the chances that activity suggestions could be successfullyrandomized and delivered even if the participant lost connectivity.Phone-side randomization also has downsides. In particular, it requires extensive safeguards,such as “handshakes” which check for a stable connection between phone and server, to ensure theintegrity of transmitted data. As an example, the order of the questions in the HeartSteps eveningsurvey was randomized by the phone. The app was designed to send participants’ responses tothe server immediately after each question was answered. Unfortunately, in several cases, techni-cal difﬁculties led to incomplete survey responses being recorded which prevented analysts fromknowing the precise order of questions that should have been given to the participant. Furthermore,the stored results of randomization for the planning intervention were found to be unreliable andwere frequently missing due to data loss. Server-side randomization would not have this issue.Making decisions on the server can facilitate the automatic creation of clean (and possibly morecomplete) datasets, but introduces some ambiguity in the timing and tailoring of the intervention.Data tables on the server can be pre-built: the system can be designed to expect one row perdecision per participant, and randomization status is always known. These pre-built tables canbe ﬁlled in over the course of the study. This eliminates possible duplication due to time zonechanges or other bugs (see Section 7). However, push notiﬁcations delivered by Apple (iOS) or9oogle (Android) from the cloud can be delayed, resulting in uncertainty in the time at which theintervention was delivered to the participant. This delay could also cause the intervention to betailored to a context which has changed and is no longer appropriate.

A notable feature of an MRT is the incorporation of “availability”, which protocolizes the notionthat it may only be appropriate to provide treatment when the participant is in certain contextsor “states” [3]. Participants may be considered “unavailable” for treatment for reasons of safety,burden/annoyance, or feasibility. For example, in HeartSteps, participants were considered un-available if they were currently driving, did not have an active internet connection, they manuallyturned off the intervention, or were walking within 90 seconds of a decision point, as measured bytheir phone [6].By design, if a participant is unavailable at a decision point then the the participant is not ran-domized at that time [11, 3, 4]. However, it may be of interest to investigate reasons for unavailabil-ity. In HeartSteps, for example, we hypothesized that the proportion of suggestion decision pointsat which participants were unavailable due to walking would increase over time. This would sug-gest the intervention helped them become more active without the need for prompts. Investigatingthis, however, requires collecting the same data from both unavailable and available participants,even though treatment cannot be delivered for the former.Ideally, when the participant is unavailable at a certain decision point, the data collection sys-tem will be able to identify and record the speciﬁc reason for that unavailability. This is partiallyfacilitated by a clear, protocolized deﬁnition of availability, which is likely to be multifacetedand time-sensitive [6], and partially by collecting the usual set of features at unavailable decisionpoints. At each decision point, then, an ideal system would make a determination as to participantavailability, output an availability indicator for that participant and, for unavailable participants,indicate which availability criterion/criteria were not met.Though unavailability and missingness are distinct concepts, ideal MRT data collection willhandle them in similar ways. As much as possible, systems should be designed to capture data nec-essary to pinpoint reason(s) for data missingness. This is critical for making valid inference [22],and identifying missingness mechanisms can aid analysts in discovering technical issues whichmay impact analyses or in identifying participant disengagement. For example, if participants donot respond to intervention components which invite or require an action (e.g., daily planning inHeartSteps), the system should always store “no response” rather than a blank entry. This cantell the analyst that the intervention was delivered, but the participant did not engage with it, thusallowing the analyst to distinguish between missing data due to disengagement and missing datadue to, e.g., technical issues that prevented intervention delivery.Some reasons for unavailability or missing data may be difﬁcult or impossible to ascertainwith certainty. For example, to preserve battery life, ﬁtness trackers may only record positive stepcounts; that is, if a participant took zero steps in a certain period, the tracker will not report anythingfor that time. In some cases, then, it may not be possible to tell if gaps in step counts are due totrue inactivity, or if the participant stopped wearing the tracker. Because of this, primary analysesof HeartSteps data used a zero-imputed version of Jawbone step count [5]. Having redundant datastreams can sometimes help in such cases—in HeartSteps, Google Fit provided another source10f step count data, allowing for sensitivity analyses in which missing Jawbone step counts weresingly imputed from Google Fit—but resolving all cases of uncertain missingness may still provedifﬁcult. To anticipate and mitigate the various scenarios that can result in missing data before thestudy starts, scientists should employ systematic software testing and deployments of beta versionsof the intervention.

Passive data collection through mobile devices like smartphones can be idiosyncratic in ways thatwould not arise in more traditional clinical settings. Primarily, it is possible for participants to(inadvertently) fail to provide data, potentially for long periods of time. This is a serious issue inMRTs because of the density of the data collected. Study designers should develop protocols toidentify when participants are not contributing data, and take appropriate steps to correct this.There are several ways in which a participant’s data collection could be interrupted. Phonesmight lose power or be manually turned off, thus limiting the ability for sensors to collect data.Generally, apps are not notiﬁed when the phone is shutting down, and so this situation may beimpossible to identify. The operating system may also shut down any background processes of theMRT app to free up memory or improve battery life, stopping data collection until the user manu-ally opens the application again. Batteries in wearable sensors may die, requiring the participant tonotice their sensor is dead, charge it, and put it back on. All of these situations create missingnessfor which the cause is difﬁcult to identify from the data. Similarly, a participant might turn offBluetooth, which is often required to sync data from sensors to the phone, or GPS, necessary forlocation detection. In some cases, data might be stored on the sensor but not synced to the server,so step counts, for example, might need to be recovered after a Bluetooth connection is restored.Location data, however, is likely not recoverable if GPS is turned off.Missingness due to data loss should be mitigated as much as possible. App developers shouldattempt to anticipate situations that could lead to data loss, such as attempting to deliver the inter-vention over a WiFi network with a “captive portal” that requires users to accept terms of servicebefore connecting (e.g., hotel WiFi). With HeartSteps, we discovered that some participants wereused to closing apps by “swiping them away” from the phone’s multitasking menu. If the partici-pant closed the HeartSteps app this way before it ensured that data sent from the app was receivedby the server, data loss could result. Both of these situations might be mitigated by phone-to-server“handshakes”, which track data exchange between the phone and the server and can signal to thephone that the exchange was successfully completed so that the phone-side data can be discarded.

MRTs require careful consideration of time, especially as it relates to the treatments delivered andparticipants’ experiences with those treatments. Special attention must be given to collecting timestamps, particularly for events related to treatment delivery. Not only is this crucial for aligningdata from multiple sources (e.g., step count data from the ﬁtness tracker and information from thephone), but it also allows for easier troubleshooting in case of bugs. Careful collection of timestamps allows for the reconstruction of the participant’s experience with the MRT.11he ability to infer the participant’s “timeline” from data is critical. Since data collection isautomated and no code is perfect, technical difﬁculties will arise. Time stamps can identify andtroubleshoot duplicated records, and can allow for the linking of data from different components ofthe intervention. Immediately translating all timestamps into Coordinated Universal Time (UTC)helps avoid problems with participants changing time zones and Daylight Savings Time. UTCtime stamps allow for the cleanest possible reconciliation of different variables that need to beconnected to each other by time stamps.In HeartSteps, information about message delivery and the participant’s response to the mes-sage were kept in separate tables on the central server, and later linked by participant ID and timestamps. Careful collection of time stamps can identify delays in the intervention delivery system:in MRTs that use server-side randomization, the time of the decision to provide treatment willlikely not be the time at which the treatment is delivered via the participant’s phone. Recordingthe time at which this delivery does occur, or some surrogate thereof, can indicate the proper timeat which to start measuring the proximal outcome.For longer studies, participant travel during the study should also be considered. One of theadvantages of mobile intervention delivery is that it can follow participants as they, for example,go on vacation. However, due to the highly time-sensitive nature of the data collected in an MRT,changes in time zones can result in aberrant behavior in the app and/or data collection systems, suchas repeated or missed decision points. As such, time stamps should be stored with the participant’scurrent time zone when possible (note that this might require access to participants’ locations).Smartphone system times may not adjust to time zone changes unless the phone is rebooted. Ifthis is the case, duplication of decision points may not be an issue, but if the decision times arechosen in advance by the user, improperly-timed treatments may be jarring. For example, considera participant traveling from the east coast of the US to Hawaii—a six hour time difference. Ifthe interventions are delivered according to a system time which is not sensitive to time zone, theparticipant might receive a suggestion tailored for the evening around noon, or for very differentweather conditions.Study designers should decide a priori how to handle participant travel. One approach mightbe to exclude time spent traveling from an individual’s data, though this requires strong scientiﬁcjustiﬁcation and a comprehensive deﬁnition of what constitutes “travel”. Any information neededto assess whether the user is traveling should also be recorded. Alternatively, if the app and in-tervention can handle time zone changes without creating jarring experiences like duplicated ormistimed decision points, time could be conceptualized as the participant’s local time and decisionpoints indexed according to this local time.Just as important as proper time stamp collection is consideration of the various time scaleson which variables are measured, which are likely reﬂective of how dynamic or changeable thesefeatures are over time. Some features might be collected only at baseline (e.g., age) or monthly(e.g., self-efﬁcacy); others, daily (e.g., how typical a day the participant had) or even at everydecision point (e.g., weather). In preparing data for analysis, these time scales should be carefullyaccounted for. For example, in HeartSteps, weather conditions were collected at every decisionpoint. In determining whether weather moderates the effect of the daily planning intervention ondaily step count, average temperature or total precipitation throughout the day might be used asa summary measure. On the other hand, when merging a daily-level variable (e.g., self-reportedstress level) onto ﬁner data (such as activity suggestions), care should be taken so that all decisionpoints on the same day are associated with the same, most-recent daily observation. Recall that12djustment for post-treatment variables can lead to biased causal inference [23].

Mobile health interventions delivered through smartphones have the potential to effect meaningfulchange in individuals’ lives. The power and prevalance of mobile devices has led to a rapid pacein innovation and a growing interest in harnessing them to improve health. The micro-randomizedtrial allows for the collection of data which can be used to construct optimized just-in-time adaptiveinterventions and make causal inference about the effects of intervention components on a proximaloutcome. However, care must be taken to ensure that this data is high-quality.To this end, we have presented a series of recommendations in several key areas which, ifimplemented, can help scientists collect and manage data which can be used to assess the effec-tiveness of momentary, mobile interventions. The central theme of these recommendations is thatcareful planning is required before the trial begins to ensure that the proper data is collected in arobust way. Passive data collection systems in MRTs are not able to document their procedures (asa human might) other than through their source code, so testing the application is critical.Designing a smartphone app and data collection system to address scientiﬁc questions involvesdifferent needs than does typical app development. Using data from smartphones or sensors in anMRT to perform causal inference requires that careful attention is paid to precisely how the data isgathered. The checklist presented in the Appendix may be a useful start to ensure that appropriate,high-quality data can be collected in a general MRT. The recommendations presented here canallow scientists to focus more on their questions of interest, and spend less time working withdisorderly data.

Acknowledgements

Research reported in this publication was supported by NIAAA, NIDA, NIBIB, and NHLBI of theNational Institutes of Health under award numbers R01AA023187, P50DA039838, U54EB020404,and R01HL125440. The content is solely the responsibility of the authors and does not necessarilyrepresent the ofﬁcial views of the National Institutes of Health. The authors wish to thank Dr.Audrey Boruvka for her work in managing data from HeartSteps, as well as the two anonymousreviewers for their thoughtful comments. 13 ppendix: Checklist for Preparing a Data Collection System foran MRT

Here, we summarize the ideas presented above into a checklist which can be used when collaborat-ing with app developers to design a data collection system for an MRT. The following is a generalguide to key steps which should be addressed in the planning stages of most MRTs to improve thequality of data collected.1.

Proximal Outcome.

Know how the proximal outcome is measured and collected, and onwhat time scale(s) it is available. (Section 3.1)2.

Randomization Agent.

Decide whether randomization should occur on the participant’sphone or on a central server, being careful to weigh advantages and disadvantages of both inthe context of the MRT being designed. (Section 4)3.

Treatment Delivery.

Ensure robust collection of data on treatment delivery at each deci-sion point. This includes the results of each randomization and the probability with whichindividuals were randomized to receive treatment. (Section 3.2)4.

Contextual Data.

List contextual variables which may be of interest for tailoring the inter-vention, or as possible moderators for later analysis. Ensure that all contextual variables arecollected regardless of the decision to treat or not treat. Assess the time scales on whichthese features are available, and determine how to ensure those used for tailoring are kept“fresh”. Be sure to consider participant privacy when choosing what to collect and storingthe data. (Sections 3.3 and 3.4)5.

Time Stamps.

As much as possible, collect the times at which key events in the trial occurred,such as randomization times, and times at which interventions were delivered to and/or seenby the participant. Immediately translate all time stamps into UTC. If location is available,collecting the participant’s local time zone with every time stamp can help troubleshootissues caused by travel. (Section 7)6.

Unavailability and Missing Data.

Develop an appropriately comprehensive deﬁnition of un-availability that allows the study app to accurately record why participants were unavailable.Ensure that all contextual data is collected at each decision point regardless of availabilitystatus. Anticipate and address sources of missing data by understanding how sensors recorda null event (e.g., zero steps), and by storing “no response” when an intervention is deliv-ered but the participant does not engage. Use phone-to-server “handshakes” to ensure data istransfered successfully before it is discarded on the phone. As much as possible, missingnessmechanisms should be identiﬁed and recorded in the data. (Section 5)7.

Understanding Participants.

List and design around common issues participants might havewith the study app and/or phone. Protocolize strategies for data recovery when, say, a par-ticipant turns Bluetooth off or the battery on his/her sensor dies. Carefully pilot test theapp before launching the study to help identify bugs which may arise in day-to-day usage.(Section 6) 14 eferences [1] Nahum-Shani, I., Smith, S.N., Spring, B.J., et al. (2016) Just-in-time adaptive interventions(JITAIs) in mobile health: Key components and design principles for ongoing health behaviorsupport,

Ann. Behav. Med. , doi:10.1007/s12160-016-9830-8.[2] Spruijt-Metz, D., Wen, C.K.F., O’Reilly, G., et al. (2015) Innovations in the use ofinteractive technology to support weight management,

Curr. Obes. Rep. , 4(4):510–519,doi:10.1007/s13679-015-0183-6.[3] Klasnja, P., Hekler, E.B., Shiffman, S., et al. (2015) Microrandomized trials: An experimentaldesign for developing just-in-time adaptive interventions.,

Heal. Psychol. , 34(Suppl):1220–1228, doi:10.1037/hea0000305.[4] Liao, P., Klasnja, P., Tewari, A., et al. (2016) Sample size calculations for micro-randomizedtrials in mHealth,

Stat. Med. , 35(12):1944–1971, doi:10.1002/sim.6847.[5] Klasnja, P., Smith, S., Seewald, N.J., et al. (2018) Efﬁcacy of Contextually Tailored Sug-gestions for Physical Activity: A Micro-randomized Optimization Trial of HeartSteps,

Ann.Behav. Med. , pp. 1–10, doi:10.1093/abm/kay067.[6] Smith, S.N., Lee, A.J., Hall, K., et al. (2017) Design lessons from a micro-randomized pilotstudy in mobile health, in Mob. Heal. (eds. J.M. Rehg, S.A. Murphy, and S. Kumar), pp.59–82, Cham: Springer, doi:10.1007/978-3-319-51394-2 4.[7] Seewald, N.J., Sun, J., and Liao, P. (2016) MRT-SS Calculator: An R Shiny Application forSample Size Calculation in Micro-Randomized Trials, arXiv:1609.00695 [stat.ME] .[8] Price, M., Yuen, E.K., Goetter, E.M., et al. (2014) mHealth: A mechanism to deliver moreaccessible, more effective mental health care,

Clin. Psychol. Psychother. , 21(5):427–436,doi:10.1002/cpp.1855.[9] Kumar, S., Nilsen, W.J., Abernethy, A., et al. (2013) Mobile health technologyevaluation: The mHealth evidence workshop,

Am. J. Prev. Med. , 45(2):228–236,doi:10.1016/j.amepre.2013.03.017.[10] Modave, F., Guo, Y., Bian, J., et al. (2017) Mobile device accuracy for step counting acrossage groups,

JMIR mHealth uHealth , 5(6), doi:10.2196/mhealth.7870.[11] Boruvka, A., Almirall, D., Witkiewitz, K., et al. (2018) Assessing Time-VaryingCausal Effect Moderation in Mobile Health,

J. Am. Stat. Assoc. , 113(523):1112–1121,doi:10.1080/01621459.2017.1305274.[12] Dempsey, W., Liao, P., Kumar, S., et al. (2017) The stratiﬁed micro-randomized trial design:sample size considerations for testing nested causal effects of time-varying treatments.[13] Kumar, S., Nilsen, W., Pavel, M., et al. (2013) Mobile health: Revolutionizing health-care through transdisciplinary research,

Computer (Long. Beach. Calif). , 46(1):28–35,doi:10.1109/MC.2012.392. 1514] Kreuter, M. (2000)

Tailoring health messages: Customizing communication with computertechnology , LEA’s Communication Series, Mahwah, N.J.: Routledge.[15] Noar, S.M., Harrington, N.G., Stee, S.K.V., et al. (2011) Tailored health communication tochange lifestyle behaviors,

Am. J. Lifestyle Med. , 5(2), doi:10.1177/1559827610387255.[16] Raij, A., Ghosh, A., Kumar, S., et al. (2011) Privacy risks emerging from the adoption ofinnocuous wearable sensors in the mobile environment,

Proc. 2011 Annu. Conf. Hum. factorsComput. Syst. - CHI ’11 , p. 11, doi:10.1145/1978942.1978945.[17] Seidl, D.E., Paulus, G., Jankowski, P., et al. (2015) Spatial obfuscation meth-ods for privacy protection of household-level data,

Appl. Geogr. , 63:253–263,doi:10.1016/j.apgeog.2015.07.001.[18] Saczynski, J.S., McManus, D.D., and Goldberg, R.J. (2013) Commonly useddata-collection approaches in clinical research,

Am. J. Med. , 126(11):946–950,doi:10.1016/j.amjmed.2013.04.016.[19] Kotz, D. (2011) A threat taxonomy for mHealth privacy, , doi:10.1109/COMSNETS.2011.5716518.[20] Martinez-Perez, B., de la Torre-Diez, I., and Lopez-Coronado, M. (2015) Privacy andsecurity in mobile health apps: A review and recommendations,

J. Med. Syst. , 39(1),doi:10.1007/s10916-014-0181-3.[21] Cassa, C.A., Wieland, S.C., and Mandl, K.D. (2008) Re-identiﬁcation of home ad-dresses from spatial locations anonymized by Gaussian skew,

Int J Heal. Geogr , 7:45,doi:10.1186/1476-072X-7-45.[22] Rubin, D.B. (1976) Inference and missing data,

Biometrika , 63(3):581, doi:10.2307/2335739.[23] Rosenbaum, P.R. (1984) The consquences of adjustment for a concomitant variable that hasbeen affected by the treatment,