[PDF] Robust training of recurrent neural networks to handle missing data for disease progression modeling

Abstract

Disease progression modeling (DPM) using longitudinal data is a challenging task in machine learning for healthcare that can provide clinicians with better tools for diagnosis and monitoring of disease. Existing DPM algorithms neglect temporal dependencies among measurements and make parametric assumptions about biomarker trajectories. In addition, they do not model multiple biomarkers jointly and need to align subjects' trajectories. In this paper, recurrent neural networks (RNNs) are utilized to address these issues. However, in many cases, longitudinal cohorts contain incomplete data, which hinders the application of standard RNNs and requires a pre-processing step such as imputation of the missing values. We, therefore, propose a generalized training rule for the most widely used RNN architecture, long short-term memory (LSTM) networks, that can handle missing values in both target and predictor variables. This algorithm is applied for modeling the progression of Alzheimer's disease (AD) using magnetic resonance imaging (MRI) biomarkers. The results show that the proposed LSTM algorithm achieves a lower mean absolute error for prediction of measurements across all considered MRI biomarkers compared to using standard LSTM networks with data imputation or using a regression-based DPM method. Moreover, applying linear discriminant analysis to the biomarkers' values predicted by the proposed algorithm results in a larger area under the receiver operating characteristic curve (AUC) for clinical diagnosis of AD compared to the same alternatives, and the AUC is comparable to state-of-the-art AUCs from a recent cross-sectional medical image classification challenge. This paper shows that built-in handling of missing values in LSTM network training paves the way for application of RNNs in disease progression modeling.

Full PDF

RRobust training of recurrent neural networks tohandle missing data for disease progression modeling

Mostafa Mehdipour Ghazi , Mads Nielsen , Akshay Pai , M. Jorge Cardoso ,Marc Modat , Sebastien Ourselin , Lauge Sørensen

Biomediq A/S, Copenhagen, DK Department of Computer Science, University of Copenhagen, DK Centre for Medical Image Computing, University College London, UK [email protected]

Abstract

Disease progression modeling (DPM) using longitudinal data is a challengingtask in machine learning for healthcare that can provide clinicians with bettertools for diagnosis and monitoring of disease. Existing DPM algorithms neglecttemporal dependencies among measurements and make parametric assumptionsabout biomarker trajectories. In addition, they do not model multiple biomarkersjointly and need to align subjects’ trajectories. In this paper, recurrent neuralnetworks (RNNs) are utilized to address these issues. However, in many cases,longitudinal cohorts contain incomplete data, which hinders the application ofstandard RNNs and requires a pre-processing step such as imputation of the missingvalues. We, therefore, propose a generalized training rule for the most widely usedRNN architecture, long short-term memory (LSTM) networks, that can handlemissing values in both target and predictor variables. This algorithm is applied formodeling the progression of Alzheimer’s disease (AD) using magnetic resonanceimaging (MRI) biomarkers. The results show that the proposed LSTM algorithmachieves a lower mean absolute error for prediction of measurements across allconsidered MRI biomarkers compared to using standard LSTM networks withdata imputation or using a regression-based DPM method. Moreover, applyinglinear discriminant analysis to the biomarkers’ values predicted by the proposedalgorithm results in a larger area under the receiver operating characteristic curve(AUC) for clinical diagnosis of AD compared to the same alternatives, and theAUC is comparable to state-of-the-art AUC’s from a recent cross-sectional medicalimage classiﬁcation challenge. This paper shows that built-in handling of missingvalues in LSTM network training paves the way for application of RNNs in diseaseprogression modeling.

Alzheimer’s disease (AD) is a chronic neurodegenerative disorder that begins with short-term memoryloss and develops over time, causing issues in conversation, orientation, and control of bodily functions[1]. Early diagnosis of the disease is challenging and the diagnosis is usually made once cognitiveimpairment has already compromised daily living. Hence, developing robust, data-driven methodsfor disease progression modeling (DPM) utilizing longitudinal data is necessary to yield a completeperspective of the disease for better diagnosis, monitoring, and prognosis [2].Existing DPM techniques attempt to describe biomarker measurements as a function of diseaseprogression through continuous curve ﬁtting. In the AD progression literature, a variety of regression-based methods have been applied to ﬁt logistic or polynomial functions to the longitudinal dynamic a r X i v : . [ c s . C V ] A ug f each biomarker [3–8]. However, parametric assumptions on the biomarker trajectories limit theapplicability of such methods; in addition, none of the existing approaches considers the temporaldependencies among measurements. Furthermore, the available methods mostly rely on independentbiomarker modeling and require alignment of subjects’ trajectories – either as a pre-processing stepor as part of the algorithm.Recurrent neural networks (RNNs) are sequence learning based methods that can offer continuous,non-parametric, joint modeling of longitudinal data while taking temporal dependencies amongstmeasurements into account [9]. However, since longitudinal cohort data often contain missing valuesdue to, for instance, dropped out patients, unsuccessful measurements, and/or varied trial design,standard RNNs require pre-processing steps for data imputation which may result in suboptimalanalyses and predictions [10]. Therefore, the lack of methods to inherently handle incomplete data inRNNs is evident [11].Long short-term memory (LSTM) networks are widely used types of RNNs developed to effectivelycapture long-term temporal dependencies by dealing with the exploding and vanishing gradientproblem during backpropagation through time [12–14]. They employ a memory cell with nonlinearreset units – so called constant error carousels (CECs), and learn to store history for either long orshort time periods. Since their introduction, a variety of LSTM networks have been developed fordifferent time-series applications [15]. The vanilla LSTM, among others, is the most commonlyused architecture that utilizes three reset gates with full gate recurrence and applies backpropagationalgorithm through time using full gradients. Nevertheless, its complete topology can include biasesand cell-to-gates (peephole) connections.The most common approach to handling missing data with LSTM networks is data interpolationpre-processing step, usually using mean or forward imputation. This two-step procedure decouplesmissing data handling and network training, resulting in a sub-optimal performance, and it is heavilyinﬂuenced by the choice of data imputation scheme. Other approaches, update the architecture toutilize possible correlations between missing values’ patterns and the target to improve predictionresults [10, 11]. Our goal is different; we want to make the training of LSTM networks robust tomissing values to more faithfully capture the true underlying signal, and to make the learned modelgeneralizable across cohorts – not relying on speciﬁc cohort or demographic circumstances correlatedwith the target.In this paper, we propose a generalized method for training LSTM networks that can handle missingvalues in both target and predictor variables. This is achieved via applying the batch gradient descentalgorithm together with normalizing the loss function and its gradients with respect to the number ofmissing points in target and input, to ensure a proportional contribution of each weight per epoch.The proposed LSTM algorithm is applied for modeling the progression of AD in the Alzheimer’sDisease Neuroimaging Initiative (ADNI) cohort [16] based on magnetic resonance imaging (MRI)biomarkers, and the estimated biomarker values are used to predict the clinical status of subjects.Our main contribution is three-fold. Firstly, we propose a generalized formulation of backpropagationthrough time for LSTM networks to handle incomplete data and show that such built-in handlingof missing values provides better modeling and prediction performances compared to using dataimputation with standard LSTM networks. Secondly, we model temporal dependencies amongmeasurements within the ADNI data using the proposed LSTM network via sequence-to-sequencelearning. To the best of our knowledge, this is the ﬁrst time such multi-dimensional sequence learningmethods are applied for neurodegenerative DPM. Lastly, we introduce an end-to-end approach formodeling the longitudinal dynamics of imaging biomarkers – without need for trajectory alignment –and for clinical status prediction. This is a practical way to implement a robust DPM for both researchand clinical applications. The main goal of this study is to minimize the inﬂuence of missing values on the learned LSTMnetwork parameters. This is achieved by using the batch gradient descend scheme together with thebackpropagation through time algorithm modiﬁed to take into account missing data in the input andtarget vectors. More speciﬁcally, the algorithm accumulates the input weight gradients proportionallyweighted according to the number of available time points per input biomarker node using the subject-speciﬁc normalization factor of β jn . In addition, it uses an L2-norm loss function with residuals2 STM Unit ⨀⨀ (cid:2026) (cid:3035) Σ ⨀ (cid:2026) (cid:3034) Σ input gate (cid:2206) (cid:3047) (cid:2185) (cid:3047)(cid:2879)(cid:2869) (cid:2185) (cid:3047) (cid:2190) (cid:3047) (cid:2841)̃ (cid:3047) (cid:2185) (cid:3047) cell (cid:2185)(cid:3556) (cid:3047) (cid:2190) (cid:3047)(cid:2879)(cid:2869) (cid:2185) (cid:3047)(cid:2879)(cid:2869) (cid:1847) (cid:3036) (cid:2026) (cid:3030) Σ input modulation (cid:2206) (cid:3047) (cid:2208)(cid:3556) (cid:3047) (cid:2190) (cid:3047)(cid:2879)(cid:2869) (cid:2026) (cid:3034) Σ forget gate (cid:2206) (cid:3047) (cid:2188)(cid:3561) (cid:3047) (cid:2190) (cid:3047)(cid:2879)(cid:2869) (cid:2185) (cid:3047)(cid:2879)(cid:2869) (cid:1847) (cid:3033) (cid:2026) (cid:3034) Σ output gate (cid:2206) (cid:3047) (cid:2197)(cid:3557) (cid:3047) (cid:2190) (cid:3047)(cid:2879)(cid:2869) (cid:2185) (cid:3047) (cid:1847) (cid:3042) hiddenactivation Figure 1: An illustration of a vanilla LSTM unit with peephole connections in red. The solid anddashed lines show weighted and unweighted connections, respectively.weighted according to the number of available time points per output biomarker node using thesubject-speciﬁc normalization factor of β jm , and normalized with respect to the total number ofavailable input values for all visits of all biomarkers – propagated through the forward pass – usingthe subject-speciﬁc normalization factor of β jx . Such modiﬁcation of the loss function also ensuresthat all gradients of the network weights are indirectly normalized. Finally, the use of batch gradientdescend ensures that there is at least one visit available per biomarker so that each input node canproportionally contribute in the weight updates. Figure 1 shows a typical schematic of a vanilla LSTM architecture. As can be seen, the topologyincludes a memory cell, an input modulation gate, a hidden activation function, and three nonlinearreset gates, namely input gate, forget gate, and output gate, each of which accepting current andrecurrent inputs. The memory cell learns to maintain its state over time while the multiplicativegates learn to open and close access to the constant error/information ﬂow, to prevent exploding orvanishing gradients. The input gate protects the memory contents from perturbation by irrelevantinputs, while the output gate protects other units from perturbation by currently irrelevant memorycontents. The forget gate deals with continual or very long input sequences, and ﬁnally, peepholeconnections allow the gates to access the CEC of the same cell state.

Assume x tj ∈ R N × is the j -th observation of an N -dimensional input vector at current time t . If M is the number of output units, feedforward calculations of the LSTM network under study can besummarized as f tj = W f x tj + U f h t − j + V f (cid:12) c t − j + b f −→ ˜ f tj = σ g ( f tj ) , i tj = W i x tj + U i h t − j + V i (cid:12) c t − j + b i −→ ˜ i tj = σ g ( i tj ) , z tj = W c x tj + U c h t − j + b c −→ ˜ z tj = σ c ( z tj ) , c tj = ˜ f tj (cid:12) c t − j + ˜ i tj (cid:12) ˜ z tj −→ ˜ c tj = σ h ( c tj ) , o tj = W o x tj + U o h t − j + V o (cid:12) c tj + b o −→ ˜ o tj = σ g ( o tj ) , h tj = ˜ o tj (cid:12) ˜ c tj , { f tj , i tj , z tj , c tj , o tj , h tj } ∈ R M × and { ˜ f tj , ˜ i tj , ˜ z tj , ˜ c tj , ˜ o tj } ∈ R M × are j -th observation offorget gate, input gate, modulation gate, cell state, output gate, and hidden output at time t beforeand after activation, respectively. Moreover, { W f , W i , W o , W c } ∈ R M × N and { U f , U i , U o , U c } ∈ R M × M are sets of connecting weights from input and recurrent, respectively, to the gates and cell, { V f , V i , V o } ∈ R M × is the set of peephole connections from the cell to the gates, { b f , b i , b o , b c } ∈ R M × represents corresponding biases of neurons, and (cid:12) denotes element-wise multiplication.Finally, σ g , σ c , and σ h are nonlinear activation functions assigned for the gates, input modulation,and hidden output, respectively. Logistic sigmoid functions are applied for the gates with range [0 , while hyperbolic tangent functions are applied for modulation of both cell input and hidden outputwith range [ − , . Let L ∈ R M × be the loss function deﬁned based on the actual target s and network output y . Here,we consider one layer of LSTM units for sequence learning which means that the network output isthe hidden output. The main idea is to calculate the partial derivatives of the normalized loss function( δ ) with respect to the weights using the chain rule. Hence, the backpropagation calculations throughtime using full gradients can be obtained as L ( m ) = 12 (cid:88) j,t β jx β jm ( y tj ( m ) − s tj ( m )) −→ δ y tj ( m ) = ∂ L tj ( m ) ∂ y tj ( m ) = 1 β jx β jm ( y tj ( m ) − s tj ( m )) ,δ h tj = δ y tj + U Tf δ f t +1 j + U Ti δ i t +1 j + U Tc δ z t +1 j + U To δ o t +1 j ,δ ˜ o tj = δ h tj (cid:12) ˜ c tj −→ δ o tj = δ ˜ o tj (cid:12) σ (cid:48) g ( o tj ) , δ ˜ c tj = δ h tj (cid:12) ˜ o tj −→ δ c tj = δ ˜ c tj (cid:12) σ (cid:48) h ( c tj ) + δ c t +1 j (cid:12) ˜ f t +1 j + V f (cid:12) δ f t +1 j + V i (cid:12) δ i t +1 j + V o (cid:12) δ o tj ,δ ˜ z tj = δ c tj (cid:12) ˜ i tj −→ δ z tj = δ ˜ z tj (cid:12) σ (cid:48) c ( z tj ) ,δ ˜ i tj = δ c tj (cid:12) ˜ z tj −→ δ i tj = δ ˜ i tj (cid:12) σ (cid:48) g ( i tj ) ,δ ˜ f tj = δ c tj (cid:12) c t − j −→ δ f tj = δ ˜ f tj (cid:12) σ (cid:48) g ( f tj ) ,δ x tj = W Tf δ f tj + W Ti δ i tj + W Tc δ z tj + W To δ o tj , where β jx = J | x j | T N and β jm = | y j ( m ) | are normalization factors to handle missing values of the j -th observation with batch size J and sequence length T . Also, | x j | and | y j ( m ) | denote the totalnumber of available input values and the number of available target time points in the m -th node,respectively. Finally, if θ ∈ { f, i, z, o } and φ ∈ { f, i } , the gradients of the loss function with respectto the weights are calculated as δW θ ( n ) = J (cid:88) j =1 β jn δ θ { → T } j x { → T } j ( n ) ,δU θ = J (cid:88) j =1 δ θ { → T } j h { → T − } j ,δ V φ = J (cid:88) j =1 T − (cid:88) t =0 δ φ t +1 j (cid:12) c tj ,δ V o = J (cid:88) j =1 T (cid:88) t =0 δ o tj (cid:12) c tj ,δ b θ = J (cid:88) j =1 T (cid:88) t =0 δ θ tj , Number of visits Age, year (mean ± SD) Education, yearmale female male female (mean ± SD)CN 1,356 1,389 76.67 ± ± ± ± ± ± ± ± ± ± ± where β jn = | x j ( n ) | T is the normalization factor handling missing input values and | x j ( n ) | is thenumber of available input time points in the n -th node. As an efﬁcient iterative algorithm, momentum batch gradient descent is applied to ﬁnd the localminimum of the loss function calculated over a batch while speeding up the convergence. The updaterule can be written as ϑ new = µϑ old − α ( δω + γω old ) ,ω new = ω old + ϑ new , where ϑ is the weight update initialized to zero, ω is the to-be-updated weight array, δω is thegradient of the loss function with respect to ω , and α , γ , and µ are the learning rate, weight decay orregularization factor, and momentum weight, respectively. We utilize the dataset from The Alzheimer’s Disease Prediction Of Longitudinal Evolution [17](TADPOLE) challenge for DPM using the LSTM network. The dataset is composed of data fromthe three ADNI phases ADNI 1, ADNI GO, and ADNI 2. This includes roughly 1,500 biomarkersacquired from 1,737 subjects (957 males and 780 females) during 12,741 visits at 22 distinct timepoints between 2003 and 2017. Table 1 summarizes statistics of the demographics in the TADPOLEdataset. Note that the subjects include missing measurements during their visits and not all of themare clinically labeled.In this work, we have merged existing groups labeled as cognitively normal (CN), signiﬁcant memoryconcern (SMC), and normal (NL) under CN, mild cognitive impairment (MCI), early MCI (EMCI),and late MCI (LMCI) under MCI, and Alzheimer’s disease (AD) and Dementia under AD. Moreover,groups with labels converting from one status to another, e.g. “MCI-to-AD”, are assumed to belongthe next status (“AD” in this example).MRI biomarkers are used for AD progression modeling. This includes T1–weighted brain MRIvolumes of ventricles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinalcortex. We normalize the MRI measurements with respect to the corresponding intracranial volume(ICV). Out of 22 visits, we select 11 visits – including baseline – with a ﬁx interval of one year tospan the majority of measurements and subjects. Next, we ﬁlter data outliers based on the speciﬁedrange of each biomarker and normalize the measurements to be in the range [ − , . Finally, subjectswith less than three distinct visits for any biomarker are removed to obtain 742 subjects. This is toensure that at least two visits are available per biomarker for performing sequence learning throughthe feedforward step and an additional visit for backpropagation.For evaluation purpose, we partition the entire dataset to three non-overlapping subsets for training,validation, and testing. To achieve this, we randomly select 10% of the within-class subjects for https://tadpole.grand-challenge.org Mean absolute error (MAE) and multi-class area under the receiver operating characteristic (ROC)curve (AUC) are used to assess the modeling and classiﬁcation performances, respectively. MAEmeasures accuracy of continuous prediction per biomarker by computing the difference betweenactual and estimated values as follows

MAE = 1 I (cid:88) j,t | y tj − s tj | , where s tj and y tj are the ground-truth and estimated values of the speciﬁc biomarker for the j -thsubject at the t -th visit, respectively, and I is the number of existing points in the target array s .Multi-class AUC [18], on the other hand, is a measure to examine the diagnostic performance ina multi-class test set using ROC analysis. It can be calculated using the posterior probabilities asfollows AUC = 1( n c ( n c − n c − (cid:88) i =1 n c (cid:88) k = i +1 n i n k (cid:104) SR i − n i ( n i + 1)2 + SR k − n k ( n k + 1)2 (cid:105) , where n c is the number of distinct classes, n i denotes the number of available points belonging tothe i -th class, and SR i is the sum of the ranks of posteriors p ( c i | s i ) after sorting all concatenatedposteriors { p ( c i | s i ) , p ( c i | s k ) } in an increasing order, where s i and s k are vectors of scores belongingto the true classes c i and c k , respectively. All the evaluated methods in this study are developed in-house in MATLAB R2017b and run on a2.80 GHz CPU with 16 GB RAM. We initialize the LSTM network weights by generating uniformlydistributed random values in the range [ − . , . and set the weights’ updates and weights’gradients to zero. We set the batch size to the number of available training subjects. Furthermore, forsimplicity, we use the ﬁrst ten visits to estimate the second to eleventh visits per subject and use theestimated values for evaluation. Finally, we train the network using feedforward and the proposedmethod of backpropagation through time where the network replace the input missing values andcorresponding error of the output missing values with zero.We utilize the validation set to tune the network optimization parameters each time by adjustingone of the parameters while keeping the rest at ﬁxed values to achieve the lowest average MAE.Peephole connections are used in the network as they intend to improve the performance. Basedon these strategies, the optimal parameters are obtained as α = 0 . , µ = 0 . , and γ = 0 . with1,000 epochs. The corresponding MAE’s for the validation set are also calculated as . × − , . × − , . × − , . × − , . × − , . × − , respectively forventricles, hippocampus, whole brain, entorhinal cortex, fusiform, and middle temporal gyrus.Moreover, it takes about 340 seconds and 0.025 seconds for training and validation, respectively. Itis worthwhile mentioning that all the estimated biomarker’s measurements are transformed back totheir actual ranges while calculating MAE’s. After successfully training our LSTM network, we examine it using the obtained test subset. Next,we train the network using mean imputation (LSTM-Mean) [11] and forward imputation (LSTM-Forward) [10]. Moreover, we use the parametric, regression-based method of [3] to model the ADprogression. Table 2 compares the test modeling performance (MAE) of the MRI biomarkers usingaforementioned approaches. As it can be deduced from Table 2, our proposed method outperforms all6able 2: Test modeling performance (MAE) of the MRI biomarkers using different DPM methods.

Proposed LSTM-Mean [11] LSTM-Forward [10] Jedynak et al. [3]Ventricles . × − . × − . × − . × − Hippocampus . × − . × − . × − . × − Whole brain . × − . × − . × − . × − Entorhinal cortex . × − . × − . × − . × − Fusiform . × − . × − . × − . × − Middle temporal gyrus . × − . × − . × − . × − Table 3: Test diagnostic performance (AUC) of the MRI biomarkers using LDA with different DPMmethods.

Proposed LSTM-Mean [11] LSTM-Forward [10] Jedynak et al. [3]CN vs. MCI 0.5914 0.5838 0.5800 0.5468CN vs. AD 0.9029 0.8404 0.8150 0.7826MCI vs. AD 0.7844 0.6936 0.6890 0.7330CN vs. MCI vs. AD 0.7596 0.7059 0.6947 0.6875 other modeling techniques in all categories. It should be noticed that when we apply data imputation,the backpropagation formulas simply generalize to the standard LSTM network.To assess the ability of the estimated biomarkers’ measurements in predicting the clinical labels, weapply a linear discriminant analysis (LDA) classiﬁer to the multi-dimensional training data estimationsto compute the posterior probability scores in the test data. The obtained scores are then used tocalculate the AUC’s. The diagnostic prediction results for the test set are shown in Table 3 for theutilized methods. As can be seen, the proposed method outperforms all other schemes in predictingclinical status of subjects per visits. This, in turn, reveals the effect of modeling on classiﬁcationperformance. One could of course use different classiﬁers to improve the results. But our focus inthis paper is on DPM or sequence-to-sequence learning. On the other hand, it is possible to trainthe LSTM network for a classiﬁcation (sequence-to-label) problem. However, since this approachrequires labeled data, it would only be able to use a subset of the utilized data in training.Furthermore, the diagnostic classiﬁcation results of the predicted MRI biomarkers’ measurementsusing the proposed approach are comparable to state-of-the-art cross-sectional MRI-based classiﬁca-tion results in the recent challenge on Computer-Aided Diagnosis of Dementia (CADDementia) [19].To be more speciﬁc, LDA classiﬁcation on predicted features using the proposed method achieves amulti-class AUC of 0.76 which is within the top-ﬁve multi-class AUCs in the challenge that rangedfrom 0.79 to 0.75.

In this paper, a training algorithm was proposed for LSTM networks aiming to improve robustnessagainst missing data, and the robustly trained LSTM network was applied for AD progressionmodeling using longitudinal measurements of imaging biomarkers. To the best of our knowledge thisis the ﬁrst time RNNs have been studied and applied for DPM within the ADNI cohort. The proposedtraining method demonstrated better performance than using imputation prior to a standard LSTMnetwork and outperformed an established parametric, regression-based DPM method, in terms ofboth biomarker prediction and subsequent diagnostic classiﬁcation.Moreover, the classiﬁcation results using the predicted MRI measurements of the proposed methodare comparable to those of the CADDementia challenge. It should, however, be noted that thereare important differences between this study and the CADDementia challenge. Firstly, this workhas the advantage of training and testing features from the same cohort whereas CADDementiaalgorithms were applied to classify data from independent cohorts. Secondly, the top performingCADDementia algorithms incorporated different types of MRI features besides volumetry. Thirdly,7n contrast to CADDementia where features were completely available, this work predicts featuresbased on longitudinal data before classiﬁcation.This study highlights the potential of RNNs for modeling the progression of AD using longitudinalmeasurements, provided that proper care is taken to handle missing values and time intervals. Ingeneral, standard LSTM networks are designed to handle sequences with a ﬁxed temporal or spatialsampling rate within longitudinal data. We used the same approach in the AD progression modelingapplication by disregarding, for example, visiting months 3, 6 and 18, and conﬁning the experimentsto yearly follow-up in the ADNI data. However, one could utilize modiﬁed LSTM architectures suchas time-aware LSTM [20] to address irregular time steps in longitudinal patient records.

Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research and innovationprogramme under the Marie Skłodowska-Curie grant agreement No 721820. This work uses theTADPOLE data sets https://tadpole.grand-challenge.org constructed by the EuroPOND consortiumhttp://europond.eu funded by the European Union’s Horizon 2020 research and innovation programmeunder grant agreement No 666992.

References [1] Guy McKhann, David Drachman, Marshall Folstein, Robert Katzman, Donald Price, andEmanuel M. Stadlan. Clinical diagnosis of Alzheimer’s disease.

Neurology , 34(7):939–939,1984.[2] Neil P. Oxtoby and Daniel C. Alexander. Imaging plus X: multimodal models of neurodegener-ative disease.

Current Opinion in Neurology , 30(4):371, 2017.[3] Bruno M. Jedynak, Andrew Lang, Bo Liu, Elyse Katz, Yanwei Zhang, Bradley T. Wyman, DavidRaunig, C. Pierre Jedynak, Brian Caffo, and Jerry L Prince. A computational neurodegenerativedisease progression score: method and results with the Alzheimer’s Disease NeuroimagingInitiative cohort.

NeuroImage , 63(3):1478–1486, 2012.[4] Anders M. Fjell, Lars T. Westlye, Håkon Grydeland, Inge Amlien, Thomas Espeseth, IvarReinvang, Naftali Raz, Dominic Holland, Anders M. Dale, and Kristine B. Walhovd. Criticalages in the life course of the adult brain: nonlinear subcortical aging.

Neurobiology of Aging ,34(10):2239–2247, 2013.[5] Neil P. Oxtoby, Alexandra L. Young, Nick C. Fox, Pankaj Daga, David M. Cash, SebastienOurselin, Jonathan M. Schott, and Daniel C. Alexander. Learning imaging biomarker trajectoriesfrom noisy Alzheimer’s disease data using a bayesian multilevel model. In

Bayesian andgrAphical Models for Biomedical Imaging , pages 85–94. 2014.[6] Michael C. Donohue, Helene Jacqmin-Gadda, Mélanie Le Goff, Ronald G. Thomas, RemaRaman, Anthony C. Gamst, Laurel A. Beckett, Clifford R. Jack, Michael W. Weiner, Jean-Francois Dartigues, and Paul S. Aisen. Estimating long-term multivariate progression fromshort-term data.

Alzheimer’s & Dementia: the Journal of the Alzheimer’s Association , 10(5):S400–S410, 2014.[7] Wai-Ying Wendy Yau, Dana L. Tudorascu, Eric M. McDade, Snezana Ikonomovic, Jeffrey A.James, Davneet Minhas, Wenzhu Mowrey, Lei K. Sheu, Beth E. Snitz, Lisa Weissfeld, et al. Lon-gitudinal assessment of neuroimaging and clinical markers in autosomal dominant Alzheimer’sdisease: a prospective cohort study.

The Lancet Neurology , 14(8):804–813, 2015.[8] Ricardo Guerrero, Alexander Schmidt-Richberg, Christian Ledig, Tong Tong, Robin Wolz,and Daniel Rueckert. Instantiated mixed effects modeling of Alzheimer’s disease markers.

NeuroImage , 142:113–125, 2016.[9] Barak A. Pearlmutter. Learning state space trajectories in recurrent neural networks.

NeuralComputation , 1(2):263–269, 1989.[10] Zachary C. Lipton, David C. Kale, and Randall Wetzel. Modeling missing data in clinical timeseries with RNNs.

Machine Learning for Healthcare , 2016.[11] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrentneural networks for multivariate time series with missing values. arXiv:1606.01865 , 2016.812] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural Computation , 9(8):1735–1780, 1997.[13] Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual predictionwith LSTM. In

Proceedings of the 9th International Conference on Artiﬁcial Neural Networks(ICANN 99) , volume 2, pages 850–855, 1999.[14] Felix A. Gers and Jürgen Schmidhuber. LSTM recurrent networks learn simple context-free andcontext-sensitive languages.

IEEE Transactions on Neural Networks , 12(6):1333–1340, 2001.[15] Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber.LSTM: A search space odyssey.

IEEE Transactions on Neural Networks and Learning Systems ,28(10):2222–2232, 2017.[16] Ronald Carl Petersen, P.S. Aisen, L.A. Beckett, M.C. Donohue, A.C. Gamst, D.J. Harvey, C.R.Jack, W.J. Jagust, L.M. Shaw, A.W. Toga, J.Q. Trojanowski, and M.W. Weiner. Alzheimer’sDisease Neuroimaging Initiative (ADNI): clinical characterization.

Neurology , 74(3):201–209,2010.[17] Razvan V. Marinescu, Neil P. Oxtoby, Alexandra L. Young, Esther E. Bron, Arthur W. Toga,Michael W. Weiner, Frederik Barkhof, Nick C. Fox, Stefan Klein, and Daniel C. Alexander.TADPOLE challenge: Prediction of longitudinal evolution in Alzheimer’s disease. arXivpreprint arXiv:1805.03909 , 2018.[18] David J. Hand and Robert J. Till. A simple generalisation of the area under the ROC curve formultiple class classiﬁcation problems.

Machine Learning , 45(2):171–186, 2001.[19] Esther E. Bron, Marion Smits, Wiesje M. Van Der Flier, Hugo Vrenken, Frederik Barkhof,Philip Scheltens, Janne M. Papma, Rebecca M.E. Steketee, Carolina Méndez Orellana, RozannaMeijboom, et al. Standardized evaluation of algorithms for computer-aided diagnosis ofdementia based on structural MRI: the CADDementia challenge.

NeuroImage , 111:562–579,2015.[20] Inci M. Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K. Jain, and Jiayu Zhou. Patient subtypingvia time-aware LSTM networks. In