[PDF] Predicting Parkinson's Disease using Latent Information extracted from Deep Neural Networks

Abstract

This paper presents a new method for medical diagnosis of neurodegenerative diseases, such as Parkinson's, by extracting and using latent information from trained Deep convolutional, or convolutional-recurrent Neural Networks (DNNs). In particular, our approach adopts a combination of transfer learning, k-means clustering and k-Nearest Neighbour classification of deep neural network learned representations to provide enriched prediction of the disease based on MRI and/or DaT Scan data. A new loss function is introduced and used in the training of the DNNs, so as to perform adaptation of the generated learned representations between data from different medical environments. Results are presented using a recently published database of Parkinson's related information, which was generated and evaluated in a hospital environment.

Full PDF

PPredicting Parkinson’s Disease using LatentInformation extracted from Deep Neural Networks

Ilianna Kollia

Big Data & Analytics CenterIBM Hellas

Athens, [email protected]

Andreas-Georgios Stafylopatis

School of Electrical & Computer EngineeringNational Technical University of Athens

Athens, [email protected]

Stefanos Kollias

School of Computer ScienceUniversity of Lincoln

Lincoln, United [email protected]

Abstract —This paper presents a new method for medicaldiagnosis of neurodegenerative diseases, such as Parkinson’s, byextracting and using latent information from trained Deep con-volutional, or convolutional-recurrent Neural Networks (DNNs).In particular, our approach adopts a combination of transferlearning, k -means clustering and k -Nearest Neighbour classiﬁca-tion of deep neural network learned representations to provideenriched prediction of the disease based on MRI and/or DaT Scandata. A new loss function is introduced and used in the trainingof the DNNs, so as to perform adaptation of the generatedlearned representations between data from different medicalenvironments. Results are presented using a recently publisheddatabase of Parkinson’s related information, which was generatedand evaluated in a hospital environment. Index Terms —latent variable information, deep convolutionaland recurrent neural networks, transfer learning and domainadaptation, modiﬁed loss function, prediction, Parkinson’s dis-ease, MRI, DaT Scan data.

I. I

NTRODUCTION

Machine learning techniques have been largely used in med-ical signal and image analysis for prediction of neurodegener-ative disorders, such as Alzheimer’s and Parkinson’s, whichsigniﬁcantly affect elderly people, especially in developedcountries [1], [2], [3].In the last few years, the development of deep learningtechnologies has boosted the investigation of using deep neuralnetworks for early prediction of the above-mentioned neurode-generative disorders. In [4], stacked auto-encoders have beenused for diagnosis of Alzheimer’s disease.3-D ConvolutionalNeural Networks (CNNs) have been used in [5] to analyzeimaging data for Alzheimer’s diagnosis. Both methods werebased on the Alzheimer’s disease neuroimaging initiativedataset, including medical images and assessments of severalhundred subjects. Recently, CNNs and convolutional-recurrentneural network (CNN-RNN) architectures have been devel-oped for prediction of Parkinson’s disease [6], based on a newdatabase including Magnetic Resonance Imaging (MRI) dataand Dopamine Transporters (DaT) Scans from patients withParkinson’s and non patients [7].In this paper we focus on the early prediction of Parkinson’s.It is the above two types of medical image data, i.e. MRIand DaT Scans that we explore for predicting an asymp-tomatic (healthy) status, or the stage of Parkinson’s at which a subject appears to be. In particular, MRI data show theinternal structure of the brain, using magnetic ﬁelds and radiowaves. An atrophy of the Lentiform and Caudate Nucleuscan be detected in MRI data of patients with Parkinson’s.DaT Scans are a speciﬁc form of single-photon emissioncomputed tomography, using Ioﬂupane Iodide-123 to detectlack of dopamine in patients’ brain.In the paper we base our developments on the deep neuralnetwork (DNN) structures (CNNs, CNN-RNNs) developedin [6] for predicting Parkinson’s using MRI, or DaT Scan,or MRI & DaT Scan data from the recently developedParkinson’s database [7]. We extend these developments byextracting latent variable information from the DNNs trainedwith MRI & DaT Scan data and generate clusters of thisinformation; these are evaluated by medical experts withreference to the corresponding status/stage of Parkinson’s. Thegenerated and medically annotated cluster centroids are usednext in three different scenarios of major medical signiﬁcance:1) Transparently predicting a new subject’s status/stage ofParkinson’s; this is performed using nearest neighbor classiﬁ-cation of new subjects’ MRI and DaT scan data with referenceto the cluster centroids and the respective medical annotations.2) Retraining the DNNs with the new subjects’ data, with-out forgetting the current medical cluster annotations; thisis performed by considering the retraining as a constrainedoptimization problem and using a gradient projection trainingalgorithm instead of the usual gradient descent method.3) Transferring the learning achieved by DNNs fed withMRI & DaT scan data, to medical centers that only pos-sess MRI information about subjects, thus improving theirprediction capabilities; this is performed through a domainadaptation methodology, in which a new error criterion isintroduced that includes the above-derived cluster centroidsas desired outputs during training.Section II describes related work where machine learningtechniques have been applied to MRI and DaT Scan data fordetecting Parkinson’s. The new Parkinson’s database we areusing in this paper is also described in this section. SectionIII ﬁrst describes the extraction of latent variable informationfrom trained deep neural networks and then presents theproposed approach in the framework of the three consideredtesting, transfer learning and domain adaptation scenarios. a r X i v : . [ c s . L G ] J a n ection IV provides the experimental evaluation which illus-trates the performance of the proposed approach using anaugmented version of the Parkinson’s database, which wealso make publicly available. Conclusions and future work arepresented in Section V.II. R ELATED W ORK

Medical image data constitute a rich source of informationregarding cell degeneration in the human nervous system ofParkinson’s patients. MRI and DaT Scan data have been thefocus of related research; in [8], principal component analysisand support vector machines were applied to MRI data, whilethe same techniques and empirical mode decomposition wereapplied to DaT Scans in [9].A Parkinson’s database comprising MRI and DaT Scan datafrom 78 subjects, 55 patients with Parkinson’s and 23 nonpatients, has been recently released [7]; it includes, in total41528 MRI data (31147 from patients and 10381 from nonpatients) and 925 DaT scans (595 and 330 respectively). Ourdevelopments next are based on this database.CNN architectures [10], [11] include convolutional, pool-ing and fully connected layers, in which convolutional ker-nel and fully connected layer weights are usually learnedthrough gradient descent, while pooling layers reduce theinput sizes through averaging operations. CNN-RNN architec-tures [11], [12] are capable of effectively analyzing temporalvariations of the inputs, by permitting intra layer connectionsand using appropriate gating operations.Recent advances in deep neural net-works [11], [13], [14], [15] have been explored in [6],where convolutional (CNN) and convolutional-recurrent(CNN-RNN) neural networks were developed and trained toclassify the information in the above Parkinson’s database intwo categories, i.e., patients and non patients, based on eitherMRI inputs, or DaT Scan inputs, or together MRI and DaTScan inputs.DaT Scans, which are a speciﬁc examination for Parkin-son’s, generally convey more information than MRI; however,using both inputs can provide better prediction performances.The developed networks included: transfer learning of theResNet-50 network [16] as far as the convolutional part of thenetworks was concerned, with retraining of the fully connectednetwork layers; adding on top of this and training a recurrentnetwork using Gated Recurrent Units (GRU) [17] in an end-to-end manner.In this paper we focus ﬁrst on the analysis of the combinedMRI and DaT Scan dataset. It should be mentioned that thetarget in Parkinson’s disease detection through MRI data ison estimation of the volume of the lentiform and of thecapita of the caudate nucleus. To deal with volume estimation,we analyse MRIs in triplets of consecutive frames. Thus, anMRI triplet of (gray-scale) images and a DaT Scan (colour)image constitute the input to the CNN and/or CNN-RNNarchitectures that we use in our developments. Fig. 1 showssuch a triplet of consecutive frames from an MRI sequenceand a corresponding DaT Scan image. Fig. 1: An MRI triplet of consecutive frames and a correspond-ing DaT ScanSection III.A presents the methodology used to extract latentvariables from the trained DNNs and to achieve diagnosis ofParkinson’s. Section III.B describes the approach for retrainingthe DNNs with new information, while preserving the alreadyextracted information. In Section III.C we examine DNN-based analysis of only MRI input triplets and show how thisanalysis can be improved by adaptation of the latent variableinformation extracted from the DNNs trained with both MRIand DaT Scan data.III. T HE P ROPOSED A PPROACH A. Extracting Latent Variables from Trained Deep NeuralNetworks

The proposed approach begins with training a CNN, or aCNN-RNN architecture, on the (train) dataset of MRI and DaTScan data. The CNN networks include a convolutional part andone or more Fully Connected (FC) layers, using neurons witha ReLU activation function. In the CNN-RNN case, these arefollowed by a recurrent part, including one ore more hiddenlayers, composed of GRU neurons.We then focus on the neuron outputs in the last FC layer(CNN case), or in the last RNN hidden layer (CNN-RNNcase). These latent variables, extracted from the trained DNNs,represent the higher level information through which thenetworks produce their predictions, i.e., whether the inputinformation indicates that the subject is patient, or not.In particular, let us consider the following dataset fortraining the DNN to predict Parkinson’s: P = (cid:8) ( x ( j ) , d ( j )); j = 1 , . . . , n (cid:9) (1)nd the corresponding test dataset: Q = (cid:8) ( (cid:101) x ( j ) , (cid:101) d ( j )); j = 1 , . . . , m (cid:9) (2)where: x ( j ) and d ( j ) represent the n network training inputs(each of which consists of an MRI triplet and a DaT Scan)and respective desired outputs (with a binary value 0/1, where0 represents a non patient and 1 represents a patient case); (cid:101) x ( j ) and (cid:101) d ( j ) similarly represent the m network test inputsand respective desired outputs.After training the Deep Neural Network using dataset P , its l neurons’ outputs in the ﬁnal FC, or hidden layer, { r ( j ) } and { (cid:101) r ( j ) } , both ∈ R l , are extracted as latent variables, obtainedthrough forward-propagation of each image, in train set R p and test set R q respectively: R p = (cid:8) ( r ( j ) , j = 1 , . . . , n (cid:9) (3)and R q = (cid:8) ( (cid:101) r ( j ) , j = 1 , . . . , m (cid:9) (4)The following clustering procedure is then implemented onthe { r ( j ) } in R p :We generate a set of clusters T = { t , . . . , t k } by minimiz-ing the within-cluster L norms of the function (cid:98) T k -means = arg min T k (cid:88) j =1 (cid:88) r ∈ R p (cid:12)(cid:12)(cid:12)(cid:12) r − µ j (cid:12)(cid:12)(cid:12)(cid:12) (5)where µ j is the mean value of the data in cluster j .This is done using the k-means++ [18] algorithm, with theﬁrst cluster centroid u (1) being selected at random from T .The class label of a given cluster is simply the mode class ofthe data points within it.As a consequence, we generate a set of cluster centroids,representing the different types of input data included in ourtrain set P : U = (cid:8) ( u ( j ) , j = 1 , . . . , k (cid:9) (6)Through medical evaluation of the MRI and DaT Scanimages corresponding to the cluster centroids, we can annotateeach cluster according to the stage of Parkinson’s that itscentroid represents.By computing the euclidean distances between the test datain R q and the cluster centroids in U and by then using thenearest neighbor criterion, we can assign each one of test datato a speciﬁc cluster and evaluate the obtained classiﬁcation -disease prediction - performance. This is an alternative way tothe prediction accomplished when the trained DNN is appliedto the test data.This alternative prediction is, however, of great signiﬁcance:in the case of non-annotated new subject’s data, selecting thenearest cluster centroid in U can be a transparent way fordiagnosis of the subject’s Parkinson’s stage; the available MRIand DaT Scan data and related medical annotations of thecluster centroids being compared to the new subject’s data. B. Retraining of Deep Neural Networks with AnnotatedLatent Variables

Whenever new data, either from patients, or from nonpatients, are collected, they should be used to extend theknowledge already acquired by the DNN, by adapting itsweights to the new data. In such a case, let us assume thata new train dataset, say P , usually of small size, say s , isgenerated and an updated DNN should be created based onthis dataset as well.There are different methods developed in the framework oftransfer learning [19], for training a new DNN on P using thestructure and weights of the above-described DNN. However,a major problem is that of catastrophic forgetting, i.e., the factthat the DNN forgets some formerly learned information whenﬁne-tuning to the new data. This can lead to loss of annotationsrelated to the latent variables extracted from the formerlytrained DNN. To avoid this, we propose the following DNNadaptation method, which preserves annotated latent variables.For simplicity of presentation, let us consider a CNN archi-tecture, in which we keep the convolutional and pooling layersﬁxed and retrain the FC and output layers. Let W be a vectorincluding the weights of the FC and output network layers ofthe original network, before retraining, and W (cid:48) denote the new(updated) weight vector, obtained through retraining. Let usalso denote by, w and w (cid:48) , respectively, the weights connectingthe outputs of the last FC, deﬁned as r in Eq. (3), to thenetwork outputs, y .During retraining, the new network weights, W (cid:48) , are com-puted by minimizing the following error criterion: E = E P + λ · E P (7)where E P represents the misclassiﬁcations done in P , whichincludes the new data and E P represents the misclassiﬁcationsin P , which includes the old information. λ is used to differen-tiate the focus between the new and old data. In the followingwe make the hypothesis that a small change of the weights W is enough to achieve good classiﬁcation performance in thecurrent conditions. Consequently, we get: W (cid:48) = W + ∆ W (8)and in the output layer case: w (cid:48) = w + ∆ w (9)in which ∆ W and ∆ w denote small weight increments. Underthis formulation, we can apply a ﬁrst-order Taylor seriesexpansion to make neurons’ activation linear.Let us now give more attention to the new data in P . Wecan do this, by expressing E P in Eq. (7) in terms of thefollowing constraint: y (cid:48) ( j ) = d ( j ); j = 1 , . . . , s (10)which requests that the new network outputs and the desiredoutputs are identical.oreover, to preserve the formerly extracted latent vari-ables, we move the input data corresponding to the annotatedcluster centroids in U from dataset P to P . Consequently,Eq. (10) includes these inputs as well; the size of P becomes: s (cid:48) = s + k (11)where k is the number of clusters in U .Let the difference of the retrained network output y (cid:48) fromthe original one y be: ∆ y ( j ) = y (cid:48) ( j ) − y ( j ) (12)Expressing the output y (cid:48) as a weighted average of the lastFC layer outputs r (cid:48) with the w (cid:48) weights, we get [6] y (cid:48) ( j ) = y ( j ) + f h · w · ∆ r ( j ) + ∆ w · r ( j ) (13)where f h denotes the derivative of the former DNN outputlayer’s neurons’ activation function. Inserting Eq. (10) intoEq. (13) results in: d ( j ) − y ( j ) = f h · w · ∆ r ( j ) + ∆ w · r ( j ) (14)All terms in Eq. (14) are known, except of the differences inweights ∆ w and last FC neuron outputs ∆ r . As a consequence,Eq. (14) can be used to compute the new DNN weights of theoutput layer in terms of the neuron outputs of the last FClayer.If there are more than one FC layers, we apply the sameprocedure, i.e., linearize the difference of the r (cid:48) , iterativelythrough the previous FC layers and express the ∆ r in termsof the weight differences in these layers. When reaching theconvolutional/pooling layers, where no retraining is to beperformed, the procedure ends, since the respective ∆ r is zero.It can be shown, similarly to [6] that the weight updates ∆ W are ﬁnally estimated through the solution of a set of linearequations deﬁned on P : v = V · ∆ W (15)where matrix V includes weights of the original DNN andvector v is deﬁned as follows:v ( j ) = d ( j ) − y ( j ); j = 1 , . . . , s (cid:48) (16)with y ( j ) denoting the output of the original DNN applied tothe data in P .Similarly to [6], the size of v is lower than the size of ∆ W ;many methods exist, therefore, for solving Eq. (16). Followingthe assumption made in the beginning of this section, wechoose the solution that provides minimal modiﬁcation ofthe original DNN weight. This is the one that provides theminimum change in the value of E in Eq. (7).Summarizing, the targeted adaptation can be solved asa nonlinear constrained optimization problem, minimizingEq. (7), subject to Eq. (10) and the selection of minimalweight increments. In our implementation, we use the gradientprojection method [20] for computing the network weightupdates and consequently the adapted DNN architecture. C. Domain Adaptation of Deep Neural Networks throughAnnotated Latent Variables

In the two previous subsections we have focused on gen-eration, based on extraction of latent variables from a trainedDNN, and use of cluster centroids for prediction and adapta-tion of a Parkinson’s diagnosis system. To do this, we haveconsidered all available imaging information, consisting ofMRI and DaT Scan data.However, in many cases, especially in general purposemedical centers, DaT Scan equipment may not be available,whilst having access to MRI technology. In the following wepresent a domain adaptation methodology, using the annotatedlatent variables extracted from the originally trained DNN, toimprove prediction of Parkinson’s achieved when using onlyMRI input data. A new DNN training loss function is used toachieve this target.Let us consider the following train and test datasets, similarto P and Q in Eq. (1) and Eq. (2) respectively, in which theinput consists only of triplets of MRI data: P (cid:48) = (cid:8) ( x (cid:48) ( j ) , d (cid:48) ( j )); j = 1 , . . . , n (cid:48) (cid:9) (17)and Q (cid:48) = (cid:8) ( (cid:101) x (cid:48) ( j ) , (cid:101) d (cid:48) ( j )); j = 1 , . . . , m (cid:48) (cid:9) (18)where: x (cid:48) ( j ) and d (cid:48) ( j ) represent the n (cid:48) network training inputs(each of which consists of only an MRI triplet) and respectivedesired outputs (with a binary value / , where 0 representsa non patient and 1 represents a patient case); (cid:101) x (cid:48) ( j ) and (cid:101) d (cid:48) ( j ) similarly represent the m (cid:48) network test inputs and respectivedesired outputs.Using P (cid:48) , we train a similar DNN structure - as in the fullMRI and DaT Scan case -, producing the following vector of l neuron outputs in its last FC or hidden layer: R (cid:48) p = (cid:8) ( r (cid:48) ( j ) , j = 1 , . . . , n (cid:48) (cid:9) (19)with the dimension of each r (cid:48) vector being l , as in the originalDNN last FC, or hidden, layer.A far as the r (cid:48) outputs are concerned, it would be desirablethat these latent variables being closer, e.g., according to themean squared error criterion, to one of the cluster centroids inEq. (6) that belongs to the same category(patient/non patient)with them.In this way, training of the DNN with only MRI inputs,would also bring its output y (cid:48) closer to the one generatedby the original DNN; this would potentially improve thenetwork’s performance, towards the much better one producedby the original DNN (trained with both MRI and DaT Scandata).Let us compute the euclidean distances between the latentvariables in R (cid:48) p and the cluster centroids in U as deﬁned inEq. (6). Using the nearest neighbor criterion we can deﬁne aset of desired vector values for the r (cid:48) latent variables, withrespect to the k cluster centroids, as follows: Z p = (cid:8) ( z ( i, j ) , i = 1 , . . . , k ; j = 1 , . . . , n (cid:48) (cid:9) (20)here z ( i, j ) is equal to, either 1 in the case of the clustercentroid u ( i ) that was selected, as closest to r (cid:48) ( j ) during theabove-described procedure, or equal to 0 in the case of therest cluster centroids.In the following, we introduce the z ( i, j ) values in a mod-iﬁed Error Criterion to be used in DNN learning to correctlyclassify the MRI inputs.Normally, the DNN (CNN, or CNN-RNN) training is per-formed through minimization of the error criterion in Eq. (21)in terms of the DNN weights: E = 1 n (cid:48) n (cid:48) (cid:88) j =1 ( d (cid:48) ( j ) − y (cid:48) ( j )) (21)where y (cid:48) and d (cid:48) denote the actual and desired networkoutputs and n (cid:48) is equal to the number of all MRI input triplets.We propose a modiﬁed Error Criterion, introducing anadditional term, using the following deﬁnitions: g ( i, j ) = u ( i ) − r (cid:48) ( j ) , i = 1 , . . . , k ; j = 1 , . . . , n (cid:48) (22)and G ( i, j ) = g ( i, j ) ∗ ( g ( i, j )) T (23)with T indicating the transpose operator.It is desirable that the G ( i, j ) term - with a respectivevalue of z ( i, j ) equal to one - is minimized, whilst the G ( i, j ) values - corresponding to the rest of the z ( i, j ) values, whichare equal to zero - are maximized. Similarly to [21], wepass G ( i, j ) through a softmax f function and subtract itsoutput from 1, so as to obtain the above-described respectiveminimum and maximum values.The generated Loss Function is expressed in terms ofthe differences of the transformed G ( i, j ) values from thecorresponding desired responses z ( i, j ) , as follows: E = 1 kn (cid:48) k (cid:88) i =1 n (cid:48) (cid:88) j =1 ( z ( i, j ) − [1 − f ( G ( i, j )]) (24)calculated on the n (cid:48) data and the k cluster centroids.In general, our target is to minimize together Eq. (21)and Eq. (24). We can achieve this using the following LossFunction: E new = η E + (1 − η ) E (25)where η is chosen in the interval [ , ].Using a value of η towards zero provides more importanceto the introduced centroids of the clusters of the latent vari-ables extracted from the best performing DNN, trained withboth MRI and DaT Scan data. On the contrary, using a valuetowards one leads to normal error criterion minimization.IV. E XPERIMENTAL E VALUATION

In this section we present a variety of experiments forevaluating the proposed approach. The implementation of allalgorithms described in the former Section has been performedin Python using the Tensorﬂow library.

A. The Parkinson’s Dataset

The data that are used in our experiments come from theParkinson’s database described in Section II. For trainingthe CNN and CNN-RNN networks, we performed an aug-mentation procedure in the train dataset, as follows. Afterforming all triplets of consecutive MRI frames, we generatedcombinations of these image triplets with each one of the DaTScans in each category (patients, non patients).Consequently, we created a dataset of 66,176 training inputs,each of them consisting of 3 MRI and 1 DaT Scan images.In the test dataset, which referred to different subjects thanthe train dataset, we made this combination per subject; thiscreated 1130 test inputs.For possible reproduction of our experiments, both thetraining and test datasets, each being split in two folders -patients and non patients - are available upon request fromthe mlearn.lincoln.ac.uk web site.

B. Testing the proposed Approach for Parkinson’s Prediction

We used the DNN structures described in [6], includingboth CNN and CNN-RNN architectures to perform Parkin-son’s diagnosis, using the train and test data of the above-described database. The convolutional and pooling part of thearchitectures was based on the ResNet-50 structure; GRU unitswere used in the RNN part of the CNN-RNN architecture.The best performing CNN and CNN-RNN structures, whentrained with both MRI and DaT Scan data, are presented inTable I.It is evident that the CNN-RNN architecture was able toprovide excellent prediction results on the database test set.We, therefore, focus on this architecture for extracting latentvariables. For comparison purposes, it can be mentioned thatthe performance of a similar CNN-RNN architecture whentrained only with MRI inputs was about 70%.It can be seen, from Table I, that the number l of neuronsin the last FC layer of the CNN-RNN architecture was 128.This is, therefore, the dimension of the vectors r extractedas in Eq. (3) and used in the cluster generation procedure ofEq. (5).We then implemented this cluster generation procedure,as described in the former Section. The k-means algorithmprovided ﬁve clusters of the data in the 128-dimensional space.Fig. 2 depicts a 3-D visualization of the ﬁve cluster centroids;stars in blue color denote the two centroids corresponding tonon patient data, while squares in red color represent the threecluster centroids corresponding to patient data.With the aid of medical experts, we generated annotationsof the images (3 MRI and 1 DaT Scan) corresponding tothe 5 cluster centroids. It was very interesting to discoverthat these centroids represent different levels of Parkinson’sevolution. Since the DaT Scans conveyed the major part of thisdiscrimination, we show in Fig.3 the DaT Scans correspondingto each one of the cluster centroids.According to the provided medical annotation, the 1stcentroid ( t ) corresponds to a typical non patient case. The2nd centroid ( t ) represents a non patient case as well, butABLE I: DNN best performing structures on DaT Scan and MRI data Structure No FC layers No Hidden Layers No Units in FC Layer(s) No Units in Hidden Layers Accuracy ( % )CNN 2 - 2622-1500 - 94%CNN-RNN 1 2 1500 128-128 98% Fig. 2: The ﬁve cluster centroids in 3-D; 2 of them (stars, bluecolor) depict non patients and 3 of them (squares, red color)represent patientsTABLE II: Training data in each generated cluster

Cluster No of Data (%) t t t t t with some ﬁndings that seem to be pathological. Moving tothe patient cases, the 3rd centroid ( t ) shows an early step ofParkinson’s - in stage 1 to stage 2, while the 4th centroid ( t )denotes a typical Parkinson’s case - in stage 2. Finally, the 5thcentroid ( t ) represents an advanced step of Parkinson’s - instage 3.It is interesting to note here that, although the DNN wastrained to classify input data in two categories - patientsand non patients -, by extracting and clustering the latentvariables, we were able to generate a richer representationof the diagnosis problem in ﬁve categories. It should bementioned that the purity of each generated cluster was almostperfect.Table II shows the percentages of training data included ineach one of the ﬁve generated clusters. It should be mentionedthat almost two thirds of the data belong in clusters 2 and 3,i.e., in the categories which are close to the borderline betweenpatients and non patients. These cases require major attentionby the medical experts and the proposed procedure can be veryhelpful for diagnosis of such subjects’ cases.We tested this procedure on the Parkinson’s test dataset,by computing the euclidean distances of the correspondingextracted latent variables from the 5 cluster centroids and byclassifying them to the closest centroid. TABLE III: Classiﬁcation of 6 subjects’ data in clusters t - t Test case t t t t t Non Patient 1 44

17 20Patient 3 3 0

18 38Patient 4 0 0 0 8 Table III shows the number of test data referring to sixdifferent subjects that were classiﬁed to each cluster. All nonpatient cases were correctly classiﬁed. In the patient cases,the great majority of the data of each patient were correctlyclassiﬁed to one of the respective centroids. In the smallnumber of misclassiﬁcations, the disease symptoms were notso evident. However, based on the large majority of correctclassiﬁcations, the subject would certainly attract the necessaryinterest from the medical expert.We next examined the ability of the above-described DNNto be retrained using the procedure described in Subsec-tion III.B.In the developed scenario, we split the above test data intwo parts: we included 3 of them (Non Patient 2, Patient 2and Patient 3) in the retraining dataset P and let the other 3subjects in the new test dataset. The size s (cid:48) of P was equal to493 inputs, including the ﬁve inputs corresponding to clustercentroids in U ; the size of the new test set was equal to 642inputs.We applied the proposed procedure to minimize the errorover all train data in P and P , focusing more on the latter,as described by Eq. (10).The network managed to learn and correctly classify all 493 P inputs, including the inputs corresponding to the clustercentroids, with a minimal degradation of its performance over P input data. We then applied the trained network to the testdataset consisting of three subjects. In this case, there wasalso a slight improvement, since the performance was raisedto 98,91%, compared to the corresponding performance onthe same three subjects’ data, shown in Table III, which was98,44%.Table IV shows the clusters to which the new extractedlatent variables (cid:101) r were classiﬁed. A comparison with the cor-responding results in Table III shows the differences producedthrough retraining.We ﬁnally examined the performance of the domain adap-tation approach that was presented in Subsection III.C.We started by training the CNN-RNN network with onlythe MRI triplets in P (cid:48) as inputs. The obtained performancewhen the trained network was applied to the test set Q (cid:48) wasonly 70,6%. For illustration of the proposed developments a) Centroid t (b) Centroid t (c) Centroid t (d) Centroid t (e) Centroid t Fig. 3: The DaT Scans corresponding to the ﬁve cluster centroids t - t TABLE IV: Classiﬁcation of 3 subjects’ data, after retraining,in clusters t - t Test case t t t t t Non Patient 1 41 TABLE V: MRI-based Classiﬁcation of 6 subjects’ data inclusters t - t Test case t t t t t Non Patient 1

74 179 8 0Non Patient 2 14 4

33 5Patient 1 16 0

49 2Patient 2 6 0

80 15Patient 3 26 3

35 10Patient 4 12 0

11 6 we extracted the r (cid:48) latent variables from this trained networkand classiﬁed them to a set of respectively extracted clustercentroids. Table V presents the results of this classiﬁcationtask, which is consistent with the acquired DNN performance.It can be seen that the MRI information leads DNN predictiontowards the patient class, which indeed contained more sam-ples in the train dataset. Most errors were made in the nonpatient class (subjects 1 and 2).We then examined the ability of the proposed approach,to train the CNN-RNN network using the modiﬁed LossFunction, using various values of η ; here we present the case TABLE VI: MRI-based Classiﬁcation of 6 subjects’ data, afterdomain adaptation, in clusters t - t Test case t t t t t Non Patient 1

147 114 5 0Non Patient 2 13

25 18 3Patient 1 13 0

35 2Patient 2 5 0

54 9Patient 3 20 2

34 8Patient 4 9 0 31 5 when using a value equal to 0.5.The obtained performance when the trained network wasapplied to the test set Q (cid:48) was raised to 81,1%. For illustratingthis improvement we also extracted the r (cid:48) latent variables fromthis trained network and classiﬁed them to one of the ﬁveannotated original cluster centroids U .Table VI presents the results of this classiﬁcation task. Itis evident that minimization of the modiﬁed Loss Functionmanaged to force the extracted latent variables get closerto cluster centroids which belonged in the correct class forParkinson’s diagnosis.V. C ONCLUSIONS AND F UTURE W ORK

The paper proposed a new approach for extracting latentvariables from trained DNNs, in particular CNN and CNN-RNN architectures, and using them in a clustering and nearestneighbor classiﬁcation method for achieving high performanceand transparency in Parkinson’s diagnosis. We have usedugmentation of the MRI and DaT Scan data in a recentParkinson’s database and provide the resulting datasets uponrequest from mlearn.lincoln.ac.uk.A DNN retraining procedure was presented, which is ableto preserve the knowledge provided by annotated formerlyextracted clustered latent variables. Moreover, a domain adap-tation approach has been developed, which is able to use theextracted clustered latent variable information for improvingthe performance of the DNN architecture when presented withless input (only MRI) data.An experimental study has been developed, using the abovedatasets, which illustrates the ability of the proposed approachto achieve high perfomance.Future work will be based on a close collaboration ofNational Technical University of Athens and University ofLincoln with IBM, particularly relating the presented researchto the IBM Watson Health initiative. The target will begeneration of novel performance-aware and transparent sys-tems for better diagnosis of neurodegenerative diseases likeParkinson’s, based on a combination of MRI and other images,epidemiological data, historical data of treatments and clinicaldata. A

CKNOWLEDGMENT

The authors wish to thank the Department of Neurologyof the Georgios Gennimatas General Hospital in Athens,Greece, and particularly Dr Georgios Tagaris, for the creationand provision of the main Parkinson’s dataset and for hiscollaboration in the evaluation of the results of the performedanalysis. R

EFERENCES[1] P. Sadja, “Machine learning for detection and diagnosis of disease,”

Annu. Rev. Biomed. Eng , pp. 537–565, 2006.[2] R. Cuinget, E. Gerardin, J. Tessieras, G. Auzias, and et al., “Automaticclassiﬁcation of patients with alzheimer’s disease, via mri: a comparisonof ten methods using the admi database,”

Neuroimage , vol. 56, no. 2,pp. 766–781, 2011.[3] R. Das, “A comparison of multiple classiﬁcation methods for diagnosisof parkinson disease,”

Expert Systems with Applications , vol. 37, no. 2,pp. 1568–1572, 2010.[4] S. Liu, W. Cai, S. Pujol, and et al., “Early diagnosis of alzheimer’sdisease with deep learning,” , pp. 1015–1018, 2014.[5] R. Li, W. Zhang, H. Suk, and et al., “Deep learning based imaging datacompletion for improved brain disease diagnosis,”

International Confer-ence on Medical Image Computing and Computer-assisted Intervention ,pp. 305–312, 2014.[6] D. Kollias, A. Tagaris, A. Stafylopatis, S. Kollias, and G. Tagaris, “Deepneural architectures for prediction in healthcare,”

Complex & IntelligentSystems , vol. 4, no. 2, pp. 119–131, 2018.[7] A. Tagaris, D. Kollias, A. Stafylopatis, G. Tagaris, and S. Kollias,“Machine learning for neurodegenerative disorder diagnosis - surveyof practices and launch of benchmark dataset,”

International Journal onArtiﬁcial Intelligent Tools , vol. 27, no. 3, 2018.[8] C. Salvatore, A. Cerasa, I. Castiglioni, and et al., “Machine learning onbrain mri data for differential diagnosis of parkinson’s disease,”

Journalof Neuroscience Methods , vol. 222, pp. 230–237, 2014.[9] A. Rojas, J. Gorriz, J. Ramirez, and et al., “Application of empiricalmode decomposition on datscan spect images to explore parkinsondisease,”

Expert Systems with Applications , vol. 40, no. 7, pp. 2756–2766, 2013. [10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[11] I. Goodfellow, Y. Bengio, and A. Courville,

Deep Learning arXiv:1409.1259 , 2014.[13] D. Kollias, M. A. Nicolaou, I. Kotsia, G. Zhao, and S. Zafeiriou, “Recog-nition of affect in the wild using deep neural networks,” , pp. 1972–1979, 2017.[14] D. Kollias, M. Yu, A. Tagaris, G. Leontidis, A. Stafylopatis, andS. Kollias, “Adaptation and contextualization of deep neural networkmodels,” in , Nov 2017.[15] F. Caliva, F. D. S. Ribeiro, A. Mylonakis, C. Demaziere, P.Vinai,G. Leontidis, and S. Kollias, “A deep learning approach to anomalydetection in nuclear reactors,” in , July 2018.[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,”

Proceedings of IEEE CVPR Conference , pp. 770–778, 2016.[17] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation ofgated recurrent neural networks on sequence modeling,” in

NIPS 2014Workshop on Deep Learning, December 2014 , 2014.[18] D. Arthur and S. Vassilvitskii, “K-means++: The advantages of carefulseeding,” in

In Proceedings of the 18th Annual ACM-SIAM Symposiumon Discrete Algorithms , 2007.[19] H. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning foremotion recognition on small datasets using transfer learning,” in

ICMI ,2015.[20] J. Rosen, “The gradient projection method for nonlinear programming,”

Journal of the Society of Industrial and Applied Mathematics , vol. 8,no. 1, pp. 181–217, 1960.[21] D. Kollias and S. P. Zafeiriou, “Training deep neural networks withdifferent datasets in-the-wild: The emotion recognition paradigm,”2018International Joint Conference on Neural Networks (IJCNN)