[PDF] G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for Biomarker Identification and Disease Classification

Abstract

We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers. Our model consists of an encoder, a decoder and a classifier. The encoder learns a non-linear subspace shared between the input data modalities. The classifier and the decoder act as regularizers to ensure that the low-dimensional encoding captures predictive differences between patients and controls. We use a learnable dropout layer to extract interpretable biomarkers from the data, and our unique training strategy can easily accommodate missing data modalities across subjects. We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data. Using 10-fold cross validation, we demonstrate that our model achieves better classification accuracy than baseline methods, and that this performance generalizes to a second dataset collected at a different site. In an exploratory analysis we further show that the biomarkers identified by our model are closely associated with the well-documented deficits in schizophrenia.

Full PDF

GG-MIND: An End-to-End Multimodal Imaging-GeneticsFramework for Biomarker Identiﬁcation and DiseaseClassiﬁcation

Sayan Ghosal , Qiang Chen , Giulio Pergola , Aaron L. Goldman , William Ulrich , KarenF. Berman , Giuseppe Blasi , Leonardo Fazio , Antonio Rampino , Alessandro Bertolino ,Daniel R. Weinberger , Venkata S. Mattay , and Archana Venkataraman Department of Electrical and Computer Engineering, Johns Hopkins University, USA Lieber Institute for Brain Development, USA Clinical and Translational Neuroscience Branch, NIMH, NIH, USA Department of Basic Medical Sciences, Neuroscience and Sense Organs, University of BariAldo Moro, Italy Azienda Ospedaliero-Universitaria Consorziale Policlinico, Bari, Italy

ABSTRACT

We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided bydiagnosis, that provides interpretable biomarkers. Our model consists of an encoder, a decoder and a classiﬁer.The encoder learns a non-linear subspace shared between the input data modalities. The classiﬁer and thedecoder act as regularizers to ensure that the low-dimensional encoding captures predictive diﬀerences betweenpatients and controls. We use a learnable dropout layer to extract interpretable biomarkers from the data, andour unique training strategy can easily accommodate missing data modalities across subjects. We have evaluatedour model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and SingleNucleotide Polymorphism (SNP) data. Using 10-fold cross validation, we demonstrate that our model achievesbetter classiﬁcation accuracy than baseline methods, and that this performance generalizes to a second datasetcollected at a diﬀerent site. In an exploratory analysis we further show that the biomarkers identiﬁed by ourmodel are closely associated with the well-documented deﬁcits in schizophrenia.

Keywords:

Deep Neural Networks, Learnable Dropout, Imaging-Genetics, Schizophrenia

1. INTRODUCTION

Neuropsychiatric disorders, such as autism and schizophrenia, are typically characterized by cognitive and behav-ioral deﬁcits. At the same time, these diseases also show high genetic heritability, which suggests an importantlink between genotypic variations and the observed phenotypic traits. Understanding this relationship mightlead to targeted biomarkers and eventually better therapeutics. Non-invasive techniques like functional MRI(fMRI) and Single Neucleotide Polymorphism (SNP) are commonly used data modalities to capture the brainactivity and genetic variations, respectively. However, integrating them in a single framework is hard due to theirinherent complexity, high data dimensionality, and our limited knowledge about the underlying relationships.Imaging-genetics has become an increasingly popular ﬁeld of study to link these modalities. Data drivenmethods can be grouped into three categories. The ﬁrst category uses multivariate regularized regression to modelthe eﬀect of genetic variations on the brain activity.

4, 5

These methods rely on sparsity to identify an interpretableset of biomarkers; however, they do not incorporate clinical diagnosis, meaning that the biomarkers may not alignwith predictive group diﬀerences. The second category uses correlation analysis to identify associations betweengenetic variations and quantitative traits.

However, the representations are rarely guided by the clinicalfactors, and it is not clear how they can be extended to accommodate more than two data modalities. Finally,the recent works of

9, 10 use probabilistic modelling and dictionary learning, respectively, to integrate imaging,genetics, and diagnosis. The generative nature of these methods makes it harder to integrate additional datamodalities. However, the ﬁeld is moving towards multimodal imaging acquisitions to capture diﬀerent snapshots a r X i v : . [ q - b i o . Q M ] J a n igure 1: G-MIND architecture. The inputs { i , i } and { g } corresponds to the two imaging modalities andgenetic data, respectively. E i ( · ) and D i ( · ) captures the encoding and decoding operations, and Y ( · ) captures theclassiﬁcation operation. z i is the learnable dropout mask, and (cid:96) n is the low dimensional latent space.of the brain all of which may have link to the genotype. Another limitation is that none of the above methods canhandle the problem of missing data. With the growing emphasis on big datasets comes the challenge of missingdata modalities. Traditionally, missing data has been managed by removing subjects from the analysis, whichdoes not make use of all the information. In this paper we introduce a novel model and training strategy thataccommodates these missing modalities to maximize the size and utility of the dataset.The above limitations have motivated our use of deep learning, and speciﬁcally, the autoencoder architecture.First, the autoencoder provides a natural way to integrate new data modalities simply by adding new encoder-decoder branches. Mathematically, a new branch will introduce another term to the loss function but does notalter the optimization procedure (e.g., backpropagating gradients). Second, missing data can easily be handledby freezing the aﬀected part of the network and updating the remaining weights. This simplicity is in starkcontrast to the classical methods, where the entire model and optimization procedure must be changed for eachnew modality and missing data conﬁguration. Third, the latent encoding provides a data-driven feature spacethat can be used for patient/control classiﬁcation. Again, this is in contrast to classical approaches, which arehighly dependent on hand-crafted feature. Finally, the classiﬁer part of our model guides the autoencoder toextract clinically interpretable features that are representative of the disease.In this paper we introduce the G enetic and M ultimodal I maging data using N eural-network D esigns ( G-MIND ) framework to identify predictive biomarkers from neuroimaging and genetics data for disease diagnosis.We use a coupled autoencoder and classiﬁer to learn a shared latent space among all the input modalities thatis representative of the population diﬀerences between patients and controls. We also incorporate a learnabledropout layer by which the model selects a random subset of input features to pass to the encoder. The featureimportances are captured in the probability of dropout, which is learned via backpropagation. We evaluate G-MIND on a study of schizophrenia that includes two task fMRI paradigms and SNP data. Our method achievesbetter classiﬁcation performance than standard baselines and also identiﬁes clinically relevant biomarkers. Wefurther applied G-MIND to a cross-site dataset without any ﬁne tuning to show the transferability of our model. . THE MULTI-MODAL ENCODER-DECODER FRAMEWORK Fig. 1 illustrates our full model. The inputs i n and i n denotes the input imaging modalities for subject n . Inour case i n and i n are activation maps from two diﬀerent fMRI paradigms. The input g n represents the SNPgenotype, and y n is a binary class label (patient or control). Let N , N , and N g denote the number of subjectsfrom whom we have the corresponding imaging or genetic modality. Let R be the total number of ROIs inbrain, and G be the total number of SNPs. The imaging data has the dimensionality i n , i n ∈ R R × , and thegenetic data has the dimensionality g n ∈ R G × . We jointly model the imaging and genetic modalities using aauto-encoder framework. The ﬁrst layer of the encoder incorporates a learnable dropout parameterized by p m for each modality m . We use the resulting low dimensional representation (cid:96) n for subject classiﬁcation. The standard Bernoulli dropout independently drops nodes using a ﬁxed probability. The Bayesian interpretationof dropout has a close resemblance with Bayesian feature selection, however, the user must ﬁx the dropoutprobability a priori . Here we wish to learn these vales, so we reparameterize the Bernoulli dropout mask with aGumbell-Softmax distribution. This continuous relaxation of Bernoulli random variable enables us to updatethe dropout probabilities while training the network. During each forward pass through the network we samplerandom variables z ni , z ni ∈ R R × , and z g ∈ R G × for imaging and genetic data, respectively, from a Gumbell-Softmax distribution and use it as a dropout mask for patient n : z ni = σ (cid:18) log( p i ) − log(1 − p i ) + log( u ni ) − log(1 − u ni ) t (cid:19) (1)where u ni is a random vector sampled from U nif orm (0 , t (temperature) controls the extent ofrelaxation from the Bernoulli distribution and p m captures the probabilities with which the features of modality m are selected. As seen in Eq. (1) when the probability p m k is close to 1 that feature will be selected most of thetime, as compared to a feature whose probability is close to 0. We further incorporate a sparsity penalty overthe probabilities p mk via the KL divergence KL ( Ber ( q ) || Ber ( p mk )) where q is a hyperparameter ﬁxed to 0 . p mk towards zero for all features. The encoder learns a nonlinear latent space that is shared between all the data modalities. As shown in Fig. 1 weencode the data following the dropout using a cascade of fully connected layers followed by a PRelu activation. Unlike standard autoencoder based networks we couple the low dimensional representations of each data modalityto leverage the common structures shared between them. The latent embedding (cid:96) n is computed as (cid:96) n = 1 M n (cid:0) E ( i n , z ni ) + E ( i n , z ni ) + E g ( g n , z ng ) (cid:1) (2)Here E i ( · ) represents the encoding operation for modality m , and M n is the number of modalities present forsubject n . As seen in Eq. (2), our latent representation is the sum of the individual projections, scaled by theamount of available data M n . This fusion strategy encourages the latent encoding for an individual patient tohave a consistent scale, even when constructed using a subset of the modalities. The decoder reconstructs the data from the latent representation to ensure that the encoder is preserving suﬃcientinformation about the inputs. We use fully connected layers along with PRelu, dropouts, and batchnorm fordecoding. Mathematically, the autoencoder loss is the l norm between the input and reconstruction: n (cid:88) n =1 || i N − D ( (cid:96) n ) || + N (cid:88) n =1 || i n − D ( (cid:96) n ) || + N g (cid:88) n =1 || g n − D ( (cid:96) n ) || where D m ( · ) is the decoding operation for modality m . nstitution ModalitiesNBack SDMT SNPLIBD 160 110 210BARI 97 97 Figure 2: Information ﬂow during the forward pass(green) and backward pass (red) when i n is absent. Table 1: The number of subjects presentfor each modality from the two institutions.Note that the SDMT task was not acquiredfor BARI The ﬁnal piece of our network is a classiﬁer for disease prediction, which will encourage the dropout mask andlatent embeddings to select discriminative features from the data. We employ fully connected layers, and a crossentropy loss for classiﬁcation: − (cid:80) Nn =1 ( y n log(ˆ y n ) + (1 − y n ) log(1 − ˆ y n )), where y is the original class label andˆ y n is the predicted class label.Our combined G-MIND objective function can be written as follows: L ( i , i , g ) = λ N (cid:88) n =1 || i n − D ( (cid:96) n ) || + λ N (cid:88) n =1 || i n − D ( (cid:96) n ) || + λ N g (cid:88) n =1 || g n − D ( (cid:96) n ) || − λ N (cid:88) n =1 ( y n log(ˆ y n ) + (1 − y n ) log(1 − ˆ y n ))+ λ (cid:88) m =1 (cid:88) k KL ( Ber ( q ) || Ber ( p mk ) (3)where N is the total number of subjects. The parameters { λ , λ , λ } control the contributions of the datareconstruction error, λ controls the contribution of classiﬁcation error, and λ regularizes the sparsity on p m .The summation in Eq. (3) enables G-MIND to handle missing data. For example if i n is not available forsubject n , then the gradients with respect to encoder E ( · ) and decoder D ( · ) will be zero. As illustrated in Fig.(2), information will ﬂow into and out of the latent space through the other network branches, and will only beused to update those parameters. During training, we learn the encoder, decoder, and classiﬁer weights, along with the probabilistic masks p m byminimizing Eq. (3). We then threshold the probabilistic mask ˆp m = ( p m > τ m ) to select the most importantfeatures for reconstruction and classiﬁcation. When testing on a new subject data, we premultiply the availablemodalities by the thresholded dropout mask, i.e., ˆ i n = i n ⊗ ˆp i . The masked input ˆ i n is sent through encoder andthe classiﬁer for diagnosis. We do not use the learned dropout procedure during testing, since diﬀerent samplesof z nm may lead to a diﬀerent diagnosis, whereas our goal to obtain a deterministic label for each subject. We set the regularization parameters of our model { λ , λ , λ , λ , λ } as 10 − β i where β i is selected such that λ i multiplied by the appropriate loss term lies within the same order of magnitude (1—10). This criterion isntuitive (i.e., equal importance is given to both the imaging and genetic data), and it is not performance driven(i.e., we do not cherry-pick the values to optimize prediction accuracy). The corresponding values for all theexperiments are: λ = 0 . , λ = 0 . , λ = 0 . , λ = 0 . , and λ = 0 .

01. We ﬁx the Bernoulli probability, to q = 0 .

001 and the temperature variable to t = 0 .

1. Based on 10-fold cross validation results we ﬁx all detectionthreshold values to τ i = 0 .

1. The architecture of our model (layer sizes and nonlinearities) is shown in Fig. 1.

We compare G-MIND to classical machine learning techniques and architectural variants that omit key features. • Multimodal Support Vector Machine (SVM):

We construct a linear SVM classiﬁer after concate-nating all the data modalities [ i T , i T , g T ] T . Notice that this model cannot handle missing data. Therefore,we ﬁt a multivariate regression to impute missing imaging modalities based on the available one for eachsubject. For example if i n is absent, we impute it as: i n = β ∗ i n , where β is the regression coeﬃcient matrixobtained from training data. We use a grid search method to ﬁnd the best set of hyper-parameters. Noticethat this tuning provides an added advantage for SVM over G-MIND. • Multimodal CCA + RF:

Canonical correlation analysis (CCA) identiﬁes bi-multivariate associationsbetween imaging and genetics data. This approach is similar to our coupled latent projection, but thetraditional CCA does not accommodate more than two data modalities. In order to overcome this weconcatenate the imaging features obtained from two experimental paradigms and perform CCA with thegenetics data. We then construct a random forest classiﬁer based on the latent projections. We use thesame approach for data imputation and to ﬁnd the best set of hyperparameters. • Encoder Only:

We compare our model to an ANN architecture based on the encoder and the classiﬁer ofG-MIND. This comparison will show us importance of using the decoder and the learnable dropout layer. • Encoder+Dropout:

We compare our model to another ANN architecture where we only used the encoder,the classiﬁer, and the learnable dropout layer. This experiment will show us the performance improvementfrom including a decoder. Based on our 10-fold cross validation we ﬁx the learned dropout threshold valuesto { τ i = 0 . , τ i = 0 . , τ g = 0 . } .

3. EXPERIMENTAL RESULTS3.1 Data and Preprocessing

Our ﬁrst dataset includes two task fMRI paradigms and SNP data provided by Lieber Institute for BrainDevelopment (LIBD) in Baltimore, MD, USA. The ﬁrst fMRI paradigm is a working memory task (Nback)that alternates between 0-back and 2-back trial blocks. During the 0-Back task the subjects is asked to press anumber shown on the screen, and during the 2-back task the subjects are instructed to press the number shown 2stimuli previously. The second fMRI paradigm is an event-based simple declarative memory task (SDMT) whichinvolves incidental encoding of complex visual scenes. Our replication dataset includes just Nback and SNP dataacquired at the University of Bari Aldo Moro, Italy (BARI). The distribution of the subjects is shown in Table 1.All fMRI data was acquired on 3-T General Electric Sigma scanner (EPI, TR/TE = 2000/28 msec; ﬂip angle= 90; ﬁeld of view = 24 cm, res: 3.75 mm in x and y dimensions and 6 mm in the z dimension for NBack and5 mm for SDMT). FMRI preprocessing include slice timing correction, realignment, spatial normalization to anMNI template, smoothing and motion regression. SPM12 is used to generate activation and contrast maps foreach paradigm. We use the Brainnetome atlas to deﬁne 246 cortical and subcortical regions. The input to ourmodel is the contrast map over these ROIs.In parallel, genotyping was done using variate Illumina Bead Chips including 510K/ 610K/660K/2.5M.Quality control and imputation were performed using PLINK and IMPUTE2 respectively. The resulting 102Klinkage disequilibrium independent SNPs are used to calculate the polygenic risk score of schizophrenia via alog-odds ratio of the imputation probability for the reference allele. By selecting

P < − , we obtain 1242linkage disequilibrium independent SNPs. As a preprocessing step, we remove the eﬀect of age, IQ, and educationfrom the imaging modalities and we have mean centred all the data modalities. ethod Perf Sens Spec Acc AucSVM 0.66 0.47 0.58 0.55CCA+RF 0.15

Table 2: Testing performance of each methodon LIBD during 10 fold cross validation. Figure 3: Distribution of accuracies by the modelstrained in all 10 CV folds, when directly evaluated onBARI.

Table 2 quantiﬁes the 10-fold testing performance of all the methods on multimodal data obtained from LIBD.We can clearly see that G-MIND achieves the best overall accuracy. Even in the presence of missing data ourmulti modal approach can successfully extract meaningful information from all the data modalities that areessential for diagnosis prediction. Our results also show the importance of the decoder and the dropout layer.In order to show the generalizability of our method we trained our model on LIBD data and tested it withoutﬁne tuning on a cross site dataset from BARI. This experiment captures the transference property of our model.We note that the SDMT task was not acquired at BARI, so the corresponding branch of G-MIND is not used.We evaluate the 10 best models obtained from the 10 diﬀerent folds to run this experiment. Fig. 3 shows thedistribution of accuracies of all the models in the form of a boxplot. Here we can see that our method shows thebest transference property compared to all the baselines. This is an interesting result as it shows the robustnessof our model against data acquisition noise and population speciﬁc noise. This performance gain further suggeststhat the learnable dropout mask can identify robust set of features that are most predictive of the disease.

Fig. 4 illustrates the most important set of brain regions as identiﬁed by the median concrete dropout probabilitymaps { p i , p i } across the 10 validation folds. We further show a more global picture of the high importancebrain regions as a surface plot in Fig. 5. Both from, Fig. 4 and Fig. 5 for the Nback task we can seeregions that include superior frontal gyrus (SFG),and inferior frontal gyrus (IFG), which are know to sub-serveexecutive cognition. Moreover, we can see regions (SFG, IFG) from dorsolateral prefrontal cortex and regions(SPL, STG) from posterior parietal cortex that overlaps with the fronto-parietal network which is known to bealtered in schizophrenia. Further clusters incorporate components of the default mode network also implicated inFigure 4: The representative set of brain regions as captured by the dropout probabilities { p , p } . The colorbar denotes the median value across 10 folds.igure 5: The surface plot of the brain regions as captured by the dropout probabilities { p , p } . The color bardenotes the median value across 10 folds. From Left to Right the images are internal surface of left hemisphere(

L–IN ), external surface of left hemisphere (

L–OUT ), internal surface of right hemisphere (

R–IN ), and externalsurface of right hemisphere (

R–OUT ).schizophrenia. The SDMT biomarkers implicate the hippocampal, parahippocampal, superior frontal regionsalong with the anteromedial thalamus which are also aﬀected in schizophrenia. These regions control executivecognition and memory encoding and that are also known to be associated with the disorder.We further use Neurosynth to decode the higher order brain states of the the biomarkers associated withNback and SDMT tasks. This analysis allows us to quantitatively compare the selected brain regions withpreviously published results and gives us a level of association with diﬀerent brain states as identiﬁed by otherstudies. Fig. 6 shows the Neurosynth terms that are strongly correlated with our biomarkers. We note that theterms associated with the Nback task corresponds to recognition and solving while the brain states for SDMTare associated with emotions and memory encoding. These results provide further evidence that G-MIND canextract potential imaging biomarkers that are highly relevant to the task and the disorder under study.Fig. 7 shows the importance map across the 1242 SNPs as computed by the median p g across the 10 folds.Figure 6: The level of association with diﬀerent cognitive states of all the brain regions identiﬁed by our modelas found in the Neurosynth database. iological Processes FDRCentral nervous system development 0.005 → Nervous system development 0.001 → System development. 0.005Generation of neurons 0.005 → Neurogenesis 0.004Regulation of calcium iontransport into cytosol 0.04 → Regulation of sequestering of calcium ion 0.008

Figure 7: The median importance map of all the SNPacross and their overlapping genes across the 10 folds. Table 3: The enriched biological processes and theirlevel of signiﬁcance obtained via GO enrichment anal-ysis.We annotated each SNP based on its overlapping or nearest gene as found from the SNP-nexus web interface. In addition, we ran a gene ontology enrichment analysis of the overlapping genes of the top 300 SNPs to identifythe enriched biological processes. This enrichment analysis gives us a way to identify the set of over-representedgenes in a biological pathway that may have an association with the disease phenotype. Table 3 captures the mostsigniﬁcant biological processes implicated by the set of SNPs, which include the nervous system development , and calcium ion regulation which are known to be strongly associated with schizophrenia. As parallel toNeurosynth analysis, we perform a gene expression based analysis over the 10 overlapping (or nearest gene ifthere is no overlap) genes of the top SNPs identiﬁed from our analysis. Here we use the GTEx database to identifythe set of brain tissues where these genes show high levels of expression. This exploratory analysis may help usto understand the cis -eﬀects of the SNPs and how they alter the functionalities of genes expressed in diﬀerenttissues of the brain. Figure 8 shows the gene expression pattern of each gene across diﬀerent brain tissues. Here, LINC00599 shows high expression levels in brain and are also known to be associated with schizophrenia andneuroticism. These ﬁndings show that the model can be used to explore potential genetic biomarkers and theirinteractions in a multivariate framework.Figure 8: The gene expression pattern of the selected set of genes in diﬀerent brain tissues based on the GTExdatabase. Higher level of a gene expression in a brain tissue imply that alteration in that gene may have astronger eﬀect on those speciﬁc brain regions. . CONCLUSION

We have presented G-MIND, a novel deep network to integrate multimodal imaging and genetic data for targetedbiomarker discovery and class prediction. Our unique use of learnable dropout with a classiﬁcation module helpsus to identify discriminative biomarkers of the disease. Our unique loss function enables us to handle missingmodalities while mining all the available information in the dataset. We demonstrate our framework on fMRI andSNP data of schizophrenia patients and controls from two diﬀerent sites. The improved performance of G-MINDacross all the experiments shows the capability this model to build a comprehensive view about the disorderbased on the incomplete information obtained from diﬀerent modalities. We note that, our framework can easilybe applied to other imaging modalities, such as structural and diﬀusion MRI simply by adding autoencoderbranches. In future work we will develop a hybrid extension of G-MIND in which we incorporate pathwayspeciﬁc information into the deep learning architecture for better understanding of the disease propagation.

Acknowledgements:

This work was supported by NSF CRCNS 1822575, and the National Institute of MentalHealth extramural research program.

Previous Submission

This work has not been published to any other publication venue.

REFERENCES

1. J. Zihl, G. Gr¨on, and A. Brunnauer, “Cognitive Deﬁcits in Schizophrenia and Aﬀective Disorders: Evidencefor A Final Common Pathway Disorder,”

Acta Psychiatrica Scandinavica (5), pp. 351–357, 1998.2. C. Lee et al. , “Prevention of Schizophrenia: Can It be Achieved?,” CNS Drugs (3), pp. 193–206, 2005.3. S. Erk et al. , “Functional Neuroimaging Eﬀects of Recently Discovered Genetic Risk Loci for Schizophreniaand Polygenic Risk Proﬁle in Five RDoC Subdomains,” Translational Psychiatry (1), p. e997, 2017.4. Wang et al. , “Identifying Quantitative Trait Loci via Group-sparse Multitask Regression and Feature Se-lection: An Imaging Genetics Study of The ADNI Cohort.,” Bioinformatics (Oxford, England) (2),pp. 229–37, 2012.5. J. Liu et al. , “A Review of Multivariate Analyses in Imaging Genetics,” Frontiers in Neuroinformatics ,p. 29, 2014.6. E. C. Chi et al. , “Imaging Genetics via Sparse Canonical Correlation Analysis.,” Proceedings. IEEE Inter-national Symposium on Biomedical Imaging , pp. 740–743, 2013.7. S. A. Meda et al. , “A Large Scale Multivariate Parallel ICA Method Reveals Novel Imaging-genetic Rela-tionships for Alzheimer’s Disease in The ADNI Cohort,”

NeuroImage (3), pp. 1608–1621, 2012.8. G. Pergola et al. , “DRD2 Co-expression Network and a Related Polygenic Index Predict Imaging, Behavioraland Clinical Phenotypes Linked to Schizophrenia,” Translational Psychiatry (1), p. e1006, 2017.9. N. K. Batmanghelich et al. , “Probabilistic Modeling of Imaging, Genetics and Diagnosis.,” IEEE transactionson medical imaging (7), pp. 1765–79, 2016.10. S. Ghosal et al. , “Bridging Imaging, Genetics, and Diagnosis in a Coupled Low-dimensional Framework,”in MICCAI: Medical Image Computing and Computer Assisted Intervention , , pp. 647–655,Springer, 2019.11. H. Kang, “The prevention and handling of the missing data,” Korean Journal of Anesthesiology (5),pp. 402–406, 2013.12. A. Ben Said, A. Mohamed, T. Elfouly, K. Harras, and Z. J. Wang, “Multimodal deep learning approach forJoint EEG-EMG Data compression and classiﬁcation,” in IEEE Wireless Communications and NetworkingConference, WCNC , Institute of Electrical and Electronics Engineers Inc., 2017.13. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

CoRR abs/1412.6980 , 2015.14. N. Jaques et al. , “Multimodal autoencoder: A deep learning approach to ﬁlling in missing sensor dataand enabling better mood prediction,” in , , pp. 202–208, Institute of Electrical and ElectronicsEngineers Inc., 2018.5. Y. Gal et al. , “Concrete Dropout,” in Advances in Neural Information Processing Systems 30 , pp. 3581–3590,Curran Associates, Inc., 2017.16. Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deeplearning,” in international conference on machine learning , pp. 1050–1059, 2016.17. E. Jang et al. , “Categorical Reparameterization with Gumbel-Softmax,” , 2016.18. K. He et al. , “Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁca-tion,” in

Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) , ICCV ’15 ,p. 1026–1034, IEEE Computer Society, 2015.19. L. Fan et al. , “The Human Brainnetome Atlas: A New Brain Atlas Based on Connectional Architecture,”

Cerebral Cortex (8), pp. 3508–3526, 2016.20. Q. Chen et al. , “Schizophrenia Polygenic Risk Score Predicts Mnemonic Hippocampal Activity,” Brain (4), pp. 1218–1228, 2018.21. J. H. Callicott et al. , “Abnormal fMRI Response of The Dorsolateral Prefrontal Cortex in Cognitively IntactSiblings of Patients with Schizophrenia,”

American Journal of Psychiatry (4), pp. 709–719, 2003.22. F. Sambataro et al. , “Treatment with Olanzapine is Associated with Modulation of The Default ModeNetwork in Patients with Schizophrenia,”

Neuropsychopharmacology (4), pp. 904–912, 2010.23. R. Rasetti et al. , “Altered Hippocampal-Parahippocampal Function During Stimulus Encoding,” JAMAPsychiatry (3), p. 236, 2014.24. W. Tor D., “NeuroSynth: a new platform for large-scale automated synthesis of human functional neu-roimaging data,” Frontiers in Neuroinformatics , 2011.25. A. Z. Dayem Ullah et al. , “SNPnexus: assessing the functional relevance of genetic variation to facilitatethe promise of precision medicine,” Nucleic Acids Research , pp. 109–113, 2018.26. Mi and others., “Protocol Update for Large-scale Genome and Gene Function Analysis with The PANTHERClassiﬁcation System (v.14.0),” Nature Protocols (3), pp. 703–721, 2019.27. B. Dean, “Is Schizophrenia The Price of Human Central Nervous System Complexity?,” Australian andNew Zealand Journal of Psychiatry (1), pp. 13–24, 2009.28. M. J. Berridge, “Dysregulation of Neural Calcium Signaling in Alzheimer Disease, Bipolar Disorder andSchizophrenia,” in Prion , (1), pp. 2–13, Landes Bioscience, 2013.29. J. Lonsdale et al. , “The Genotype-Tissue Expression (GTEx) project,” (6), pp. 580–585, 2013.30. F. S. Goes et al. , “Genome-wide association study of schizophrenia in Ashkenazi Jews,” American Journalof Medical Genetics Part B: Neuropsychiatric Genetics (8), pp. 649–659, 2015.31. M. Luciano et al. , “Association analysis in over 329,000 individuals identiﬁes 116 independent variantsinﬂuencing neuroticism,”

Nature Genetics50