[PDF] Interpretable multimodal fusion networks reveal mechanisms of brain cognition

Abstract

Multimodal fusion benefits disease diagnosis by providing a more comprehensive perspective. Developing algorithms is challenging due to data heterogeneity and the complex within- and between-modality associations. Deep-network-based data-fusion models have been developed to capture the complex associations and the performance in diagnosis has been improved accordingly. Moving beyond diagnosis prediction, evaluation of disease mechanisms is critically important for biomedical research. Deep-network-based data-fusion models, however, are difficult to interpret, bringing about difficulties for studying biological mechanisms. In this work, we develop an interpretable multimodal fusion model, namely gCAM-CCL, which can perform automated diagnosis and result interpretation simultaneously. The gCAM-CCL model can generate interpretable activation maps, which quantify pixel-level contributions of the input features. This is achieved by combining intermediate feature maps using gradient-based weights. Moreover, the estimated activation maps are class-specific, and the captured cross-data associations are interest/label related, which further facilitates class-specific analysis and biological mechanism analysis. We validate the gCAM-CCL model on a brain imaging-genetic study, and show gCAM-CCL's performed well for both classification and mechanism analysis. Mechanism analysis suggests that during task-fMRI scans, several object recognition related regions of interests (ROIs) are first activated and then several downstream encoding ROIs get involved. Results also suggest that the higher cognition performing group may have stronger neurotransmission signaling while the lower cognition performing group may have problem in brain/neuron development, resulting from genetic variations.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 1

Interpretable multimodal fusion networks revealmechanisms of brain cognition

Wenxing Hu, Xianghe Meng, Yuntong Bai, Aiying Zhang, Biao Cai, Gemeng Zhang, Tony W. Wilson,Julia M. Stephen, Vince D. Calhoun,

Fellow, IEEE, and Yu-Ping Wang,

Senior Member, IEEE

Abstract —Multimodal fusion beneﬁts disease diagnosis by pro-viding a more comprehensive perspective. Developing algorithmsis challenging due to data heterogeneity and the complex within-and between-modality associations. Deep-network-based data-fusion models have been developed to capture the complex asso-ciations and the performance in diagnosis has been improved ac-cordingly. Moving beyond diagnosis prediction, evaluation of dis-ease mechanisms is critically important for biomedical research.Deep-network-based data-fusion models, however, are difﬁcult tointerpret, bringing about difﬁculties for studying biological mech-anisms. In this work, we develop an interpretable multimodalfusion model, namely gCAM-CCL, which can perform automateddiagnosis and result interpretation simultaneously. The gCAM-CCL model can generate interpretable activation maps, whichquantify pixel-level contributions of the input features. This isachieved by combining intermediate feature maps using gradient-based weights. Moreover, the estimated activation maps areclass-speciﬁc, and the captured cross-data associations are inter-est/label related, which further facilitates class-speciﬁc analysisand biological mechanism analysis. We validate the gCAM-CCLmodel on a brain imaging-genetic study, and show gCAM-CCL’s performed well for both classiﬁcation and mechanismanalysis. Mechanism analysis suggests that during task-fMRIscans, several object recognition related regions of interests(ROIs) are ﬁrst activated and then several downstream encodingROIs get involved. Results also suggest that the higher cognitionperforming group may have stronger neurotransmission signalingwhile the lower cognition performing group may have problemin brain/neuron development, resulting from genetic variations.

Index Terms —Interpretable, multimodal fusion, brain func-tional connectivity, CAM.

I. I

NTRODUCTION R ECENTLY, there is increasing recognition that mul-timodal imaging data fusion can exploit the comple-mentary information across different data, leading to better

YP. Wang* is with the Biomedical Engineering Department, Tu-lane University, New Orleans, LA 70118. (corresponding author e-mail:[email protected]).W. Hu, Y. Bai, A. Zhang, B. Cai, G. Zhang are with the BiomedicalEngineering Department, Tulane University, New Orleans, LA 70118.X. Meng is with the Center of System Biology, Data Information and Re-productive Health, School of Basic Medical Science, Central South University,Changsha, Hunan, 410008, China. (e-mail: [email protected]).J. M. Stephen is with the Mind Research Network, Albuquerque, NM87106.T. W. Wilson is with the Department of Neurological Sciences, Universityof Nebraska Medical Center, Omaha, NE 68198.V. D. Calhoun is with the Tri-institutional Center for Translational Researchin Neuroimaging and Data Science (TReNDS), Georgia State University,Georgia Institute of Technology, Emory University, Atlanta, GA 30030. (e-mail: [email protected]).Manuscript received **** **, ****; revised **** **, ****. performance in terms of diagnosis and the analysis of mech-anisms [1]. Conventional multimodal fusion is often focusedon matrix decomposition approaches. Among these methods,canonical correlation analysis (CCA) [2] has been widely usedto integrate multimodal data by detecting linear cross-datacorrelations. However, CCA fails when data have complexnonlinear interactions. To capture complex cross-data associ-ations, deep neural network (DNN) based models, e.g., deepCCA [3], have been developed which employ deep network toextract high-level cross-data associations. These methods canlead to improved performance in terms of prediction/diagnosis[3], [4].Beyond diagnosis, it is also important to uncover hiddendisease mechanisms. This requires the data analysis modelto be interpretable, i.e., with explicit and interpretable datarepresentations. However, DNN is composed of a large numberof layers and each layer consists of several nonlinear trans-forms/operations, e.g., nonlinear activation and convolution,resulting in difﬁculties in interpreting its data representations.Moreover, the captured cross-data associations are not guar-anteed to be relevant to the variable of interest, e.g., disease.Instead, the associations may result from interest-irrelevantsignals, e.g., noise and background. Therefore, it is not clearhow to use the captured associations for disease mechanismanalysis.To address these issues, we develop an interpretable DNNbased multimodal fusion model, Grad-CAM guided convolu-tional collaborative learning (gCAM-CCL), which can performautomated diagnosis and result interpretation simultaneously.The gCAM-CCL model can generate interpretable activationmaps indicating pixel-wise contributions of the inputs, en-abling automated result interpretation. Moreover, the activationmaps are class-speciﬁc, which can further promote class-difference analysis and biological mechanism analysis. Inaddition, the cross-data associations captured by gCAM-CCLare interest-related, e.g., disease-related. This is achieved byfeeding the network representations to a collaborative layer [5]which considers both cross-data interactions and the ﬁtting totraits.The rest of the paper is organized as follows. SectionII describes the limitations of several existing multimodalfusion methods and how the proposed model addresses thelimitations. Data collection and preprocessing procedures aswell as experiments and results of applying gCAM-CCL toimaging genetic study can be found in Section III. A briefdiscussion was given in Section IV. a r X i v : . [ q - b i o . N C ] J un OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 2

II. M

ETHOD

A. Multimodal data fusion: analyzing cross-data association

Classical multimodal data fusion methods are often focusedon cross-data matrix factorization. Among them, canonicalcorrelation analysis (CCA) [2] has been widely used in multi-view/omics studies [6], [7]. CCA aims to ﬁnd the mostcorrelated variable pairs, i.e., canonical variables, and furtherassociation analysis can be performed accordingly.Speciﬁcally, given two data matrices X ∈ R n × r , X ∈ R n × s ( n represents sample/subject size, and r, s represents thefeature/variable sizes in two data sets), CCA seeks two optimalloading vectors u ∈ R r × and u ∈ R s × which maximizethe Pearson correlation corr ( X u , X u ) , as in Eq. 1. ( u ∗ , u ∗ ) = argmax u ,u u (cid:48) Σ u (1)subject to u (cid:48) Σ u = 1 , u (cid:48) Σ u = 1 where u ∈ R r × , u ∈ R s × , Σ ij := X (cid:48) i X j , i, j = 1 , . Solving optimization Eq. 1 will yield the most correlatedcanonical variable pair, i.e., X u and X u . More correlatedcanonical variable pairs (with lower correlations) can beobtained subsequently by solving the extended optimizationproblem, as formulated in Eq. 2. ( U ∗ , U ∗ ) = argmax U ,U Trace (cid:0) U (cid:48) Σ U (cid:1) (2)subject to U (cid:48) Σ U = U (cid:48) Σ U = I n where U ∈ R r × k , U ∈ R s × k , k = min(rank( X ), rank( X )).CCA captures only linear associations and therefore itrequires that different data/views follow the same distribution.However, different modality data, e.g., fMRI imaging andgenetic data, may follow different distributions and havedifferent data structures. As a result, CCA fails to detectthe association between heterogeneous data-sets. To addressthis problem, Deep CCA (DCCA) was proposed by Andrewet al. [3] to detect more complicated correlations. DCCAintroduces a deep network representation before applyingCCA framework. Unlike linear CCA, which seeks the optimalcanonical vectors U , U , DCCA seeks the optimal networkrepresentation f ( X ) , f ( X ) , as shown in Eq. 3. ( f ∗ , f ∗ ) = argmax f ,f (cid:8) max U ,U U (cid:48) f (cid:48) ( X ) f ( X ) U (cid:107) f ( X ) U (cid:107) (cid:107) f ( X ) U (cid:107) (cid:9) (3)where f , f are two deep networks.The introduction of deep network representation leads to amore ﬂexible ability to detect both linear and nonlinear cor-relations. According to experiments on both speech data andhandwritten digits data [3], DCCA’s representation was moreeffective in ﬁnding correlations compared to other methods,e.g., linear CCA, and kernel CCA. Despite DCCA’s superiorperformance, the detected associations are not guaranteed tobe relevant to any phenotype of interest, e.g., disease. Instead,the detected associations, may be caused by irrelevant signals,e.g., background and noise. As a result, the use of detectedassociations is challenging for further disease mechanismanalysis. B. Deep collaborative learning (DCL): phenotype-relatedcross-data association

To address the limitations of DCCA, we proposed a mul-timodal fusion model, deep collaborative learning (DCL) [5],which can capture phenotype-related cross-data associationsby enforcing additional ﬁtting to phenotype label, as formu-lated in Eq. 4. ( Z ∗ , Z ∗ ) = argmax Z ,Z { max U ,U Trace ( U (cid:48) Z (cid:48) Z U ) − (4)min β (cid:107) Y − Z β (cid:107) − min β (cid:107) Y − Z β (cid:107) } = argmax Z ,Z {(cid:107) Σ − Σ Σ − (cid:107) tr − (cid:107) Y − Z ( Z (cid:48) Z ) − Z (cid:48) Y (cid:107) − (cid:107) Y − Z ( Z (cid:48) Z ) − Z (cid:48) Y (cid:107) } = argmax Z ,Z F ( Z , Z ) where U , U subject to U (cid:48) Σ U = U (cid:48) Σ U = I ; (cid:107) A (cid:107) tr := Trace ( √ A (cid:48) A ) = Σ σ i ; Z = f ( X ) ∈ R n × p , Z = f ( X ) ∈ R n × q , f , f represent two deep networks; Y ∈ R n × represents phenotype or label data.As shown in Eq. 4, DCL seeks the optimal networkrepresentation Z = f ( X ) , Z = f ( X ) to maximizecross-data correlations. Compared to DCCA, DCL’s repre-sentation retains label related information which guaranteeslabel/phenotype related associations. In this way, further anal-ysis of disease mechanisms can be performed and betterclassiﬁcation performance can be achieved, according to thework described in [5]. Moreover, DCL relaxes the requirementthat projections u and u have to be in the same direction.This leads to a better representation of both phenotypicalinformation and cross-data correlation in a more effectivemanner.With the ability to capture both cross-data associations andtrait-related signals, DCL can exploit complementary informa-tion from multimodal data, as demonstrated in a brain imagingstudy [5]. However, DCL uses deep networks to extracthigh-level features, which are difﬁcult to interpret, result inobstacles for identifying signiﬁcant features/biomarkers. Asa result, DCL can only be used for classiﬁcation/diagnosisrather than exploring disease mechanisms, and consequentlythe medical impact of its applications is limited. C. Deep Network Interpretation: CAM based methods

Both DCCA and DCL use deep neural networks (DNN) forfeature extraction. DNN employs a sequence of intermediatelayers to extract high-level features. Each layer is composedof a number of complex operations, e.g., nonlinear activation,kernel convolution, batch normalization. DNN based modelshave found numerous successful applications in both com-puter vision and medical imaging ﬁelds, as a result of theirsuperior ability to extract high-level features. However, thelarge number of layers and the complex/nonlinear operationsin each layer bring about a difﬁculty in network explanationand feature identiﬁcation. As a result, users may cast doubton the reliability of the deep networks: whether deep networks

OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 3

Fig. 1: The work-ﬂow of convolutional collaborative learning (CCL), an end-to-end model for automated classiﬁcation and interpretation formultimodal fusion. Genetic data is fed into a ConvNet and then ﬂattened to a fully connected (FC) layer. Brain functional connectivity (FC)data is fed into a deep network. A collaborative learning layer fuses the two deep networks and passes two composite gradients mutuallyduring the back-propagation process. make decisions based on the object of interest, or based onirrelevant/background information.

1) Class Activation Mapping (CAM):

To make DNN ex-plainable, Class Activation Mapping (CAM) method [8] wasproposed. CAM generates an activation map for each sam-ple/image indicating pixel-wise contributions to the decisionof interest, e.g., class label. Moreover, as its name tells,CAM’s activation maps are class-speciﬁc, providing morediscriminative information for further class-speciﬁc analysis.This dramatically helps build trust in deep networks: forcorrectly classiﬁed images/samples, CAM explains how theclassiﬁcation decision is made by highlighting the objectof interest; for incorrectly classiﬁed images/samples, CAMillustrates why incorrect decisions are made by highlightingthe misleading regions.CAM’s activation maps are obtained by computing anoptimal combination of intermediate feature maps. As featuremaps only exist in convolutional layers, CAM can be appliedonly to convolutional neural networks (CNN). A weight coefﬁ-cient is needed for each feature map to evaluate its importanceto the decision of interest. However, for most CNN basedmodels, this weight is not provided. To solve this problem, are-training procedure is introduced, in which the feature mapsare used directly by a newly introduced layer to re-conductclassiﬁcation. The weights can then be calculated using theparameters in the introduced layer accordingly. The detailedCAM method is described as follows.For a pre-trained CNN-based model, assume that a tar-get feature map layer consists of K channels/feature-maps F k ∈ R h × w ( k = 1 , , · · · , K ) , where h, w represent theheight and width of each feature map, respectively. CAMdiscards all the subsequent layers and then introduces a new layer (with softmax activation) to re-conduct classiﬁcationusing these feature maps F k . A prediction score S c will thenbe calculated by the newly introduced layer for each class c ( c = 1 , , · · · , C ) , as formulated in Eq. 5. S c = K (cid:88) k =1 w ck global avg pooling (cid:0) F k (cid:1) (5)where w ck represents the weight coefﬁcient of feature map F k for class c .After that, class-speciﬁc activation maps can be generatedby ﬁrst combining the feature maps using the trained weights w ck and then conducting upsampling to project it onto inputimages, as in Eq. 6.map cam = upsampling (cid:32) K (cid:88) k =1 w ck F k (cid:33) (6)The re-training procedure, however, is time consuming,which limits CAM’s application. Moreover, classiﬁcation ac-curacy will sacriﬁce due to the modiﬁcation of the model’sarchitecture, and consequently the accuracy of activation mapswill decrease.

2) Gradient-weighted CAM (Grad-CAM):

To address thelimitations of the CAM method, Gradient-weighted CAM(Grad-CAM), was proposed [9] to compute activation mapswithout modifying the model’s architecture. Similar to CAM,Grad-CAM also needs a set of weight coefﬁcients so asto combine feature maps. This can be achieved by ﬁrstcalculating the gradients of decision of interest w.r.t eachfeature maps and then performing global average poolingon the gradients to get scalar weights. In this way, Grad-CAM avoids adding extra layers and consequently both model-retraining and performance-decrease problems can be solved.

OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 4

The formulations of how Grad-CAM calculates weights g ck and activation map map gradcam are as follows. g ck = global avg pooling (cid:18) ∂y c ∂F k (cid:19) (7)where y c represents the prediction score for class c , andmap gradcam = upsampling (cid:32) K (cid:88) k =1 g ck F k (cid:33) (8)

3) Guided Grad-CAM: high resolution class-speciﬁc acti-vation maps:

Both CAM’s and Grad-CAM’s activation mapsare coarse due to the upsampling procedure, as feature mapsnormally are of smaller size compared to input images.This brings about difﬁculties in identifying small but im-portant object-features. Fine-grained visualization methods,e.g., guided backpropagation (BP) [10] and deconvolution[11], can generate high resolution activation maps. Thesemethods use backward projections which operate on layer-to-layer gradients. Upsampling procedure is not involved inthese back projection methods, and therefore high resolutionactivation maps can be obtained. Nevertheless, the activationmaps are not class-speciﬁc, bringing about obstacles in inter-preting the activation maps, especially for multi-class (morethan 2) scenarios. To obtain both high resolution and class-speciﬁc activation maps, guided Grad-CAM was proposed inthe work [9] which incorporated guided BP into Grad-CAM.Guided Grad-CAM computes activation maps by performinga Hadamard product between the Grad-CAM map and theGuided BP map, as formulated in Eq. 9.map guided gradcam = map guidedBP (cid:12) map gradcam (9)where map guidedBP represents the map computed usingguided BP algorithm [10], and (cid:12) represents the Hadamardproduct operation. For example, given two arbitrary matri-ces A, B ∈ R m × n , their Hadamard product is deﬁned as ( A (cid:12) B ) ij := A ij B ij . D. Grad-CAM guided convolutional collaborative learning(gCAM-CCL)

For the purpose of interpretable multimodal fusion, wedevelop a new model, Grad-CAM guided convolutional col-laborative learning (gCAM-CCL), which incorporates bothguided BP and Grad-CAM methods into the DCL model. Asshown in Fig. 1, gCAM-CCL ﬁrst integrates two modality datausing the collaborative networks, and then computes class-speciﬁc activation maps using Guided BP and Grad-CAM. Inthis way, gCAM-CCL can perform both automated classiﬁca-tion/diagnosis and automated biomarker-identiﬁcation as wellas result interpretation simultaneously.To be more speciﬁc, gCAM-CCL uses a 1D ConvNet tolearn features from SNP data and uses a 2D ConvNet tolearn features from brain imaging data. The output of twoConvNets are ﬂattened and then fused in the collaborativelayer with the loss function in Eq. 11, which considers bothcross-data associations and their ﬁttings to phenotype/label y .After that, two intermediate layers will be selected, from whichthe feature maps will be combined using the gradient-based weights (Eqs. 7-8) and class-speciﬁc Grad-CAM activationmaps will be generated accordingly. Meanwhile, ﬁne-grainedactivation maps are computed by projecting the gradientsback from the collaborative layer to the input layer usingGuided BP. The obtained activation maps indicate pixel-wisecontributions to the decision of interest, e.g., prediction, andsigniﬁcant biomarkers, e.g., brain FCs and genes, can beidentiﬁed accordingly.Compared to the DCL model [5], gCAM-CCL employs bothnew architecture and new loss function so as to incorporateGrad-CAM. As computing activation maps needs a layer offeature-maps, gCAM-CCL replaces a multilayer perceptron(MLP) network with two ConvNets so that multi-channelfeature maps can be obtained. This also beneﬁts model-trainingas ConvNet dramatically reduces the the number of parametersby enforcing shared kernel weights.Moreover, to compute class-speciﬁc activation maps, gradi-ents ∂y c ∂F k w.r.t. each class c ( c = 1 , , · · · , C ) are needed, asillustrated in Eq. 7.. However, DCL uses external classiﬁers,e.g., support vector machine (SVM), and therefore no classinformation is provided in DCL’s gradients. To solve thisproblem, gCAM-CCL replaces external classiﬁers with anembedded softmax classiﬁer so that class-speciﬁc gradients g kc can be obtained.Furthermore, ideal class-speciﬁc activation maps shouldhighlight only the features relevant to the corresponding class,e.g., ’dog’ class. However, features related to other classes,e.g., ﬁsh-related features, may have strong but negative contri-butions to predicting ’dog’ class, resulting in noise features inthe activation maps. To remove the noise features, we apply aReLU function to the gradients, as shown in Eq. 10. The ReLUfunction ensures positive effects so that pixels with negativecontributions can be ﬁltered out. g ck = global avg pooling (cid:18) ReLU ( ∂y c ∂F k ) (cid:19) (10)where y c represents the prediction score for class c .In addition, as pointed out in Wang’s work [4], both DCCA[3] and DCL [5] include the parameter of sample size into theirloss functions, resulting in a problem in batch size tuning. Inother words, their loss functions are dependent on batch sizedue to a population-level correlation term U (cid:48) Z (cid:48) Z U . As aresult, a large batch size is required [4], leading to a challengefor batch size tuning and network training. In this work, wepropose a new loss function which resolves the batch-sizedependence, as formulated in Eq. 11. As shown in Eq. 11, thepopulation-level correlation term is replaced with a summationof sample-level loss. Moreover, the correlation term is replacedwith a regression loss, i.e., cross-entropy loss, as it has beenshown that the optimization of correlation term is equivalentto the optimization of regression loss [4]. Loss = − (cid:88) i =1 (cid:16) (1 − y ) log ( h ( i )1 ) + ylog (1 − h ( i )2 ) (cid:17) (11) − (cid:88) i =1 (cid:16) h (1) i log ( h (2) i ) + h (2) i log ( h (1) i ) (cid:17) OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 5 where h (1) i , h (2) i are the outputs of two ConvNets, as illustratedin Fig. 1.This batch-independent loss function is easier to extend tomulti-class multi-view scenarios and the extended loss func-tion is formulated as follows. Loss = − m m (cid:88) i =1 C (cid:88) c =1 y c log ( h ( i ) c ) (12) − m ( m − m (cid:88) i,j ( i (cid:54) = j ) C (cid:88) c =1 h ( i ) c log ( h ( j ) c ) where m represents the number of views, and C representsthe number of classes.III. A PPLICATION TO BRAIN IMAGING GENETIC STUDY

We apply the gCAM-CCL model to an imaging geneticstudy, in which brain FC data is integrated with single nu-cleotide polymorphism (SNPs) data to classify low/high cogni-tive groups. Multiple brain regions of interests (ROIs) functionas a group when performing a speciﬁc task, e.g., reading.Brain FC depicts the functional associations between differentbrain ROIs [12]. On the other hand, genetic factors may alsohave inﬂuences on brain functions, as brain dysfunctionalityis genetically inheritable. Imaging-genetic integration enablesexploring brain function from a more comprehensive view,which may further contribute to the study of normal andpathological brain mechanisms. The proposed gCAM-CCLmodel, which can perform automated diagnosis and featureinterpretation, can be used to extract and analyze the complexinteractions both within and between brain FC data and geneticdata.

A. Brain imaging data

Several brain fMRI modalities from the Philadelphia Neu-rodevelopmental Cohort (PNC) [13] were used in the ex-periments. PNC cohort is a large-scale collaborative studybetween the Brain Behavior Laboratory at the University ofPennsylvania and the Childrens Hospital of Philadelphia. Ithas a collection of multiple neuroimaging data, e.g., fMRI,and genomic data, e.g., SNPs, from adolescents aged from8 to 21 years. Three types of fMRI data are available inPNC cohort: resting-state fMRI, emotion task fMRI, andnback task fMRI (nback-fMRI). As our work was focusedon analyzing cognitive ability, only nback-fMRI, which wasrelated to working memory and lexical processing, was usedin the experiments. The duration of nback-fMRI scan was11.6 minutes (231 TR), during which subjects were asked toconduct standard nback tasks.SPM12 was used to conduct motion correction, spatial nor-malization, and spatial smoothing. Movement artefact (headmotion effect) was removed via a regression procedure usinga rigid body (6 parameters: 3 translation and 3 rotationparameters) [14], and the functional time series were band-passﬁltered using a 0.01Hz to 0.1Hz frequency range as signiﬁcantsignals mainly focus on low frequency. For quality control, we TABLE I: 13 Brain relevant tissues from GTEx database

Brain amygdata Brain nucleus accumbensBrain caudate Brain cerebellar hemisphereBrain cerebellum Brain frontal cortexBrain cortex Brain substantisa nigraBrain putamen Brain anterior cingulate cortexBrain spinal cord Brain hypothalamusBrain hippocampus excluded high motion subjects with translation > <

275 (Signal-to-ﬂuctuation-noise ratio) following thework in [15]. 264 regions of interest (ROIs) (containing 21,384voxels) were extracted based on the Power coordinates [16]with a sphere radius parameter of 5mm. For each subject, a × image was then obtained based on the × ROI-ROI connections, which was used next as image inputsfor the gCAM-CCL model.

B. SNP data

The genomic data were collected from 3 platforms, includ-ing the Illumina HumanHap 610 array, the Illumina Human-Hap 500 array, and the Illumina Human Omni Express array.The three platforms generated 620k, 561k, 731k SNPs, respec-tively [13]. A common set of SNPs (313k) were extracted,and then PLINK [17] was used to perform standard qualitycontrols, including the Hardy-Weinberg equilibrium test forgenotyping errors with p-value < −

5, extraction of commonSNPs (MAF > >

10% and samples with missing SNPs >

5% were removed.The remaining missing values were imputed by Minimac 3[18] using the reference genome from 1000 Genome Project.In addition, only the SNPs within gene bodies were kept forfurther analysis, resulting in 98,804 SNPs in 14,131 genes.As the study aimed to investigate the brain, we furthernarrowed down the scope to brain-expression-related SNPs.This was achieved using the expression quantitative traitloci (eQTL) data from Genotype-Tissue Expression (GTEx) database [19], a large scale consortium studying tissue-speciﬁcgene regulations and expressions. The GTEx data were col-lected from 53 different tissue sites from around 1000 subjects.Among the 53 tissue sites, 13 tissues were brain-related andthey were listed in Table I. A set of 108 SNP loci whichshowed signiﬁcant tissue regulation level (eQTL < × e -8) in all 13 brain relevant tissues were selected. In addition,SNPs in the top 100 brain-expressed genes were also selectedbased on the GTEx database. These procedures resulted in 750SNP loci, which were used next as the genetic input for thegCAM-CCL model. C. Integrating brain imaging and genetic data: classiﬁcation

The gCAM-CCL was then applied to integrate brain imag-ing data with SNPs data to classify subjects with low/highcognitive abilities. The wide range achievement test (WRAT)[20] score, a measure of comprehensive cognitive ability, https://gtexportal.org/ OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 6 including reading, comprehension, math skills, etc., was usedto evaluate the cognitive ability of each subject. The 854subjects were divided into three classes: high cognitive/WRATgroup (top 20% WRAT score), low cognitive/WRAT group(bottom 20% WRAT score), and middle group (the rest),following the procedures in work [5].The gCAM-CCL model adopted a 1D convolutional nieuralnetwork (CNN) to learn the interactions between alleles atdifferent single-nucleotide polymorphism (SNP) loci. ConvNethas been widely used on sequencing and gene expression data[21], [22] to learn local genetic structures. According to thesestudies, 1D kernels with relatively larger size are preferred.As a result, a × kernel and a × kernel were used.The detailed architecture of gCAM-CCL is listed in Table IV.The partition of the data is as follows: training set (70%),validation set (15%), and test set (15%). The proposed gCAM-CCL model was trained on training set; hyper-parameterswere selected based on the loss on the validation set; andthe classiﬁcation performance was reported based on the testset.Hyper-parameters, including momentum, activation func-tion, learning rate, decay rate, batch size, maximum epochs,were selected using the validation set and their values werelisted in Table II. Mini-batch SGD was used to solve theoptimization problem. Over-ﬁtting problem occurred due tosmall sample size. To solve overﬁtting, dropout was used andthe dropout probability of the middle layers was set to be 0.2.Moreover, early stopping was used during network trainingto further address overﬁtting. In addition, batch normalizationwas implemented after each layer to relieve the gradientvanishing/exploding problem resulting from ReLU activation.Computational experiments were conducted on a Desktop withan Intel(R) Core(TM) i7-8700K CPU (@ 3.70GHz), a 16GRAM, and a NVIDIA GeForce GTX 1080 Ti GPU (11G).For the purpose of comparison, several classical classiﬁers,e.g., SVM, random forest (RF), decision tree, were imple-mented for classifying low/high WRAT groups. In addition,several deep network based classiﬁers were implemented,including CCL with external classiﬁers (SVM/RF), multi-layer perceptron (MLP). The result of classifying high/lowcognitive groups was shown in Table III. From Table III,gCAM-CCL outperforms both conventional classiﬁers, e.g.,SVM, and regular deep network fusion method, in whichtwo data were concatenated as the input. This is consistentwith the result in the work [5], which also showed that thecollaborative network can improve classiﬁcation performancefor multimodal data. Moreover, gCAM-CCL with intrinsicsoftmax classiﬁers achieved better classiﬁcation performancecompared with ’CCL+SVM’ and ’CCL+RF’. This may be dueto the incorporation of cross-entropy loss, i.e., Eq. 11, whichhelps the network more efﬁciently learn loss-gradient duringback-propagation process at each iteration. D. Integrating brain imaging and genetic data: result inter-pretation

The class-speciﬁc activation maps for low WRAT groupand high WRAT group were plotted in Figs. 2-3, respectively.

Fig. 2: The brain FC activation maps for Low WRAT group: Grad-CAM (top 4 subﬁgures) and Gradient-Guided Grad-CAM (bottom 4subﬁgures).Fig. 3: The brain FC activation maps for High WRAT group: Grad-CAM (top 4 subﬁgures) and Gradient-Guided Grad-CAM (bottom 4subﬁgures).

From Fig. 2, the low WRAT group shows a relatively largernumber of activated FCs, which contributed to making the ’lowWRAT group’ decision. In comparison, the high WRAT group(Fig. 3) shows a relatively smaller number of signiﬁcant FCs,which contributed to the ’high WRAT group’ decision. Thisis further validated in the average histogram of the activationmaps, i.e., Fig. 4. For the low WRAT group (Fig. 4-left), alarge portion of FCs were activated (high grey-scale value),while for high WRAT group (Fig. 4-right), only a small portionof them were activated.To identify signiﬁcant brain FCs and SNPs, pixels withgray-value > . × maximum gray-value were selected, fol-lowing the instructions in the work [9]. After that, FCs andSNPs with > . occurring frequency across all subjects were Fig. 4: The histogram of the Grad-CAM activation maps of brainFCs (see Figs. 2-3).

OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 7

TABLE II: Hyper-parameter setting

Methods Epochs batch size Activation Learning rate Decay rate dropout Momentum gCAM-CCL 500 4 ReLU, Sigmoid 0.00001 Half per 200 epochs 0.2 (middle layers) 0.9

Fig. 5: The identiﬁed class-discriminative brain FCs by gCAM-CCL. The full names of ROIs can be found in Table IX. Each circle arcrepresents a ROI (based on Power parcellation [16]). The length of a circle arc indicates the number of ROI-ROI connections on this ROI.TABLE III: The comparison of classiﬁcation performances(Low/High WRAT classiﬁcation).

Classiﬁer ACC SEN SPF F1 gCAM-CCL 0.7501 0.7762 0.7157 0.7610CCL+SVM 0.7387 0.7637 0.7083 0.7504CCL+RF 0.7419 0.7666 0.7014 0.7523MLP+SVM 0.7231 0.7555 0.6915 0.7215SVM 0.7082 0.7562 0.6785 0.7093DT 0.6626 0.6778 0.6430 0.6605RF 0.7119 0.7559 0.6714 0.7138Logist 0.6745 0.7386 0.6285 0.6900 further selected as signiﬁcant FCs (see Figs. 6-5) and SNPs(listed in Tables V-VI).The identiﬁed brain FCs (ROI-ROI connections) and theircorresponding ROIs were visualized in Fig. 5 and Fig. 6,respectively. For the high WRAT group (Fig. 5.b), threehub-ROIs (lingual gyrus, middle occipital gyrus, and inferioroccipital gyrus) exhibited dominant ROI-ROI connections overthe others. All of the three hub-ROIs are occipital-related.Lingual gyrus, also known as medial occipitotemporal gyrus,plays an important role in visual processing [23], [24], objectrecognition, and word processing [23]. The other two hubs,i.e., middle and inferior occipital gyrus, also play a rolein object recognition [25]. As shown in Fig. 5.b, the hub-ROIs also connect to several other ROIs, e.g., cuneus, andparahippocampal gyrus. Among them, the cuneus receivesvisual signals and is involved in basic visual processing. The

Fig. 6: The identiﬁed brain functional connectivity. The top 3subﬁgures: Low WRAT group (axial view, coronal view, sagittal view,respectively); the bottom 3 subﬁgures: High WRAT group (axial view,coronal view, sagittal view, respectively). parahippocampal gyrus is related to encoding and recognition[26]. These suggest that the three occipital gyri are ﬁrstactivated when processing visual and word signals duringthe WRAT test, and then several downstream processingROIs, e.g., para hippocampal gyrus, are activated for furthercomplex encoding. As a result, strong FCs in these ROI-ROI connections may lead the gCAM-CCL to select the highWRAT group.

OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 8

TABLE IV: The Architecture of gCAM-CCL fMRI ConvNet SNP ConvNetLayer Name Input Shape Operations Connects to Layer Name Input Shape Operations Connects tof conv1 (b, 1, 264, 264) K, P, MP = 7, 3, 2 f conv2 s conv1 (b, 1, 750) K, MP = 31, 6 s conv2f conv2 (b, 16, 132, 132) K, P, MP = 5, 2, 4 f conv3 s conv2 (b, 16, 120) K, MP = 31, 6 s conv3f conv3 (b, 32, 33, 33) K, P, MP = 3, 1, 3 f conv4 s conv3 (b, 32, 15) K = 15 s ﬂattenf conv4 (b, 32, 11, 11) K = 11 f ﬂatten s ﬂatten (b, 64, 1) - collab layerf ﬂatten (b, 64, 1) - collab layer - - - -collab layer (b, 4)Notations: b (batch size), K (kernel size), P (padding), MP (maxpooling).

TABLE V: Identiﬁed SNP loci (Low WRAT group)

SNP rs

TABLE VI: Identiﬁed SNP loci (High WRAT group)

SNP rs

For the low WRAT group (Fig. 5.a), there were no signiﬁ-cant hub ROIs identiﬁed. Instead, several previously reportedtask-negative regions, e.g., temporal-parietal and cingulategyrus [27], were identiﬁed. This indicates that the low WRATgroup may be weaker in activating cognition-processing ROIsand therefore task-negative are relatively more active, whichleads the gCAM-CCL to make the ’low WRAT group’ deci-sion.As seen in Fig. 5a-b, a relatively larger number of FCscontributed to the low WRAT group, compared to that of thehigh WRAT group. Despite this, as shown in Table III, thesensitivity, however, is lower than the speciﬁcity, which meansthat the accuracy of classifying low WRAT group is lower.This suggests that the identiﬁed FCs for the high WRAT groupare relatively more discriminative while the low WRAT groupmay contain more noisy FCs.Gene enrichment analysis is conducted on the identiﬁedSNPs (Tables V-VI) using ConsensusPathDB-human (CPDB)database , and the enriched pathways are listed in TablesVII-VIII. Several neurotransmission related pathways, e.g.,regulation of neurotransmitter levels and synaptic signaling,are enriched from the identiﬁed high WRAT group genes. Thissuggests that the high WRAT group may have stronger neuronsignaling ability. The stronger neuron-signalling may beneﬁtthe daily training and development of ROI-ROI connections,which may further contribute to stronger cognitive ability.For the low WRAT group, several brain development andneuron growth related pathways, e.g., midbrain developmentand growth cone, were enriched, which suggests that the lowWRAT group may highlight problems in brain/neuron devel-opment. This may further affect the ROI-ROI connections,leading to weaker cognitive ability.IV. C ONCLUSION

In this work, we develop an interpretable deep multimodalfusion model, namely gCAM-CCL, which can perform au-tomated classiﬁcation/diagnosis and result interpretation. ThegCAM-CCL model can generate activation maps which indi-cate pixel-wise contribution of the inputs, e.g., images and ge-netic vectors, by ﬁrst calculating each feature map’s gradientsand then merge the gradients using global average pooling tocombine the feature maps. Moreover, the activation maps areclass-speciﬁc, which further promotes class-difference analysisand biological mechanism analysis. http://cpdb.molgen.mpg.de/ OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 9

TABLE VII: Gene enrichment analysis of the identiﬁed genes (Low WRAT group). Q-values represent multiple testing corrected p-value.

Pathway Name Pathway Source Set size Contained p-value q-valueEukaryotic Translation Elongation Reactome 106 5 1.18E-06 1.65E-05Peptide chain elongation Reactome 101 4 3.12E-05 2.18E-04Calcium Regulation in the Cardiac Cell Wikipathways 149 4 1.48E-04 6.89E-04Translation Reactome 310 5 2.09E-04 7.33E-04Metabolism of proteins Reactome 2008 11 3.96E-04 1.11E-03Midbrain development Gene Ontology 94 4 1.22E-05 1.53E-03Site of polarized growth Gene Ontology 167 4 1.25E-04 2.19E-03Growth cone Gene Ontology 165 4 1.20E-04 4.07E-03Cellular catabolic process Gene Ontology 2260 12 7.22E-05 4.55E-03Pathways in cancer - Homo sapiens (human) KEGG 526 5 2.39E-03 5.58E-03Metabolism of amino acids and derivatives Reactome 342 4 3.20E-03 6.40E-03

TABLE VIII: Gene enrichment analysis of the identiﬁed genes (High WRAT group). Q-values represent multiple testing corrected p-value.

Pathway Name Pathway Source Set size Contained p-value q-valueRegulation of neurotransmitter levels Gene Ontology 335 9 6.77E-10 1.04E-07Transmission across Chemical Synapses Reactome 224 7 7.75E-08 2.40E-06Synaptic signaling Gene Ontology 711 10 3.17E-08 2.43E-06Insulin secretion - Homo sapiens (human) KEGG 85 5 3.26E-07 5.06E-06Organelle localization by membrane tethering Gene Ontology 170 6 1.34E-07 6.84E-06Regulation of synaptic plasticity Gene Ontology 179 6 1.89E-07 7.21E-06Exocytosis Gene Ontology 909 10 3.06E-07 9.36E-06Membrane docking Gene Ontology 179 6 1.82E-07 1.28E-05Vesicle docking involved in exocytosis Gene Ontology 45 4 5.11E-07 1.30E-05Plasma membrane bounded cell projection part Gene Ontology 1452 12 3.05E-07 1.34E-05Cell projection part Gene Ontology 1452 12 3.05E-07 1.37E-05Neurotransmitter release cycle Reactome 51 4 1.76E-06 1.37E-05Synaptic Vesicle Pathway Wikipathways 51 4 1.76E-06 1.37E-05Neuronal System Reactome 368 7 2.25E-06 1.39E-05Secretion by cell Gene Ontology 1493 12 4.13E-07 1.45E-05Adrenergic signaling in cardiomyocytes - Homo sapiens (human) KEGG 144 5 4.47E-06 2.31E-05Gastric acid secretion - Homo sapiens (human) KEGG 75 4 8.34E-06 3.70E-05Neuron part Gene Ontology 1713 12 1.83E-06 4.09E-05Plasma membrane bounded cell projection Gene Ontology 2098 13 2.22E-06 4.87E-05

TABLE IX: Abbreviations of the ROIs

Inferior Parietal Lobule (Inf Pari) Angular Gyrus (Angular)Inferior Occipital Gyrus (Inf Occi) Fusiform Gyrus (fusiform)Inferior Frontal Gyrus (Inf Fron) Cingulate Gyrus (Cingu)Middle Occipital Gyrus (Mid Occi) Sub-Gyral (SubGyral)Middle Frontal Gyrus (Mid Fron) Paracentral Lobule (ParaCetr)Parahippocampa Gyrus (Parahippo) Postcentral Gyrus (PostCetr)Middle Temporal Gyrus (Mid Temp) Precuneus (Precun)Superior Parietal Lobule (Sup Pari)

The proposed model was applied to an imaging-geneticstudy to classify low/high WRAT groups. Experimental re-sults demonstrate gCAM-CCL’s superior performance in bothclassiﬁcation and biological mechanism analysis. Based on thegenerated activation maps, a number of signiﬁcant brain FCsand SNPs were identiﬁed. Among the signiﬁcant FCs (ROI-ROI connections), three visual processing ROIs exhibiteddominant ROI-ROI connections over the others. In addition,several signal encoding ROIs, e.g., the parahippocampa gyrus,showed connections to the three hub-ROIs. These suggestthat during task-fMRI scans, object recognition related ROIsare ﬁrst activated and then downstream ROIs get involvedin further signal encoding. Results also suggest that highcognitive group may have higher neuron-transmitter signallinglevels while low cognitive group may have problems inbrain/neuron development, resulting from genetic-level differ- ences. The results demonstrate that gCAM-CCL is superiorin both classiﬁcation and result interpretation, and thereforeit can ﬁnd wide applications in multimodal integration andimaging-genetic studies.A

CKNOWLEDGMENT

The authors would like to thank the NIH (R01 GM109068,R01 MH104680, R01 MH107354, P20 GM103472, R01EB020407, R01 EB006841, R01 MH121101, P20 R01GM130447) and NSF (

EFERENCES[1] J. Sui et al. , “Neuroimaging-based individualized prediction of cognitionand behavior for mental disorders and health: Methods and promises,”

Biological Psychiatry , 2020.[2] H. Hotelling, “Relations between two sets of variates,”

Biometrika ,vol. 28, pp. 321–377, 1936.[3] G. Andrew et al. , “Deep canonical correlation analysis,” in

InternationalConference on Machine Learning , 2013, pp. 1247–1255.[4] W. Wang et al. , “On deep multi-view representation learning,” in

International Conference on Machine Learning , 2015, pp. 1083–1092.[5] W. Hu et al. , “Deep collaborative learning with application to mul-timodal brain development study,”

IEEE Transactions on BiomedicalEngineering , 2019.[6] ——, “Adaptive sparse multiple canonical correlation analysis withapplication to imaging (epi) genomics study of schizophrenia,”

IEEETransactions on Biomedical Engineering , vol. 65, no. 2, pp. 390–399,2017.

OURNAL OF L A TEX CLASS FILES, VOL. **, NO. *, **** **** 10 [7] D. Lin et al. , “Correspondence between fmri and snp data by groupsparse canonical correlation analysis,”

Medical image analysis , vol. 18,no. 6, pp. 891–902, 2014.[8] B. Zhou et al. , “Learning deep features for discriminative localization,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 2921–2929.[9] R. R. Selvaraju et al. , “Grad-cam: Visual explanations from deepnetworks via gradient-based localization,” in

Proceedings of the IEEEinternational conference on computer vision , 2017, pp. 618–626.[10] J. T. Springenberg et al. , “Striving for simplicity: The all convolutionalnet,” arXiv preprint arXiv:1412.6806 , 2014.[11] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-tional networks,” in

European conference on computer vision . Springer,2014, pp. 818–833.[12] V. D. Calhoun and T. Adali, “Time-varying brain connectivity in fmridata: whole-brain data-driven approaches for capturing and characteriz-ing dynamic states,”

IEEE Signal Processing Magazine , vol. 33, no. 3,pp. 52–66, 2016.[13] T. D. Satterthwaite et al. , “Neuroimaging of the philadelphia neurode-velopmental cohort,”

Neuroimage , vol. 86, pp. 544–553, 2014.[14] K. J. Friston et al. , “Characterizing dynamic brain responses with fmri:a multivariate approach,”

Neuroimage , vol. 2, no. 2, pp. 166–172, 1995.[15] B. Rashid et al. , “Dynamic connectivity states estimated from restingfmri identify differences among schizophrenia, bipolar disorder, andhealthy control subjects,”

Frontiers in human neuroscience , vol. 8, p.897, 2014.[16] J. D. Power et al. , “Functional network organization of the human brain,”

Neuron , vol. 72, no. 4, pp. 665–678, 2011.[17] S. Purcell et al. , “Plink: a tool set for whole-genome association andpopulation-based linkage analyses,”

The American journal of humangenetics , vol. 81, no. 3, pp. 559–575, 2007.[18] S. Das et al. , “Next-generation genotype imputation service and meth-ods,”

Nature genetics , vol. 48, no. 10, p. 1284, 2016.[19] J. Lonsdale et al. , “The genotype-tissue expression (gtex) project,”

Nature genetics , vol. 45, no. 6, p. 580, 2013.[20] G. S. Wilkinson and G. J. Robertson,

Wide range achievement test .Psychological Assessment Resources, 2006.[21] R. Singh et al. , “Deepchrome: deep-learning for predicting gene ex-pression from histone modiﬁcations,”

Bioinformatics , vol. 32, no. 17,pp. i639–i648, 2016.[22] J. Zhou et al. , “Deep learning sequence-based ab initio prediction ofvariant effects on expression and disease risk,”

Nature genetics , vol. 50,no. 8, p. 1171, 2018.[23] A. Mechelli et al. , “Differential effects of word length and visual contrastin the fusiform and lingual gyri during,”

Proceedings of the RoyalSociety of London. Series B: Biological Sciences , vol. 267, no. 1455,pp. 1909–1913, 2000.[24] G. R. Mangun et al. , “Erp and fmri measures of visual spatial selectiveattention,”

Human brain mapping , vol. 6, no. 5-6, pp. 383–389, 1998.[25] K. Grill-Spector et al. , “The lateral occipital complex and its role inobject recognition,”

Vision research , vol. 41, no. 10-11, pp. 1409–1422,2001.[26] P. M´egevand et al. , “Seeing scenes: topographic visual hallucinationsevoked by direct electrical stimulation of the parahippocampal placearea,”

Journal of Neuroscience , vol. 34, no. 16, pp. 5399–5405, 2014.[27] J. P. Hamilton et al. , “Default-mode and task-positive network activityin major depressive disorder: implications for adaptive and maladaptiverumination,”