[PDF] Explaining Local, Global, And Higher-Order Interactions In Deep Learning

Abstract

We present a simple yet highly generalizable method for explaining interacting parts within a neural network's reasoning process. First, we design an algorithm based on cross derivatives for computing statistical interaction effects between individual features, which is generalized to both 2-way and higher-order (3-way or more) interactions. We present results side by side with a weight-based attribution technique, corroborating that cross derivatives are a superior metric for both 2-way and higher-order interaction detection. Moreover, we extend the use of cross derivatives as an explanatory device in neural networks to the computer vision setting by expanding Grad-CAM, a popular gradient-based explanatory tool in computer vision, to the higher order. While Grad-CAM can only explain the importance of individual objects in images, our method, which we call TaylorCAM, can explain a neural network's relational reasoning across multiple objects. We show the success of our explanations both qualitatively and quantitatively with a human study. Code for all experiments, fully reproducible, may be found at this https URL.

Full PDF

EExplaining Local, Global, And Higher-OrderInteractions In Deep Learning

Samuel Lerman

Department of Computer ScienceUniversity of RochesterRochester, NY 14627 [email protected]

Chenliang Xu

Department of Computer ScienceUniversity of RochesterRochester, NY 14627 [email protected]

Charles Venuto

Department of NeurologyUniversity of Rochester Medical CenterRochester, NY 14627

[email protected]

Henry Kautz

Department of Computer ScienceUniversity of RochesterRochester, NY 14627 [email protected]

Abstract

We present a simple yet highly generalizable method for explaining interactingparts within a neural network’s reasoning process. In this work, we consider lo-cal, global, and higher-order statistical interactions. Generally speaking, local interactions occur between features within individual datapoints, while global interactions come in the form of universal features across the whole dataset.With deep learning, combined with some heuristics for tractability, we achievestate of the art measurement of global statistical interaction effects, includingat higher orders ( -way interactions or more). We generalize this to the mul-tidimensional setting to explain local interactions in multi-object detection andrelational reasoning using the COCO annotated-image and Sort-Of-CLEVR toydatasets respectively. Here, we submit a new task for testing feature vector inter-actions, conduct a human study, propose a novel metric for relational reasoning,and use our interaction interpretations to innovate a more effective Relation Net-work. Finally, we apply these techniques on a real-world biomedical datasetto discover the higher-order interactions underlying Parkinson’s disease clini-cal progression. Code for all experiments, fully reproducible, is available at: https://github.com/slerman12/ExplainingInteractions . The universe is made up of myriad interacting parts. To truly understand complex systems andprocesses, it is not enough to view their functions as an amalgamation of independent contributors.Rather, they are a complex web of inter-operating inﬂuences [3]. With some exceptions [38],explainable deep learning has hereto concerned itself with identifying important features, featurevectors, and isolated concepts. However, in the real world, humans intuitively understand thatdecisions are consequences of complex relations, not merely extrapolations from rankings of singularphenomena. For example, upon seeing a yield sign, it is natural to look to see if there are also passingcars. If not, the yield sign may be safely dismissed and one could keep driving without stopping. Ifthere is a passing car, the law is to yield to the other car.If an intelligent agent made the decision to stop upon approaching a yield sign and a passing car,explaining their actions with precision would require an explanation of this interaction. As far

Preprint. Under review. a r X i v : . [ c s . L G ] J un s individual factors go, perhaps a nearby pedestrian is also present, but without an interactionalinterpretation, one would not be able to distinguish the independence of the yield sign and passingcar from the pedestrian, and one would not be privy to the knowledge of the salient interaction.Furthermore, a naive observer might think that yield signs always indicate “stop” without realizingthat the agent’s response to the yield sign would depend on the presence of a passing car. Similarly,explaining an agent’s strategies in any task — be it computer vision, natural language processing,biomedicine, reinforcement learning, or future forecasting — is imprecise without an interactionalapproach. In chess, good strategies are derived from different interactions of pieces; a strategy maynot be wholly inferred from just seeing what individual pieces the agent prioritized. In the economy,crashes are not easily summarized and if one is forecasted, preventing it requires an understanding ofmany dependencies.In light of all of this, we propose a number of contributions towards explaining interactions indeep learning. To begin, we design a novel method for extracting interaction effects based oninput cross derivatives that we call T-NID. Interaction effects are a fundamental notion in statistics[40]. Our method generalizes existing formalisms and achieves state of the art performance againstbaselines from recent works, making gains with pairwise and higher-order interactions. We makethis computation tractable by translating local interaction effects into global interaction effects viarepresentative samples and employing a simple subsampling heuristic.Then, we generalize Grad-CAM [31], an input gradient-based method for explaining feature vectorimportances, to the two-way and higher-order setting using our interaction effects formalism, and indoing so, we enable the explanation of interactions of multidimensional representations in arbitrarydeep neural networks. This method, which we call TaylorCAM, is demonstrated on the task of objectdetection using the COCO annotated-image dataset [21].We also explore the use of this technique as a magnifying glass on a neural network’s relationalreasoning, i.e. , precisely how it reasons about relations and the extent to which it does so, verifyingthe quality of our explanations quantitatively with a small human study. To our knowledge, this isthe ﬁrst bridge between the statistical notion of interaction effects and relational reasoning in deeplearning. Our approach explains a subtle limitation of the existing Relation Network architecture [30],which allows us to make a suitable adjustment to its design and use our new architecture to achieveimproved performance on the Sort-Of-CLEVR toy visual question-answering (VQA) dataset. Wealso propose and examine a new metric for measuring a neural network’s proclivity towards relationalreasoning, showing a correlation between our metric, the network’s achieved performance, and itsrelational capacity.Finally, we conduct a real-world application of these techniques on a many-dimensional biomedicaldataset with which we explain the interacting factors behind the progression of Parkinson’s disease.In the real world, if one asks a clinician about a single variable such as age — "How does age affectdisease progression?" — the answer is usually "it depends." The natural question, which we attemptto answer, is "it depends on what?" For example, what is the individual’s gender? What medicationsare they taking? How severe is their current disease status? In order to reﬂect reality and the truecomplexity of disease progression, such higher-order interactions must be understood in biomedicine. Recently, there have been several attempts to compute statistical interactions with deep learning.Neural Interaction Detection (NID) [37] used neural network weights to interpret interactions,observing that interactions occur at nonlinear activations in the ﬁrst hidden layer of an MLP. Like ourapproach T-NID, [8] used gradient information to compute statistical interaction effects. However,they relied on Bayesian neural networks, required averaging a high number of hessians, and onlycomputed global interaction effects, not focusing on local or higher-order interactions. [32] relied onself attention [38] to compute a measure analogous to non-emergent interaction effects and apply thisto an analysis in the biomedical domain. Higher-order interactions have been considered throughoutbiomedicine, particularly for understanding gene interactions [41, 2, 22, 7].[28, 34, 13] used input gradients to explain the reasoning of a neural network. [43] did so withclass activation maps. Grad-CAM [31] and Grad-CAM++ [6] combined both approaches to localizeimportant feature vectors in computer vision with class activation maps and gradients.2e also connect the notion of interaction effects with relational reasoning, which has receivedincreased attention in deep learning [3, 30, 42, 29], and use our method of TaylorCAM to interpretthe reasoning process of Relation Networks [30]. While most past works have mainly focused onexplaining individual factors of a neural network’s predictions, the weights in multi-head dot productattention [38] could be interpreted as interactional explanations for neural networks that includeMHDPA in their architecture [32]. The interactions identiﬁed in this manner may not necessarily beemergent or naively extrapolated to higher orders. In contrast, TaylorCAM is applicable to explainingany sufﬁciently differentiable neural network directly from its gradient information.

We will discuss three kinds of interaction effects: local, global, and higher-order. Local interactionsoccur within individual datapoints and vary across the dataset. The automated driving example withan interaction between the yield sign and oncoming car indicating "stop" illustrates this idea. Incomputer vision, objects — typically represented by feature vectors projected by a ConvolutionalNeural Network (CNN) — interact differently from point to point. Global interactions come in theform of universal features across the whole dataset. These are summarized not for one point, but forgeneral points in the entire domain. An example of this may be the various interactions of biomedicalfeatures that hold across patients, e.g. , how two medications, when administered separately, maygenerally be beneﬁcial, but when administered together, may instead be harmful.We will begin formalizing this notion by deﬁning statistical interaction as follows:

Deﬁnition 3.1. Statistical Interaction

An interaction of order (cid:96) is a set of unique variables x , ..., x (cid:96) which have a nonzero interaction effect .Next, we will deﬁne interaction effect as follows: Deﬁnition 3.2. Interaction Effect

An interaction effect IE ,...,(cid:96) between variables x , ..., x (cid:96) ∈ x on a function F ( x ) with inputs x is measured as: IE ,...,(cid:96) = ∂ (cid:96) F ( x ) ∂x · · · ∂x (cid:96) . (1)This deﬁnition is inspired by the theory suggested by [1]. In plain English, an interaction effect ishow much the meaning of one variable changes for a unit change in another variable. Naturally, thischange is reﬂected by the cross partial derivative. "Change" is an intuitive measure for interaction.From the earlier example, given a representation of a yield sign and an oncoming car, changing therepresentation of the oncoming car into a representation of an empty road also changes the meaningof the yield sign from "stop" to "go." For a more formal example, consider F ( x ) = x sin ( x ) + cos ( x ) . F consists of an interaction between x and x for some x since ∂F ( x ) / ( ∂x ∂x ) isnonzero. However, x does not belong to an interaction since any cross derivative w.r.t. x iszero. For thoroughness, we formally unify our deﬁnition above with the colloquial understanding of"interaction" as well as the mathematical meaning of relation in the Appendix . Adapt to Neural Networks

Substituting F with a trained neural network, we can compute thelocal interaction effects for a datapoint up to order (cid:96) as long as the neural network F is (cid:96) -timesdifferentiable. In classiﬁcation, softmax ensures this to be the case. In regression, we substituteReLUs with Gaussian-error rectiﬁed linear units (GELUs), which have been shown to be comparablein performance [14]. Otherwise, this formalism affords the computation of interaction effects forarbitrary neural network architectures. Translate Local Effects to Global Effects

While computing local interaction effects is relevantto two of our application domains — computer vision and relational reasoning — typically instatistics, there is greater interest in computing global interaction effects. In tandem with our work,[8] converted local pairwise interaction effects to global pairwise interaction effects by averaginga set of representative samples retrieved via k-means clustering, in effect dividing the dataset byEuclidean distance and computing the global average from the centroids. We will similarly averagerepresentative local interaction effects in order to compute a global summary, but we will use asimpler and more efﬁcient technique. In our case, efﬁciency is of more concern because computinghigher-order interaction effects requires the computation of higher-order derivatives, which for manysamples can become intractable. To translate local interaction effects into global interaction effects3t any order, we sample representative samples that have a wide range over the dataset and that arepotentially meaningful. We choose the samples that are closest to a subset of common aggregates,including mean, median, min, max, and mode. As well as a random sample for good measure.Likewise, we used L2 distance to measure closeness. In addition to this, we considered differentways to aggregate the interaction effects of these samples. Again, namely mean, median, min, max,or mode. We ran a wide sweep of the complete power set of these potential samples and aggegatesto ﬁnd which combination performed best on a wide array of synthetic datasets selected from priorworks [37, 33, 23, 16], chosen to test for various types of interactions. Results of this power sweepare reported in the

Appendix . We ended up using the mean interaction effect of the samples closestto the mean, minimum, and mode of all samples, as well as a random sample.

Improve Efﬁciency

Another heuristic for efﬁciency that we employed was subsampling the inter-actions that would be computed. Naturally, testing for every combination up to order (cid:96) would be veryexpensive. Every double, every triple, every quadruple, etc. — the problem grows combinatorially.We were able to mitigate this to a degree by taking advantage of the property of statistical interactioneffects that an (cid:96) -way interaction can only exist if all its corresponding ( (cid:96) - 1)-interactions exist [33].In turn, we were able to reduce the search space by only selecting non-redundant combinations of the k interactions from the previous order whose interaction effects were highest, beginning with usingevery combination up to order o and then subsampling the top k for every order thereafter.Our complete algorithm, which we call Taylor-Neural Interaction Detection (T-NID) due to thehigher-order derivatives, is described in pseudocode in the Appendix .Finally, we need to make a point about the sign of the resulting cross partial derivatives. A positivevalue indicates change in the positive direction; negative, negative. Since in regression we areinterested in the overall effect of an interaction and are agnostic to the direction, we take the squaredvalue of the cross-partial as our measure of interaction effect. In contrast, for classiﬁcation, weuse the sign — positive or negative — corresponding to the class of interest. And for multi-classclassiﬁcation, we take F to be the network corresponding to the class output of interest, usuallysampling the class with the highest estimated probability, and use its squared cross partial derivatives. To this point, we have generalized our computation of interaction effects to the local, global, andhigher-order setting, but we have not yet considered the case where features are multidimensional, asis the case in higher-level deep neural network representations.Explaining the inﬂuence of feature vectors is common in computer vision and is a mainstay ofinterpreting CNNs. However, we have illustrated with multiple examples why a precise explanationof a model’s decisions requires an explanation of its interacting components, not just singular entities.To our knowledge, the notion of statistical interaction effects has never been generalized to themultidimensional setting. [8] applied their approach to a toy MNIST dataset consisting of a ﬁxedset of feature vectors such that they could compute global interaction effects, but they mapped thosefeature vectors to single neurons and computed standard interaction effects between those mappedneurons. The limitation of this approach is that it cannot be used to explain local phenomena, whichis traditionally what is of interest in computer vision, NLP, and other areas where multidimensionalfeature vectors are used.

Up until now, we have discussed interactions in terms of how changing one variable changes themeaning of changing another variable — however, we would like to point out now that this is notprecisely what we are interested in. Take the yield sign and passing car for example. The interaction ismeaningful because one changes the meaning of the other [11], not because one changes the meaningof changing the other. Changing the passing car into, say, an empty road changes the meaning of theyield sign from "stop" to "go", but a cross derivative measures how changing the passing car changesthe meaning of changing the yield sign, not the meaning of the yield sign itself.What we are interested in is indeed what the effect is of changing the car on the meaning of theyield sign. But how does one quantify the meaning of the yield sign? We would like to know how4he yield sign’s main effect, while ﬁxed, would change if the passing car were out of the picture orotherwise changed. Thus, we will represent "meaning" in the only way we can, by how much theobject contributes to the output, which as it happens is the characteristic problem of Grad-CAM andother explanatory tools in deep learning [31, 43, 6, 28, 34, 13]. So what we are interested in is howmuch changing the car C changes the importance IMP of yield sign Y , where importance is relativeto the class output for decision "go" G . For local changes, this is equivalent to: S Y,C = ∂ IMP ( Y, G ) (cid:14) ∂C , (2)where S Y , C represents the interaction salience between the yield sign and passing car, and IMP ( Y , G ) represents the importance of the yield sign to the neural network’s decision to go or stop.We use the term interaction salience due to the deviation from interaction effects in Deﬁnition 3.2. Suppose we have an (cid:96) -times differentiable function F : R n,d → R , which will stand for our neuralnetwork, where (cid:96) ≥ . F takes in matrix x consisting of n feature vectors x , ..., x n ∈ R d ofdimension d . So F is the portion of the network downstream of a set of feature vectors such as thoseprojected by a CNN, which we ﬂatten along the height and width dimension to produce x , ..., x n . Quantify Importance

To ﬁll

IMP in Equation 2, we turn to class activation maps (CAMs) [43].However, as observed by the solution of [31], to ﬁnd out how a class activation map increases theclass’s likelihood, we would like to know how its features contribute to the output, which we cando with their gradients. We can estimate the global effect by summing the gradient of each featurevector x k and weighing the sum to each CAM. This amounts exactly to Grad-CAM [31]: IMP ( x i , F ( x )) = GradCAM ( x i , F ( x )) = (cid:88) p x ip (cid:88) k ∂F ( x ) ∂ x kp . (3) Generalize Grad-CAM to Compute Interactions

Now that we have the importance of a featurevector (via essentially Grad-CAM), we can formulate S ij , the interaction salience between featurevectors x i and x j , by substituting Equation 3 into Equation 2 and summing the dimensions as follows: S ij = (cid:88) m ∂ (cid:34)(cid:88) p x ip (cid:88) k ∂F ( x ) ∂ x kp (cid:35) (cid:30) ∂ x jm . (4) Merge with Statistical Interaction Effects

Finally, we bring this to an easy-to-compute form byrealizing that the partial derivative in the denominator ∂ x j can be computed together with the partialderivative in the numerator. We also square the salience because a change of importance in eitherdirection would be signiﬁcant. We note that the following is a generalization of Grad-CAM thatreduces elegantly to a modiﬁed interaction effects Deﬁnition 3.2: S ij = (cid:32) (cid:88) m (cid:88) p x ip (cid:88) k ∂ F ( x ) ∂ x kp ∂ x jm (cid:33) = (cid:32) (cid:88) m,p,k x ip IE kp,jm (cid:33) . (5)In tests, we found setting k = i in Equations 3 - 5 without the global sum over k to perform just aswell and often better, perhaps because the local gradients in Equation 3 more precisely correspond tofeatures. We call Equation 5 HessianCAM. HessianCAM may be further differentiated with respectto a cross partial ∂ x q to get a -way interaction salience, and that can be further differentiated up toany order (cid:96) . Thus, we name this TaylorCAM, a higher-order generalization of Grad-CAM, whereGrad-CAM (or a close variant) is the special case (cid:96) = 1 and HessianCAM is the special case (cid:96) = 2 .Note that interaction saliences are conditional. The interaction salience of feature x i on feature x j is not necessarily the same as that of x j on x i . Interaction salience S ij represents the inﬂuenceof x i on the importance of x j . Interaction salience S ijk... represents the inﬂuence of x i on theinteraction salience of interaction x j , x k , ... . To address this, we sum the mutual pairs, e.g. , S ij + S ji ,although we note that we did so only to make the presentation clearer and not because it is required.For many interpretation tasks, understanding that the meaning of the yield sign depends on the car,but the meaning of the car does not depend on the yield sign is crucial to getting the most preciseunderstanding. Computing the mutual pairs does not require re-computation of any derivatives,and can be achieved easily by permuting the resulting interaction saliences and summing them, asdemonstrated in our public code. Lastly, we zero out the diagonals and redundant grid cells of theresulting interaction saliences to only consider interactions between non-redundant feature vectors.5able 1: AUC scores for pairwise interaction effects. Top-1 scores are bolded.ANOVA HierLasso RuleFit AG NID NID + MLP-M T-NID F ( x ) 0 . . . .

970 0 . ± . − . ± . F ( x ) 0 .

468 0 .

636 0 .

698 0 .

88 0 .

79 0 . ± . − . ± . F ( x ) 0 .

657 0 .

556 0 . . ± . . ± . F ( x ) 0 .

563 0 .

634 0 . . .

85 0 . ± . − . ± . F ( x ) 0 .

544 0 .

625 0 .

797 0 . ± . . ± . F ( x ) 0 .

780 0 .

730 0 .

811 0 . . . ± . − . ± . F ( x ) 0 .

726 0 .

571 0 .

666 0 .

81 0 .

84 0 . ± . − . ± . F ( x ) 0 .

929 0 .

958 0 .

946 0 .

937 0 .

989 0 . ± . − . ± . F ( x ) 0 .

783 0 .

681 0 .

584 0 .

808 0 .

83 0 . ± . − . ± . F ( x ) 0 .

765 0 .

583 0 . .

995 0 . ± . − . ± . Average .

721 0 .

698 0 .

764 0 .

87 0 .

92 0 . ± . − . ± . One limitation of TaylorCAM is that “meaning” is deﬁned as contribution to the output, so if twodifferent objects have the same contribution to the output, then changing one into the other would beconsidered meaningless, and so the interactions might not be identiﬁed. An example of this limitationis when an agent is asked, "What is the color of the circle furthest from the red square?” If the furthestcircle is blue, and the second furthest is also blue, then changing the furthest into a square does nothave a meaningful impact on the red square’s contribution to the output, as determined by Grad-CAM,since the output would be unchanged (blue). Grad-CAM++ [6] may hold an insight as to how toaddress this, via even-higher order derivatives. Another limitation is that "change" is being measuredlocally, as derivatives do not account for non-local rates of change. This means that TaylorCAM,like other deep learning explanatory tools, depends on the local regions of representations. Lastly, ofcourse, is the time complexity of computing higher-order derivatives. Higher-order differentiationhas become increasingly more accessible with Taylor-mode autograd methods like JAX [4] andlibraries like the new Pytorch functional autograd API [24], yet remains a challenge as the ordergrows. For HessianCAM, we had no trouble computing 2nd-order derivatives of Relation Networksusing Pytorch and CPU memory.

We evaluate T-NID’s ability to rank interactions on the suite of synthetic functions proposed by[37, 33, 23, 16], which were “designed to have a mixture of pairwise and higher-order interactions,with varying order, strength, nonlinearity, and overlap” [37]. These are available to see in the

Appendix and in Table 1 of [37].For pairwise interaction effects (see Table 1), we report or reproduce the experiments of [37] verbatim,measuring AUC scores between predicted interaction rankings and ground truths. A pair x i , x j isconsidered an interaction either by itself or when it is a subset of a higher-order interaction, as in[33, 23]. Included for comparison are benchmarks from various statistical and machine learningmethods [40, 35, 33, 37], as reported by [37]. NID [37] uses an interpretation of the weights from astandard MLP to detect interactions, whereas NID + MLP-M uses an MLP with additional univariatenetworks summed at the output to discourage modeling of main effects and false spurious interactions.In contrast, our T-NID uses only a standard MLP with GELU activations. Unlike NID, we found nosigniﬁcant beneﬁt from MLP-M or sparsity regularization. Despite the simpler architecture, T-NID isimmune to some of the deﬁcits of NID and NID + MLP-M. T-NID is able to distinguish main effectsand spurious interactions in F and F , and while NID + MLP-M modeled spurious main effectsin the { , , } interaction of F , T-NID recognizes it as an interaction, as the cross derivative isnonzero across the domain of x , x , x . All around, T-NID performs on par or better than priorbaselines at computing pairwise statistical interaction effects on these synthetic tasks.6able 2: AUC scores for higher-order n -way interaction effects3-Way Interactions 4-Way Interactions 5-Way InteractionsNID T-NID NID T-NID NID T-NID F ( x ) 0 . ± .

01 0 . ± .

04 0 . ± .

34 0 . ± . N/A N/A F ( x ) 0 . ± .

01 0 . ± .

14 0 . ± .

18 0 . ± . N/A N/A F ( x ) 0 . ± .

00 1 . ± .

01 1 . ± . N/A N/A F ( x ) 0 . ± .

01 0 . ± .

05 1 . ± .

00 1 . ± . N/A N/A F ( x ) 0 . ± .

00 0 . ± . N/A N/A N/A N/A F ( x ) 0 . ± .

01 0 . ± . N/A N/A N/A N/A F ( x ) 0 . ± .

03 0 . ± .

06 0 . ± .

04 0 . ± .

03 1 . ± .

00 0 . ± . F ( x ) 0 . ± .

03 1 . ± .

00 0 . ± .

32 1 . ± . N/A N/A F ( x ) 0 . ± .

02 0 . ± .

10 0 . ± .

13 0 . ± .

09 0 . ± .

11 0 . ± . F ( x ) 0 . ± .

00 1 . ± . N/A N/A N/A N/AAverage . ± . ± .

07 0 . ± . ± .

11 0 . ± . ± . For higher-order interactions, we do not report AUC scores against the full ground truth, as thatwould grow combinatorially more expensive with higher orders. Since NID also extracts interactionsone order at a time, we compare the AUC scores of NID and T-NID one order at a time and useground truths from the union of their discovered interactions. That way, they can be assessed relativeto one another, albeit not universally. In addition to the results reported in Table 2, we tested manyvariants of architectures and report results with NID + MLP-M in the

Appendix . In all cases, therelative results were largely the same, with T-NID achieving the highest scores, except less so at -way interactions when equipped with its own main effects network (MLP-M). Since any-order NIDtends to ﬁnd supersets much better than subsets, at -way interactions, NID misses nearly all presentinteractions, whereas T-NID fares relatively well. We ran two qualitative assessments of TaylorCAM in multi-object detection. In both, the taskwas to identify whether a pair of objects were present in tandem. We tested the objects “car” and“person” in the COCO annotated-image dataset [21], and we designed our own toy dataset consistingof cars (rectangles), signs (triangles), and a yield sign (red triangle) with labels “go” or “stop."Most interestingly, in the Yield-or-Go task, we found TaylorCAM rightfully interprets no nonzerointeraction saliences or inter-object interactions in negative samples. We showcase ﬁgures and discussthese experiments further in the

Appendix . Sort-Of-CLEVR is a toy dataset for relational reasoning proposed by [30]. It is a less-computationallyexpensive 2D form of the CLEVR VQA dataset [18] with a focus on relational questions. In oursetup, these questions include distance relationships and compare-and-count tasks. To demonstrateTaylorCAM’s potential for revealing a neural network’s relational reasoning, we train a RelationNetwork (RN) [30] on Sort-Of-CLEVR and visualize its top consecutive interactions in Figure 1.We observed that TaylorCAM affords clear explanations of the RN’s reasoning for a question. TheTable 3: Human study

Objects QuestionsGrad-CAM 14.0% 29.3%Random 16.7% 33.3%TaylorCAM

Table 4: Interaction salience andrelational capacity correlation

CNN + IRN RN MLP Avg PoolingAMIS

89% 56% 5%

Table 5: Top MoCA interactions

Top N -Way Interaction Strengthnp3rign, handed 2.92E-05id_num, scau20, mcarec4 4.77E-06scau13, np1slpn, np1cnst, nhy 6.00E-07slplmbmv, np1dprs, np2walk, np3rigru, np3pstbl 1.23E-07 object of interest usually belongs to each top interaction, while its corresponding interactions areusually sensible for the question, focusing nearby or far away in proximity questions, and on theappropriate shapes in counting questions. To quantify, we selected a random batch of samplesand their ordered interaction saliences, and conducted a small human study ( n = 10 ), asking eachindividual to guess (1) the object of interest and (2) the question being asked, from just looking at7 a) Q: "Which shape is closest to the green square?"(b) Q: "Which shape is furthest from the blue circle?"(c) Q: "How many objects have shape of green object?" (d) Q: "Which shape is closest to the purple square?"(e) Q: "Which shape is furthest from the pink circle?"(f) Q: "How many objects have shape of yellow object?" Figure 1: Shown are the top 4 interactions identiﬁed from a Relation Network’s predictions on 6visual question-answering samples. The boxes can be interpreted as saying, "the meaning of oneregion depends on the contents of the other region." We recommend testing yourself to see if you canguess (1) the object of interest and (2) the question being asked, without looking at the caption. The6 objects are "blue", "purple", "red", "yellow" "orange", and "green" and the 3 questions are "Whichshape is closest to the object of interest?", "Which shape is furthest from the object of interest?", and"How many objects have the same shape as the object of interest?"the ranked interaction visuals. We report the results in Table 3, demonstrating strong explainabilitywith signiﬁcantly higher guess-accuracy than using Grad-CAM or random guessing. A completebreakdown of this and Grad-CAM’s low performance is available in the

Appendix .Additionally, we explore the

Average Maximum Interaction Salience (AMIS) as a predictor of theperformance and capacity of a model for relational reasoning. To be clear, this metric is the meaninteraction salience of each test sample’s maximum interaction salience. These stats are reportedin Table 11, where we show this correlation across four CNN architectures for relational reasoningin VQA. We report additional architectures and discuss the results in the

Appendix . We used ourexplanations to devise a minor adjustment to the RN architecture, which we call Interactional RelationNetwork (IRN), that mitigated non-relational behavior and achieved better performance. Due tothe need for brevity, these details, as well as all architectural details (hyperparameters, layer sizes,epochs), may be found in the

Appendix . IRN performance is included in Table 11.

Parkinson’s disease (PD) is a neurodegenerative disease characterized clinically by motor and non-motor symptoms that vary over time, progressing interdependently. We classiﬁed patients fromthe PPMI study dataset ( ) with more severe progression in declineof cognitive function, as measured by the Montreal Cognitive Assessment (MoCA) scale. Topinteractions are displayed in Table 5. The top pairwise interaction was handedness and severity ofrigidity in the neck. Handedness has been signiﬁcantly associated with speciﬁc genetic loci implicatedin the pathogenesis of neurologic disorders including PD [39]. More severe rigidity symptoms in PDare also associated with faster cognitive decline [27]. Our analysis suggests that various measurespreviously thought to be unrelated should be considered together when predicting faster cognitiveprogression in PD. See

Appendix for more details, a full analysis, and many more interpretations.

With T-NID and TaylorCAM, we have shown that input cross derivatives, combined with a fewsimple heuristics and intuitions, are a powerful tool for explaining interactions in deep learning.8-NID, using GELU activations, representative samples, and interaction subsampling, achieves stateof the art scores at ranking statistical interactions. Meanwhile, TaylorCAM generalizes Grad-CAMto the higher order and effectively explains interactions in object detection and relational reasoning,affording a human cohort the insight to guess questions in VQA from only seeing the top discoveredvisual interactions. We also tied these metrics to relational reasoning and note that we used them tobetter customize the Relation Network architecture. To cap it off, we applied T-NID to the real-worldproblem of classifying rate of clinical progression in Parkinson’s disease and made some expected aswell as novel observations about potential underlying mechanisms of PD progression. By making ourcode publicly available, we hope that these simple explanatory tools can be used and built upon tobetter explain the complex interoperating factors underlying neural network reasoning and the world.

Broader Impact

A common critique of deep neural networks has been their apparent “black box” nature. Any ﬁeldthat beneﬁts from understanding why a neural network predicts something, not just what it predicts,may beneﬁt from an explanatory tool that affords more precise understandings of relations anddependencies underlying predictions, e.g. , biomedicine, economics, and areas where AI might haveauthority, like in the judicial system. However, it would be dangerous to trust these explanatorysystems indiscriminately. If an explanation of clinical disease progression points to a beneﬁcialinteraction between two drugs, careful study is needed to determine if those drugs are indeedimplicated before administering them as treatment. Although these explanations are useful tools, theyare not perfect, and they are only as good as the model they are applied to. A racist model, due toracial biases in data, may explain that the cause of something is racial when in fact the real cause issomething more complicated — as always, these technologies should not be trusted blindly.

Acknowledgments and Disclosure of Funding

Research reported in this publication was supported by the National Institute of Neurological Disor-ders and Stroke of the National Institutes of Health under Award Number P50NS108676. The contentis solely the responsibility of the authors and does not necessarily represent the ofﬁcial views of theNational Institutes of Health.

References [1] C. Ai and E. C. Norton. Interaction terms in logit and probit models.

Economics letters ,80(1):123–129, 2003.[2] H. Aschard. A perspective on interaction effects in genetic association studies.

Geneticepidemiology , 40(8):678–688, 2016.[3] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski,A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deeplearning, and graph networks. arXiv preprint arXiv:1806.01261 , 2018.[4] J. Bettencourt, M. J. Johnson, and D. Duvenaud. Taylor-mode automatic differentiation forhigher-order derivatives in JAX. In

Advances in neural information processing systems, Work-shop Program Transformations , 2019.[5] R. Caruana, S. Lawrence, and C. L. Giles. Overﬁtting in neural nets: Backpropagation, conjugategradient, and early stopping. In

Advances in neural information processing systems , pages402–408, 2001.[6] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: General-ized gradient-based visual explanations for deep convolutional networks. In , pages 839–847, 2018.[7] G. K. Chen and D. C. Thomas. Using biological knowledge to discover higher order interactionsin genetic association studies.

Genetic epidemiology , 34(8):863–878, 2010.[8] T. Cui, P. Marttinen, and S. Kaski. Recovering pairwise interactions using neural networks. In

Advances in neural information processing systems, Bayesian Deep Learning workshop , 2019.99] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In , pages248–255. Ieee, 2009.[10] D. W. Dickson. Parkinson’s disease and parkinsonism: neuropathology.

Cold Spring Harborperspectives in medicine , 2(8):a009258, 2012.[11] J. H. Friedman, B. E. Popescu, et al. Predictive learning via rule ensembles.

The Annals ofApplied Statistics , 2(3):916–954, 2008.[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[13] Y. Hechtlinger. Interpretation of prediction models using the input gradient.

ArXiv ,abs/1611.07634, 2016.[14] D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussianerror linear units.

CoRR , abs/1606.08415, 2016.[15] E. Heremans, E. Nackaerts, S. Broeder, G. Vervoort, S. P. Swinnen, and A. Nieuwboer. Hand-writing impairments in people with parkinson’s disease and freezing of gait.

Neurorehabilitationand neural repair , 30(10):911–919, 2016.[16] G. Hooker. Discovering additive structure in black box functions. In

Proceedings of the 2004ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’04 ,page 575. ACM Press.[17] S. Hwang, P. Agada, S. Grill, T. Kiemel, and J. J. Jeka. A central processing sensory deﬁcitwith parkinson’s disease.

Experimental brain research , 234(8):2369–2379, 2016.[18] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick.Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages2901–2910, 2017.[19] N. Jozwiak, R. B. Postuma, J. Montplaisir, V. Latreille, M. Panisset, S. Chouinard, P.-A.Bourgouin, and J.-F. Gagnon. Rem sleep behavior disorder and cognitive impairment inparkinson’s disease.

Sleep , 40(8), 2017.[20] V. Kelly, C. Johnson, E. McGough, A. Shumway-Cook, F. Horak, K. Chung, A. Espay, F. Revilla,J. Devoto, C. Wood-Siverio, et al. Association of cognitive domains with postural instability/gaitdisturbance in parkinson’s disease.

Parkinsonism & related disorders , 21(7):692–697, 2015.[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick.Microsoft coco: Common objects in context. In

European conference on computer vision , pages740–755. Springer, 2014.[22] G. Liu, H. Zeng, and D. K. Gifford. Visualizing complex feature interactions and feature sharingin genomic deep neural networks.

BMC bioinformatics , 20(1):1–14, 2019.[23] Y. Lou, R. Caruana, J. Gehrke, and G. Hooker. Accurate intelligible models with pairwiseinteractions. In

Proceedings of the 19th ACM SIGKDD international conference on Knowledgediscovery and data mining - KDD ’13 , page 623. ACM Press.[24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In

Advances in neural informationprocessing systems , 2017.[25] W. Poewe. Dysautonomia and cognitive dysfunction in parkinson’s disease.

Movement disorders:ofﬁcial journal of the Movement Disorder Society , 22(S17):S374–S378, 2007.[26] M. M. Ponsen, A. Daffertshofer, E. C. Wolters, P. J. Beek, and H. W. Berendse. Impairment ofcomplex upper limb motor function in de novo parkinson’s disease.

Parkinsonism & RelatedDisorders , 14(3):199–204, 2008.[27] A. Rajput, A. Voll, M. Rajput, C. Robinson, and A. Rajput. Course in parkinson diseasesubtypes: a 39-year clinicopathologic study.

Neurology , 73(3):206–212, 2009.[28] A. S. Ross, M. C. Hughes, and F. Doshi-Velez. Right for the right reasons: training differentiablemodels by constraining their explanations. In

Proceedings of the 26th International JointConference on Artiﬁcial Intelligence , IJCAI’17, pages 2662–2670. AAAI Press.1029] A. Santoro, R. Faulkner, D. Raposo, J. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals,R. Pascanu, and T. Lillicrap. Relational recurrent neural networks. In

Advances in neuralinformation processing systems , pages 7299–7310, 2018.[30] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap.A simple neural network module for relational reasoning. In

Advances in neural informationprocessing systems , pages 4967–4976, 2017.[31] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visualexplanations from deep networks via gradient-based localization. In , pages 618–626, 2017.[32] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, and J. Tang. Autoint: Automaticfeature interaction learning via self-attentive neural networks. In

Proceedings of the 28th ACMInternational Conference on Information and Knowledge Management , pages 1161–1170, 2019.[33] D. Sorokina, R. Caruana, M. Riedewald, and D. Fink. Detecting statistical interactions withadditive groves of trees. In

Proceedings of the 25th international conference on Machinelearning - ICML ’08 , pages 1000–1007. ACM Press.[34] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In

Proceedingsof the 34th International Conference on Machine Learning - Volume 70 , ICML’17, page3319–3328. JMLR.org, 2017.[35] R. Tibshirani. Regression shrinkage and selection via the lasso: a retrospective.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 73(3):273–282, 2011.[36] A. Troster, A. Paolo, K. Lyons, S. Glatt, J. Hubble, and W. Koller. The inﬂuence of depressionon cognition in parkinson’s disease: a pattern of impairment distinguishable from alzheimer’sdisease.

Neurology , 45(4):672–676, 1995.[37] M. Tsang, D. Cheng, and Y. Liu. Detecting statistical interactions from neural network weights.In

International Conference on Learning Representations , 2018.[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, andI. Polosukhin. Attention is all you need. In

Advances in neural information processing systems ,pages 5998–6008, 2017.[39] A. Wiberg, M. Ng, Y. Al Omran, F. Alfaro-Almagro, P. McCarthy, J. Marchini, D. L. Bennett,S. Smith, G. Douaud, and D. Furniss. Handedness, language areas and neuropsychiatric diseases:insights from brain imaging and genetics.

Brain , 142(10):2938–2947, 2019.[40] T. Wonnacott and R. Wonnacott.

Introductory statistics . Wiley series in probability andmathematical statistics. Wiley, 1977.[41] N. Yi. Statistical analysis of genetic interactions.

Genetics research , 92(5-6):443–459, 2010.[42] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert,T. Lillicrap, E. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals,and P. Battaglia. Deep reinforcement learning with relational inductive biases. In

InternationalConference on Learning Representations , 2019.[43] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features fordiscriminative localization. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2921–2929, 2016.11 ppendix

Table of Contents

A Note On Terminology 13B Representative Samples & Aggregations 13C T-NID Algorithm 13D Test Suite Of Synthetic Functions 14E Additional Architectures For N -Way Interactions 14F Object Detection Figures & Analysis 15G Human Study Analysis 16H AMIS Scores Of Different Architectures 17I Interactional Relation Network (IRN) 17J Architecture Conﬁgurations 18K Biomedical Analysis 18 Note On Terminology

In colloquial terms, two things are said to interact when they depend on each other in some way.Similar to [11], this can be formalized as follows:

Deﬁnition A.1. Entity Interaction

Given an entity e with attributes ( a , ..., a n ) , an interactionexists with another entity e with attributes ( b , ..., b n ) if some a i depends on some b j or some b j depends on some a i .Now we will deﬁne mathematical relation. Deﬁnition A.2. Relation

Given sets A and B , the binary relation from A to B is a subset of theCartesian product A × B .We would like to unify our colloquial understanding of interaction in Deﬁnition A.1, our mathematicaldeﬁnition of relation in Deﬁnition A.2, and our deﬁnition for statistical interaction effects in Deﬁnition2 of the main paper.To connect this to Deﬁnition 2, we will reframe features as entities with the following theorem: Theorem A.1.

Given a function F ( x ) and feature x i , let entity e i consist of attributes ( x i , F/∂x i ) .An interaction exists between e and e if there is a nonzero interaction effect between x and x . Proof.

If there is a nonzero interaction effect between x and x , then ∂ F ( x ) / ( ∂x ∂x ) (cid:54) = 0 forsome input x . Then F/∂x depends on x and consequently, there exists an interaction betweenentities e and e .We have shown that our statistical interaction implies an interaction according to our colloquialunderstanding. An interaction exists between e and e if (but not only if, since the change neednot be local) F ( x ) / ( ∂x ∂x ) (cid:54) = 0 , meaning F/∂x depends on x . This is considered a binaryrelation between the two attributes, as all functions are relations, though not all relations are functions.Formally: given a function F ( x ) , a feature x i , and entity e i consisting of attributes ( x i , F/dx i ) , ifthere is a nonzero interaction effect between x and x , then a relation exists between the attributesof the two entities.We have shown that, under this framing, an interaction effect is a relation, and if the interaction effectis nonzero, there must be a dependency/interaction between those entities. Since feature vectors inCNNs could be treated as entities [30, 42, 29], and if one interprets their gradients on the output tobe implicit attributes, computing interaction effects between CNN feature vectors is equivalent toidentifying the colloquial interactions and relations described in this formulation.This is trivially generalized to interactions/relations of higher orders.To summarize, a mathematical relation is implied by a colloquial interaction is implied by a statisticalinteraction, and this hierarchy can be formalized by regarding a feature x i as an entity whoseattributes include its gradients with respect to the function of interest. Thus, we offer a simple, formalconnection between our statistical interaction effects deﬁnition and mathematical relations, as wellas an integration of both into the colloquial understanding of “interaction” as merely a dependencybetween two “things.” B Representative Samples & Aggregations

Table 6 displays the top 10 aggregations and representative samples discovered via our power sweep.

C T-NID Algorithm

Our complete T-NID is described in Algorithm 1. Note that each derivation of interaction effect usingDeﬁnition 2 of the main paper for an interaction I = ˆ I ∪ j of size ˆ (cid:96) where | ˆ I | = ˆ (cid:96) − for sample x can be derived as a single-order partial derivative ∂ IE ˆ I /∂x j and does not need to be recomputedfrom the ground up. 13able 6: Top average (across all orders) AUC scores for different aggregations of representativesamples Aggregation Of Representative Samples AUC ScoreMean Of Mean-Min-Mode-Rand 0.61825Mean Of Med-Min-Mode-Rand 0.61825Mean Of Mean-Med-Min-Mode-Rand 0.61775Mean Of Mean-Min-Max-Mode-Rand 0.6155Mean Of Med-Min-Max-Mode-Rand 0.6155Med Of Mean-Min-Mode-Rand 0.61525Med Of Med-Min-Mode–Rand 0.61525Mean Of Mean-Med-Min-Max-Mode-Rand 0.61525Mean Of Mean-Min-Rand 0.614Mean Of Med-Min-Rand 0.614 Algorithm 1

T-NID algorithm in pseudocode

Inputs (cid:96) -times differentiable trained neural network F , dataset X with i th sample features X i , ..., X in , order (cid:96) , orders without subsampling o , subsampling size k . Outputs

Interaction effects IE I for top estimated interactions I ⊆ { , ..., n } , where | I | ≤ (cid:96) .Get representative samples: For j th aggregation ∈ mean, minimum, mode, random c = argmin i (cid:107) X i − aggregation ( X , axis = 0) (cid:107) r j = X c For each representative sample:

For r j ∈ r Compute all non-redundant partial derivatives up to order o : For I ⊆ { , ..., n } , where | I | ≤ oI = sort ( I ) If IE ( j ) I uninitiated Initiate IE ( j ) I according to Deﬁnition 2 of the main paperCompute remaining partial derivatives up to order (cid:96) by subsampling top k from previous orders: For ˆ (cid:96) ∈ o + 1 , ..., (cid:96) For ˆ I ∈ top k argmax of IE ( j ) I , where | I | = ˆ (cid:96) − For I ⊆ { , ..., n } , where | I | = (cid:96) and ˆ I ⊂ I If IE ( j ) I uninitiated Initiate IE ( j ) I according to Deﬁnition 2 of the main paperTake the mean interaction effects across representative samples: For I ⊆ { , ..., n } if IE ( j ) I initiated for some j IE I = mean ( IE ( j ) I ) for all j where IE ( j ) I initiated Return IE D Test Suite Of Synthetic Functions

The test-suite of synthetic functions used to evaluate T-NID may be found in Table 7, courtesy of[37].

E Additional Architectures For N -Way Interactions Table 8 shows results for T-NID + MLP-M (T-NID using a neural network equipped with a maineffects network as well as trained with sparsity regularization) and NID + MLP-M, the architectureused in [37]. 14able 7: Synthetic test-suite functions F (x) π x x √ x − sin − ( x ) + log ( x + x ) − x x (cid:113) x x − x x F (x) π x x (cid:112) | x | − sin − (0 . x ) + log ( | x + x | + 1) + x | x | (cid:113) | x | | x | − x x F (x) exp | x − x | + | x x | − x | x | + log (cid:0) x + x + x + x (cid:1) + x + x F (x) exp | x − x | + | x x | − x | x | + ( x x ) + log (cid:0) x + x + x + x (cid:1) + x + x F (x) x + x + x + (cid:112) | x + x | + | x + x | + x x x F (x) exp ( | x x | + 1) − exp ( | x + x | + 1) + cos ( x + x − x ) + (cid:112) x + x + x F (x) (arctan ( x ) + arctan ( x )) + max ( x x + x , − x x x x x ) + (cid:16) | x | | x | (cid:17) + (cid:80) i =1 x i F (x) x x + 2 x + x + x + 2 x + x + x + x + sin ( x sin ( x + x )) + arccos (0 . x ) F (x) tanh ( x x + x x ) (cid:112) | x | + exp ( x + x ) + log (cid:0) ( x x x ) + 1 (cid:1) + x x + | x | F (x) sinh ( x + x ) + arccos (tanh ( x + x + x )) + cos ( x + x ) + sec ( x x ) Table 8: N-Way AUC scores for T-NID + MLP-M and NID + MLP-M, both using a main effectsnetwork and sparsity regularization, as described in [37] F ( x ) F ( x ) F ( x ) F ( x ) F ( x ) F ( x ) F ( x ) F ( x ) F ( x ) F ( x ) average T-NID 3-Way NID 3-Way . ± .

064 0 . ± . . ± .

165 0 . ± . . ± .

013 0 . ± . . ± .

007 0 . ± . . ± .

009 0 . ± . . ± .

025 0 . ± . . ± .

264 0 . ± . . ± . . ± . . ± .

146 0 . ± . . ± . . ± . . ± .

069 0 . ± . T-NID 4-Way NID 4-Way . ± .

389 0 . ± . . ± .

065 0 . ± . . ± .

006 1 . ± . . ± .

08 0 . ± . N/A N/AN/A N/A . ± .

322 0 . ± . . ± . . ± . . ± .

068 0 . ± . N/A N/A . ± .

133 0 . ± . T-NID 5-Way NID 5-WayN/A N/AN/A N/AN/A N/AN/A N/AN/A N/AN/A N/A . ± .

367 0 . ± . N/A N/A . ± .

024 0 . ± . N/A N/A . ± .

200 0 . ± . F Object Detection Figures & Analysis

We evaluated two datasets in multi-object detection. In both, the task is to identify whether a pair ofobjects are each present in tandem. If only one is present, then the class label is negative. We testedthis on the objects “car” and “person” in the COCO annotated-image dataset, and we designed ourown toy dataset consisting of cars (rectangles), signs (triangles), and a yield sign (red triangle) withthe task being to decide “go” or “stop” depending on whether both a car and a yield sign is present.In both cases, we conﬁgured the frequency of the labels such that an even amount of positive andnegative samples were in the training set.We found the COCO task to be somewhat inconclusive, because of model overﬁtting and rather lowtest accuracy, but still observed some sensible explanations, as seen in the top left part of Figure 2.In the Yield-or-Go task, we found the explanations to be more elucidating. To our surprise, the modelin the Yield-or-Go task appears to have two strategies. The ﬁrst is what we expected: it interactsthe yield sign (red triangle) with a car (rectangle), as seen in the bottom left row of Figure 2. In thesecond one, about as frequent, it interacts one particular car of interest with the other cars. One wouldexpect it to always interact the car and the yield sign, but actually, it seems the model discovered thatit can solve the problem just as well by checking if (1) a car is present, and (2) a red car is not present.Because of how the task was set up — with each object having a different color — (2) implies thata yield sign is present. So interestingly, what we found is that the model alternates between twostrategies, one where it acts predictably and one where it prioritizes the cars and looks at each pair ofthem, in which case it accurately predicts “stop.”15 a) Top, interacting pairs "person" and "car". Bottom,interacting yield sign (red triangle) and car (rectangle). (b) When no yield sign is present, interactions are fre-quently or occur primarily between adjacent regions. Figure 2: Simple interactions in multi-object detectionHowever, the more interesting result comes from when the correct label is “go,” i.e. , a car and yieldsign are not present together. In this case, we ﬁnd that the model rarely interacts anything, but rathereither all interaction saliences are zero or it interacts objects with themselves (immediately adjacentregions). An example of this is emphasized in Figure 2b. This self-interacting appears to be oneintuitive and convenient way to interpret that the model does not perceive any salient interactions.When it does interact multiple objects, it usually does so using the red shape as the central object thatit interacts all others to.

G Human Study Analysis

Table 9: Object of interest guess-accuraciesTaylorCAM Accuracy Grad-CAM AccuracyGreen

40% 13 . Red .

7% 30%

Blue

40% 10%

Purple

N/AOrange

15% 3 . Yellow .

3% 25%

Table 10: Question guess-accuraciesTaylorCAM Accuracy Grad-CAM AccuracyQuestion 1

76% 44%

Question 2

55% 14%

Question 3 .

3% 30%

Tables 9 and 10 refer to the human study. We found a wide range of explainability across differentcolors and questions. Consistently, the color red in both TaylorCAM and Grad-CAM exceeds allothers in guess-accuracy, achieving signiﬁcantly better guess-accuracy than random guessing in bothcases. This could indicate a psychological bias in our population toward the color red, or a bias inthe model that results in more intuitive explanations with respect to the color red. We consideredthe possibility that red was simply more frequently guessed, but that was not the case. The mostguessed color was green, followed by blue, followed by red, followed by orange, followed by yellow,followed by purple. 16ue to random sampling, none of the sampled images for Grad-CAM included a purple object ofinterest, so it is marked "N/A" in Table 9. Other than purple, the least accurately guessed color —consistently across both Grad-CAM and TaylorCAM — was orange.While some Grad-CAM colors strongly outperform random guessing (red and yellow), on average,people struggled guessing the object of interest with Grad-CAM. This is shown quantitatively, aswell as having been expressed to us by the study participants. This is because Grad-CAM onlyexplains which individual objects contribute to the output, which in the case of relational VQA, is allof them with an equal importance assigned to the object of interest and any objects that are includedin the question-answer, such as the furthest or nearest object. This results in uninterpretable andsometimes misleading visualizations, making it very hard — both quantitatively and subjectively —to guess an object of interest in VQA from the visual only. Without knowing the object of interest, itis consequently much harder to guess the question asked.Both Grad-CAM and TaylorCAM did surprisingly well on question 1, which asked "Which shape isnearest to the object of interest?" Closeness is easier to interpret with both explanatory tools, sinceit is usually more visually apparent. However, we found question 2 ("Which shape is furthest fromthe object of interest?") to be much harder to interpret for Grad-CAM, perhaps because it is unclearwhat the object of interest is, resulting in multiple "far away" objects of arbitrary distance from eachother being ranked highly. For example, two objects that are far away from the object of interestmight be close to each other, creating the false impression that the question is asking about closeness.Thus, without conﬁdence regarding the object of interest and the interacting parts, we found rankedimportances alone to be unintuitive and even misleading. H AMIS Scores Of Different Architectures

Table 11: Interaction salience and relational capacity correlationCNN + RN-GELU RN-Sigmoid RN-TanhAMIS 0.30 0 0Accuracy 71% 12% 43%To better control for variables such as architecture size, we further investigate our observed correlationby comparing the same architecture except varying only the activation function. We test GELU,Sigmoid, and Tanh. Here, we found that the correlation between AMIS score and performance stillholds when only the activation functions are changed. Of course, further study is needed to determinestatistical signiﬁcance and understand if and why there may be a connection between magnitude ofinteraction salience and relational reasoning.

I Interactional Relation Network (IRN)

A standard RN pools a set of feature vectors O = { o , ..., o n } , their corresponding positionalencodings C = { c , ..., c n } , and a question q as follows:RN ( O, C, q ) = f φ (cid:32) (cid:88) i,j g θ ( o i , o j , c i , c j , q ) (cid:33) , (6)where f and g are modeled by neural networks parameterized by φ and θ respectively.We observed through TaylorCAM that many of the top interactions in the RN’s reasoning werebetween individual regions and themselves, even when we zeroed out diagonals. This is illustrated inFigure 2b. To mitigate this, we made a simple modiﬁcation to the RN architecture which we found toyield better test accuracy:IRN ( O, C, q ) = f φ (cid:32) (cid:88) i,j g θ ( h ψ ( o i , c i , q ) , h ψ ( o j , c j , q ) , c i , c j , q ) (cid:33) , (7)17 a) Top -way interactions for MoCA fast progression N -Way Interaction Strengthid_num, scau20, mcarec4 4.77E-06id_num, drmagrac, mcarec4 4.69E-06educyrs, np1apat, bmi 4.56E-06np1dprs, np2walk, np3pstbl 4.27E-06scau13, np1slpn, np1cnst, nhy 6.00E-07scau11, scau13, scau20, bmi 5.66E-07scau11, scau13, np1slpn, nhy 5.64E-07scau13, scau20, np1urin, nhy 5.43E-07slplmbmv, np1dprs, np2walk, np3rigru, np3pstbl 1.23E-07slplmbmv, np1dprs, np2walk, np3rign, np3pstbl 1.22E-07scau5, np1dprs, np2walk, np3rigru, np3pstbl 1.19E-07slplmbmv, np1dprs, np2walk, np3pstbl, mcarec2 1.18E-07 (b) Top -way, -way, and -way interactions forMoCA fast progression Figure 3: Interaction effects for classifying fast clinical progression of MoCA scores from baselinewhere h is an MLP parameterized by ψ .For lack of a better name, we call this architecture Interactional Relation Network (IRN) because itexplicitly separates within its architecture the concerns of reasoning about interactions from reasoningabout individual objects. J Architecture Conﬁgurations

T-NID

For T-NID, we trained a GELU-activated multi-layer perceptron with hidden layer sizes140, 100, 60, and 20 for 200 epochs with a learning rate of 0.003 using early stopping [5] witha patience of 10. Results were averaged across 10 trials. Input data was normalized by standarddeviation. T-NID hyperparameters were set as (cid:96) = 5 , o = 2 , k = 10 . TaylorCAM

For our COCO [21] task, we used Pytorch’s ResNet-50 [12] pretrained on ImageNet[9], except we replaced the global average pooling layer with an additional convolutional layercomposed of 1024 out-channels, size 2 kernel, 2 stride, and 2 padding, followed by 3 hidden linearlayers of size 512, 256, 64, because global average pooling resulted in zero-valued higher-orderderivatives. For our Relation Network, we used an open source reference implementation, whichcan be found here: https://github.com/kimhc6028/relational-networks , since [30] didnot release their code to the public. For IRN, we modeled h ψ with two linear layers of size 128, 32.We trained for 50 epochs. K Biomedical Analysis

We applied these techniques to the Parkinson’s Progression Marker Initiative (PPMI) study ( ) dataset, which follows persons living with early-stage Parkinson’s dis-ease for up to approximately eight years collecting clinical and biological data from participants.Parkinson’s disease (PD) is a neurodegenerative progressive disease, characterized clinically by motor( e.g. , tremor, rigidity) and non-motor ( e.g. , cognition and autonomic dysfunction) symptoms thatvary over time within and between patients. Progression of motor and non-motor symptoms arelikely not independent of each other. Instead, collateral damage may be inﬂicted multilaterally withnon-motor and motor pathological features progressing interdependently. As an example, depressivesymptoms in Parkinson’s disease are common and may perpetuate motor and cognitive deﬁcits, whichcould impact function, and ultimately diminish quality of life. Therefore, it is necessary to takeas comprehensive of an approach as possible in unraveling the clinical progression of Parkinson’sdisease.As PD progresses, cognitive impairment leading to dementia may affect up to 80% of patients,ultimately impairing one’s functional independence. Within the PPMI study, we tested -, -, -and -way interactions to understand multivariable features at baseline that distinguish patients witha more severe progression in decline of cognitive function (“fast progressors”) compared to those18 a) Top -way interactions for uMCA fast progression N -Way Interaction Strengthscau1, np3lgagr, np3risng 1.97E+00scau1, scau9, np3lgagr 1.41E+00scau1, np2hwrt, np3lgagr 1.31E+00scau1, senllrsp, np3lgagr 1.19E+00time_from_diag, scau1, np2frez, np3lgagr 5.39E+00scau1, np2walk, np2frez, np3lgagr 5.28E+00ranos, scau1, np2frez, np3lgagr 5.25E+00scau1, rls, np2frez, np3lgagr 5.21E+00scau1, np3lgagr, np3risng, np3gait, np3rtarl 1.93E+01scau1, np2hygn, np3lgagr, np3risng, np3gait 1.80E+01scau1, mslarsp, np3lgagr, np3risng, np3gait 1.68E+01dxrigid, scau1, np3lgagr, np3risng, np3gait 1.47E+01 (b) Top -way, -way, and -way interactions foruMCA fast progression Figure 4: Interaction effects for classifying fast clinical progression of uMCA scores from baselinewith a more benign course of cognitive changes, as measured by the MontrealCognitive Assessment(MoCA) scale. Top -way interaction effects identiﬁed (Figure 3) among “fast progressors” includedfeature interactions between handedness (handed) and severity of rigidity in the neck (np3rign);presence of resting tremor at disease diagnosis (dxtremor) and severity of rigidity in the lowerextremities (np3rigll); and, severity of tremor (np2trmr) and alternating trail making test from theMoCA scale (mcaalttm) – which ultimately is a measure of processing speed, mental ﬂexibility,ability to sequence, and visuo-motor skills. Each of these features individually have some establishedassociations with cognitive dysfunction or neuropsychological disorders; however, their interactionstogether have not been previously considered. For example, handedness, has been signiﬁcantlyassociated with functional connectivity between language networks, as well as speciﬁc genetic lociimplicated in the pathogenesis of neurologic disorders including Parkinson’s disease [39]. Moresevere rigidity symptoms in Parkinson’s disease are also associated with faster cognitive decline[27]. Our analysis, for the ﬁrst time, suggests that measures of both handedness and rigidity severitytogether are important to consider when predicting faster cognitive progression in Parkinson’s disease.As shown in Figure 3, we provide -, -, and5