[PDF] Obtaining Faithful Interpretations from Compositional Neural Networks

Abstract

Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explanation of the model's reasoning; that is, that all modules perform their intended behaviour. In this work, we propose and conduct a systematic evaluation of the intermediate outputs of NMNs on NLVR2 and DROP, two datasets which require composing multiple reasoning steps. We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour. To remedy that, we train the model with auxiliary supervision and propose particular choices for module architecture that yield much better faithfulness, at a minimal cost to accuracy.

Full PDF

OObtaining Faithful Interpretations from Compositional Neural Networks

Sanjay Subramanian ∗∗ Ben Bogin ∗ Nitish Gupta ∗ Tomer Wolfson , Sameer Singh Jonathan Berant , Matt Gardner Allen Institute for AI Tel-Aviv University University of Pennsylvania University of California, Irvine { sanjays,mattg } @allenai.org, { ben.bogin,joberant } @cs.tau.ac.il,[email protected], [email protected], [email protected] Abstract

Neural module networks (NMNs) are a pop-ular approach for modeling compositionality:they achieve high accuracy when applied toproblems in language and vision, while reﬂect-ing the compositional structure of the prob-lem in the network architecture. However,prior work implicitly assumed that the struc-ture of the network modules, describing theabstract reasoning process, provides a faith-ful explanation of the model’s reasoning; thatis, that all modules perform their intended be-haviour. In this work, we propose and con-duct a systematic evaluation of the intermedi-ate outputs of NMNs on NLVR2 and DROP,two datasets which require composing multi-ple reasoning steps. We ﬁnd that the inter-mediate outputs di ﬀ er from the expected out-put, illustrating that the network structure doesnot provide a faithful explanation of model be-haviour. To remedy that, we train the modelwith auxiliary supervision and propose partic-ular choices for module architecture that yieldmuch better faithfulness, at a minimal cost toaccuracy. Models that can read text and reason about it in aparticular context (such as an image, a paragraph,or a table) have been recently gaining increased at-tention, leading to the creation of multiple datasetsthat require reasoning in both the visual and textualdomain (Johnson et al., 2016; Suhr et al., 2017;Talmor and Berant, 2018; Yang et al., 2018a; Suhret al., 2019; Hudson and Manning, 2019; Dua et al.,2019). Consider the example in Figure 1 fromNLVR2: a model must understand the composi-tional sentence in order to then ground dogs in theinput, count those that are black and verify thatthe count of all dogs in the image is equal to thenumber of black dogs. ∗ Equal Contribution “All the dogs are black.” find[ dogs ]filter[ black ]

24% 96%

Basic-NMN Faithful-NMN equal countequalcount

False (98%) count count find[ dogs ]filter[ black ] Figure 1: An example for a visual reasoning prob-lem where both the Basic and Faithful NMNs producethe correct answer. The Basic NMN, however, failsto give meaningful intermediate outputs for the find and filter modules, whereas our improved Faithful-NMN assigns correct probabilities in all cases. Boxesare green if probabilities are as expected, red otherwise.

Both models that assume an intermediate struc-ture (Andreas et al., 2016; Jiang and Bansal, 2019)and models without such structure (Tan and Bansal,2019; Hu et al., 2019; Min et al., 2019) have beenproposed for these reasoning problems. Whilegood performance can be obtained without a struc-tured representation, an advantage of structuredapproaches is that the reasoning process in suchapproaches is more interpretable . For example, astructured model can explicitly denote that thereare two dogs in the image, but that one of them is not black . Such interpretability improves our sci-entiﬁc understanding, aids in model development,and improves overall trust in a model. a r X i v : . [ c s . C L ] M a y n the first quarter, the Texans trailed early after QB Kerry Collins threw a 19-yard TD pass to WR Nate Washington. Second quarter started with kicker Neil Rackers made a 37-yard field goal, and the quarter closed with kicker Rob Bironas hitting a 30-yard field goal. The Texans tried to cut the lead with QB Matt Schaub getting a 8-yard TD pass to WR Andre Johnson, but the Titans would pull away with RB Javon Ringer throwing a 7-yard TD pass . The Texans tried to come back into the game in the fourth quarter, but only came away with Schaub throwing a 12-yard TD pass to WR Kevin Walter. relocate[who threw] find-max-num filter [the second half] find [touchdown pass] Who threw the longest touchdown pass in the second half?two dogs are touching a food dish with their face equal count with-relation [is touching] relocate [face] find [dog] find [food dish] number [two]

Figure 2: An example for a mapping of an utterance to a gold program and a perfect execution in a reasoningproblem from NLVR2 (top) and DROP (bottom).

Neural module networks (NMNs; Andreas et al.,2016) parse an input utterance into an executableprogram composed of learnable modules that aredesigned to perform atomic reasoning tasks and canbe composed to perform complex reasoning againstan unstructured context. NMNs are appealing sincetheir output is interpretable; they provide a logicalmeaning representation of the utterance and alsothe outputs of the intermediate steps (modules) toreach the ﬁnal answer. However, because moduleparameters are typically learned from end-task su-pervision only, it is possible that the program willnot be a faithful explanation of the behaviour ofthe model (Ross et al., 2017; Wiegre ﬀ e and Pinter,2019), i.e., the model will solve the task by execut-ing modules according to the program structure, butthe modules will not perform the reasoning steps asintended . For example, in Figure 1, a basic NMN predicts the correct answer False , but incorrectlypredicts the output of the find [ dogs ] operation. Itdoes not correctly locate one of the dogs in theimage because two of the reasoning steps ( find and filter ) are collapsed into one module ( find ).This behavior of the find module is not faithful toits intended reasoning operation; a human readingthe program would expect find [ dogs ] to locate alldogs. Such unfaithful module behaviour yields anunfaithful explanation of the model behaviour.Unfaithful behaviour of modules, such as multi-ple reasoning steps collapsing into one, are unde-sirable in terms of interpretability; when a modelfails to answer some question correctly, it is hard totell which modules are the sources of error. Whilerecent work (Yang et al., 2018b; Jiang and Bansal, 2019) has shown that one can obtain good perfor-mance when using NMNs, the accuracy of individ-ual module outputs was mostly evaluated throughqualitative analysis, rather than systematically eval-uating the intermediate outputs of each module.We provide three primary contributions regard-ing faithfulness in NMNs. First, we propose theconcept of module-wise faithfulness – a system-atic evaluation of individual module performancein NMNs that judges whether they have learnedtheir intended operations, and deﬁne metrics toquantify this for both visual and textual reason-ing ( § gold programs , does not yield module-wise faithfulness, i.e., the modules do not performtheir intended reasoning task. Second, we providestrategies for improving module-wise faithfulnessin NMNs ( § ﬀ ects faithfulness ( § § § Faithful-NMN ) results in expected mod-ule outputs as compared to the

Basic-NMN . Last,we collect human-annotated intermediate outputsfor 536 examples in NLVR2 and for 215 exam-ples in DROP to measure the module-wise faith-fulness of models, and publicly release them forfuture work. Our code and data are available at https://github.com/allenai/faithful-nmn . Neural Module Networks

Overview

Neural module networks (NMNs; An-dreas et al., 2016) are a class of models thatmap a natural language utterance into an exe-cutable program, composed of learnable modulesthat can be executed against a given context (im-ages, text, etc.), to produce the utterance’s deno-tation (truth value in NLVR2, or a text answerin DROP). Modules are designed to solve atomicreasoning tasks and can be composed to performcomplex reasoning. For example, in Figure 1,the utterance “All the dogs are black” is mappedto the program equal(count(find[ dogs ]),count(filter[ black ](find[ dogs ]))) . The find module is expected to ﬁnd all dogs in theimage and the filter module is expected to out-put only the black ones from its input. Figure 2shows two other example programs with the ex-pected output of each module in the program.A NMN has two main components: (1) parser,which maps the utterance into an executable pro-gram; and (2) executor, which executes the pro-gram against the context to produce the denotation.In our setup, programs are always trees where eachtree node is a module. In this work, we focus onthe executor, and speciﬁcally the faithfulness ofmodule execution. We examine NMNs for bothtext and images, and describe their modules next.

In this task, given two images and a sentence thatdescribes the images, the model should output

True i ﬀ the sentence correctly describes the im-ages. We base our model, the Visual-NMN, onLXMERT (Tan and Bansal, 2019), which takes asinput the sentence x and raw pixels, uses FasterR-CNN (Ren et al., 2015) to propose a set ofbounding boxes, B , that cover the objects in theimage, and passes the tokens of x and the boundingboxes through a Transformer (Vaswani et al., 2017),encoding the interaction between both modali-ties. This produces a contextualized representation t ∈ R | x |× h for each one of the tokens, and a repre-sentation v ∈ R |B|× h for each one of the boundingboxes, for a given hidden dimension h .We provide a full list of modules and their imple-mentation in Appendix A. Broadly, modules takeas input representations of utterance tokens throughan utterance attention mechanism (Hu et al., 2017),i.e., whenever the parser outputs a module, italso predicts a distribution over the utterance to- kens ( p , . . . , p | x | ), and the module takes as input (cid:80) | x | i = p i t i , where t i is the hidden representation oftoken i . In addition, modules produce as output(and take as input) vectors p ∈ [0 , |B| , indicatingfor each bounding box the probability that it shouldbe output by the module (Mao et al., 2019). For ex-ample, in the program filter [ black ]( find [ dog ]),the find module takes the word ‘dog’ (using ut-terance attention , which puts all probability masson the word ‘dog’), and outputs a probability vec-tor p ∈ [0 , |B| , where ideally all bounding boxescorresponding to dogs have high probability. Then,the filter module takes p as input as well as theword ‘black’, and is meant to output high probabil-ities for bounding boxes with ‘black dogs’.For the Visual-NMN we do not use a parser, butrely on a collected set of gold programs (includinggold utterance attention ), as described in §

5. Wewill see that despite this advantageous setup, a basicNMN does not produce interpretable outputs.

Our Text-NMN is used to answer questions inthe DROP dataset and uses the modules as de-signed for DROP in prior work (Gupta et al.,2020) along with three new modules we deﬁnein this work. The modules introduced in Guptaet al. (2020) and used as is in our Text-NMNare find , filter , relocate , count , find-num , find-date , find-max-num , find-min-num , num-compare and date-compare . All these mod-ules are probabilistic and produce, as output, a dis-tribution over the relevant support. For example, find outputs a distribution over the passage to-kens and find-num outputs a distribution over thenumbers in the passage. We extend their modeland introduce additional modules; addition and subtraction to add or subtract passage numbers,and extract-answer which directly predicts ananswer span from the representations of passage to-kens without any explicit compositional reasoning.We use BERT-base (Devlin et al., 2019) to encodethe input question and passage.The Text-NMN does not have access to goldprograms, and thus we implement a parser as anencoder-decoder model with attention similar toKrishnamurthy et al. (2017), which takes the ut-terance as input, and outputs a linearized abstractsyntax tree of the predicted program. Module-wise Faithfulness

Neural module networks (NMNs) facilitate inter-pretability of their predictions via the reasoningsteps in the structured program and providing theoutputs of those intermediate steps during execu-tion. For example, in Figure 2, all reasoning stepstaken by both the Visual-NMN and Text-NMN canbe discerned from the program and the interme-diate module outputs. However, because moduleparameters are learned from an end-task, there isno guarantee that the modules will learn to per-form their intended reasoning operation. In sucha scenario, when modules do not perform theirintended reasoning, the program is no longer afaithful explanation of the model behavior sinceit is not possible to reliably predict the outputs ofthe intermediate reasoning steps given the program.Work on NMNs thus far (Yang et al., 2018b; Jiangand Bansal, 2019) has overlooked systematicallyevaluating faithfulness, performing only qualitativeanalysis of intermediate outputs.We introduce the concept of module-wise faith-fulness aimed at evaluating whether each modulehas correctly learned its intended operation by judg-ing the correctness of its outputs in a trained NMN.For example, in Figure 2 (top), a model would bejudged module-wise faithful if the outputs of all themodules, find , relocate , and with relation ,are correct – i.e. similar to the outputs that a humanwould expect. We provide gold programs whenevaluating faithfulness, to not conﬂate faithfulnesswith parser accuracy. Modules in Visual-NMN provide for each bound-ing box a probability for whether it should bea module output. To evaluate intermediate out-puts, we sampled examples from the develop-ment set, and annotated gold bounding boxes foreach instance of find , filter , with-relation and relocate . The annotator draws the correctbounding-boxes for each module in the gold pro-gram, similar to the output in Figure 2 (top).A module of a faithful model should assign highprobability to bounding-boxes that are aligned withthe annotated bounding boxes and low probabilitiesto other boxes. Since the annotated bounding boxesdo not align perfectly with the model’s boundingboxes, our evaluation must ﬁrst induce an align-ment. We consider two bounding boxes as “aligned”if the intersection-over-union (IOU) between them exceeds a pre-deﬁned threshold T = .

5. Notethat it is possible for an annotated bounding box tobe aligned with several proposed bounding boxesand vice versa. Next, we consider an annotated bounding box B A as “matched” w.r.t a module out-put if B A is aligned with a proposed bounding box B P , and B P is assigned by the module a probability > .

5. Similarly, we consider a proposed boundingbox B P as “matched” if B P is assigned by the mod-ule a probability > . B A .We compute precision and recall for each mod-ule type (e.g. find ) in a particular example byconsidering all instances of the module in that ex-ample. We deﬁne precision as the ratio between thenumber of matched proposed bounding boxes andthe number of proposed bounding boxes assigneda probability of more than 0.5. We deﬁne recall asthe ratio between the number of matched annotatedbounding boxes and the total number of annotatedbounding boxes. F is the harmonic mean of preci-sion and recall. Similarly, we compute an “overall”precision, recall, and F score for an example byconsidering all instances of all module types in thatexample. The ﬁnal score is an average over allexamples. Please see Appendix B.2 for furtherdiscussion on this averaging. Each module in Text-NMN produces a distribu-tion over passage tokens ( § find , filter , etc.) should predict highprobability for tokens that appear in the gold spansand zero probability for other tokens.To measure a module output’s correctness, weuse a metric akin to cross-entropy loss to measurethe deviation of the predicted module output p att from the gold spans S = [ s , . . . , s N ]. Here eachspan s i = ( t i s , t i e ) is annotated as the start and endtokens. Faithfulness of a module is measured by: I = − (cid:80) Ni = (cid:32) log (cid:80) t i e j = t i s p j att (cid:33) . Lower cross-entropycorresponds to better faithfulness of a module. The numerators of the precision and the recall are di ﬀ er-ent. Please see Appendix B.1 for an explanation. Improving Faithfulness in NMNs

Module-wise faithfulness is a ﬀ ected by variousfactors: the choice of modules and their implemen-tation ( § § § The count module always ap-pears in NLVR2 as one of the top-level modules(see Figures 1 and 2). We now discuss howits architecture a ﬀ ects faithfulness. Consider theprogram, count(filter[ black ](find[ dogs ])) .Its gold denotation (correct count value) wouldprovide minimal feedback using which the descen-dant modules in the program tree, such as filter and find , need to learn their intended behavior.However, if count is implemented as an expres-sive neural network, it might learn to perform tasksdesignated for find and filter , hurting faithful-ness. Thus, an architecture that allows counting,but also encourages descendant modules to learntheir intended behaviour through backpropagation,is desirable. We discuss three possible count ar-chitectures, which take as input the bounding boxprobability vector p ∈ [0 , |B| and the visual fea-tures v ∈ R |B|× h . Layer-count module is motivated by the count ar-chitecture of Hu et al. (2017), which uses a linearprojection from image attention, followed by a soft-max. This architecture explicitly uses the visualfeatures, v , giving it greater expressivity comparedto simpler methods. First we compute p · v , theweighted sum of the visual representations, basedon their probabilities, and then output a scalar countusing: FF (LayerNorm(FF ( p · v )) , where FF andFF are feed-forward networks, and the activationfunction of FF is ReLU in order to output positivenumbers only.As discussed, since this implementation has ac-cess to the visual features of the bounding boxes,it can learn to perform certain tasks itself, withoutproviding proper feedback to descendant modules.We show in § Sum-count module on the other extreme, ignores v , and simply computes the sum (cid:80) |B| i = p i . Being Top-level modules are Boolean quantiﬁers, such asnumber comparisons like equal (which require count ) or exist . We implement exist using a call to count and greater-equal (see Appendix A), so count always occursin the program. parameter-less, this architecture provides directfeedback to descendant modules on how to changetheir output to produce better probabilities. How-ever, such a simple functional-form ignores thefact that bounding boxes are overlapping, whichmight lead to over-counting objects. In addition,we would want count to ignore boxes with lowprobability. For example, if filter predicts a 5%probability for 20 di ﬀ erent bounding boxes, wewould not want the output of count to be 1 . Graph-count module (Zhang et al., 2018) is a mid-dle ground between both approaches - the na¨ıve

Sum-Count and the ﬂexible

Layer-Count . Like

Sum-Count , it does not use visual features, butlearns to ignore overlapping and low-conﬁdencebounding boxes while introducing only a minimalnumber of parameters (less than 300). It doesso by treating each bounding box as a node in agraph, and then learning to prune edges and clus-ter nodes based on the amount of overlap betweentheir bounding boxes (see paper for further details).Because this is a light-weight implementation thatdoes not access visual features, proper feedbackfrom the module can propagate to its descendants,encouraging them to produce better predictions.

Textual reasoning

In the context of Text-NMN(on DROP), we study the e ﬀ ect of several moduleson interpretability.First, we introduce an extract-answer mod-ule. This module bypasses all compositional rea-soning and directly predicts an answer from theinput contextualized representations. This has po-tential to improve performance, in cases where aquestion describes reasoning that cannot be cap-tured by pre-deﬁned modules, in which case theprogram can be the extract-answer module only.However, introducing extract-answer adverselya ﬀ ects interpretability and learning of other mod-ules, speciﬁcally in the absence of gold programs.First, extract-answer does not provide any in-terpretability. Second, whenever the parser pre-dicts the extract-answer module, the param-eters of the more interpretable modules are nottrained. Moreover, the parameters of the encoderare trained to perform reasoning internally in a non-interpretable manner. We study the interpretabilityvs. performance trade-o ﬀ by training Text-NMNwith and without extract-answer .Second, consider the program find-max-num(find[ touchdown ]) that aimsto ﬁnd the longest touchdown . find-max-num hould sort spans by their value and return themaximal one; if we remove find-max-num , theprogram would reduce to find[ touchdown ] ,and the find module would have to select thelongest touchdown rather than all touchdowns,following the true denotation. More generally,omitting atomic reasoning modules pushes othermodules to compensate and perform complextasks that were not intended for them, hurtingfaithfulness. To study this, we train Text-NMN byremoving sorting and comparison modules (e.g., find-max-num and num-compare ), and evaluatehow this a ﬀ ects module-wise interpretability. As explained, given end-task supervision only,modules may not act as intended, since their param-eters are only trained for minimizing the end-taskloss. Thus, a straightforward way to improve in-terpretability is to train modules with additionalatomic-task supervision.

Visual reasoning

For Visual-NMN, we pre-train find and filter modules with explicit intermedi-ate supervision, obtained from the GQA balanceddataset (Hudson and Manning, 2019). Note thatthis supervision is used only during pre-training –we do not assume we have full-supervision for theactual task at hand. GQA questions are annotatedby gold programs; we focus on “exist” questionsthat use find and filter modules only, such as “Are there any red cars?” .Given gold annotations from Visual Genome (Kr-ishna et al., 2017), we can compute a label for eachof the bounding boxes proposed by Faster-RCNN.We label a proposed bounding box as ‘positive’ ifits IOU with a gold bounding box is > .

75, and‘negative’ if it is < .

25. We then train on GQAexamples, minimizing both the usual denotationloss, as well as an auxiliary loss for each instanceof find and filter , which is binary cross en-tropy for the labeled boxes. This loss rewards highprobabilities for ‘positive’ bounding boxes and lowprobabilities for ‘negative’ ones.

Textual reasoning

Prior work (Gupta et al.,2020) proposed heuristic methods to extract super-vision for the find-num and find-date modulesin DROP. On top of the end-to-end objective, theyuse an auxiliary objective that encourages thesemodules to output the “gold” numbers and datesaccording to the heuristic supervision. They showthat supervising intermediate module outputs helps improve model performance. In this work, we eval-uate the e ﬀ ect of such supervision on the faithful-ness of both the supervised modules, as well asother modules that are trained jointly. The goal of decomposing reasoning into multi-ples steps, each focusing on di ﬀ erent parts ofthe utterance, is at odds with the widespread useof contextualized representations such as BERTor LXMERT. While the utterance attention ismeant to capture information only from tokensrelevant for the module’s reasoning, contextu-alized token representations carry global infor-mation. For example, consider the program filter[ red ](find[ car ]) for the phrase red car .Even if find attends only to the token car , its rep-resentation might also express the attribute red , so find might learn to ﬁnd just red cars , rather thanall cars , rendering the filter module useless, andharming faithfulness. To avoid such contextualiza-tion in Visual-NMN, we zero out the representa-tions of tokens that are unattended, thus the inputto the module is computed (with LXMERT) fromthe remaining tokens only. We ﬁrst introduce the datasets used and the exper-imental setup for measuring faithfulness ( § § ﬀ erent designchoices a ﬀ ect it ( § § Please see Appendix C for further detail about theexperimental setups.

Visual reasoning

We automatically generategold program annotations for 26 ,

311 training setexamples and for 5 ,

772 development set examplesfrom NLVR2. The input to this generation processis the set of crowdsourced question decomposi-tions from the B reak dataset (Wolfson et al., 2020).See Appendix C.1 for details. For module-wisefaithfulness evaluation, 536 examples from the de-velopment set were annotated with the gold outputfor each module by experts. odel Performance (Accuracy)

Overall Faithful. ( ↑ ) Module-wise Faithfulness F ( ↑ )Prec. Rec. F ﬁnd ﬁlter with-relation relocateLXMERT Upper Bound 1 0.84 0.89 0.89 0.92 0.95 0.75NMN w / Layer-count 71.2 0.39 0.39 0.11 0.12 0.20 0.37

NMN w / Sum-count 68.4 / Graph-count 69.6 0.37 0.39 0.28 0.31 0.29 0.37 0.19NMN w / Graph-count + decont. 67.3 0.29 0.51 0.33 0.38 0.30 0.36 0.13NMN w / Graph-count + pretraining 69.6 0.44 0.49 0.36 0.39 0.34 0.42 0.21NMN w / Graph-count + decont. + pretraining 68.7 0.42 Table 1 : Faithfulness and accuracy on NLVR2. “decont.” refers to decontextualized word representations. Precision, recall, andF are averages across examples, and thus F is not the harmonic mean of the corresponding precision and recall. Model Performance (F Score)

Overall Faithful. (cross-entropy ∗ ↓ ) Module-wise Faithfulness ∗ ( ↓ )ﬁnd ﬁlter relocate min-max † ﬁnd-arg † Text-NMN w / o prog-supw / extract-answer / o extract-answer / prog-supno auxiliary sup 65.3 11.2 13.7 16.9 1.5 2.2 13.0w / o sorting & comparison 63.8 8.4 9.6 11.1 1.6 1.3 10.6w / module-output-sup Table 2 : Faithfulness and performance scores for various NMNs on DROP. ∗ lower is better. † min-max is average faithfulness of find-min-num and find-max-num ; ﬁnd-arg of find-num and find-date . Textual reasoning

We train Text-NMN onDROP, which is augmented with program supervi-sion for 4 ,

000 training questions collected heuristi-cally as described in Gupta et al. (2020). The modelis evaluated on the complete development set ofDROP which does not contain any program super-vision. Module-wise faithfulness is measured on215 manually-labeled questions from the develop-ment set, which are annotated with gold programsand module outputs (passage spans).

Results are seen in Table 1.Accuracy for LXMERT, when trained and eval-uated on the same subset of data, is 71.7%; slightlyhigher than NMNs, but without providing evidencefor the compositional structure of the problem.For faithfulness, we measure an upper-boundon the faithfulness score. Recall that this scoremeasures the similarity between module outputsand annotated outputs. Since module outputs areconstrained by the bounding boxes proposed byFaster-RCNN ( § Upper Bound shows the maximal faithfulness score conditioned on the proposed bounding boxes.We now compare the performance and faithful-ness scores of the di ﬀ erent components. Whentraining our NMN with the most ﬂexible countmodule, ( NMN w / Layer-count ), an accuracy of71 .

2% is achieved, a slight drop compared toLXMERT but with low faithfulness scores. Using

Sum-count drops about 3% of performance, but in-creases faithfulness. Using

Graph-count increasesaccuracy while faithfulness remains similar.Next, we analyze the e ﬀ ect of decontextualizedword representations (abbreviated “decont.”) andpre-training. First, we observe that NMN w / Graph-count + decont. increases faithfulness score to0 .

33 F at the expense of accuracy, which drops to67 . NMN w / Graph-count + pre-training ) achieves higher faithfulness scores witha higher accuracy of 69 . .

47 F ) with a min-imal accuracy drop. We perform a paired permuta-tion test to compare NMN w / Graph-count + decont. + pretraining with NMN w / Layer-count and ﬁndthat the di ﬀ erence in F is statistically signiﬁcant( p < . extual reasoning As seen in Table 2, whentrained on DROP using question-program super-vision, the model achieves 65 . performanceand a faithfulness score of 11 .

2. When adding su-pervision for intermediate modules ( § .

5. Similar to Visual-NMN, this shows that su-pervising intermediate modules in a program leadsto better faithfulness.To analyze how choice of modules a ﬀ ects faith-fulness, we train without sorting and comparisonmodules ( find-max-num , num-compare , etc.).We ﬁnd that while performance drops slightly, faith-fulness deteriorates signiﬁcantly to 8 .

4, showingthat modules that perform atomic reasoning arecrucial for faithfulness. When trained without pro-gram supervision, removing extract-answer im-proves faithfulness (9 . → .

9) but at the cost ofperformance (63 . → . ). This shows thatsuch a black-box module encourages reasoning inan opaque manner, but can improve performanceby overcoming the limitations of pre-deﬁned mod-ules. All improvements in faithfulness are signif-icant as measured using paired permutation tests( p < . Generalization

A natural question is whethermodels that are more faithful also generalize better.We conducted a few experiments to see whetherthis is true for our models. For NLVR2, we per-formed (1) an experiment in which programs intraining have length at most 7, and programs attest time have length greater than 7, (2) an exper-iment in which programs in training have at most1 filter module and programs at test time haveat least 2 filter modules, and (3) an experimentin which programs in training do not have both filter and with-relation modules in the sameprogram, while each program in test has both mod-ules. We compared three of our models –

NMNw / Layer-count , NMN w / Sum-count , and

NMN w / Graph-count + decont. + pretraining . We did notobserve that faithful models generalize better (infact, the most unfaithful model tended to achievethe best generalization).To measure if faithful model behavior leads tobetter generalization in Text-NMN we conductedthe following experiment. We selected the sub-set of data for which we have gold programs andsplit the data such that questions that require max-imum and greater-than operations are present inthe training data while questions that require com- puting minimum and less-than are in the test data.We train and test our model by providing gold-programs under two conditions, in the presenceand absence of additional module supervision. Weﬁnd that providing auxiliary module supervision(that leads to better module faithfulness; see above)also greatly helps in model generalization (perfor-mance increases from 32 . → . ). We analyze outputs of dif-ferent modules in Figure 3. Figures 3a, 3b showthe output of find [ llamas ] when trained with con-textualized and decontextualized word representa-tions. With contextualized representations (3a), the find fails to select any of the llamas , presumablybecause it can observe the word eating , thus e ﬀ ec-tively searching for eating llamas , which are not inthe image. Conversely, the decontextualized modelcorrectly selects the boxes. Figure 3c shows that find outputs meaningless probabilities for most ofthe bounding boxes when trained with Layer-count ,yet the count module produces the correct value(three). Figure 3d shows that find fails to pre-dict all relevant spans when trained without sortingmodules in Text-NMN.

Error analysis

We analyze cases where outputswere unfaithful. First, for visual reasoning, we no-tice that faithfulness scores are lower for long-tailobjects. For example, for dogs , a frequent nounin NLVR2, the execution of find[ dogs ] yields anaverage faithfulness score of 0.71, while items suchas roll of toilet paper , barbell and safety pin receivelower scores (0.22, 0.29 and 0.05 respectively; ex-ample for a failure case for safety pin in Fig. 3e).In addition, some objects are harder to annotatewith a box ( water , grass , ground ) and thereforereceive low scores. The issue of small objects canalso explain the low scores of relocate . In thegold box annotations used for evaluation, the av-erage areas for find , filter , with-relation ,and relocate (as a fraction of the total imagearea) are 0 .

19, 0 .

15, and 0 .

07, respectively.Evidently, relocate is executed with small ob-jects that are harder to annotate ( tongue , spots , topof ), and indeed the upper-bound and model scoresfor relocate are lowest among the module types. NMNs were originally introduced for visual ques-tion answering and applied to datasets with syn- tt: “the llamas in both images are eating” (a) (b) find[ llamas ] (c) find[ people ] utt: “there are three people” (e)

91% 90% <1% find[ safety pin ] utt:“at least one safety pin is not embellished.” count3 The Redskins obtained an early lead when RB Clinton Portis scored on a 3-yard TD run. St. Louis scored again when free safety Oshiomogho Atogwe scored a 75 yards touchdown. Washington regained the lead with ….. and a Clinton Portis 2-yard rushing TD. St. Louis would come back with a 49-yard field goal. find[ touchdown run ] (d) Figure 3: Comparison of module outputs betweenNMN versions: (a) Visual-NMN with contextualizedrepresentations, (b) Visual-NMN with decontextual-ized representations, (c) model using a parameter-richcount layer (Layer-Count), (d) Text-NMN trained with-out sorting module produces an incorrect find output(misses ), and (e) Visual-NMN fail-ure case with a rare object (of w / Graph-count + decont. + pretraining ) thetic language and images as well as VQA (Antolet al., 2015), whose questions require few reason-ing steps (Andreas et al., 2016; Hu et al., 2017;Yang et al., 2018b). In such prior work, module-wise faithfulness was mostly assessed via qualita-tive analysis of a few examples (Jiang and Bansal,2019; Gupta et al., 2020). Yang et al. (2018b) didan evaluation where humans rated the clarity of thereasoning process and also tested whether humanscould detect model failures based on module out-puts. In contrast, we quantitatively measure eachmodule’s predicted output against the annotatedgold outputs.A related systematic evaluation of interpretabil-ity in VQA was conducted by Trott et al. (2018).They evaluated the interpretability of their VQAcounting model, where the interpretability score isgiven by the semantic similarity between the gold label for a bounding box and the relevant word(s) inthe question. However, they studied only countingquestions, which were also far less compositionalthan those in NLVR2 and DROP.Similar to the gold module output annotationsthat we provide and evaluate against, H otpot QA(Yang et al., 2018a) and C o QA (Reddy et al., 2019)datasets include supporting facts or rationales forthe answers to their questions, which can be usedfor both supervision and evaluation.In concurrent work, Jacovi and Goldberg (2020)recommend studying faithfulness on a scale ratherthan as a binary concept. Our evaluation methodcan be viewed as one example of this approach.

We introduce the concept of module-wise faithful-ness , a systematic evaluation of faithfulness in neu-ral module networks (NMNs) for visual and textualreasoning. We show that na¨ıve training of NMNsdoes not produce faithful modules and propose sev-eral techniques to improve module-wise faithful-ness in NMNs. We show how our approach leadsto much higher module-wise faithfulness at a lowcost to performance. We encourage future workto judge model interpretability using the proposedevaluation and publicly published annotations, andexplore techniques for improving faithfulness andinterpretability in compositional models.

Acknowledgements

We thank members of UCI NLP, TAU NLP, andthe AllenNLP teams as well as Daniel Khashabifor comments on earlier drafts of this paper. Wealso thank the anonymous reviewers for their com-ments. This research was partially supported byThe Yandex Initiative for Machine Learning, theEuropean Research Council (ERC) under the Euro-pean Union Horizons 2020 research and innovationprogramme (grant ERC DELPHI 802800), fundingby the ONR under Contract No. N00014-19-1-2620, and by sponsorship from the LwLL DARPAprogram under Contract No. FA8750-19-2-0201.This work was completed in partial fulﬁllment forthe Ph.D degree of Ben Bogin. eferences

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Learning to compose neural net-works for question answering. In

Proceedings ofNAACL-HLT , pages 1545–1554.Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual Question An-swering. In

International Conference on ComputerVision (ICCV) .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.Drop: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In

Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 2368–2378.Nitish Gupta, Kevin Lin, Dan Roth, Sameer Singh, andMatt Gardner. 2020. Neural Module Networks forReasoning over Text. In

International Conferenceon Learning Representations (ICLR) .Dan Hendrycks and Kevin Gimpel. 2016. Gaus-sian error linear units (gelus). arXiv preprintarXiv:1606.08415 .Minghao Hu, Yuxing Peng, Zhen Huang, and Dong-sheng Li. 2019. A multi-type multi-span networkfor reading comprehension that requires discrete rea-soning. In

Proceedings of the IEEE In-ternational Conference on Computer Vision , pages804–813.Drew A Hudson and Christopher D Manning. 2019.Gqa: A new dataset for real-world visual reasoningand compositional question answering. In

Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 6700–6709.Alon Jacovi and Yoav Goldberg. 2020. Towards faith-fully interpretable nlp systems: How should we de-ﬁne and evaluate faithfulness? In

Proceedings of the 2020 Conference of the Association for Compu-tational Linguistics .Yichen Jiang and Mohit Bansal. 2019. Self-assemblingmodular networks for interpretable multi-hop rea-soning. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages4473–4483, Hong Kong, China. Association forComputational Linguistics.Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C. Lawrence Zitnick, andRoss B. Girshick. 2016. Clevr: A diagnostic datasetfor compositional language and elementary visualreasoning. , pages 1988–1997.Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations.

International Journal of Computer Vision ,123(1):32–73.Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In

Proceedings ofthe 2017 Conference on Empirical Methods in Natu-ral Language Processing , pages 1516–1526.Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B.Tenenbaum, and Jiajun Wu. 2019. The neuro-symbolic concept learner: Interpreting scenes,words, and sentences from natural supervision. In

International Conference on Learning Representa-tions .Sewon Min, Eric Wallace, Sameer Singh, Matt Gard-ner, Hannaneh Hajishirzi, and Luke Zettlemoyer.2019. Compositional questions do not necessitatemulti-hop reasoning. In

Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics , pages 4249–4257, Florence, Italy. Asso-ciation for Computational Linguistics.E.W. Noreen. 1989.

Computer-Intensive Methods forTesting Hypotheses: An Introduction . Wiley.Siva Reddy, Danqi Chen, and Christopher D Manning.2019. Coqa: A conversational question answeringchallenge.

Transactions of the Association for Com-putational Linguistics , 7:249–266.Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time objectdetection with region proposal networks. In

Pro-ceedings of the 28th International Conference onNeural Information Processing Systems - Volume 1 ,NIPS’15, pages 91–99, Cambridge, MA, USA. MITPress.ndrew Slavin Ross, Michael C. Hughes, and FinaleDoshi-Velez. 2017. Right for the right reasons:Training di ﬀ erentiable models by constraining theirexplanations. In IJCAI .Howard Seltman. 2018. Approximations for mean andvariance of a ratio.Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi.2017. A corpus of natural language for visual rea-soning. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 217–223.Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang,Huajun Bai, and Yoav Artzi. 2019. A corpus forreasoning about natural language grounded in pho-tographs. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 6418–6428.Alon Talmor and Jonathan Berant. 2018. The web asa knowledge-base for answering complex questions.In

Proceedings of NAACL-HLT , pages 641–651.Hao Tan and Mohit Bansal. 2019. LXMERT: Learningcross-modality encoder representations from trans-formers. In

International Conference on LearningRepresentations .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Dan Ventura. 2007.

CS478 Paired Permutation TestOverview . Accessed April 29, 2020.Sarah Wiegre ﬀ e and Yuval Pinter. 2019. Attention isnot not explanation. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 11–20, Hong Kong, China. Associ-ation for Computational Linguistics.Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gard-ner, Yoav Goldberg, Daniel Deutch, and JonathanBerant. 2020. Break it down: A question under-standing benchmark.

Transactions of the Associa-tion for Computational Linguistics , 8:183–198. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D Manning. 2018a. Hotpotqa: A dataset fordiverse, explainable multi-hop question answering.In

Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing , pages2369–2380.Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018b. HotpotQA: A datasetfor diverse, explainable multi-hop question answer-ing. In

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 2369–2380, Brussels, Belgium. Associationfor Computational Linguistics.Alexander Yeh. 2000. More accurate tests for the sta-tistical signiﬁcance of result di ﬀ erences. In Pro-ceedings of the 18th conference on Computationallinguistics-Volume 2 , pages 947–953. Associationfor Computational Linguistics.Yan Zhang, Jonathon Hare, and Adam Prgel-Bennett.2018. Learning to count objects in natural imagesfor visual question answering. In

International Con-ference on Learning Representations . Modules

We list all modules for Visual-NMN in Table 3.For Text-NMN, as mentioned, we use all mod-ules are described in Gupta et al. (2020). Inthis work, we introduce the (a) addition and subtraction modules that take as input two dis-tributions over numbers mentioned in the passageand produce a distribution over all posssible addi-tion and subtraction values possible. The outputdistribution here is the expected distribution for therandom variable Z = X + Y (for addition ), and (b) extract-answer that produces two distributionsover the passage tokens denoting the probabilitiesfor the start and end of the answer span. This distri-bution is computed by mapping the passage tokenrepresentations using a simple MLP and softmaxoperation. B Measuring Faithfulness inVisual-NMN

B.1 Numerators of Precision and Recall

As stated in Section 3.1, for a given module typeand a given example, precision is deﬁned as thenumber of matched proposed bounding boxes di-vided by the number of proposed bounding boxes towhich the module assigns a probability more than0.5. Recall is deﬁned as the number of matchedannotated bounding boxes divided by the numberof annotated bounding boxes. Therefore, the nu-merators of the precision and the recall need notbe equal. In short, the reason for the discrepancyis that there is no one-to-one alignment betweenannotated and proposed bounding boxes. To fur-ther illustrate why we chose not to have a commonnumerator, we will consider two sensible choicesfor this shared numerator and explain the issueswith them.One choice for the common numerator is thenumber of matched proposed bounding boxes. Ifwe were to keep the denominator of the recall thesame, then the recall would be deﬁned as the num-ber of matched proposed bounding boxes dividedby the number of annotated bounding boxes. Con-sider an example in which there is a single anno-tated bounding box that is aligned with ﬁve pro-posed bounding boxes. When this deﬁnition ofrecall is applied to this example, the numeratorwould exceed the denominator. Another choicewould be to set the denominator to be the numberof proposed bounding boxes that are aligned with some annotated bounding box. In the example, thisapproach would penalize a module that gives highprobability to only one of the ﬁve aligned proposedbounding boxes. However, it is not clear that amodule giving high probability to all ﬁve proposedboxes is more faithful than a module giving highprobability to only one bounding box (e.g. perhapsone proposed box has a much higher IOU withthe annotated box than the other proposed boxes).Hence, this choice for the numerator does not makesense.Another choice for the common numerator is thenumber of matched annotated bounding boxes. Ifwe were to keep the denominator of the precisionthe same, then the precision would be deﬁned asthe number of matched annotated bounding boxesdivided by the number of proposed bounding boxesto which the module assigns probability more than0.5. Note that since a single proposed boundingbox can align with multiple annotated boundingboxes, it is possible for the numerator to exceed thedenominator.Thus, these two choices for a common numeratorhave issues, and we avoid these issues by deﬁningthe numerators of precision and recall separately.

B.2 Averaging Faithfulness Scores

The method described in Section 3.1 computes aprecision, recall, and F score for each examplefor every module type occurring in that example.The faithfulness scores reported in Table 1 areaverages across examples. We also considered twoother ways of aggregating scores across examples:1. Cumulative P / R / F1: For each module type, wecompute a single cumulative precision and re-call across all examples. We then compute thedataset-wide F score as the harmonic meanof the precision and the recall. The resultsusing this method are in Table 4. There aresome di ﬀ erences between these results andthose in Table 1, e.g. in these results, NMN w / Graph-count + decont. + pretraining has thehighest faithfulness score for every moduletype, including relocate .2. Average over module occurrences: For eachmodule type, for each occurrence of the mod-ule we compute a precision and recall andcompute F as the harmonic mean of preci-sion and recall. Then for each module type,we compute the overall precision as the aver-age precision across module occurrences andimilarly compute the overall recall and F .Note that a module can occur multiple timesin a single program and that each image isconsidered a separate occurrence. The resultsusing this method are in Table 5. Again, thereare some di ﬀ erences between these results andthose in Table 1, e.g. NMN w / Sum-count hasa slightly higher score for with-relation than

NMN w / Graph-count + decont. + pre-training .With both of these alternative score aggregationmethods, we still obtained p < .

001 in our signiﬁ-cance tests.We also noticed qualitatively that the metric canpenalize modules that assign high probability toproposed bounding boxes that have a relativelyhigh IOU that does not quite pass the IOU thresholdof 0 .

5. In such cases, while it may not make senseto give the model credit in its recall score, it alsomay not make sense to penalize the model in itsprecision score. Consequently, we also performedan evaluation in which for the precision calculationwe set a separate “negative” IOU threshold of 10 − (e ﬀ ectively 0) and only penalized modules for highprobabilities assigned to proposed boxes whoseIOU is below this threshold. The results computedwith example-wise averaging are provided in Table6. C Details about Experiments

Visual Reasoning

We use the published pre-trained weights and the same training conﬁgura-tion of LXMERT (Tan and Bansal, 2019), with36 bounding boxes proposed per image. Due tomemory constraints, we restrict training data toexamples having a gold program with at most 13modules.

C.1 Program Annotations

We generated program annotations for NLVR2 byautomatically canonicalizing its question decompo-sitions in the B reak dataset (Wolfson et al., 2020).Decompositions were originally annotated by Ama-zon Mechanical Turk workers. For each utterance,the workers were asked to produce the correct de-composition and an utterance attention for eachoperator (module), whenever relevant.

Limitations of Program Annotations

Thoughour annotations for gold programs in NLVR2 arelargely correct, we ﬁnd that there are some ex-amples for which the programs are unnecessarily

Figure 4: An example of a gold program for NLVR2that is unnecessarily complicated. complicated. For instance, for the sentence “theright image contains a brown dog with its tongueextended.” the gold program is shown in Figure4. This program could be simpliﬁed by replacingthe with-relation with the second argument of with-relation . Programs like this make learn-ing more di ﬃ cult for the NMNs since they usemodules (in this case, with-relation ) in degen-erate ways. There are also several sentences thatare beyond the scope of our language, e.g. compar-isons such as “the right image shows exactly twovirtually identical triﬂe desserts.” D Signiﬁcance tests

D.1 Visual Reasoning

We perform a paired permutation test to test thehypothesis H : NMN w / Graph-count + decont. + pretraining has the same inherent faithfulness as NMN w / Layer-count . We follow the proceduredescribed by Ventura (2007), which is similar totests described by Yeh (2000) and Noreen (1989).Speciﬁcally, we perform N total = ,

000 trialsin which we do the following. For every exam-ple, with probability 1 / scoresobtained by the two models for that example. Thenwe check whether the di ﬀ erence in the aggregatedF scores for the two models is at least as ex-treme as the original di ﬀ erence in the aggregated F scores of the two models. The p-value is given by N exceed / N total , where N exceed is the number of trialsin which the new di ﬀ erence is at least as extremeas the original di ﬀ erence. odule Output Implementation find [ q att ] p W T ([ x ; v ]) + b filter [ q att ]( p ) p p (cid:12) ( W T ([ x ; v ]) + b ) with-relation [ q att ]( p , p ) p max( p ) p (cid:12) MLP([ x ; v ; v ]) project [ q att ]( p ) p max( p ) find ( q att ) (cid:12) MLP([ W ; v ; v ]) count ( p ) N number (cid:16)(cid:80) ( p ) , σ (cid:17) exist ( p ) B greater-equal (p, 1) greater-equal ( a : N , b : N) B greater (a, b) + equal (a, b) less-equal ( a : N , b : N) B less (a, b) + equal (a, b) equal ( a : N , b : N) B (cid:80) Kk = Pr[ a = k ] Pr[ b = k ] less ( a : N , b : N) B (cid:80) Kk = Pr[ a = k ] Pr[ b > k ] greater ( a : N , b : N) B (cid:80) Kk = Pr[ a = k ] Pr[ b < k ] and ( a : B , b : B) B a*b or ( a : B , b : B) B a + b-a*b number ( m : F , v : F) N Normal(mean = m , var = v ) sum ( a : N , b : N) N number ( a mean + b mean , a var + b var ) difference ( a : N , b : N) N number ( a mean − b mean , a var + b var ) division ( a : N , b : N) N number (cid:18) a mean b mean + b var a mean b mean , a mean b mean (cid:18) a var a mean + b var b mean (cid:19)(cid:19) intersect ( p , p ) p p · p discard ( p , p ) p max( p − p , in-left-image ( p ) p p s.t. probabilities for right image are 0 in-right-image ( p ) p p s.t. probabilities for left image are 0 in-at-least-one-image B macro (see caption) in-each-image

B macro (see caption) in-one-other-image

B macro (see caption)

Table 3: Implementations of modules for NLVR2 NMN. First ﬁve contain parameters, the rest are deterministic.The implementation of count shown here is the Sum-count version; please see Section 4 for a description of othercount module varieties and a discussion of their di ﬀ erences. ‘B’ denotes the Boolean type, which is a probabilityvalue ([0..1]). ‘N’ denotes the Number type which is a probability distribution. K =

72 is the maximum countvalue supported by our model. To obtain probabilities, we ﬁrst convert each Normal random variable X to acategorical distribution over { , , ..., K } by setting Pr[ X = k ] = Φ ( k + . − Φ ( k − .

5) if k ∈ { , , ..., K − } .We set Pr[ X = = Φ (0 .

5) and Pr[ X = K ] = − Φ ( K − . Φ ( · ) denotes the cumulative distributionfunction of the Normal distribution. W , W are weight vectors with shapes 2 h × h ×

1, respectively. Here h =

768 is the size of LXMERT’s representations. b is a scalar weight. MLP denotes a two-layer neural networkwith a GeLU activation (Hendrycks and Gimpel, 2016) between layers. x denotes a question representation, and v i denotes encodings of objects in the image. x and v i have shape h × |B| , where |B| is the number of proposals. p denotes a vector of probabilities for each proposal and has shape 1 × |B| . (cid:12) and [; ] represent elementwisemultiplication and matrix concatenation, respectively. The expressions for the mean and variance in the divisionmodule are based on the approximations in Seltman (2018). The macros execute a given program on the two inputimages. in-at-least-one-image macro returns true i ﬀ the program returns true when executed on at least oneof the images. in-each-image returns true i ﬀ the program returns true when executed on both of the images. in-one-other-image takes two programs and returns true i ﬀ one program return true on left image and secondprogram returns true on right image, or vice-versa. odel Performance (Accuracy) Overall Faithful. ( ↑ ) Module-wise Faithfulness( ↑ )Prec. Rec. F1 ﬁnd ﬁlter with-relation relocateLXMERT Upper Bound 1 0.63 0.77 0.78 0.79 0.73 0.71NMN w / Layer-count 71.2 0.069 0.29 0.11 0.13 0.09 0.07 0.05NMN w / Sum-count 68.4 0.25 0.18 0.21 0.23 0.20 0.16 0.05NMN w / Graph-count 69.6 0.20 0.22 0.21 0.24 0.19 0.17 0.04NMN w / Graph-count + decont. 67.3 0.21 0.29 0.24 0.28 0.22 0.19 0.04NMN w / Graph-count + pretraining 69.6 0.28 0.31 0.30 0.34 0.27 0.25 0.09NMN w / Graph-count + decont. + pretraining 68.7 Table 4 : Faithfulness scores on NLVR2 using the cumulative precision / recall / F evaluation. Model Performance (Accuracy)

Overall Faithful. ( ↑ ) Module-wise Faithfulness( ↑ )Prec. Rec. F1 ﬁnd ﬁlter with-relation relocateLXMERT Upper Bound 1 0.91 0.92 0.90 0.95 0.96 0.82NMN w / Layer-count 71.2 0.67 0.64 0.39 0.21 0.50 0.61

NMN w / Sum-count 68.4 / Graph-count 69.6 0.55 0.64 0.43 0.36 0.47 0.54 0.41NMN w / Graph-count + decont. 67.3 0.47 0.70 0.45 0.42 0.47 0.55 0.33NMN w / Graph-count + pretraining 69.6 0.58 0.70 0.47 0.42 0.49 0.58 0.41NMN w / Graph-count + decont. + pretraining 68.7 0.58 Table 5 : Faithfulness scores on NLVR2 using the average over module occurrences evaluation.

Model Performance (Accuracy)

Overall Faithful. ( ↑ ) Module-wise Faithfulness( ↑ )Prec. Rec. F1 ﬁnd ﬁlter with-relation relocateLXMERT Upper Bound 1 0.8377 0.89 0.89 0.92 0.95 0.75NMN w / Layer-count 71.2 0.59 0.39 0.25 0.31 0.28 0.45 0.30NMN w / Sum-count 68.4 / Graph-count 69.6 0.68 0.39 0.38 0.43 0.36 0.44 0.22NMN w / Graph-count + decont. 67.3 0.62 0.51 0.47 0.53 0.39 0.43 0.16NMN w / Graph-count + pretraining 69.6 0.70 0.49 0.47 0.52 0.41 0.51 0.27NMN w / Graph-count + decont. + pretraining 68.7 0.71 Table 6 : Faithfulness scores on NLVR2 using a negative IOU threshold of 10 −8