[PDF] Graph-based Heuristic Search for Module Selection Procedure in Neural Module Network

Abstract

Neural Module Network (NMN) is a machine learning model for solving the visual question answering tasks. NMN uses programs to encode modules' structures, and its modularized architecture enables it to solve logical problems more reasonably. However, because of the non-differentiable procedure of module selection, NMN is hard to be trained end-to-end. To overcome this problem, existing work either included ground-truth program into training data or applied reinforcement learning to explore the program. However, both of these methods still have weaknesses. In consideration of this, we proposed a new learning framework for NMN. Graph-based Heuristic Search is the algorithm we proposed to discover the optimal program through a heuristic search on the data structure named Program Graph. Our experiments on FigureQA and CLEVR dataset show that our methods can realize the training of NMN without ground-truth programs and achieve superior efficiency over existing reinforcement learning methods in program exploration.

Full PDF

GGraph-based Heuristic Searchfor Module Selection Procedurein Neural Module Network

Yuxuan Wu and Hideki Nakayama

The University of Tokyo { wuyuxuan,nakayama } @nlab.ci.i.u-tokyo.ac.jp Abstract.

Neural Module Network (NMN) is a machine learning modelfor solving the visual question answering tasks. NMN uses programs toencode modules’ structures, and its modularized architecture enables itto solve logical problems more reasonably. However, because of the non-diﬀerentiable procedure of module selection, NMN is hard to be trainedend-to-end. To overcome this problem, existing work either includedground-truth program into training data or applied reinforcement learn-ing to explore the program. However, both of these methods still haveweaknesses. In consideration of this, we proposed a new learning frame-work for NMN. Graph-based Heuristic Search is the algorithm we pro-posed to discover the optimal program through a heuristic search on thedata structure named Program Graph. Our experiments on FigureQAand CLEVR dataset show that our methods can realize the trainingof NMN without ground-truth programs and achieve superior eﬃciencyover existing reinforcement learning methods in program exploration.

With the development of machine learning in recent years, more and more taskshave been accomplished such as image classiﬁcation, object detection, and ma-chine translation. However, there are still many tasks that human beings performmuch better than machine learning systems, especially those in need of logicalreasoning ability. Neural Module Network (NMN) is a model proposed recentlytargeted to solve these reasoning tasks [1,2]. It ﬁrst predicts a program indicatingthe required modules and their layout, and then constructs a complete networkwith these modules to accomplish the reasoning. With the ability to break downcomplicated tasks into basic logical units and to reuse previous knowledge, NMNachieved super-human level performance on challenging visual reasoning taskslike CLEVR [3]. However, because the module selection is a discrete and non-diﬀerentiable process, it is not easy to train NMN end-to-end.To deal with this problem, a general solution is to separate the training intotwo parts: the program predictor and the modules. In this case, the programbecomes a necessary intermediate label. The two common solutions to providethis program label are either to include the ground-truth programs into train-ing data or to apply reinforcement learning to explore the optimal candidate a r X i v : . [ c s . A I] S e p Y. Wu, H. Nakayama

Fig. 1.

Our learning framework enables the NMN to solve the visual reasoning problemwithout ground-truth program annotation. program. However, these two solutions still have the following limitations. Thedependency on ground-truth program annotation makes NMN’s application hardto be extended to datasets without this kind of annotation. This annotation isalso highly expensive while being hand-made by humans. Therefore, programannotation cannot always be expected to be available for tasks in real-worldenvironments. In view of this, methods relying on ground-truth program an-notation cannot be considered as complete solutions for training NMN. On theother hand, the main problem in the approaches based on reinforcement learningis that with the growth of the length of programs and number of modules, thesize of the search space of possible programs becomes so huge that a reasonableprogram may not be found in an acceptable time.In consideration of this, we still regard the training of NMN as an openproblem. With the motivation to take advantage of NMN on broader tasks andovercome the diﬃculty in its training in the meanwhile, in this work, we pro-posed a new learning framework to solve the non-diﬀerentiable module selectionproblem in NMN.In this learning framework, we put forward the Graph-based Heuristic Searchalgorithm to enable the model to ﬁnd the most appropriate program by itself.Basically, this algorithm is inspired by Monte Carlo Tree Search (MCTS). Sim-ilar to MCTS, our algorithm conducts a heuristic search to discover the mostappropriate program in the space of possible programs. Besides, inspired by theintrinsic connection between programs, we proposed the data structure namedProgram Graph to represent the space of possible programs in a way more rea-sonable than the tree structure used by MCTS. Further, to deal with the casesthat the search space is extremely huge, we proposed the Candidate SelectionMechanism to narrow down the search space.With these proposed methods, our learning framework implemented thetraining of NMN regardless of the existence of the non-diﬀerentiable moduleselection procedure. Compared to existing work, our proposed learning frame-work has the following notable characteristics: raph-based Heuristic Search for Module Selection in NMN 3 – It can implement the training of NMN with only the triplets of { question,image, answer } and without the ground-truth program annotation. – It can explore larger search spaces more reasonably and eﬃciently. – It can work on both trainable modules with neural architectures and non-trainable modules with discrete processing.

Generally, Visual Reasoning can be considered as a kind of Visual QuestionAnswering (VQA) [4]. Besides the requirement of understanding informationfrom both images and questions in common VQA problems, Visual Reasoningfurther asks for the capacity to recognize abstract concepts such as spatial,mathematical, and logical relationships. CLEVR [5] is one of the most famousand widely used datasets for Visual Reasoning. It provides not only the tripletsof { question, image, answer } but also the functional programs paired with eachquestion. FigureQA [6] is another Visual Reasoning dataset we focus on in thiswork. It provides questions in ﬁfteen diﬀerent templates asked on ﬁve diﬀerenttypes of ﬁgures.To solve Visual Reasoning problems, a naive approach would be the combi-nation of Convolutional Neural Network (CNN) and Recurrent Neural Network(RNN). Here, CNN and RNN are responsible for extracting information fromimages and questions, respectively. Then, the extracted information is combinedand fed to a decoder to obtain the ﬁnal answer. However, this methodology oftreating Visual Reasoning simply as a classiﬁcation problem sometimes cannotachieve desirable performance due to the diﬃculty of learning abstract conceptsand relations between objects [4,6,3]. Instead, more recent work applied modelsbased on NMN to solve Visual Reasoning problems [3,7,8,9,10,11,12]. Neural Module Network (NMN) is a machine learning model proposed in 2016 [1,2].Generally, the overall architecture of NMN can be considered as a controller anda set of modules. Given the question and the image, ﬁrstly, the controller ofNMN takes the question as input and outputs a program indicating the requiredmodules and their layout. Then, the speciﬁed modules are concatenated witheach other to construct a complete network. Finally, the image is fed to theassembled network and the answer is acquired from the root module. As faras we are concerned, the advantage of NMN can be attributed to the abilityto break down complicated questions into basic logical units and the ability toreuse previous knowledge eﬃciently.By the architecture of modules, NMN can further be categorized into threesubclasses: the feature-based, attention-based, and object-based NMN.

Y. Wu, H. Nakayama

For feature-based NMNs, the modules apply CNNs and their calculationsare directly conducted on the feature maps. Feature-based NMNs are the mostconcise implementation of NMN and were utilized most in early work [3].For attention-based NMNs, the modules also apply neural networks but theircalculations are conducted on the attention maps. Compared to feature-basedNMNs, attention-based NMNs retain the original information within images bet-ter so they achieved higher reasoning precision and accuracy [1,2,7,9].For object-based NMNs, they regard the information in an image as a set ofdiscrete representations on objects instead of a continuous feature map. Corre-spondingly, their modules conduct pre-deﬁned discrete calculations. Comparedto feature-based and attention-based NMNs, object-based NMNs achieved thehighest precision on reasoning [10,11]. However, their discrete design usuallyrequires more prior knowledge and pre-deﬁned attributes on objects.

Monte Carlo Method is the general name of a group of algorithms that make useof random sampling to get an approximate estimation for a numerical comput-ing [13]. These methods are broadly applied to the tasks that are impossible ortoo time-consuming to get exact results through deterministic algorithms. MonteCarlo Tree Search (MCTS) is an algorithm that applied the Monte Carlo Methodto the decision making in game playing like computer Go [14,15]. Generally, thisalgorithm arranges the possible state space of games into tree structures, andthen applies Monte Carlo estimation to determine the action to take at eachround of games. In recent years, there also appeared approaches to establishcollaborations between Deep Learning and MCTS. These work, represented byAlphaGo, have beaten top-level human players on Go, which is considered to beone of the most challenging games for computer programs [16,17].

The general architecture of our learning framework is shown as Fig.2. As statedabove, the training of the whole model can be divided into two parts: a. Pro-gram Predictor and b. modules. The main diﬃculty of training comes from theside of Program Predictor because of the lack of expected programs as traininglabels. To overcome this diﬃculty, we proposed the algorithm named Graph-based Heuristic Search to enable the model to ﬁnd the optimal program by itselfthrough a heuristic search on the data structure Program Graph. After thissearching process, the most appropriate program that was found is utilized asthe program label so that the Program Predictor can be trained in a supervisedmanner. In other words, this searching process can be considered as a proceduretargeted to provide training labels for the Program Predictor.The abstract of the total training workﬂow is presented as Algorithm 1. Notethat here q denotes the question, p denotes the program, { module } denotes the raph-based Heuristic Search for Module Selection in NMN 5 Question Program PredictorFor training:Graph-based Heuristic Search √ Program {Image} {Answer}{Module}

Module Network a. b.

Fig. 2.

Our Graph-based Heuristic Search algorithm assists the learning of the ProgramPredictor. set of modules available in the current task, { img } denotes the set of images thatthe question is asking on, { ans } denotes the set of answers paired with images.Details about the Sample function are provided in Appendix A.

Algorithm 1

Total Training Workﬂow function Train()

2: Program Predictor, { module } ← Intialize()3: for loop in range(

Max loop ) do q , { img } , { ans } ← Sample(Dataset)5: p ← Graph-based Heuristic Search( q , { img } , { ans } , { module } )6: Program Predictor.train( q , p )7: end for end function To start with, we ﬁrst give a precise deﬁnition of the program we use. Note thateach of the available modules in the model has a unique name, ﬁxed numbers ofinputs, and one output. Therefore, a program can be deﬁned as a tree meetingthe following rules :i) Each of the non-leaf nodes stands for a possible module, each of the leafnodes holds a (cid:104)

END (cid:105) ﬂag.ii) The number of children that a node has equal to the number of inputs ofthe module that the node represents.For the convenience of representation in prediction, a program can also betransformed into a sequence of modules together with (cid:104)

END (cid:105) ﬂags via pre-ordertree traversal. Considering that the number of inputs of each module is ﬁxed,the tree form can be rebuilt from such sequence uniquely.Then, as for the Program Graph, Program Graph is the data structure we useto represent the relation between all programs that have been reached through-out the searching process, and it is also the data structure that our algorithmGraph-based Heuristic Search works on. A Program Graph can be built meetingthe following rules :i) Each graph node represents a unique program that has been reached.

Y. Wu, H. Nakayama

Program Grapha b c … : modules[E] : flag

Fig. 3.

Illustration of part of a Program Graph ii) There is an edge between two nodes if and only if the edit distance oftheir programs is one. Here, insertion, deletion, and substitution are the threebasic edit operations whose edit distance is deﬁned as one. Note that the editdistance between programs is judged on their tree form.iii) Each node in the graph maintains a score. This score is initialized as theoutput probability of the program of a node according to the Program Predictorwhen the node is created, and can be updated when the program of a node isexecuted.Fig.3 is an illustration of a Program Graph consisting of several programnodes together with their program trees as examples. To distinguish the node inthe tree of a program and the node in the Program Graph, the former will bereferred to as m n for “module node” and the latter will be referred to as p n for “program node” in the following discussion. Details about the initializationof the Program Graph are provided in Appendix B.

Graph-based Heuristic Search is the core algorithm in our proposed learningframework. Its basic workﬂow is presented as the

M ain function in line 1 ofAlgorithm 2. After Program Graph g gets initialized, the basic workﬂow can bedescribed as a recurrent exploration on the Program Graph consisting of thefollowing four steps :i) Collecting all the program nodes in Program Graph g that have not beenfully explored yet as the set of candidate nodes { p n } c .ii) Calculating the Expectation for all the candidate nodes.iii) Selecting the node with the highest Expectation value among all thecandidate nodes.iv) Expanding on the selected node to generate new program nodes andupdate the Program Graph.The details about the calculation of Expectation and expanding strategy areas follows. raph-based Heuristic Search for Module Selection in NMN 7 Algorithm 2

Graph-based Heuristic Search function Main( q , { img } , { ans } , { module } ) g ← InitializeGraph( q )3: for step in range( Max step ) do { p n } c ← p n for p n in g and p n .fully explored == False5: p n i .Exp ← FindExpectation( p n i , g ) for p n i in { p n } c p n e ← p n i s.t. p n i .Exp = max { p n i .Exp for p n i in { p n } c }

7: Expand( p n e , g , { img } , { ans } , { module } )8: end for p n best ← p n i s.t. p n i .score = max { p n i .score for p n i in { p n i }} return p n best .program11: end function function Expand ( p n e , g , { img } , { ans } , { module } )13: p n e .visit count ← p n e .visit count + 114: if p n e .visited == False then p n e .score ← accuracy( p n e .program, { img } , { ans } , { module } )16: p n e .visited ← True17: end if { m n } c ← m n for m n in p n e .program and m n .expanded == False19: m n m ← Sample( { m n } c )20: { program } new ← Mutate( p n e .program, m n m , { module } )21: for program i in { program } new do if LegalityCheck(program i ) == True then g .update(program i )24: end if end for m n m .expanded ← True27: p n e .fully explored ← True if { m n } c .remove( m n m ) == ∅ end function Expectation

Expectation is a grade deﬁned on each program node to determinewhich node should be selected for the following expansion step. This Expectationis calculated through the following Equation 1.Exp = D (cid:88) d =0 w d ∗ max { p n j . score | p n j in g, distance ( p n i , p n j ) ≤ d } + αp n i . visit count + 1 (1)Intuitively, this equation measures how desirable a program is to guide themodules to answer a given question reasonably. Here, D , w d , and α are hyper-parameters indicating the max distance in consideration, a sequence of weightcoeﬃcients while summing best scores in the diﬀerent distance d , and the scalecoeﬃcient to encourage visiting unexplored nodes, respectively. Y. Wu, H. Nakayama

In this equation, the ﬁrst term observes the nodes nearby and ﬁnd the highestscore in each diﬀerent distance d from 0 to D . Then, these scores are weightedby w d and summed up. Note that the distance here is measured on the ProgramGraph, which also equals to the edit distance between two programs. The secondterm in this equation is a balance term negatively correlated to the number oftimes that a node has been visited and expanded on. This term balances thegrades of unexplored or less explored nodes. Expansion Strategy

Expansion is another important procedure in our pro-posed algorithm as shown in line 12 of Algorithm 2. The main objective of thisprocedure is to generate new program nodes and update the Program Graph.To realize this, the ﬁve main steps are as follows:i) If the node p n e in Program Graph is visited for the ﬁrst time, try itsprogram by building the model with speciﬁed modules to answer the question,then update the score of the node with the accuracy. If there are modules withneural architecture, these modules should also be trained here, but the updatedparameters are retained only if the new accuracy exceeds the previous one.ii) Collect the module nodes that have not been expanded on yet within theprogram, then sample one from them as the module node m n m to expand on.iii) Mutate the program at module m n m to generate a new set of programs { program } new with three edit operations: insertion, deletion, and substitution.iv) For the new programs judged to be legal, if there is not yet a noderepresenting the same program in the Program Graph g , then create a newprogram node representing this program and add it to g . The related edge shouldalso be added to g if it does not exist yet.v) If all of the module nodes have been expanded on, then mark this programnode p n e as fully explored.For the Mutation in step iii), the three edit operations are illustrated byFig.4. Here, insertion adds a new module node between the node m n m and itsparent node. The new module can be any of the available modules in the model.If the new module has more than one inputs, m n m should be set as one of itschildren, and the rest of the children are set to leaf nodes with (cid:104) END (cid:105) ﬂag.Deletion deletes the node m n m and set its child as the new child of m n m ’sparent. If m n m has more than one child, only one of them should be retainedand the others are abandoned.Substitution replaces the module of m n m with another module. The newmodule can be any of the modules that have the same number of inputs as m n m .For insertion and deletion, if they are multiple possible mutations becausethe related node has more than one child as shown in Fig.4, all of them areretained.These rules ensure that newly generated programs consequentially have legalstructures, but there are still cases that these programs are not legal in thesense of semantics, e.g., the output data type of a module does not match theinput data type of its parent. Legality check is conducted to determine whether raph-based Heuristic Search for Module Selection in NMN 9 a program is legal and should be added to the Program Graph, more detailsabout this function are provided in Appendix C. InsertionDeletionSubstitution andand

Fig. 4.

Example of the mutations generated by the three opeartions insertion, deletion,and subsitution.

The learning framework presented above is already a complete framework torealize the training of the NMN. However, in practice we found that with thegrowth of the length of programs and the number of modules, the size of searchspace explodes exponentially. This brings trouble to the search. To overcomethis problem, we further proposed the Candidate Selection Mechanism (CSM),which is an optional component within our learning framework. Generally speak-ing, if CSM is activated, it selects only a subset of modules from the whole ofavailable modules. Then, only these selected modules are used in the followingGraph-based Heuristic Search. The abstract of the training workﬂow with CSMis presented as Algorithm 3.Here, we included another model named Necessity Predictor into the learn-ing framework. This model takes the question as input, and predicts a N m -dimensions vector as shown in Fig.5. Here, N m indicates the total number of Algorithm 3

Training Workﬂow with Candidate Selection Mechanism function Train()

2: Program Predictor, Necessity Predictor, { module } ← Intialize()3: for loop in range(

Max loop ) do q , { img } , { ans } ← Sample(Dataset)5: { module } candidate ← Necessity Predictor( q , { module } )6: p ← Graph-based Heuristic Search( q , { img } , { ans } , { module } candidate )7: Necessity Predictor.train( q , p )8: Program Predictor.train( q , p )9: end for end function modules. Each value in the output vector is a real number in the range of [0,1] indicating the possibility that each module is necessary for the solution ofthe given question. N p and N r are the two hyperparameters for the candidatemodules selection procedure. N p indicates the number of modules that are se-lected according to the predicted possibility value, i.e., to select N p moduleswith the top N p prediction values. N r indicates the number of modules that areselected randomly besides the N p ones. Then, the union of these two selectionswith N p + N r modules becomes the candidate modules for the following search.For the training of this Necessity Predictor, the best program found in thesearch is transformed into a N m -dimensions boolean vector indicating whethereach module appeared in the program. Then, this boolean vector is set as thetraining label so that the Necessity Predictor can also be trained in a supervisedmanner as Program Predictor does. Question:How many cubes are there? Necessity Predictor

Filter_shape[cube]Filter_shape[cylinder]Filter_shape[sphere]……CountExistSceneScene

RandomTop Candidates

Fig. 5.

The process to selecte the N p + N r candidate modules Our experiments are conducted on the FigureQA and the CLEVR dataset. Theirsettings and results are presented in the following subsections respectively.

The main purpose of the experiment on FigureQA is to certify that our learningframework can realize the training of NMN on a dataset without ground-truthprogram annotations and outperform the existing methods with models otherthan NMN.An overview of how our methods work on this dataset is shown in Fig.6.Considering that the size of the search space of the programs used in FigureQAis relatively small, the CSM introduced in Section 3.4 is not activated.Generally, the workﬂow consists of three main parts. Firstly, the technique ofobject detection [18] together with optical character recognition [19] are applied raph-based Heuristic Search for Module Selection in NMN 11

Table 1.

Setting of hyperparameters in our experiment

Max loop Max step D w d α

100 1000 4 (0.5, 0.25, 0.15, 0.1) 0.05 to transform the raw image into discrete element representations as shown inFig.6.a. For this part, we applied Faster R-CNN [20,21] with ResNet 101 as thebackbone for object detection and Tesseract open source OCR engine [22,23] fortext recognition. All the images are resized to 256 by 256 pixels before followingcalculations.Secondly, for the part of program prediction as shown in Fig.6.b., we appliedour Graph-based Heuristic Search algorithm for the training. The setting of thehyperparameters for this part are shown in Table 1. The type of ﬁgure is treatedas an additional token appended to the question.Thirdly, for the part of modules as shown in Fig.6.c., we designed somepre-deﬁned modules with discrete calculations on objects. Their functions arecorresponded to the reasoning abilities required by FigureQA. These pre-deﬁnedmodules are used associatively with modules with neural architecture. Details ofall these modules are provided in Appendix D.Table 2 shows the results of our methods compared with baseline and existingmethods. “Ours” is the primitive result from the experiment settings presentedabove. Besides, we also provide the result named “Ours + GE” where “GE”stands for ground-truth elements. In this case, element annotations are obtaineddirectly from ground-truth plotting annotations provided by FigureQA insteadof the object detection results. We applied this experiment setting to measurethe inﬂuence of the noise in object detection results.Through the result, ﬁrstly it can be noticed that both our method and ourmethod with GE outperform all the existing methods. In our consideration, thesuperiority of our method mainly comes from the successful application of NMN.As stated in Section 2.2, NMN has shown outstanding capacity in solving logical

Object DetectionQ.Is Aqua the maximum? Program PredictorFor training:Graph-based Heuristic Search Discriminator, LookUp,FindElement, , A: Yesa. b.c.

Fig. 6.

An example of the inference process on FigureQA2 Y. Wu, H. Nakayama

Table 2.

Comparison of accuracy with previous methods on the FigureQA dataset.Method AccuracyValidation Sets Test SetsSet 1 Set 2 Set 1 Set 2Text only [6] 50.01% 50.01%CNN+LSTM [6] 56.16% 56.00%Relation Network [6,24] 72.54% 72.40%Human [6] 91.21%FigureNet [25] 84.29%PTGRN [26] 86.25% 86.23%PReFIL [27] 94.84% 93.26% 94.88% 93.16%Ours 95.74% 95.55% % %Ours + GE % % problems. However, limited by the non-diﬀerentiable module selection procedure,the application of NMN can hardly be extended to those tasks without ground-truth program annotations like FigureQA. In our work, the learning frameworkwe proposed can realize the training of NMN without ground-truth programs sothat we succeeded to apply NMN on this FigureQA. This observation can alsobe certiﬁed through the comparison between our results and PReFIL.Compared to PReFIL, considering that we applied the nearly same 40-layerDenseNet to process the image, the main diﬀerence we made in our model is theapplication of modules. The modules besides the ﬁnal Discriminator ensure thatthe inputs fed to the Discriminator are related to what the question is asking onmore closely.Here, another interesting fact shown by the result is the diﬀerence betweenaccuracies reached on set 1 and set 2 of both validation sets and test sets. Notethat in FigureQA, validation set 1 and test set 1 adopted the same color schemeas the training set, while validation set 2 and test set 2 adopted an alternatedcolor scheme. This diﬀerence leads to the diﬃculty of the generalization fromthe training set to the two set 2. As a result, for PReFIL the accuracy on eachset 2 drops more than 1.5% from the corresponding set 1. However, for ourmethod with NMN, this decrease is only less than 0.4%, which shows a bettergeneralization capacity brought by the successful application of NMN.Also, Appendix E reports the accuracies achieved on test set 2 by diﬀerentquestion types and ﬁgure types. It is worth mentioning that our work is the ﬁrstone to exceed human performance on every question type and ﬁgure type. The main purpose of the experiment on CLEVR is to certify that our learn-ing framework can achieve superior searching eﬃciency compared to the classicreinforcement learning method.For this experiment, we created a subset of CLEVR containing only thosetraining data whose questions appear at least two times in all training questions. raph-based Heuristic Search for Module Selection in NMN 13

There are 31252 diﬀerent questions together with their corresponding programsin this subset. The reason of applying such a subset is that the size of the wholespace of possible programs is approximately up to 10 , which is so huge thatno existing method can realize the search in it without any prior knowledge orsimpliﬁcation on programs. Considering that the training of modules is highlytime-consuming, we only activate the part of program prediction in our learningframework, which is shown as Fig.6.b. With this setting, the modules speciﬁed bythe program would not be trained actually. Instead, a boolean value indicatingwhether the program is correct or not is returned to the model as a substitute forthe question answering accuracy. Here, only the programs that are exactly thesame as the ground-truth programs paired with given questions are consideredas correct.In this experiment, comparative experiments were made on the cases of bothactivating and not activating the CSM. The structures of the models used asthe Program Predictor and the Necessity Predictor are as follows. For ProgramPredictor, we applied a 2-layer Bidirectional LSTM with hidden state size of 256as the encoder, and a 2-layer LSTM with hidden state size of 512 as the decoder.Both the input embedding size of encoder and decoder are 300. The setting ofhyperparameters are the same as FigureQA as shown in Table.1 except that M ax loop is not limited. For Necessity Predictor, we applied a 4-layer MLP.The input of the MLP is a boolean vector indicating whether each word in thedictionary appears in the question, the output of the MLP is a 39-dimensionalvector for there are 39 modules in CLEVR, the size of all hidden layers is 256. Thehyperparameters N p and N r are set to 15 and 5 respectively. For the sentenceembedding model utilized in the initialization of the Program Graph, we appliedthe GenSen model with pre-trained weights [28,29].For the baseline, we applied REINFORCE [30] as most of the existing work [3,12]did to train the same Program Predictor model.The searching processes of our method, our method without CSM, and REIN-FORCE are shown by Fig.7. Note that in this ﬁgure, the horizontal axis indicatesthe times of search, the vertical axis indicates the number of correct programsfound. The experiments on our method and our method without CSM are re-peated four times each, and the experiment on REINFORCE is repeated eighttimes. Also, we show the average results as the thick solid lines in this ﬁgureindicating the average times of search used to ﬁnd speciﬁc numbers of correctprograms. Although in this subset of CLEVR, the numbers of correct programsthat can be ﬁnally found are quite similar for the three methods, their searchingprocesses show great diﬀerences. From this result, three main conclusions canbe drawn.Firstly, in terms of the average case, our method shows a signiﬁcantly highereﬃciency in searching appropriate programs.Secondly, the searching process of our method is much more stable while thebest case and worst case of REINFORCE diﬀer greatly.Thirdly, the comparison between the result of our method and our methodwithout CSM certiﬁed the eﬀectiveness of the CSM. Fig. 7.

Relation between the times of search and the number of correct programs foundwithin the searching processes of three methods.

In this work, to overcome the diﬃculty of training NMN because of its non-diﬀerentiable module selection procedure, we proposed a new learning frameworkfor the training of the NMN. Our main contribution in this framework can besummarized as follows.Firstly, we proposed the data structure named Program Graph to representthe search space of programs more reasonably.Secondly and most importantly, we proposed the Graph-based HeuristicSearch algorithm to enable the model to ﬁnd the most appropriate programby itself to get rid of the dependency on the ground-truth programs in training.Thirdly, we proposed the Candidate Selection Mechanism to improve theperformance of the learning framework when the search space is huge.Through the experiment, the experiment on FigureQA certiﬁed that ourlearning framework can realize the training of NMN on a dataset without ground-truth program annotations and outperform the existing methods with modelsother than NMN. The experiment on CLEVR certiﬁed that our learning frame-work can achieve superior eﬃciency in searching programs compared to the clas-sic reinforcement learning method. In view of this evidence, we conclude thatour proposed learning framework is a valid and advanced approach to realize thetraining of NMN.Nevertheless, our learning framework still cannot deal with the extremelyhuge search spaces, e.g., the whole space of possible programs in CLEVR. Weleave further study on methods that can realize the search in such enormoussearch spaces as the future work.

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number JP19K22861. raph-based Heuristic Search for Module Selection in NMN 15

References

1. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.(2016) 39–482. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neuralnetworks for question answering. In: Proceedings of the 2016 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies. (2016) 1545–15543. Johnson, J., Hariharan, B., van der Maaten, L., Hoﬀman, J., Fei-Fei, L.,Lawrence Zitnick, C., Girshick, R.: Inferring and executing programs for visualreasoning. In: Proceedings of the IEEE International Conference on ComputerVision. (2017) 2989–29984. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh,D.: Vqa: Visual question answering. In: Proceedings of the IEEE internationalconference on computer vision. (2015) 2425–24335. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C.,Girshick, R.: Clevr: A diagnostic dataset for compositional language and elemen-tary visual reasoning. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. (2017) 2901–29106. Kahou, S.E., Michalski, V., Atkinson, A., K´ad´ar, ´A., Trischler, A., Bengio, Y.:Figureqa: An annotated ﬁgure dataset for visual reasoning. In: International Con-ference on Learning Representations. (2018)7. Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason:End-to-end module networks for visual question answering. In: Proceedings of theIEEE International Conference on Computer Vision. (2017) 804–8138. Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation viastack neural module networks. In: Proceedings of the European conference oncomputer vision (ECCV). (2018) 53–699. Mascharka, D., Tran, P., Soklaski, R., Majumdar, A.: Transparency by design:Closing the gap between performance and interpretability in visual reasoning. In:Proceedings of the IEEE conference on computer vision and pattern recognition.(2018) 4942–495010. Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scenegraphs. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. (2019) 8376–838411. Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolicvqa: Disentangling reasoning from vision and language understanding. In: Ad-vances in Neural Information Processing Systems. (2018) 1031–104212. Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic conceptlearner: Interpreting scenes, words, and sentences from natural supervision. In:International Conference on Learning Representations. (2019)13. Metropolis, N., Ulam, S.: The monte carlo method. Journal of the Americanstatistical association (1949) 335–34114. Kocsis, L., Szepesv´ari, C.: Bandit based monte-carlo planning. In: Europeanconference on machine learning, Springer (2006) 282–29315. Coulom, R.: Eﬃcient selectivity and backup operators in monte-carlo tree search.In: International conference on computers and games, Springer (2006) 72–836 Y. Wu, H. Nakayama16. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G.,Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Master-ing the game of go with deep neural networks and tree search. nature (2016)48417. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A.,Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go withouthuman knowledge. Nature (2017) 35418. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-curate object detection and semantic segmentation. In: Proceedings of the IEEEconference on computer vision and pattern recognition. (2014) 580–58719. Singh, S.: Optical character recognition techniques: a survey. Journal of emergingTrends in Computing and information Sciences (2013) 545–55020. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time objectdetection with region proposal networks. In: Advances in Neural Information Pro-cessing Systems (NIPS). (2015)21. Yang, J., Lu, J., Batra, D., Parikh, D.: A faster pytorch implementation of fasterr-cnn. https://github.com/jwyang/faster-rcnn.pytorch (2017)22. Smith, R.: An overview of the tesseract ocr engine. In: Ninth International Con-ference on Document Analysis and Recognition (ICDAR 2007). Volume 2., IEEE(2007) 629–63323. Smith, R.: Tesseract open source ocr engine. https://github.com/tesseract-ocr/tesseract (2019)24. Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia,P., Lillicrap, T.: A simple neural network module for relational reasoning. In:Advances in neural information processing systems. (2017) 4967–497625. Reddy, R., Ramesh, R., Deshpande, A., Khapra, M.M.: Figurenet: A deep learn-ing model for question-answering on scientiﬁc plots. In: 2019 International JointConference on Neural Networks (IJCNN), IEEE (2019) 1–826. Cao, Q., Liang, X., Li, B., Lin, L.: Interpretable visual question answering by rea-soning on dependency trees. IEEE Transactions on Pattern Analysis and MachineIntelligence (2019)27. Kaﬂe, K., Shrestha, R., Price, B., Cohen, S., Kanan, C.: Answering questions aboutdata visualizations using eﬃcient bimodal fusion. arXiv preprint arXiv:1908.01801(2019)28. Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purposedistributed sentence representations via large scale multi-task learning. In: Inter-national Conference on Learning Representations. (2018)29. Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Gensen.https://github.com/Maluuba/gensen (2018)30. Williams, R.J.: Simple statistical gradient-following algorithms for connectionistreinforcement learning. Machine learning (1992) 229–256raph-based Heuristic Search for Module Selection in NMN 17 Appendix A Training Data Sampling

The basic sampling unit of training data is triplet as ( q , { img } , { ans } ). Gener-ally, as shown in Fig.8, we maintain three sets { U nmet } , { U nsolved } , { Solved } to distinguish training data in diﬀerent status.Intuitively, { U nmet } contains training data that have not been met and used.At the beginning of learning, all the training data is stored in { U nmet } . { U nsolved } contains training data that has been sampled from { U nmet } but on which the ﬁnal accuracy achieved in the following search did not reach ahyperparameter named Acceptable Boundary . { Solved } contains training data that has been sampled from { U nmet } or { U nsolved } and the ﬁnal accuracy reached the Acceptable Boundary .We denote the numbers of training data triplets in these three sets as N um , N us and N s , respectively. Sampling Program Search YesNo

Fig. 8.

Data sampling strategy

For each

Sample step in each training loop, the training data can be sampledfrom either { U nmet } or { U nsolved } with probability P um and P us as shown inEquation 2. P um = (cid:40) e − NusNs +1 , if N um > , otherwise (2a) P us = 1 − P um (2b) Appendix B Program Graph Initialization

To initialize the Program Graph, at most three initial program nodes are createdas the starting points for the following search. The programs of them are:i) The program predicted by the Program Predictor model.ii) The program found for the question within { Solved } that is closest to thecurrent given question.iii) The shortest legal program.Speciﬁcally for ii), this term only works when { Solved } is not empty. If so,a pre-trained sentence embedding model SE ( · ) is utilized to judge the semantic distance between questions and ﬁnd a question q c from { Solved } that is seman-tically closest to the current given question q . This process can be expressedas Equation 3. Here, SE ( · ) takes the question sentence as input and outputs aﬁxed-length vector. (cid:107) SE ( q ) − SE ( q s ) (cid:107) judges the L distance between SE ( q )and SE ( q s ). Then, the program found for q c in previous searches becomes theinitial program for the Program Graph. q c = arg min q s ∈{ Solved } (cid:107) SE ( q ) − SE ( q s ) (cid:107) (3) Appendix C Legality Check for Programs

As stated in Section 3.3, our rules for generating mutations on programs canensure the legality of structure, but not necessarily the legality of semantics.Here, the illegality of semantics mainly comes from the type system of modules.Within NMN, the inputs and outputs passed between modules are restricted withtypes such as feature map, number, object, or set of objects. The calculation ofNMN fails if the intermediate data fed to a module does not match the datatype that module requires.Generally, there are two solutions to this problem. One is to add the ille-gal programs to the Program Graph anyway yet mark these programs as non-executable and skip the step of trying these programs to get the accuracies.However, excessive illegal programs within the Program Graph waste plenty ofsearching steps on them so that the eﬃciency of search drops obviously.The other solution is to simply refuse to add these illegal programs to theProgram Graph. However, in this way the Program Graph is possible to becomedisconnected. Therefore, some sub-graphs may never be reached from others.In consideration of this, we applied a compromise between these two solu-tions. We used a hyperparameter named

T olerance to restrict the maximumcount of data type mismatches that can be tolerated. The programs of whichthe count of data type mismatches is not greater than

T olerance will still beadded to the Program Graph although they cannot be executed to obtain theaccuracy. This setting can balance the eﬃciency and coverage of the search.

Appendix D Modules Used in FigureQA Dataset

The modules used in the experiment on the FigureQA dataset are shown inTable 3. Here, the column of “Shape” indicates the number and type of inputsand output. The column of “Architecture” indicates whether the module is pre-deﬁned with rule-based calculation, or is a trainable neural network.Speciﬁcally for the behavior of each module, “Find Element” ﬁnds an elementthat matches the given keyword from all the detected elements. Here, keywordsare the name of colors extracted from the questions. Because there are at mosttwo keywords within a question, two of this module are required and each ofthem corresponds to one of the keywords. raph-based Heuristic Search for Module Selection in NMN 19

Table 3.

Modules used in the experiment on FigureQA datasetName Shape Architecture NumberFind Element (None) → Element pre-deﬁned 2Look Up (Element) → Element pre-deﬁned 1Look Down (Element) → Element pre-deﬁned 1Look Left (Element) → Element pre-deﬁned 1Look Right (Element) → Element pre-deﬁned 1Find Same (Element) → Elements pre-deﬁned 1Discriminator (Element/Elements/None) * 2 → Answer neural network N “Look Up” ﬁnds the closest element that is in the area of from 45 ◦ top leftto 45 ◦ top right of the given element.“Look Down”, “Look Left”, and “Look Right” behave similarly to “LookUp”.“Find Same” ﬁnds a set of elements with the same attributes as the givenelement. In this experiment, we specify this attribute to color.“Discriminator” has two inputs. For each input, it masks the original imagewith the bounding boxes of the given element or sets of elements. Then, themasked image is fed to a neural network to infer the answer. The input canalso be empty. In this case, it would directly feed the original image to theneural network. To compare our method with existing work fairly, we use a 40-layer DenseNet similar to the one applied in PReFIL as the backbone of theDiscriminator. The architecture of Discriminator is shown in Fig.9. The numberof ﬁlters in the ﬁrst convolutional layer of DenseNet is 64. Considering that thetwo inputs of Discriminator are parallel and most of their features are similar, theﬁrst convolutional layer works on them independently with shared weight. Allthree following dense blocks have 12 layers. Their growth rate is set to 12. Thenumber of ﬁnal classes is 2 representing the answer “Yes” or “No” in FigureQA.For training, we used cross-entropy loss and SGD optimizer with learning ratedecay. The batch size is set to 64. The learning rate is initialized to be 0.1 and ConvolutionConvolution Concatnate AnswerDenseBlock Transition DenseBlock Transition DenseBlock Classifier40 layer DenseNet

Fig. 9.

Architecture of our Discriminator with a 40 layer DenseNet as backbone0 Y. Wu, H. Nakayama drops to 0.01, 0.001, 0.0001, and 0.00001 on epoch 8, 12, 16, and 20, respectively.The maximum number of the epochs of training is 24, yet considering that thetraining of a 40-layer DenseNet on the entire 24 epochs is highly time-consuming,during the search only the training on the ﬁrst 4 epochs are conducted andthe validation accuracy is returned then. After the search on each question iscompleted, the Discriminator speciﬁed by the optimal program will be trainedagain on the entire 24 epochs.

Appendix E Results by Question Type and Figure Typein FigureQA Dataset

Table 4.

Accuracy on Test Set 2 by diﬀerent question types.Question Template RN Human PReFIL OursIs X the minimum? 76.78 97.06 97.20

Is X the maximum? 83.47 97.18 98.07

Is X the low median? 66.69 86.39 93.07

Is X the high median? 66.50 86.91 93.00

Is X less than Y? 80.49 96.15 98.20

Is X greater than Y? 81.00 96.15 98.07

Does X have the minimum area under the curve? 69.57 94.22 94.00

Does X have the maximum area under the curve? 78.45 95.36 96.91

Is X the smoothest? 58.57 78.02 71.87

Is X the roughest? 56.28 79.52 74.67

Does X have the lowest value? 69.65 90.33 92.17

Does X have the highest value? 76.23 93.11 94.83

Is X less than Y? 67.75 90.12 92.38

Is X greater than Y? 67.12 89.88 92.00

Does X intersect Y? 68.75 89.62 91.25

Overall 72.18 91.21 92.79

Accuracy on Test Set 2 by diﬀerent ﬁgure types.Figure Type RN Human PReFIL OursVertical Bar 77.13 95.90 98.25

Horizontal Bar 77.02 96.03 97.98

Pie 73.26 88.26 92.84

Line 66.69 90.55 87.79

Dot Line 69.22 87.20 89.57