[PDF] Blocksworld Revisited: Learning and Reasoning to Generate Event-Sequences from Image Pairs

Abstract

Full PDF

BBlocksworld Revisited: Learning and Reasoning toGenerate Event-Sequences from Image Pairs

Tejas Gokhale, Shailaja Sampat, Zhiyuan Fang, Yezhou Yang, Chitta Baral

Arizona State University {tgokhale, ssampa17, zfang29, yz.yang, chitta}@asu.edu

Abstract

The process of identifying changes or transformations in a scene along with theability of reasoning about their causes and effects, is a key aspect of intelligence.In this work we go beyond recent advances in computational perception, and in-troduce a more challenging task, Image-based Event-Sequencing (IES). In IES,the task is to predict a sequence of actions required to rearrange objects from theconﬁguration in an input source image to the one in the target image. IES alsorequires systems to possess inductive generalizability. Motivated from evidence incognitive development, we compile the ﬁrst IES dataset, the Blocksworld ImageReasoning Dataset (BIRD) which contains images of wooden blocks in differentconﬁgurations, and the sequence of moves to rearrange one conﬁguration to theother. We ﬁrst explore the use of existing deep learning architectures and show thatthese end-to-end methods under-perform in inferring temporal event-sequencesand fail at inductive generalization. We then propose a modular two-step approach:Visual Perception followed by Event-Sequencing, and demonstrate improved per-formance by combining learning and reasoning. Finally, by showing an extensionof our approach on natural images, we seek to pave the way for future research onevent sequencing for real world scenes. Deep neural networks trained in an end-to-end fashion have resulted in exceptional advances incomputational perception, especially in object detection [9, 21], semantic segmentation [4, 26], andaction recognition [2]. Given this capability, a next step is to enable vision modules to reason aboutperceived visual entities such as objects and actions. Some works [27] approach this paradigm byinferring spatial, temporal and semantic relationships between the entities. Other works deal withidentifying changes in these relationships (spatial [12] or temporal [18]). Spatial reasoning has beenexplored in the context of Visual Question Answering (VQA) via the CLEVR dataset [14]. RelationNetworks (RN) proposed in [22] augment image feature extractors and language embedding moduleswith a composite and differentiable relational reasoning module, to answer questions about attributesand relative locations of blocks.In this work, we go beyond and present a new task, Image-based Event Sequencing (IES). Givena pair of images, the goal in IES is to predict a temporal sequence of events or moves needed torearrange the object-conﬁguration in the ﬁrst image to that in the second. An important requirementfor potential IES solvers is inductive generalizability , the ability of predicting an event-sequence ofany length, even when trained only on samples with shorter lengths. A simple analogy can be foundin the process of sorting a list; a correct program should be able to sort irrespective of the numberof swaps required. Inductive generalizability is a characteristic possessed by humans; a person whoknows how to drive, but has never driven more than 20 miles, can drive to any farther destinationreachable by road, provided with the correct directions. BIRD is available publicly at https://asu-active-perception-group.github.io/bird_dataset_web/ a r X i v : . [ c s . C V ] M a y o validate IES systems, we need a testbed, and to the best of our knowledge, no public testbed exists(with detailed annotations about spatial conﬁgurations and event-sequences). While CLEVR [14] andSort-of-CLEVR [22] also contain images of block-conﬁgurations, they are artiﬁcially generated andmore importantly do not include detailed sequences between pairs of images. Moreover, the blocksin these datasets are never stacked or in contact and so there are no constraints on movement of theseblocks. The creators of these datasets force the blocks to be at a minimum margin from each other,and thus any block can be picked up and moved without affecting the other blocks in the conﬁguration.However in real world scenes, objects do impose constraints on one another, for instance a bookwhich has a cup on top of it, cannot be moved without disturbing the cup. Thus, we compile theBlocksworld Reasoning Image Dataset (BIRD) that includes 1 million samples containing a sourceimage and a target image (each containing wooden blocks arranged in different conﬁgurations), andall possible sequences of moves to rearrange the source conﬁguration into the target conﬁguration.To tackle the IES challenge, we propose a modular approach and decompose the problem into twostages, Visual Perception and Event-Sequencing. Stage-I is an encoder network that converts eachinput image into a vector representing the spatial and object-level conﬁguration of the image. Stage-IIuses these vectors to generate event-sequences. This decomposition of the system into two modulesmakes the sequencing module standalone and reproducible. While the encoder can change based ondomain, the sequencing module once learned on the blocksworld domain, can be reused on morecomplex domains, such as real-world scenes. We compare this two-stage approach with severalexisting end-to-end baselines, and show signiﬁcant improvement.To test for inductive generalization, we train our models on data containing true sequences with anupper bound on length, and test them on samples that require sequences of longer lengths. We observethat end-to-end methods fail to generalize while two-stage methods exhibit inductive capabilities.Inductive Logic Programming [17] which combines learning and reasoning by using backgroundknowledge, performs the best under this setting, and can be used to learn event-sequences withunbounded lengths.Thus, our contributions are fourfold; we:1. introduce the ﬁrst IES challenge and compile the BIRD dataset as a testbed,2. show that end-to-end training fails at event-sequence generation and inductive generalization,3. show the beneﬁts of a two-stage approach, and4. show that a sequencing module learned on the BIRD data can be re-used on natural images,yielding a capability towards human level intelligence [23]. We identify three tasks that are most relavant to the IES task; “Spot-the-Difference", Reasoning inVisual Question Answering (VQA) and Visual Relationship Extraction.

Change detection between a pair of images has been explored previously with image differencingtechniques using unsupervised [1] or semi-supervised [7] methods. However, these are pixel-leveltechniques and are not designed to compute the semantic differences between two images. The“Spot-the-Difference" task introduced in [12] leverages natural language annotations to generatemulti-sentence description of differences between two images. An existing work which comes closestto our IES task is the Viewpoint Invariant Change Captioning (VICC) task [18] where the aim isto generate a textual description of the changes between objects in two images ( before and after ).However, the VICC model only predicts which object in the before-image changed position, but doesnot specify its position in the after-image or how that change might have taken place. Since the VICCmodel is built on the CLEVR dataset - in which blocks are never in contact with each other, there areno constraints on movement, therefore making the reasoning aspect of VICC simpler than IES.

Spatial reasoning has been explored extensively in the context of Visual Question Answering (VQA)via the CLEVR dataset. Given an image, the task is to answer questions that require reasoning about2 ove(

P, out ) SOURCE IMAGE TARGET IMAGE

Move ( Y, table ) Move ( G, O )Move ( Y, table ) Move ( G, O ) Move(

P, out ) Figure 1: Illustration of two event-sequences between an image-pair (with intermediate conﬁgurationsshown for clarity).attributes such as shapes, textures, colors and relative locations of objects in the image. RelationNetworks (RN) proposed in [22] seek to solve this problem by augmenting image feature extractorsand language embedding modules with a relational reasoning module. The RN is an end-to-enddifferentiable and composite function that computes relations between the question embedding and allpossible combinations of image features. In comparison, the IES task requires not only understandingattributes of objects, but also requires inferring a sequence of actions or events that could lead toa desired conﬁguration of the objects. In VQA, the input is an image-question pair and the outputis a single word or class label, whereas in IES, the input is an image-image pair and the output isa temporal sequence of events. The IES task can be thought of as the question - “How would younavigate from the source image to the target image in blocksworld?". Although the outcomes inVQA and IES are of different types, the capabilities required to perform inference involve building acertain level of reasoning to perform spatial and relational tasks.

Another category related to the IES task is visual relationship extraction in which the aim is to embedobjects and their relationships into < subject, relation, object > triplets, given an image and its caption.[20] consider each triplet as a separate class and learn to predict each relationship as a triplet aswell as localize it as a bounding box in image-space, while [25] allow a continuous output space forobjects and relations. In this section, we formulate the IES task in terms of inputs, outputs and desired properties of thesystems that attempt the task.

The input to the IES task is a pair of images ( source I S and target I T ), that contain objects appearingin different conﬁgurations. The goal of the IES task is to ﬁnd an event sequence M = [ m  , . . . , m L ] ,such that performing M on I S leads to the conﬁguration in I T . Here L is the length of sequence Mand m t is the move at time t ∈ { , . . . , L } . Figure 1 shows an example. Note that a pair of imagescan have multiple, unique or no permissible event-sequences. Under this problem setting, we deﬁne the concept of inductive generalization. Given a training dataset S with n samples, let L max be the maximum length of sequences found in the dataset. S = { X . . . X n } where X i = ( I Si , I Ti , M i ) ∀ i ∈ { , . . . , n } (1) L max = max i ∈{ ,...,n } | M i | (2)Then, a system is said to possess inductive generalizability if it is able to predict event-sequencesaccurately for any sample ˆ X = ( ˆ I S , ˆ I T , ˆ M ) where | ˆ M | > L max .3 2 4 ,

440 3 , I m a g e s (a) Based on number of stacks ,

236 2 ,

310 1 ,

920 1 ,

080 720 I m a g e s (b) Based on number of blocks in image Figure 2: Distribution of block images in BIRD

In this work, we focus on the “Blocksworld" setting where every image contains blocks of differentcolors arranged in various conﬁgurations.

What’s so special about blocks? Our motivation for constructing a curated dataset of blocksworldimages comes from literature in cognitive development. Extensive studies such as [19, 13, 3] showthat playing with wooden blocks beneﬁts the early stages of a child’s development. These worksshow how block-play aids in development of a child’s sensorimotor, symbolic, logical, mathematicalas well as abstract and causal reasoning abilities. [23] have argued that building with blocks enableschildren to mathematize the world around them in terms of physics, geometry, visual attributes, andabstract semantics or meanings assigned to blocks.The crucial insight from these works is that the task of reasoning about a complex visual scenebeneﬁts from abstractions in terms of blocks; when every object in a scene is treated as a block, theentire scene can be re-imagined in the blocksworld framework. Correctly generating event-sequencesfrom images requires perceiving objects, colors, textures and reasoning about spatial relationshipsin order to come up with a plan to build towards the goal. [8] use an “Interpretation-by-Synthesis"approach to progressively build up representations of images. We propose a similar construct forvisual perception that could aid in reasoning tasks such as the one in IES.With the claim that the IES task can be learned on the Blocksworld domain, and extended and reusedon other domains without re-training, we introduce a new dataset – the

Blocksworld Image ReasoningDataset (BIRD) . BIRD consists of 7267 images of blocks arranged in different conﬁgurations that we captured in whitebackground and uniform lighting conditions. We use wooden blocks from a set of six colors C andarrange them in all possible permutations. In doing so, we follow two constraints – an image containsno more than ﬁve blocks, and no two blocks of the same color. Figure 2 shows the distribution of thedataset based on number of blocks and stacks of blocks in the image. Our intention is to distinguishour dataset from CLEVR [14] in that our dataset contains blocks that are in contact or stacked on topof each other, and also that we use real images (as opposed to rendered images in CLEVR). We annotate each image with two vectors that uniquely represent the conﬁguration of blocks as shownin Figure 3. The “ color-blind arrangement vector" represents the locations of blocks in a grid. The“color vector" represents colors of the blocks from bottom-to-top and left-to-right, with each colorrepresented as a 3-bit binary vector. For every pair of source and target images, we assign all possibleminimal-length event-sequences M , with each move m t in the sequence given by: move ( X , Y , t ); t ∈ { , , . . . , } , X (cid:54) = Y where X ∈ C , Y ∈ D = C ∪ { “ table ” } ∪ { “ out ” } . (3)4 rrangement Encoder Color

Encoder CNN

ResNet 50 .. .

0 0 00 0 0

Figure 3: Images with their arrangement and color vectorFor example, move ( R , G ,  ) implies that a red block is moved on top of a green block at the secondtime-step. We pair every image in the dataset with every other image and use the CLINGO [6] AnswerSet Programming solver to generate a dataset of (cid:104) image-image-sequence (cid:105) triplets as shown in Figure1, uniformly sampled across all sequence lengths (cid:96) ∈ { no-sequence , , . . . , } . The maximum lengthof minimal-length sequences in our dataset is 8. To reason about the conﬁgurations, we use the following background knowledge to delineate theconditions under which each move is legal:

Exogeneity : Block A can be moved at time t ⇐⇒ it exists in the conﬁguration ∀ ˆ t < t . Freedom of Blocks :1. Block A is free at time t ⇔ ∀ B , ¬ on ( B , A , t ) .2. Block A can be moved ⇔ A is free.3. Block B can be placed on block A ⇔ A is free.4. A block that is “out of table" cannot be moved. Inertia : A block unless moved doesn’t change location.

Sequentialism : At most one move can be performed at each time instance.

Armed with our novel dataset, we test two approaches to attempt the Image-based Event Sequencing(IES) task, End-to-End Learning and Modular Two-Stage Methods.

In BIRD, each move is represented according to Equation 3. Since |D| = 8 , we represent each Xor Y with a 8-bit one-hot vector, and therefore get a 16-bit representation for each move m t . Themaximum number of moves for any image-pair in our dataset is 8; therefore our ground-truth eventsequence is a 128-bit binary vector. Our input is a pair of RGB images ( I S , I T ); i.e. a 6 channelinput with dimensions × . Thus our end-to-end modules are given by: f E : R × × → { ,  } . (4)We train deep neural network architectures that can leverage spatial context such as Resnet-50 [10],PSPNet [26] and Relational Networks (RN) [22], to directly generate event-sequences from imagepairs. We use Pyramid Scene Parsing network (PSPNet) as a baseline since it uses pyramidal poolingas a global contextual prior for extracting spatial relations. It is worth exploring if spatial relationshipscaptured by PSPNet for semantic segmentation can be useful in the IES task. Relational Networkshave been shown to work for relational reasoning in Visual Question Answering and have an image-question pair as input. An RN extracts image features using a Convolutional Neural Network (CNN)515] and text features using Long-Short Term Memory (LSTM) [11] embedding and uses thesefeatures as inputs to the Relational Module. In our case, we have an image-image pair instead thus,we replace the LSTM with another CNN feature extractor and train the RN end-to-end. We decompose the task into Stage-I (Visual Perception) and Stage-II (Event Sequencing).

Stage-I is trained to encode input images into an interpretable representation. We identify that spatiallocalization of blocks with respect to one another requires knowing where the relative location ofeach block, given by an arrangement vector , along with the characteristics of each block, given by a color vector . We train a 8-layer convolutional network ( f A ) to encode this arrangement vector. In ourdataset, the maximum number of blocks is 5, so the arrangement can be expressed as a × grid.Then we train a Resnet-50 based color grounding module ( f C ) as in [5], and use it along with thepredicted arrangement vector to obtain the color vector, that represents the color of each block as a3-bit binary vector, in a bottom-to-top, left-to-right order. Thus our visual perception is given by thetwo encoders, expressed as: f A : R × × → R × , f C : R × × → R × . (5) Stage-II is trained to use the encoded representation of images to generate minimal-length sequencesof moves to reach the target from the source conﬁguration. We compare the efﬁcacy of FullyConnected Neural Networks (FC), reinforcement learning using the Q-Learning algorithm (QL) andrule-based Inductive Logic Programming (ILP). The worst case sequence length (8) serves as upperbound for sequence generation using QL and ILP. Given an action m t and a conﬁguration z t , we alsodevelop a Logic Engine that can deterministically generate the next conﬁguration z t +1 . The logicengine ( g l ) can be expressed as: z t +  = g l ( z t , m t ) , (6)where z  = [ f A ( I S ) , f C ( I S )] . (7) Deep Neural Networks.

We explore if conventional neural networks can be used to generate discreteevent-sequences in the IES task, by using Fully Connected (FC) networks as one of the baselines topredict event-sequences. We train an FC network with ﬁve layers under a multi-label classiﬁcationparadigm with binary cross-entropy loss and Adam optimizer.

Reinforcement Learning.

Q-learning (QL) [24] is a widely used model-free reinforcement learningalgorithm that models the world as a ﬁnite Markov Decision Process (MDP) in which agents receivea reward based on the action they perform at every time-step. The QL algorithm ﬁnds an optimalpolicy by total discounted expected reward. We look at our event-sequencing problem as a ﬁniteMDP between the start image and target image with moves in the temporal sequence being analogousto “actions". The policy that we learn in this reinforcement learning framework is designed such thatit is consistent with the background knowledge for event-sequencing in BIRD.

Inductive Logic Programming.

Inductive Logic Programming (ILP) [17] is a subclass of machinelearning algorithms that aims to learn logic programs. Given the encoded background knowledge B of the domain and a set of positive and negative examples represented as structured facts E + and E − ,the ILP system learns a logic program that entails E + but not E − .[16] have shown that the addition of a formal reasoning layer to standard statistical machine learningapproaches signiﬁcantly increases the reasoning capability of an agent. With that motivation, weuse ILP to learn Answer Set Programs for our event-sequencing task. Using examples from BIRDrepresented in a structured ASP format, we learn effects of the action move ( X, Y, t ) on the relativepositions of X, Y equivalent to the rule: on ( X, Y, t + 1) : - move ( X, Y, t ) . (8)6able 1: Comparison of all methods for the IES task on BIRD, with respect to the FSA and SLAmetrics. Note that PR refers to Stage-I with perfect recognition. Approach Method FSA SLAHuman

100 100

End-to-End Learning

ResNet50 30.52 36.26PSPNet 35.04 56.69RN 34.37 52.09

PR + Stage-II

FC 68.87 72.58QL 84.10 87.83ILP

100 100Stage-I + Stage-II

FC 56.25 60.24QL 68.98 71.17ILP (cid:96) (Maximum sequence-length in Training Set) F u ll S e qu e n ce A cc u r ac y ( % ) End − to − End P R + QL Enc + F C Enc + ILPP R + F C P R + ILP Enc + QL Figure 4: Inductive capability of each method, shown in terms of FSA on the test set containingsequences longer than those used for training. (Best when viewed in color).

We deﬁne two metrics for our experiments. If y and ˆ y are the and the predicted sequence, then FullSequence Accuracy (FSA) is the percentage of exact matches, and

Step Level Accuracy (SLA) is thepercentage of common moves between y and ˆ y .FSA = 1 N N (cid:88) i =1 { y i == ˆ y i } (9)SLA = 1 N N (cid:88) i =1 (cid:80) Ll =1 { y i(cid:96) == ˆ y (cid:96)i } L (10) We evaluate and compare end-to-end and two-stage methods in Table 1. Two-stage methods signiﬁ-cantly outperform all end-to-end methods, even with imperfect Stage-I encoders (Enc). Since theoutput space is exponentially large, we postulate that end-to-end networks lack the ability to mapfrom pixel-space to this large sequence-space. 7 bstraction M o v e ( T V , S u i t c a s e ) M o v e ( B a ll , o u t ) M o v e ( S k a t e b o a r d , o u t ) M o v e ( B a c k p a c k , o u t ) Natural Images ,

Event SequencesObject Detections ConfigurationsDictionary M a s k - R C NN D e t e c t o r Input Images , S e qu e n c i n g E n c o d i n g

0 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 0

1 1 1 1 1

0 0 0 0 0

0 0 0 0

0 0 11 0 1 1 0 0 0 1 0 0 1 10 0 1

0 1 1

0 0 0 0 0 0 0 0 0

Figure 5: Experiments on Natural Images: Given a source and target image we get object detectionsusing a Mask-RCNN. These detections are re-imagined in the blocksworld framework on which weperform event-sequencing using models trained on BIRD to get output moves.Table 2: Results of using BIRD sequencing module for natural images (with Perfect Recognition orMask-RCNN as Stage-I)

Approach PR + Stage-II Stage-I + Stage-II

FC QL ILP FC QL ILPFSA (%) 55.34 92.20 100 47.47 64.26 75.55SLA (%) 61.06 96.42 100 51.71 69.16 80.57

If an image-pair requires more number of moves than present in the training data, our system shouldinductively infer this longer sequence of steps. We test this

Inductive Generalizability with an ablationstudy; we create datasets such that the training set has samples with maximum length (cid:96) and thetest set with minimum length (cid:96) + 1 . Figure 4 illustrates that end-to-end methods do not possessthis ability, while two stage methods generalize well to some degree; as (cid:96) increases, the inductivecapability of QL and FC increases. Inductive Logic Programming with perfect recognition (PR) isable to generalize irrespective of the value of (cid:96) . We collected a set of 30 images which contain the object classes “Person", “TV", “Suitcase", “Table",“Backpack" and “Ball" as a prototype to test the hypothesis that the sequencing module trained onBIRD can be reused for natural image inputs. We used a pre-trained Mask-RCNN [9] network toproduce object detections and re-imagined the image in the blocksworld setting, by using a one-to-one mapping from each object to a block-type in BIRD. Thus for a pair of natural images, we cantest various sequencing modules trained on BIRD by directly using the corresponding blocksworldre-imaginations to generate event-sequences as shown in Figure 5. Table 2 shows a comparison ofour Stage-II baselines.

Table 1 shows that all three end-to-end methods are signiﬁcantly outperformed by two-stage methods,even when using imperfect encoders from Stage-I. Our output space consists of 8 moves with eachmove having 48 possibilities, making the number of possible outputs ≈ . × . We postulatethat end-to-end networks are incapable at handling an output space as large as in IES, and as a resultfail to identify the semantic correspondence between the pixel-space and sequence-space. Since thetwo-stage approach is guided by the perception module to encode a interpretable latent vector, itaids the sequencing module to infer sequences. We argue that encoding scenes from pixel-domaininto semantic and interpretable representations and then using these for reasoning has an edge overlearning to reason directly from pixels. ILP with background knowledge outperforms all the otherbaselines as shown in (Table 1). We note that while Q-Learning also achieves good accuracies on theIES task, it is not able to generalize as well as ILP in terms of inductive reasoning capabilities as canbe seen from Figure 4. 8 Conclusion

In this paper, we introduced the Image-based Event Sequencing (IES) challenge along with theBlocksworld Image Reasoning Dataset (BIRD) that we believe has the potential to open new researchavenues in cognition-based learning and reasoning, and as a step towards combining learning andreasoning in computer vision. Our experiments show that end-to-end deep neural networks fail toreliably generate event-sequences and do not exhibit inductive generalization. We argue that encodingscenes from pixel-domain into interpretable representations and then using these for reasoning hasan edge over learning to reason directly from pixels. By decomposing the task into two modules -perception and sequencing, we propose a two-stage approach that has multiple advantages. Firstly,the sequencing beneﬁts from a perception module that encodes images into meaningful spatialrepresentations. Next, we show that the sequencing module trained on BIRD can be reused in thenatural image domain, by simply replacing the perception module with object detectors. Finally, ourexperiments show that modular methods possess inductive generalizability, opening up promisingavenues for visual reasoning. Our future work would deal with expanding BIRD into a more genericdataset, by relaxing constraints on BIRD and making it more generic. We plan to allow a largervariety of actions, a larger set of block characteristics, and also to extend this approach to othercomplex real-world environments.

Acknowledgement

The authors are grateful to the National Science Foundation for Grant 1816039 under the NSF RobustIntelligence Program.

References [1] Lorenzo Bruzzone and Diego F Prieto. Automatic analysis of the difference image for unsupervised changedetection.

IEEE Transactions on Geoscience and Remote sensing , 38(3):1171–1182, 2000.[2] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset.In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308,2017.[3] Sally S. Cartwright. Play can be the building blocks of learning.

Young Children , 43(5):44–47, July 1988.[4] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoderwith atrous separable convolution for semantic image segmentation. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages 801–818, 2018.[5] Zhiyuan Fang, Shu Kong, Charless Fowlkes, and Yezhou Yang. Modularized textual grounding forcounterfactual resilience.

Proceedings of the IEEE conference on computer vision and pattern recognition ,2019.[6] Martin Gebser, Benjamin Kaufmann, Roland Kaminski, Max Ostrowski, Torsten Schaub, and MariusSchneider. Potassco: The potsdam answer set solving collection.

Ai Communications , 24(2):107–124,2011.[7] Lionel Gueguen and Raffay Hamid. Large-scale damage detection using satellite imagery. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition , pages 1321–1328, 2015.[8] Abhinav Gupta, Alexei A Efros, and Martial Hebert. Blocks world revisited: Image understandingusing qualitative geometry and mechanics. In

European Conference on Computer Vision , pages 482–496.Springer, 2010.[9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In

Computer Vision (ICCV),2017 IEEE International Conference on , pages 2980–2988. IEEE, 2017.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[11] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780,1997.[12] Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similarimages. arXiv preprint arXiv:1808.10584 , 2018.[13] Harriet Johnson.

The Art of Block Building . The John Day Company, New York, 1983.

14] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and RossGirshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2901–2910,2017.[15] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[16] Arindam Mitra and Chitta Baral. Addressing a question answering challenge by combining statisticalmethods with inductive rule learning and reasoning. In

Thirtieth AAAI Conference on Artiﬁcial Intelligence ,2016.[17] Stephen Muggleton. Inductive logic programming.

New generation computing , 8(4):295–318, 1991.[18] Dong Huk Park, Trevor Darrell, and Anna Rohrbach. Viewpoint invariant change captioning. arXivpreprint arXiv:1901.02527 , 2019.[19] Jean Piaget.

Play, Dreams, and Imitation in Childhood . W.W. Norton and Co., New York, 1962.[20] Bryan A Plummer, Arun Mallya, Christopher M Cervantes, Julia Hockenmaier, and Svetlana Lazeb-nik. Phrase localization and visual relationship detection with comprehensive image-language cues. In

Proceedings of the IEEE International Conference on Computer Vision , pages 1928–1937, 2017.[21] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-timeobject detection. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages779–788, 2016.[22] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia,and Timothy Lillicrap. A simple neural network module for relational reasoning. In

Advances in neuralinformation processing systems , pages 4967–4976, 2017.[23] Julie S. Sarama and Douglas H. Clements. Building blocks and cognitive building blocks - playing toknow the world mathematically.

American Journal of Play , 1(3):313–337, Winter 2001.[24] Christopher JCH Watkins and Peter Dayan. Q-learning.

Machine learning , 8(3-4):279–292, 1992.[25] Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed M. Elgammal, and MohamedElhoseiny. Large-scale visual relationship understanding.

CoRR , abs/1804.10660, 2018.[26] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsingnetwork. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages2881–2890, 2017.[27] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos.In

Proceedings of the European Conference on Computer Vision (ECCV) , pages 803–818, 2018., pages 803–818, 2018.