[PDF] A Neural-Symbolic Architecture for Inverse Graphics Improved by Lifelong Meta-Learning

Abstract

We follow the idea of formulating vision as inverse graphics and propose a new type of element for this task, a neural-symbolic capsule. It is capable of de-rendering a scene into semantic information feed-forward, as well as rendering it feed-backward. An initial set of capsules for graphical primitives is obtained from a generative grammar and connected into a full capsule network. Lifelong meta-learning continuously improves this network's detection capabilities by adding capsules for new and more complex objects it detects in a scene using few-shot learning. Preliminary results demonstrate the potential of our novel approach.

Full PDF

AA Neural-Symbolic Architecture for InverseGraphics Improved by Lifelong Meta-Learning

Michael Kissner [0000 − − − and Helmut Mayer [0000 − − − Institute for Applied Computer ScienceBundeswehr University Munich, Germany { michael.kissner, helmut.mayer } @unibw.de Abstract.

We follow the idea of formulating vision as inverse graph-ics and propose a new type of element for this task, a neural-symboliccapsule. It is capable of de-rendering a scene into semantic informationfeed-forward, as well as rendering it feed-backward. An initial set of cap-sules for graphical primitives is obtained from a generative grammar andconnected into a full capsule network. Lifelong meta-learning continu-ously improves this network’s detection capabilities by adding capsulesfor new and more complex objects it detects in a scene using few-shotlearning. Preliminary results demonstrate the potential of our novel ap-proach.

The idea of inverting grammar parse-trees to generate neural networks is notnew [24,25], but has been largely abandoned. We revisit this idea and inverta generative grammar into a network of neural-symbolic capsules. Instead oflabels, this capsule network outputs an entire scene-graph of the image, which iscommonplace in modern game engines like Godot [22]. Our approach is an inversegraphics pipeline for the prospective idea of an inverse game-engine [2][27].We begin by introducing the generative grammar and how to invert symbolsand rules to obtain neural-symbolic capsules (Section 3). They are internallydiﬀerent to the ones proposed by Hinton et al. [4], as they essentially act as con-tainers for regression models. Next, we present a modiﬁed routing-by-agreementand training protocol (Section 4), coupled to a lifelong meta-learning pipeline(Section 5). Through meta-learning, the capsule network continuously grows andtrains individual capsules. We ﬁnally demonstrate the potential of the approachby presenting some results based on an ”Asteroids”-like environment (Section6) before ending with a conclusion.

Capsule Networks.

In [4] Hinton et al. introduced capsules, extending theidea of classical neurons by allowing them to output vectors instead of scalars.These vectors can be interpreted as attributes of an object and aim to reduce a r X i v : . [ c s . C V ] S e p Michael Kissner, Helmut Mayer information-loss between layers of convolutional neural networks (CNN) [6,8].Capsules require specialized routing protocols, such as routing-by-agreement[20], where the activation probability of a capsule is dependent on the agree-ment of its real inputs with their expected values. Further extensions for cap-sules have been proposed, such as using matrices internally [5], using 3D input[32] and improving equivariance [9].

Neural-Symbolic Methods.

There has been a strong eﬀort to make a purelyneural approach to vision more interpretable [10,13,16,19,21,31]. An alterna-tive approach to interpretability has been to deeply intertwine symbolic meth-ods into connectionist methods. For computer vision, this is proving fruitful forde-rendering and scene decomposition [7,11,12,23,26,28,34]. Many of these ap-proaches use shape programs, a decomposition of the scene into a set of renderinginstructions. The scene-graphs we construct can be viewed as such shape pro-grams, but in a diﬀerent representation. This symbolic information is well suitedfor more complex tasks, such as visual question answering (VQA) [14,30] andscene manipulation [29]. We follow many of the presented ideas in our work.

The neural-symbolic capsules are derived from a modiﬁed [15] generative at-tributed context-free grammar for image generation. We require that our gram-mar is non-recursive and has a ﬁnite number of symbols to avoid inﬁnite pro-ductions. The following notation is used for our grammar G : G = ( S, V, Σ , R, A, C ) (1) S : The axiom (starting) symbol S of our grammar is some object name (e.g.,[ car ] or [ house ]) for which we want to generate further, more detailed symbols. V : The ﬁnite set of non-terminal symbols V represents parts of the axiom symbol(e.g., [ car door ]) or parts of parts (e.g., [ door handle ]). The further down thegrammar parse-tree a symbol resides, the more primitive it becomes.Σ : The set of terminal symbols Σ consists of graphical primitives. These mayinclude elements such as [ edge ] or [ sphere ]. Whichever terminal symbols arechosen will determine the possible complexity that can be represented by thenon-terminal symbols. R : A production rule r ∈ R is of the formΩ → λ with Ω ∈ V, λ ∈ (cid:91) i ∈ N \{ } ( V ∪ Σ) . (2)The right-hand-side (RHS) of a rule r has the form λ = λ · · · λ | λ | , where λ i is either terminal or non-terminal and by | λ | we denote the total numberof produced symbols. There may be multiple rules in R that have the sameleft-hand-side (LHS) symbol. We also introduce a special function called draw that applies to terminal symbols, i.e., primitives, forcing them to produce aset of pixels corresponding to their graphical representation (cf. Figure 1) eural-Symbolic Architecture for Inverse Graphics 3 draw : λ → [ pixel-layer ] λ ∈ Σ . (3) A : Every terminal and non-terminal symbol λ j with rule r , as well as each[ pixel-layer ], is associated with an attribute vector α j = ( α j , · · · , α nj ). C : For r to produce meaningful attributes, they must be constrained by a setof realistic laws that allow for a wide spectrum of results. Particularly, weintroduce a set of non-linear equations which constrain the attributes. Eachattribute α ij of a symbol λ j produced by rule r : Ω → λ is associated with aconstraint g ij and calculated using α ij = g ij ( α Ω ). By A ( r ) we denote the setof attributes { α j } and by C ( r ) the set of all constraints { g ij } for r . The draw function is also considered to be such a constraint g . Fig. 1.

Draw: A grammar producing an image from a [ house ] symbol.

The order of the symbols produced by a rule represents depth-sorting. Forexample, [ triangle ] [ square ] ﬁrst draws the triangle, then the square. We interprettwo rules with the same LHS symbol to be equivalent to either drawing diﬀerentunique viewpoints (from the front, from the back) of the same object, drawing aprimitive using diﬀerent part conﬁgurations (chair with padding, chair withoutpadding) or drawing it in diﬀerent styles (sketch, photo). Ω λ α i j = g i j ( α Ω ) λ Symbol Ωwith production rule r Ω λ α i Ω ≈ γ i ( α , ··· , | λ | ) λ Capsule Ωwith route r Fig. 2.

Inversion of a symbol of the grammar parse-tree results in a capsule. We il-lustrate both the symbol and the capsule using a hexagon to avoid confusion withneurons. Michael Kissner, Helmut Mayer

In this section we introduce our neural-symbolic capsules. We ”invert” asymbol of our grammar to create a capsule and connect it in reverse to theorder in the parse-tree to form a capsule network. Our approach to the capsule’sinternals is diﬀerent to [20]. The idea of routing-by-agreement and vectorizedoutputs remains unchanged, however, we replace the internal algorithm by aregression model and output an additional activation probability.

Terminal Symbols → Primitive Capsules.

Each terminal symbol representsa renderable graphical primitive that is connected directly to a layer of pixelsand we refer to its inversion as a primitive capsule . These capsules performdetection based on pixel inputs.

Non-Terminal Symbols → Semantic Capsules.

We invert non-terminalsymbols to form semantic capsules . By Ω we henceforth interchangeably referto both the capsule and its corresponding symbol.

Rules → Routes.

The constraints g of a rule r take the attributes α Ω ofsymbol Ω and produce new attributes α , · · · α | λ | . After inversion, the capsuleΩ takes those same attributes α , · · · α | λ | as input to generate α Ω using g − .However, g is not invertible in most cases and we instead introduce γ as our bestapproximation, such that || g ( γ ( α )) − α || (4)is minimized (cf. Figure 2). We refer to the inverted rule as a route and depend-ing on context we denote by r both the rule and the route. This also holds truefor primitive capsules, where detection means the inversion of its draw function. We use a modiﬁed routing-by-agreement protocol to ﬁnd the best ﬁtting routeand attributes. Ω may appear on the RHS of multiple routes (i.e., LHS of arule), but as only one route leads to the activation of the capsule, we introducean activation probability p r for each. Our goal is attribute equivariance andactivation probability invariance under feature-preserving transformations. Wepropose the following internals of our capsule (cf. Figure 3):1. The output α Ω r for a route r is calculated using α Ω r = γ r ( α , ··· , | λ | ) . (5)2. For each input α j r for a route r , we estimate the expected input value ˜ α j r as if α j r were unknown, using the following equation: ˜ α j = g r,j ( α Ω r ) . (6)3. The activation probability p Ω r of a route r is calculated as p Ω r = 1 | ( λ ) r | (cid:88) ( λ ) r (cid:107) Z ( α i , ˜ α i ) (cid:107) | Z | · w (cid:18) p i ¯ p i − (cid:19) , (7) eural-Symbolic Architecture for Inverse Graphics 5 α i p i γ r α Ω g r α i ˜ α i Agreement ˜ α i α i p i p Ω ······ γ r n α Ω n g r n α i ˜ α i Agreement ˜ α i α i p i p Ω n RouteSelection α Ω j p Ω j α Ω p Ω α Ω p Ω CapsuleRoute 1 Route n ObservationTable( p Ω , α λ , α Ω ) (1) ( p Ω , α λ , α Ω ) (2) ( p Ω , α λ , α Ω ) (3) ... α Ω p Ω α Ω p Ω Fig. 3.

The inner structure of a capsule Ω with inputs α i , p i and outputs α Ω , p Ω ,representing our routing-by-agreement protocol (Equations 5 to 9). The outputs arestored in an observation table and individual routes are highlighted in yellow. where ( λ ) r denotes the set of all inputs that contribute to a route r , p i theroute’s input capsule’s probability of activation, Z an agreement-functionwith output vector of size | Z | , (cid:107)·(cid:107) the l -norm, w some window functionwith w (0) = 1, sup { w } = 1 and ¯ p i the past mean probability for that input.4. Steps 1. - 3. are repeated for each r ∈ R (Ω).5. Find the route that was most likely used r ﬁnal = max r { p Ω r } (8)and set the ﬁnal output as p Ω = p Ω r ﬁnal (9) α Ω = α Ω r ﬁnal . (10)Steps 1 and 2 correspond to an architecture equivalent to a de-renderingautoencoder, g ( γ ( α )) = ˜ α , but with known interpretation for the latent variables(attributes). Here, γ acts as encoder and g as decoder.For now assume that ¯ p i is known in step 3. The agreement-function Z mea-sures how well the inputs of γ correspond to the outputs of g . For semanticcapsules, we choose the agreement-function Z ( α i , ˜ α i ) = max { w ( ˜ α i − ˆ α i ) : ˆ α i ∈ R α i } , (11)where w describes an n -dimensional window function and R α i the set of rota-tionally equivalent α i . For primitive capsules ﬁnding an appropriate Z dependsvery much on the design and symmetries of the decoder g ( draw -function). Michael Kissner, Helmut Mayer

Individual capsules are connected as shown in Figure 2. A full capsule network isconstructed from multiple grammars with diﬀerent axioms ([ table ], [ chair ], ...).To avoid multiple capsules for the same symbol in the network, we merge themand introduce an observation table Λ that stores all occurrences on a per-imagebasis (cf. Figure 3). For instance, a [ table ] capsule does not need to connect tofour [ table-leg ] capsules, but only to one with four entries in its observation table.As these observations are reset at the beginning of each pass, we assume thatall past entries in Λ are stored in permanent memory elsewhere as (cid:16) ( p λ ) ( i ) , ( α λ ) ( i ) , ( α Ω ) ( i ) (cid:17) r , (12)allowing us to calculate the mean value ¯ p λ , as well as to perform meta-learning.During a forward pass the entries in all the observation tables form one ormany tree structures, which we call the observed parse-trees . Their topmostsymbol is not necessarily one of the axioms of the grammars the capsule networkis based on (cf. left of Figure 4). Each observed parse-tree, thus, induces its own observed grammar with the topmost symbol being its observed axiom .For the multitude of observed grammars in the capsule network, we postulatethat we can always deﬁne a higher-level grammar by simply taking their unionand deﬁning a new axiom with a rule that produces the previous axioms (cf.Figure 4). For example, the grammars for [ table ] and [ chair ] allow us to deﬁnea meaningful higher-level grammar with [ dining-room ] as the axiom. Fig. 4.

A capsule network with all activated capsules in blue (left). Here, the topmostactivated capsules (dark blue) do not have a common parent capsule that activated.In this case the meta-learning agent adds a common parent / axiom (right).

We assume the decoder g ( draw ) for primitivecapsules is known. Finding an analytical solution to γ is out of reach. Instead,we use a regression model for γ and synthesize training sets with g . We deﬁnethe inputs to g using quantile functions Q j ( p ) = p for each attribute α j Ω , which eural-Symbolic Architecture for Inverse Graphics 7 we may reﬁne according to our prior knowledge. Next, with χ i,j some uniformrandom variable, χ i,j ∼ U ([0 , f some function that applies random back-grounds, occlusions and special eﬀects, we generate γ ’s virtually inﬁnite trainingset using (cid:16) ( f ( g ( Q j ( χ i,j )))) ( i ) , ( Q j ( χ i,j )) ( i ) (cid:17) . (13) Training Semantic Capsules.

If only γ of a route is known and g unknown,we use a similar method to the case above. Ideally, we calculate α Ω using γ andtrain g using the training sets(( γ ( α λ )) ( i ) , ( α λ ) ( i ) ) . (14)We must, however, ﬁrst ﬁnd a suitable γ . The initial output attributes of oursemantic capsules consist of the distinct set of all input attributes (cid:83) i A ( r i ). Weare free in our choice for γ , which is non-injective in most cases. It is expectedthat there will be collisions, i.e., diﬀerent sets of inputs leading to the sameoutput. These collisions are the main focus of our meta-learning pipeline. Tominimize these collisions, we choose a γ that calculates the mean of inputs ofthe same type, weighted by their size (width, height, depth). This weighting bysize is to ensure that, for example, a wooden chair with many metallic screws isstill considered wooden instead of metallic. For a general k th attribute we have: α k Ω = γ k ( α λ ) = 1 (cid:80) λ (cid:107) α sizeλ (cid:107) (cid:88) λ α kλ · (cid:107) α sizeλ (cid:107) . (15)However, we use special functions for size and position α size Ω = γ size ( α λ ) = max λ,i (cid:0) R − · ( α posλ + R λ · B λ,i ) (cid:1) − min λ,i (cid:0) R − · ( α posλ + R λ · B λ,i ) (cid:1) (16) α pos Ω = γ pos ( α λ ) = R Ω · (cid:20) max λ,i (cid:0) R − · ( α posλ + R λ · B λ,i ) (cid:1) + min λ,i (cid:0) R − · ( α posλ + R λ · B λ,i ) (cid:1)(cid:21) , (17)to ensure that they are in the correct reference frame. Here, α rot , α size , α pos arethe vectorized subsets of the attribute vector α for rotation, size and position, R λ and R Ω indicate the Euler rotation matrix calculated from the rotationattributes α rotλ and α rot Ω and B λ,i the i th corner position vector of the boundingbox of λ (i.e., pairwise permutations of α sizeλ / − α sizeλ / α pos Ω , α rot Ω and α size Ω > T i ( · ) denotetransformation functions that rotate (acting on α posλ , α rotλ ), translate (acting Michael Kissner, Helmut Mayer on α posλ ) and scale (acting on α posλ , α sizeλ ) all parts, while leaving the relativerotation, position and size to each other unchanged.For the other attributes α kλ we have little knowledge on how to performfeature-preserving transformations. However, if our original training set (cf.Equation 14) contains an output attribute α k Ω for which all entries are smallerthan some (cid:15) , we can safely assume that we have never encountered an objectwith this attribute and are free to ”invent” possible values for this attribute typeby simply setting all input α kλ uniformly to some constant. For example, if ourtraining set is ﬁlled with real apples, for which the stem is brown and the bodyred or green, we can invent a metallic apple by simply assuming that both thestem and body are metallic, as we have no idea what it really would look like.By U i ( · ) we denote a linear ”style” transformation that sets a constant value inthe range [0 ,

1] to all unused attributes of the same type and either activates ordeactivates it. We ﬁnally have our fully augmented set for training g :(( γ ( T i ( U i ( α λ )))) ( i ) , ( T i ( U i ( α λ ))) ( i ) ) . (18)A single example is suﬃcient for the above training regime to begin augmen-tation by translating, rotating and resizing the object ( T i ), as well as trying outdiﬀerent styles ( U i ). New Attributes and Re-Training.

For semantic capsules, the set of at-tributes is not static and can grow. We diﬀerentiate between adding attributesdue to inheritance and due to a trigger from the meta-learning agent.Inheritance occurs automatically when one of the input capsules is extendedby an attribute that is unknown to the current capsule. We simply expand thecapsule’s internals by said attribute and retrain it. This inheritance propagatesdown the network, forcing subsequent capsules to inherit them as well.The more interesting case arises when the capsule is triggered by meta-learning to expand its attributes by adding α new Ω . First, we expand every at-tribute vector in memory for this capsule by the new attribute, but set to α new Ω = 0. Next, the internal attribute vector of the capsule and its γ and g functions are extended. We begin by replacing g with a new regression model ofincreased input width. The problem here is that we require γ to train it, whichat this point has not been extended yet. Instead we split α and γ into two parts,one containing the newly added attribute α new Ω = γ new ( α λ ) as an output andone with the previous attributes as output α old Ω = γ old ( α λ ): γ ( α λ ) = (cid:0) γ old ( α λ ) ⊕ γ new ( α λ ) (cid:1) , (19)where by a ⊕ b we mean the concatenation of two vectors. At this point γ old , α new Ω and α λ are known and this suﬃces to start training g using (cid:18) ( γ old ( T i ( U i ( α λ ))) ⊕ α new Ω ) ( i ) , ( T i ( U i ( α λ ))) ( i ) (cid:19) . (20)Finally we need to determine γ new . We add a regression model with oneoutput that runs in parallel to γ old and train it using the new decoder g : eural-Symbolic Architecture for Inverse Graphics 9 (cid:18) ( g ( α old Ω ⊕ α new Ω )) ( i ) , ( α new Ω ) ( i ) (cid:19) . (21) It is far too diﬃcult to deﬁne the entire grammar with all rules and constraintsfrom scratch to generate a complete capsule network. Instead, our approachworks bottom-up and we only deﬁne the terminal symbols (primitive capsules),letting the meta-learning agent learn all semantic capsules and routes. Thismeans that our initial set of primitive capsules limits what the network caneventually analyze and learn. Ideally, we would deﬁne primitive capsules for themost basic set of primitives from which we are able to construct every kind ofobject. We can, however, reﬁne this set later on. A grammar with [ square ] asterminal symbol produces the same results, even if it is reﬁned by [ edge ] ter-minal symbols with the rule [ square ] → [ edge ] [ edge ] [ edge ] [ edge ]. For the draw functions, we rely on the current state of computer graphics. Here we have ac-cess to a near endless supply of parameterizable primitives [18][33] and graphicspipelines capable of physically-based rendering [17]. By Equation 4, if we canrender, we can de-render it to some set of valid attributes.We postulated above that there always exists a higher-level grammar withan axiom that includes all the symbols of the observed grammars. We go a stepfurther and view the capsule network as incomplete if there is more than oneobserved axiom (cf. Figure 4). There are four possible causes for this: A.1

A non-activated parent capsule is lacking a route. ”What existing symbol best describes these parts?”

A.2

A parent capsule is missing. ”What new symbol best describes these parts?”

B.1

An attribute is lacking training data. ”What existing attribute best describes this style or pose?”

B.2

An attribute is missing. ”What new attribute best describes this style or pose?”

We may remedy these causes using one of two methods, either triggeringthe creation of a new route in an existing or new capsule ( (A.1) , (A.2) ) ortriggering the training of an existing or new attribute ( (B.1) , (B.2) ).However, deciding which of the four causes is responsible for the multipleobserved axioms in the current forward pass is subjective even for humans. Forexample, consider a capsule network that has [ leg ], [ panel ] and [ chair ] capsules.It encounters a new scene and the observed parse-tree contains four [ leg ] activa-tions and one [ panel ] activation. [ chair ], however, did not activate, triggering themeta-learning pipeline, due to multiple observed axioms. Is this just a [ chair ]with a previously unknown style (B.2) ? Or is this a completely new capsulesuch as [ stool ] (A.2) ?We, thus, introduce a decision matrix (cf. Table 1). Akin to child develop-ment, we train this matrix by querying an oracle in the early stages of the capsule Feature A.1 A.2 B.1 B.2Observed Axioms have same Ω as parent 4 3 14 12Observed Axioms don’t have same Ω as parent 5 19 1 0Parts tracked from previous scenes 14 1 17 12Ω : Z ( α , ˜ α ) indicates one attribute mismatchwith no entry in memory α i > (cid:15) Z ( α , ˜ α ) indicates attribute mismatchfor (position, rotation, size) only 4 3 13 10Ω : Z ( α , ˜ α ) indicates attribute mismatchfor more than half of all attributes 12 14 4 4 · · · Table 1.

Example of a trained decision matrix with an excerpt of features derivedfrom the observed parse-trees and what cause they indicate (number of past oracledecisions). Here Ω is the capsule with the highest p Ω that did not activate. network’s training process and update the entries. Decisions are made by sum-ming up all rows of features that evaluate to true and ﬁnding the column withmaximum value. We may remove the oracle at any point in time, as it does notimpair the learning capability of the network itself, only the ability of the agentto make human-like decisions and learn the correct names. Lexical Interpretation.

We interpret our grammar lexically. It is easy to seethat each symbol represents a compound noun ([ chair ] or [ dining-room ]). For at-tributes, this analysis is more involved. Note that we have three attributes whichwe treated diﬀerently in Equations 15-17: α rot , α size and α pos . We interpretthese as prepositions. This becomes obvious, once we have multiple objects in ascene and are able to refer to their spatial relationship using words such as ”on”or ”near”, based purely on these attributes.For static scenes, we interpret all remaining attributes as adjectives, such as”wooden” or ”red”. Their magnitude is then related to adverbs, such as ”very”.However, in dynamic scenes, some attributes of an object change over time anddescribe new poses for the parts. Thus, we interpret these time-dependent at-tributes as verbs, such as ”walk”. Their value is equivalent to the normalizedtime evolution of an animation.These interpretations are both interesting semantically, as well as for query-ing the oracle. Instead of presenting a choice between (A.1) - (B.2) and somevalues, an actual question can be formed from the activated features (cf. Table1). Consider a capsule network that has thus far only seen a modern [ chair ],made out of a blend of metal and wood. We now show it a chair made of thesame parts, but with less metal and in a classical design. Even though the capsulewas trained with basic style transformations U , the design is still too complex tograsp. Instead, meta-learning is triggered by cause (B.2) , because of a mismatchof attributes ”metallic”, ”wooden” and ”modern” in Z ( α , ˜ α ). As we have access eural-Symbolic Architecture for Inverse Graphics 11 to a lexical interpretation, we can make these abstract pieces of information eas-ier to understand for a human oracle, by letting the meta-learning agent pose anactual question: ”This object looks similar to a modern chair, but is very woodeninstead. What adjective best describes this style?”. The answer ”classical” thenadds a new attribute to the capsule. Implementation.

We implement the renderer g of primitive capsules usingsigned distance ﬁelds [18] and their encoder γ using an AlexNet-like convolu-tional neural network [6] for regression. For semantic capsules we use Equations15-17 for γ and a 4-layer deep dense regression neural network with tanh ac-tivation functions for g . The training data is generated synthetically using theprocess described in this paper and all hyperparameters, such as learning rate,are ﬁne-tuned by hand. Our implementation called VividNet is found on Githubat https://github.com/Kayzaks/VividNet . Results.

In the initial phase, our capsule network has three primitive capsules:[ square ], [ triangle ] and [ circle ]. We begin by showing it an image of a spaceship(LHS of Figure 5), upon which it detects all the relevant graphical primitives,such as three triangles, one circle and one square, but has no semantic under-standing of their relation. As this constellation leads to ﬁve activated capsuleswith no common parent, the meta-learning agent is called into action. In thiscase, it is obvious that (A.2) is triggered, as there are no semantic capsulesyet. The exact split, however, is subjective and up to the oracle. We may treatthese primitives as one space-ship (top row of Figure 5) or group them into twoindependent parts, booster and shuttle, which make up the space-ship (bottomrow of Figure 5). In either case, the capsule network is extended by new capsulesand trained using only this one example.Now, the capsule network is shown a new scene, which includes an asteroidmade up of three circles. The routes of the [ ship ], [ booster ] and [ shuttle ] capsulesﬁnd no agreement, as none of them have three circles as their parts. Again, themeta-learning agent queries the oracle, which concludes that a new [ asteroid ]capsule is required (A.2) . The asteroids, however, vary quite a bit. In a newscene with a diﬀerent asteroid, three circles are detected. This time however, aparent capsule ([ asteroid ]) does exist, that admits all three circles as its parts,but due to the diﬀerent conﬁguration did not activate. This leads to a diﬀerentset of activated features in our decision matrix and we ﬁnd, after querying theoracle, that the [ asteroid ] is merely missing a route (A.1) . Alternatively, theagent could have concluded that the capsule is missing a style attribute (B.2) .In Figure 5 we show two of the many possible timelines the training processcould have taken, depending on the choice of features in the decision matrix,as well as the response by the oracle during the meta-learning process. It wassuﬃcient to show the capsule network one spaceship (or its parts) and a fewasteroids to construct the entire network and correctly identify these objectsand all their attributes in a new scene.

SquareTriangleCircle ShipAsteroid Belt-SceneSquareTriangleCircle BoosterShuttleAsteroid Ship Belt-Scene

Fig. 5.

Two of many possible capsule network conﬁgurations the meta-learning agentmight end up with, depending on the oracle and decision matrix.

Comparison.

Our approach diﬀers too much from current classiﬁcation meth-ods in order to make a direct numeric comparison. The neural symbolic capsulenetwork can only express conﬁdence, but has no notion of accuracy, as any in-accuracy is remedied by the meta-learning pipeline and its oracle. This doesnot mean it has perfect accuracy, but rather that it continues to learn forever.Further, the initial choice of primitive capsules is very important in the overallperformance of the network. Any comparison would, thus, need to ﬁxate thecapsule network in a subjective conﬁguration, eliminating the beneﬁt of lifelongmeta-learning.

In this work we showed the internal workings of our neural-symbolic capsulenetwork and how it extends itself through lifelong meta-learning. The proposednetwork is bi-directional: Feed-forward (i.e., the capsule network) it is a pat-tern recognition algorithm and feed-backward (i.e., the generative grammar) itis a procedural graphics engine. The ability to render allows us to generate asegmentation mask for all detected objects and their components. However, ourreliance on rendering for primitive capsules as the underlying mechanism for in-verse graphics also comes with the downside that we are limited by the currentstate of computer graphics for detection.We also showed how the network is capable of learning to detect new objectsusing a few-shot approach and that the training process is very human-like. Thisallows it to grow indeﬁnitely with less training data, but requires the presenceof an oracle to provide nouns, adjectives or verbs for new discoveries, replacingthe large amounts of hand labeled data found in the classical approach.We believe that by next focusing on video data as input and coupling the sys-tem with intuitive physics [1,3], we may extend the inverse-graphics capabilitiesto inverse-simulation. eural-Symbolic Architecture for Inverse Graphics 13

References

1. Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., Kavukcuoglu, K.: Interactionnetworks for learning about objects, relations and physics. NIPS (2016)2. Battaglia, P.W., Hamrick, J.B., Tenenbaum, J.B.: Simulation as an engine of physi-cal scene understanding. Proceedings of the National Academy of Sciences (45),1832718332 (2013)3. Hamrick, J.B., Ballard, A.J., Pascanu, R., Vinyals, O., Heess, N., Battaglia, P.W.:Metacontrol for adaptive imagination-based optimization. ICLR (2017)4. Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. Interna-tional Conference on Artiﬁcial Neural Networks pp. 44–51 (2011)5. Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. ICLR(2018)6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep con-volutional neural networks. NIPS pp. 1097–1105 (2012)7. Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.B.: Deep convolutionalinverse graphics network. NIPS (2015)8. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied todocument recognition. Proceedings of the IEEE (11), 2278–2324 (1998)9. Lenssen, J.E., Fey, M., Libuschewski, P.: Group equivariant capsule networks. NIPS(2018)10. Lipton, Z.C.: The mythos of model interpretability. CoRR abs/1606.03490 (2017)11. Liu, Y., Wu, Z., Ritchie, D., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Learningto describe scenes with programs. ICLR (2019)12. Liu, Z., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Physical primitive decomposi-tion. ECCV (2018)13. Mahendran, A., Vedaldi, A.: Understanding deep image representations by invert-ing them. CVPR pp. 5188–5196 (2015)14. Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic conceptlearner: Interpreting scenes, words, and sentences from natural supervision. ICLR(2019)15. Martinovic, A., Gool, L.V.: Bayesian grammar learning for inverse procedural mod-eling. CVPR (2013)16. Montavon, G., Samek, W., Mller, K.R.: Methods for interpreting and understand-ing deep neural networks. Digital Signal Processing , 1–15 (2018)17. Pharr, M., Humphreys, G., Jakob, W.: Physically Based Rendering 3rd Edition.Morgan Kaufmann (2016)18. Qu´ılez, I.: Rendering signed distance ﬁelds (2017),

19. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you? explaining thepredictions of any classiﬁer. Knowledge Discovery and Data Mining pp. 1135–1144(2016)20. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. NIPS(2017)21. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks:Visualising image classiﬁcation models and saliency maps. arXiv:1312.6034 (2014)22. Team, G.E.: Godot engine (2019), https://godotengine.org

23. Tian, Y., Luo, A., Sun, X., Ellis, K., Freeman, W.T., Tenenbaum, J.B., Wu, J.:Learning to infer and execute 3d shape programs. ICLR (2019)4 Michael Kissner, Helmut Mayer24. Towell, G.G., Shavlik, J.W.: Extracting reﬁned rules from knowledge-based neuralnetworks. Machine Learning (1), 71–101 (1993)25. Towell, G.G., Shavlik, J.W.: Knowledge-based artiﬁcial neural networks. ArtiﬁcialIntelligence (1), 119–165 (1994)26. Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstrac-tions by assembling volumetric primitives. CVPR (2017)27. Ullman, T.D., Spelke, E., Battaglia, P., Tenenbaum, J.B.: Mind games: Gameengines as an architecture for intuitive physics. Trends in Cognitive Science21