[PDF] Concepts, Properties and an Approach for Compositional Generalization

Abstract

Compositional generalization is the capacity to recognize and imagine a large amount of novel combinations from known components. It is a key in human intelligence, but current neural networks generally lack such ability. This report connects a series of our work for compositional generalization, and summarizes an approach. The first part contains concepts and properties. The second part looks into a machine learning approach. The approach uses architecture design and regularization to regulate information of representations. This report focuses on basic ideas with intuitive and illustrative explanations. We hope this work would be helpful to clarify fundamentals of compositional generalization and lead to advance artificial intelligence.

Full PDF

CConcepts, Properties and an Approach forCompositional Generalization

Yuanpeng Li

Abstract

Compositional generalization is the capacity to recognize and imagine a large amount of novelcombinations from known components. It is a key in human intelligence, but current neuralnetworks generally lack such ability. This report connects a series of our work for compositionalgeneralization, and summarizes an approach. The ﬁrst part contains concepts and properties.The second part looks into a machine learning approach. The approach uses architecture designand regularization to regulate information of representations. This report focuses on basic ideaswith intuitive and illustrative explanations. We hope this work would be helpful to clarifyfundamentals of compositional generalization and lead to advance artiﬁcial intelligence.

Humans leverage compositional generalization to recombine familiar concepts to understand andcreate new things. We have been using this ability from early civilization. For example, Sphinx hasthe face of a human and the body of a lion (Figure 1a). We do not see such a living animal, butancient people could create it and we can recognize it. This shows we are able to recombine diﬀerentparts of seen objects for an unseen object. Sphinx actually also has wings of an eagle, and thetype of wings can be another part to combine. This means we have an exponentially large amountof combinations as the number of parts grows. So compositional generalization helps humans toeﬃciently learn from a few training data, and generalize to a many unseen combinations. We hopemachines also have such ability.Diﬀerent compositional generalization approaches have been investigated, such as architecturedesign [1, 2], independence assumption [3, 4], data augmentation [5, 6], causality [7, 8], reinforcementlearning [9], group theory [10] and meta-learning [11]. There are also general discussions [12, 13].Compositional generalization has been applied in many areas, such as instruction learning [11],grounding [14], continual learning [15], question answering [16, 17, 18], reasoning [19], zero-shotlearning [20] and language inference [21]. In this report, we focus on summarizing a series of ourwork for theoretical discussions [22, 23, 24]. Please refer to these papers for concrete examples [25, 26]and applications [27]. Please also ﬁnd broad related work in the papers.In this report, we ﬁrst discuss concepts and properties related to compositional generalization.Based on them, we clarify the setting in our scope (Section 3.1). We then propose an approachwith architecture design, training and inference. We also share conjectures to partially explain somehuman behaviors, such as system 1 and system 2 cognition. This report has three main key points.First, what is compositional generalization . Second, what is conditional independence prop-erty , and how does it help compositional generalization. Third, how to control random variableinformation , and how does it enable conditional independence property. We will explain them inthe following sections and summarize them in conclusion.

In this section, we ﬁrst introduce compositional generalization and disentangled representation.We then discuss two questions about disentangled representation: subjectivity of components andconditional independence property. 1 a r X i v : . [ c s . A I] F e b a) Sphinx (b) Centaur Figure 1: Recombine parts of bodies to create an imagination.

We compare diﬀerent types of generalization to describe compositional generalization. The imagesin Figure 2 only contain input distributions for explanation purposes. Conventionally, many machinelearning researches take the assumption that training and test distributions are identical (Figure 2left). This means a main problem of conventional generalization is to learn a model working on thecorrect underlying distribution, and use it in the test. In this case, the smoothness assumption isimportant for learning the model, so that there exist some types of general purpose regularization,such as L regularization and dropout. Also, when we have more training data, we are likely to havebetter test performance.Figure 2: Types of generalization. Gray areas are input distributions (manifolds).Out-of-distribution (o.o.d.) generalization [12], however, has diﬀerent training and test distribu-tions (Figure 2 middle). We focus on the test distribution manifold not in the training manifold,because the overlapping part is similar to conventional generalization. The diﬀerence of the dis-tributions needs both training and test distributions to deﬁne, so that training distribution alonedoes not have information for the diﬀerence. This means the distribution diﬀerence information canonly be given as prior knowledge during training, which is not general, but is speciﬁc for the testdistribution. So, in this case, more training data or general regularization does not directly helplearning the distribution diﬀerence. We are familiar with arguments in conventional generalization,but some of them may not directly apply in o.o.d. generalization. Compositional generalization , a.k.a. systematic generalization, is a type of o.o.d. generalization.It has multiple components, and the generalization requires recombining values of diﬀerent compo-nents in a novel way. The values in each component appear in the training. In Figure 2 (right), a2est sample is not in the training distribution, but when we decompose it to horizontal and verticaldirections, the values of each component are in the training distribution, and we can combine thesevalues for the test sample. However, the components might be mixed together, and it is not straight-forward to separate them. This means we do not know the horizontal and the vertical directionsin such cases. When the representation has these orthogonal directions, we say it is a disentangledrepresentation. A disentangled representation [28] contains several separate component representations. Each com-ponent representation corresponds to an underlying component, or generative factor. When a rep-resentation is not disentangled, it is an entangled representation . In the examples in Figure 3, wesuppose to know that the components are color and shape. The upper images are entangled rep-resentations, where color and shape are in the same image. The lower vectors are disentangledrepresentations, where the left vector is for color and the right vector is for shape.Figure 3: Entangled (upper) and disentangled (lower) representations.Then we have several questions. Where the types of components are from? In this case, it meanshow do we know the components are color and shape? Another question is what’s the relationbetween two representations, such as between the entangled and the disentangled representations? We discuss the ﬁrst question of where the types of components are from. This part is still con-troversial and maybe not straightforward to agree, but we like to share the idea. The idea is thatthe components can be subjectively deﬁned by humans. Sometimes they are common agreement ofhumans. This also enables discussing diﬀerent components in the same machine learning framework.We study a general way to encode human’s understanding of components to models.Components can be subjective, but not arbitrary. They are deﬁned according to how humansperceived the world. This means some components may be factors in real world physics, such asposition and rotation. They inﬂuence human perceptions, but humans decide the components.In the example in Figure 1, we have Sphinx and Centaur. Though they are both created bycompositional generalization, they have diﬀerent components, one for face, and the other for upperbody. Another example is color. We often use primary colors red, green and blue as componentsfor colors. However, their essential diﬀerence is the light wavelength. There are three colors becausegenerally humans have three types of photopsin proteins in eyes, each absorbing a primary color [29].This means if an animal or a machine has four types of proteins, then they may have four primarycolors. So primary colors are not completely objective, and not completely subjective. It dependson the objective biological mechanism of humans.These are the examples of subjectivity of components. Since machines are not humans, they donot know what subjective components humans have. This means we need a general way to encodehuman’s understanding into models as prior knowledge.3 .4 Conditional independence property

Let’s look at the relation between representations. The key idea is conditional independence property.We may ﬁrst look at a question in Figure 4a. What is in the right hand, when we see there is a forkin the left hand? We do not know the exact answer, but the fork tells something about the answer.We may guess the right hand has a knife or spoon. (a) Left only. (b) Both. (c) Right only.

Figure 4: Conditional independence property. What is in the right hand?Let’s ask again the question when we also have observation of the right hand (Figure 4b). Giventhe observation of a spoon, we can tell the right hand has a spoon. Then, let’s hide the left hand(Figure 4c). In this case, the answer is the same, and hiding the left hand does not inﬂuence theanswer. This means the answer depends only on the observation of the right hand, though the lefthand is related. In other words, given the observation of the right hand, the answer is conditionallyindependent of other things. This property is called conditional independence property [22].We formalize this property and ﬁnd how it helps compositional generalization. We consider tworepresentations X = X , . . . , X K and Y = Y , . . . , Y K . They both have K components, and each pairof components are aligned. Conditional independence property can be summarized as Y i dependsonly on X i . This can be written in probability. ∀ i : P ( Y i | X , . . . , X K , Y , . . . , Y i − , Y i +1 , . . . , Y K ) = P ( Y i | X i ) . We also formalize compositional generalization (Figure 2 right). We consider a particular testsample with values of X and Y . In the training, each component value of X i appears, but the valueof X does not appear. Note that this means the components are not marginally independent. Whenthe value of X i appears, the value of Y i has a high probability. In the test, the value of X appears,and we hope the predicted conditional probability of Y given X is high. For example in Figure 3, atest sample can be a yellow heart, and it does not appear in training. However, yellow appears inyellow moon, and heart appears in red heart in training. X is image, and Y is label pair.In train, In test, ∀ i : P ( X i ) > , P ( X , . . . , X K ) = 0 , P ( X , . . . , X K ) > , ∀ i : P ( Y i | X i ) is high. P ( Y , . . . , Y K | X , . . . , X K ) is predicted high.The conditional independence property bridges training and test distributions. We ﬁrst apply chainrule, and use compositional independence property. P ( Y , . . . , Y K | X , . . . , X K ) = K (cid:89) i =1 P ( Y i | X , . . . , X K , Y , . . . , Y i − ) = K (cid:89) i =1 P ( Y i | X i ) . When P ( Y i | X i ) are all high, their product is high, so that P ( Y , . . . , Y K | X , . . . , X K ) is high. There-fore, a model satisfying conditional independence property addresses compositional generalization. In this section, we introduce our setting, and describe an approach for compositional generalization.We mainly discuss how to encode the prior knowledge and enable conditional independence property.4 .1 Settings

X H Y

Encode Decode

Figure 5: Problem setting. Bothinput X and output Y are entan-gled. We use a disentangled hiddenrepresentation H . The model hasencoding and decoding modules.We focus on a general setting for compositional generalization.We consider a problem with both entangled input X and en-tangled output Y , and components are aligned. For example inlanguage translation, both input and output languages are en-tangled with grammar and lexicon. The input grammar decidesoutput grammar, and input lexicon decides output lexicon.Compositional generalization requires recombining values ofdiﬀerent components in a novel way. As we consider componenttypes are subjective (Section 2.3), it requires knowing what arethe types of components, such as shape and color. There arediﬀerent ways to add this prior knowledge. In some cases, theprior knowledge is in the design of data structure (position inimage). Some approaches design training data distribution to make the components statisticallymarginally independent. We attend to using the prior knowledge in model architecture design withparticular regularization.We focus on using disentangled representation, because it is conceptually straightforward forcompositional generalization. We do not assume statistical independence between components or useannotations on components. In such a setting, we have encoding and decoding modules (Figure 5).Encoder converts entangled input X to hidden disentangled representation H , and decoder converts H to entangled output Y . We can set Y = X for unsupervised representation learning. We hope to enable conditional independence property (Section 2.4) by encoding prior knowledge forcomponents. This means we expect a component representation H i to have exact information ofthe corresponding component. For example, we hope a component representation contains the colorinformation. This requires controlling information of a random variable (component representation).To achieve it, we hope to design a loss function that has the minimum value when the informa-tion is expected. So we study the relation between optimization loss and entropy of a componentrepresentation. Note that entropy measures the amount of information, not the contents, but we useentropy for intuitive explanation. Also note that we consider a (multi-dimensional) representationas a random variable. We discuss the distribution, and its entropy, of this random variable withall the samples in a dataset. This means for one dataset, we have only one distribution for thecomponent representation and only one entropy for the distribution.The strategy is to design a convex loss with the minimum at the target entropy (Figure 6c,and note that the horizontal axis is entropy instead of parameters). This requires techniques toincrease entropy, decrease entropy, and enable local turning of loss at the target entropy (locality).We look at related techniques in machine learning (Table 1). Prediction loss increases entropy,because when we train a model to have correct prediction, an intermediate representation shouldcontain more information to do that. During increasing entropy, we can encode local turning pointby architecture design as we will discuss in Section 3.3. Regularization can reduce entropy, and wediscuss in Section 3.4. However, it is not clear how to encode locality during reducing entropy.Table 1: Machine learning techniques.Increase entropy Decrease entropyLoss Prediction loss RegularizationLocality Architecture design Not clearWith the above availability, we design two losses. For the loss to increase entropy (Figure 6a), weuse prediction loss and architecture design. The loss rapidly decreases when entropy increases and is5elow the target value, and it is constant after that. For the loss to decrease entropy (Figure 6b), weuse regularization. The loss stably increases as entropy increases. They together form the expectedcurve (Figure 6c). This approach has two advantages. First, the target position is encoded only toone loss. Second, it does not need speciﬁc values for the losses. H L (a) Increase entropy. H L (b) Decrease entropy. H L (c) Combined. Figure 6: Loss and entropy. Horizontal axis is component entropy H ( H i ). Vertical axis is loss. ˆ Y i − ˆ Y i ˆ Y i +1 H i − H i H i +1 × × Figure 7: Architecture design. Avoiding con-nections from H j ( j (cid:54) = i ) to ˆ Y i , so only H i caninﬂuence ˆ Y i . The prediction loss makes H i atleast contain information of Y i .We ﬁrst discuss how to encode locality when increas-ing information. In Figure 6a, the loss decreaseswhen entropy is not enough, and the loss is constantwhen entropy is enough. This means that we hopeto make a component representation have at leastcertain information. We achieve this by architecturedesign combined with prediction loss.When each output component ˆ Y i is connectedonly to one corresponding hidden component repre-sentation H i , information of ˆ Y i can come only fromthis hidden component representation (Figure 7).Note that this also means H i is connected forwardonly to ˆ Y i . We consider that if the output ˆ Y is cor-rect, all its components ˆ Y i should be correct. Since this component needs to be correct when reducingprediction loss, the hidden component representation should contain at least the information of thecomponent. Please also refer to Appendix A for extended discussions. (a) Hidden components. (b) Entangled outputs. Figure 8: Example of architecture design eﬀect.The upper row is for the shape component, andthe lower row is for the color component. Oneoutput component changes its value only whenthe corresponding hidden component representa-tion changes its value. To make the output correct,each hidden representation should at least containthe information for the component. This is the way that we encode componentprior knowledge in the architecture design, i.e.,we describe how it generates output. Note thatthis generating process might be diﬀerent fromthe real generating process. For example, wehave generative factors of shape, size and colorfor an apple. For a machine, a generating pro-cess can ﬁrst choose a shape, then adjust the sizeand paint color. However, a real apple growswith these three components changing together.This technique works for the decoding pro-cess, because the output needs to be comparedwith ground truth to compute the loss. Also,for humans, describing the decoding process iseasier than the encoding process. For example,computer graphics is easier than computer vi-sion. Computer graphics, in many cases, doesnot need machine learning, such as developing3D games. However, computer vision is hardwithout machine learning.6et’s look at an example in Figure 8. There are two component representations, one for shape andthe other for color. We can design an architecture to achieve the following eﬀects. The output shapechanges only when the ﬁrst component representation changes its value. The output color changesonly when the second component representation changes its value. With such design, to producecorrect output, the ﬁrst component representation should at least contain the shape information,and the second component representation should at least contain the color information, because theother representation is not able to provide the information. Please refer to [24] for more analysis.

Figure 9: Entropy Regularization.Green is for noise. Orange is fornorm regularization.We then talk about reducing the information of a componentrepresentation. This does not require component speciﬁc priorknowledge. Entropy for a random variable can be roughly un-derstood as the number of possible values.Entropy regularization [25] aims at reducing entropy of acomponent representation. Given a representation x , we com-pute the L norm and add normal noise to each element of therepresentation. This decreases the channel capacity, so thatthe entropy for the representation reduces. We then feed thenoised representation to the next layer, and add the norm toloss function. L = L original + λL ( x ) EntReg( x ) = x + α N (0 , I )where α is a weight of noise, positive for training and zero forinference. λ is a coeﬃcient.Please see Figure 9 for intuitive illustration. The noise makes diﬀerent values far from each otherin vector space. If they are close, the noise will make them not distinguishable, so the predictionwould be wrong. At the same time, norm regularization makes diﬀerent values close to each other toreduce the region of manifold. These two forces squash the values, so that unnecessary values will bemerged. With less number of possible values, the entropy reduces. Please also refer to Appendix B. We discussed two losses to increase and decrease entropy. However, during the optimization of neuralnetworks, there are other inﬂuences acting like losses, and we consider them as losses for simpleexplanation. These losses come from stochastic gradient descent. It is a widely used optimizationalgorithm with many variations, and the following arguments apply to them.One loss is from stochastic sampling. This reduces entropy because it adds noise. The eﬀect issimilar to entropy regularization, but this is weak. This eﬀect appears mainly during the later stageof training. Occasionally, this enables learning compositionality without entropy regularization. Formore details, please refer to [30].Another loss comes from gradient descent. It increases entropy of a component representation.This is because the optimization process imposes a bias toward non-compositional solutions, whichis because gradient seeks the steepest direction, so that it uses all available and redundant inputinformation. This mainly happens during the early stage of training. This eﬀect can be canceled byentropy regularization, so entropy regularization is important. Note that it is not prominent whenthere is only one solution, e.g., with linear model. Please refer to [22] for more details.

Let’s summarize the four losses during optimization. Please see Figure 10 for intuitions. The ﬁrstloss is from prediction loss with architecture design. This loss decreases rapidly when entropy issmall, and it is constant after the entropy is above the target value. The second loss is from stochasticsampling [30]. This exists naturally but is weak. The third loss is from gradient descent [22]. The7ourth loss is from entropy regularization [25], and it counteracts the eﬀect from gradient descent.This loss should be less steep than the prediction loss, so that the summed loss has the lowest pointnear to the expected value. H L H L (a) Prediction loss (blue). H L H L (b) Stochastic sampling (red). H L H L (c) Gradient descent (cyan). H L H L (d) Entropy regularization (orange). Figure 10: Four losses. Horizontal axis is entropy H ( H i ) for a component representation on trainingdistribution. Vertical axis is loss L . In each pair of ﬁgures, left is individual loss, and right is summedloss (green). The summed loss for all inﬂuences has the minimum close to the expected point. In this section, we look at problems during inference. We also discuss conjectures for human behav-iors (Appendix C). We then analyze that language tasks are less likely to suﬀer from the problemsin inference (Appendix D).

So far, we have discussed learning compositionality during training. However, our goal is composi-tional generalization, and we hope for high performance in the test. So a question is whether themodel still works on test distribution. We have both encoding and decoding parts (Figure 5).Decoding still works if encoding is correct. This is because of the architecture design, whereonly the corresponding component representation produces the component in output. Since thecomponent representation is correct, and it is in the same manifold as training by deﬁnition ofcompositional generalization, the network produces a correct output component.However, the encoding part may not work on test distribution. It extracts disentangled repre-sentation from entangled input representation. This extraction network, however, can be a generalnetwork, and we do not have special treatment for it. By deﬁnition of compositional generalization,the input manifold changes, and a general network does not work well in such cases. Therefore, theencoding network may not produce correct disentangled representation. Please also refer to [23].

One idea to address this problem is to convert the encoding problem to a decoding problem byreversing input and output and specify architecture design with an additional decoding network h (Figure 11). Similar to the other decoding network f , h works when each component is in its trainingmanifold. Since the input and the output is opposite, we cannot get the hidden representation H with a forward pass. So we use optimization to get the input H that best produces the output X .To regularize each test H i in its training manifold, we may keep the manifold information, anduse it in test. A straightforward way to keep the information is to store some training samples. They8ay be stored as input representation or hidden representation. In test, we make each test H i closeto the corresponding training ones. The encoding network g provides initial hidden representations.In summary (Figure 11), we jointly train three models g, h, f in training. In inference, we ﬁrstuse the encoder g to get initial hidden representation. Then we use the additional decoder h tooptimize the hidden representation to reconstruct the input with manifold regularization. We thenuse the original decoder f to convert the optimized hidden representation to output. X H Y g ( X ; φ ) h ( H ; ψ ) f ( H ; θ ) (a) Training ﬂowchart. The threemodules are trained with end-to-endoptimization. X H X H H Y g ( X ; φ ) h ( H ; ψ ) f ( H ; θ ) (b) Inference ﬂowchart. (Left) initial hidden representations ex-traction. (Middle) optimization of hidden representations asmodule input. (Right) output prediction. Figure 11: Flowcharts of the proposed approach. X is input, Y is output, and H is hidden repre-sentation. The architecture has three modules: g, h, f . This report introduces compositional generalization and an approach to it, with pointers to a seriesof corresponding papers. It has three important key points. First, what is compositional gener-alization. It is an out-of-distribution generalization with recombination of seen component valuesin a novel way. Second, conditional independence property. This is the core property of compo-sitional generalization. It means an output component depends only on the corresponding inputcomponent. The last point is controlling random variable information. It enables conditional inde-pendence property. We achieve it by squeezing entropy from above and below. We hope this reportwill help understanding compositional generalization and advancing artiﬁcial intelligence.

Acknowledgments

We thank Mohamed Elhoseiny, Liang Zhao, Wei Xu, Kenneth Church, Joel Hestness, Jianyu Wang,Yi Yang and Zhuoyuan Chen for helpful suggestions and discussions.

References [1] Jake Russin, Jason Jo, and Randall C O’Reilly. Compositional generalization in a deep seq2seqmodel by separating syntax and semantics. arXiv preprint arXiv:1904.09708 , 2019.[2] Anirudh Goyal, Alex Lamb, Jordan Hoﬀmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio,and Bernhard Sch¨olkopf. Recurrent independent mechanisms. arXiv preprint arXiv:1909.10893 ,2019.[3] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,Shakir Mohamed, and Alexander Lerchner. β -vae: Learning basic visual concepts with aconstrained variational framework. In International Conference on Learning Representations(ICLR) , 2017.[4] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, GuillaumeDesjardins, and Alexander Lerchner. Understanding disentangling in β -vae. arXiv preprintarXiv:1804.03599 , 2018. 95] Jacob Andreas. Good-enough compositional data augmentation. In Proceedings of the 58thAnnual Meeting of the Association for Computational Linguistics , pages 7556–7566, Online,July 2020. Association for Computational Linguistics.[6] Ekin Aky¨urek, Afra Feyza Aky¨urek, and Jacob Andreas. Learning to recombine and resampledata for compositional generalization. arXiv preprint arXiv:2010.03706 , 2020.[7] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle,Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learningto disentangle causal mechanisms. In

International Conference on Learning Representations ,2020.[8] Yuanpeng Li et al. Eﬃciently disentangle causal representations.

OpenReview , 2020. https://openreview.net/pdf?id=Sva-fwURywB .[9] Qian Liu, Shengnan An, Jian-Guang Lou, Bei Chen, Zeqi Lin, Yan Gao, Bin Zhou, NanningZheng, and Dongmei Zhang. Compositional generalization by learning analytical expressions. arXiv preprint arXiv:2006.10627 , 2020.[10] Jonathan Gordon, David Lopez-Paz, Marco Baroni, and Diane Bouchacourt. Permutationequivariant models for compositional generalization in language. In

International Conferenceon Learning Representations , 2020.[11] Brenden M Lake. Compositional generalization through meta sequence-to-sequence learning.In

Advances in Neural Information Processing Systems , pages 9788–9798, 2019.[12] Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568 , 2017.[13] Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition. arXiv preprint arXiv:2011.15091 , 2020.[14] Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M Lake. Abenchmark for systematic generalization in grounded language understanding. arXiv preprintarXiv:2003.05161 , 2020.[15] Xisen Jin, Junyi Du, and Xiang Ren. Visually grounded continual learning of compositionalsemantics. arXiv preprint arXiv:2005.00785 , 2020.[16] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.[17] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reason-ing and compositional question answering. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 6700–6709, 2019.[18] Daniel Keysers, Nathanael Sch¨arli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashu-bin, Nikola Momchev, Danila Sinopalnikov, Lukasz Staﬁniak, Tibor Tihon, Dmitry Tsarkov,Xiao Wang, Marc van Zee, and Olivier Bousquet. Measuring compositional generalization: Acomprehensive method on realistic data. In

International Conference on Learning Representa-tions , 2020.[19] Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. Teach-ing pre-trained models to systematically reason over implicit knowledge. arXiv preprintarXiv:2006.06609 , 2020.[20] Tristan Sylvain, Linda Petrini, and Devon Hjelm. Locality and compositionality in zero-shotlearning. In

International Conference on Learning Representations , 2020.1021] Atticus Geiger, Ignacio Cases, Lauri Karttunen, and Christopher Potts. Posing fair generaliza-tion tasks for natural language inference. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages 4485–4495, Hong Kong, China, November2019. Association for Computational Linguistics.[22] Yuanpeng Li et al. Gradient descent resists compositionality.

OpenReview , 2020. https://openreview.net/pdf?id=VMAesov3dfU .[23] Yuanpeng Li et al. Transferability of compositionality.

OpenReview , 2020. https://openreview.net/pdf?id=GHCu1utcBvX .[24] Yuanpeng Li. Necessary and suﬃcient conditions for compositional representations.

OpenRe-view , 2020. https://openreview.net/pdf?id=r6I3EvB9eDO .[25] Yuanpeng Li, Liang Zhao, Jianyu Wang, and Joel Hestness. Compositional generalizationfor primitive substitutions. In

Proceedings of the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages 4284–4293, 2019.[26] Yuanpeng Li. Grounded compositional generalization with environment interactions.

OpenRe-view , 2020. https://openreview.net/pdf?id=b6BdrqTnFs7 .[27] Yuanpeng Li, Liang Zhao, Kenneth Church, and Mohamed Elhoseiny. Compositional languagecontinual learning. In

International Conference on Learning Representations , 2020.[28] Yoshua Bengio. Deep learning of representations: Looking forward. In

International Conferenceon Statistical Language and Speech Processing , pages 1–37. Springer, 2013.[29] Samuel G Solomon and Peter Lennie. The machinery of colour vision.

Nature Reviews Neuro-science , 8(4):276–286, 2007.[30] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks viainformation. arXiv preprint arXiv:1703.00810 , 2017.[31] Daniel Kahneman.

Thinking, fast and slow . Macmillan, 2011.

A Partial observation of output combinations

In Section 3.3, we discussed architecture design that enables a component representation to containat least the information of a component. One condition here is that when prediction ˆ Y equalsto ground-truth Y , all the component outputs are correct ˆ Y i = Y i . As an extended topic, insome complicated cases, the condition may not be met when the same output Y can ambiguouslycorrespond to diﬀerent combinations of component values, if it discards a part of information forthe combinations.However, even in such cases, the arguments still hold with disambiguation. Broadly speaking,how to disambiguate is another type of prior knowledge. For example, reducing entropy of eachcomponent representation makes the ambiguity disappear in some tasks. Then, the entropy regu-larization also performs disambiguation. This mechanism is used in [25], where a combination is forsyntax tree and words on nodes, but Y only contains words. B Entropy regularization in language learning

We like to share a joke for “law of entropy increase” in human language learning. Law of entropyincrease originally says in an isolated system, the entropy increases over time. Here, a beginner oflearning a second language is likely to ”overuse” compositional generalization to create unnaturalphrases. As one becomes more ﬂuent over time, the problem is less (less highly compositional).Since entropy reduction helps compositional generalization, this means entropy increases over time.11

Conjectures for system 1 and system 2 cognition

We like to share some conjectures for system 1 and system 2 cognition [31]. System 1 is a fast andunconscious cognition process. System 2 is a slow and conscious cognition process. Figure 12 is aborrowed example ( https://youtu.be/4KpZBiKda0k ). System 1 is driving on a familiar road. Thedriver is relaxed, and can drive while chatting. System 2 is driving on an unfamiliar road. Thedriver needs to focus on driving. (a) System 1: fast (b) System 2: slow

Figure 12: Examples for human cognition.In system 2, humans need more time and attention. What do we do with these resources? Theconjecture is we are doing optimization. More precisely, when the input is familiar, the encodingnetwork works well, so that optimization is simple and fast (maybe fewer optimization steps). Whenthe input is unfamiliar, the encoding network does not provide a good initial hidden representation,so the optimization is diﬃcult and slow.Another conjecture is that our long-term memory is used for manifold regularization, whichrequires storing training samples.

D Inference for language

Figure 13: A word “composition” in dictionary(