[PDF] Visually Grounded Continual Learning of Compositional Phrases

Abstract

Humans acquire language continually with much more limited access to data samples at a time, as compared to contemporary NLP systems. To study this human-like language acquisition ability, we present VisCOLL, a visually grounded language learning task, which simulates the continual acquisition of compositional phrases from streaming visual scenes. In the task, models are trained on a paired image-caption stream which has shifting object distribution; while being constantly evaluated by a visually-grounded masked language prediction task on held-out test sets. VisCOLL compounds the challenges of continual learning (i.e., learning from continuously shifting data distribution) and compositional generalization (i.e., generalizing to novel compositions). To facilitate research on VisCOLL, we construct two datasets, COCO-shift and Flickr-shift, and benchmark them using different continual learning methods. Results reveal that SoTA continual learning approaches provide little to no improvements on VisCOLL, since storing examples of all possible compositions is infeasible. We conduct further ablations and analysis to guide future work.

Full PDF

VVisually Grounded Continual Learning of Compositional Semantics

Xisen Jin Junyi Du Xiang Ren

Department of Computer Science, University of Southern California { xisenjin, junyidu, xiangren } @usc.edu Abstract

Children’s language acquisition from the vi-sual world is a real-world example of continuallearning from dynamic and evolving environ-ments; yet we lack a realistic setup to studyneural networks’ capability in human-like lan-guage acquisition. In this paper, we proposea realistic setup by simulating children’s lan-guage acquisition process. We formulate lan-guage acquisition as a masked language mod-eling task where the model visits a stream ofdata with continuously shifting distribution.Our training and evaluation encode two impor-tant challenges in human’s language learning,namely the continual learning and the compo-sitionality. We show the performance of exist-ing continual learning algorithms is far fromsatisfactory. We also study the interactions be-tween memory based continual learning algo-rithms and compositional generalization andconclude that overcoming overﬁtting and com-positional overﬁtting may be crucial for a goodperformance in our problem setup . Children’s language acquisition process from thevisual world is a real-world example of learningcomplicated natural language processing tasks.Simulating children’s language learning processwith neural networks helps researchers to under-stand the capability and the limit of neural net-works in modeling complicated tasks (Sur´ıs et al.,2019), and inspire researchers to push the limit byaddressing the found issues (Lu et al., 2018; Lakeand Baroni, 2017).However, no prior work encodes an impor-tant challenge for simulating language acquisi-tion from the visual world: the ability to learnin an evolving environment, also known as con-tinual learning. While continual learning itself Our code and data can be found at https://github.com/INK-USC/VisCOLL a hand holding a half eaten red apple

Apples a hand holding a half eaten

MASK MASK

Visually Grounded Masked Language Prediction a hand holding a half eaten red apple

Figure 1:

Visually grounded Continual Composi-tional Language Learning (VisCOLL).

We highlightthe noun phrase to be masked in each caption. Givenimage and masked caption, the model is required topredict the masked noun phrase. is a popular topic since decades ago, these algo-rithms are usually studied in the context of sim-ple image classiﬁcation tasks. These setups are farfrom real environments where the end tasks canbe much more complex, such as language acquisi-tion. In this paper, we propose to incorporate con-tinual learning into the framework of simulatingchildren’s language acquisition process.Another challenge that we consider is compo-sitionality in language. Compositionality allowsatomic words to be combined with certain rulesto represent complicated semantics (Zadrozny,1992). For humans, compositionality is a demon-stration of productivity which emerges as early as3 years old (Pinker et al., 1987): the toddlers maylearn a nonsense stem, e.g., wug, to refer to an ob-ject; then, if there are two of them, they can reportthat by saying there are two wugs (Berko, 1958).To test models’ ability to learn compositionality,we formulate the language acquisition problem asa visually grounded masked language modelingtask, which requires the model to predict multiplemasked words; speciﬁcally, we expect the modelsto compose atomic words to generate novel com-position of words. Technically, it also introducesan exponentially large output space, which brings a r X i v : . [ c s . C L ] M a y est Examples people and workers standing around boxes of apples .a child in period costume looking at a table with a bowl , cup , and a basket of apples a red pick up truck with a large blue object in it's back.a man standing in front of Dale Jr. posters with red light ... ... TaskTrainingExamples Training Data StreamApples Light Truck A red truck withthe door ajar on the street some fresh red apples on the grass Test example (novel composition)Test example (seen composition)

Figure 2:

Training and testing examples in our problem formulation.

At training, the model visits a stream ofimage-caption pairs. We highlight the words that are masked for prediction. The distribution of the training datastream, identiﬁed by tasks labels, changes continuously over time. See Figure 3 for an illustration of the continuousshift. At testing, the model is asked to predict either a seen composition of words or novel composition of seenwords. extra challenges for most of the continual learningalgorithms. For example, memory based contin-ual learning algorithms, which identify and storeimportant examples for each class in a ﬁxed-sizedmemory for future replay, should never expect tostore an example for each word combination vis-ited. It implies learning the compositionally gen-eralize, i.e. , learning to identify atomic conceptsin an example and combine them (Keysers et al.,2020) is crucial for performance. For example,after storing examples for “red cars”, ideally themodels do not further need to store examples for“red apples” to alleviate the forgetting on predict-ing “red” in “red apples”; in contrast, we do nothope the models overﬁt the stored examples for“red cars’ by predicting all cars as red or only“red” for apples. However, no prior works studysuch interaction between memory based continuallearning and compositional generalization.In this paper, we propose the Visually groundedContinual cOmpositional Language Learning (VisCOLL) task, aiming at simulating chil-dren’s language learning process. We create twodatasets, namely COCO-shift and Flickr-shift, toencode challenges of compositionality and contin- ual learning for VisCOLL. We conduct systematicevaluations over the VisCOLL datasets to studythe difﬁculties and the characteristics of the task.

In this section, we introduce related works on con-tinual learning as well as compositional languagelearning.Continual learning aims to alleviate catas-trophic forgetting (Robins, 1995), i.e., signiﬁcantperformance degrade on early data when the mod-els are trained on a non-stationary data stream. Ex-isting continual learning algorithms can be sum-marized into memory-based approaches (Lopez-Paz and Ranzato, 2017; Aljundi et al., 2019b),pseudo-replay based approaches (Shin et al.,2017), regularization based approaches (Kirk-patrick et al., 2017; Zenke et al., 2017; Nguyenet al., 2018) and architecture based approaches.The benchmarks for evaluation are usually man-ually constructed from classiﬁcation datasets, by“splitting” the training examples into several dis-joint subsets by labels, or applying a ﬁxed trans-formation for each subset of training examples,and let the model visit these subsets one byne (Lopez-Paz and Ranzato, 2017). The mostcommonly used datasets are Split MNIST, Per-muted MNIST (Kirkpatrick et al., 2017), and SplitCIFAR (Rebufﬁ et al., 2017) datasets. How-ever, the training and testing environments in thesebenchmark datasets are far from the complicatedreal environment, where the end task is much morecomplex, and the data stream is less structured( e.g., having no strict task boundaries).On the other hand, recent works in languagelearning try to understand and make explicit mod-eling of compositional semantics, i.e., the abil-ity to composing the meaning of atomic wordsfor higher-level meaning in neural networks, butwithout the context of the continual learning. Lakeand Baroni (2017) study compositional general-ization in language generation with synthetic in-struction following tasks. Yuan et al. (2019) stud-ies compositional language acquisition with text-based games. Some works further incorporatevisual inputs in studying compositional languageunderstanding and generation, by taking visualnavigation (Anderson et al., 2018), visual ques-tion answering (Bahdanau et al., 2019), visuallygrounded masked word prediction (Sur´ıs et al.,2019) as end tasks.Few works have tried to apply compositionallanguage learning as an end task for studying con-tinual learning. Li et al. (2020) is a closely relatedworks that studies challenges in continual learn-ing of sequential prediction tasks while focusingon synthetic instruction following tasks. How-ever, the analysis and the techniques of separat-ing semantics and syntax is restricted to the caseswhere both inputs and outputs are text, and doesnot apply to visual inputs. Nguyen et al. (2019)study continual learning of image captioning, butthey do not analyze challenges of sequential pre-dictions, and still make strong assumptions aboutthe structure of the data stream.

In this section, we introduce our problem formu-lation for Visually grounded Continual cOmposi-tional Language Learning (

VisCOLL ). Our for-mulation encodes two main challenges, namely compositionality and continual learning . Wechoose visually grounded masked language mod-eling as a proxy for evaluating models’ capabilitiesin learning compositional semantics: it requiresmodel to describe complicated and unseen visual scenes by composing atomic words. Then, weconstruct a training environment where the train-ing data comes in a non-stationary data streamwithout clear “task” boundaries to simulate the re-alistic environment. Figure 2 illustrates the train-ing and testing examples in our formulation. In therest of the section, we introduce details of our tasksetup.

Task Deﬁnition.

We employ masked languagemodeling with visual inputs as an end task: thetraining and testing examples consist of image-caption pairs x img and x text , where a text spanin x text is masked with MASK tokens and needsto be predicted by the model. The masked textspan x label always include a noun and optionallyinclude verbs or adjectives. To study whether themodel learns compositionality in language, we de-ﬁne each noun, verb, and adjective as an atom, andstudy whether the model can predict both seen andnovel compositions of nouns and verbs/adjectives.For example, we may test whether the model suc-cessfully predicts “red apples” (a combination ofan adjective and a noun) when the model has seenexamples that involve “red” and “apples” sepa-rately. Continuously Shifting Data Distribution.

Un-like traditional ofﬂine training setups where themodel is allowed to visit the training examplesrepeatedly for multiple passes, we study an onlinecontinual learning setup, where the trainingexamples come as a non-stationary stream andare only visited for a single pass. Importantly, fora realistic simulation of the real-world scenarioswhere a child may see and learn, we assumethe data distribution changes gradually: forexample, the model may see more “apples” inthe beginning, and see less of them later. Unlikemost of the prior continual learning benchmarks,we do not assume strict task boundaries, wherethe models may never see any apples when theyhave passed. Formally, at each time step t , themodel receives a small mini-batch of examples { ( x img , x text , x label ) , ..., ( x B − img , x B − text , x B − label ) } .The distribution p ( x img , x text , x label ) is non-stationary, i.e., changes over time. Note thatour formulation rules out continual learningalgorithms that make use of information abouttask boundaries. In the following Section 4, weintroduce how we construct data streams thatencode our challenges. T a s k p r o b a b ili t y Figure 3: Probabablity of ﬁrst 50 tasks in different timesteps in the constructed stream on the Flickr-shift. Eachcurve corresponds to a task. x -axis shows the time step,and y -axis shows the probability of the task. In this section, we introduce how we constructnon-stationary data streams from MS COCO (Linet al., 2014) dataset and Flickr30k (Plummer et al.,2015) dataset for our VisCOLL setup. We nameour datasets

COCO-shift and

Flickr-shift respec-tively.Both COCO and Flickr datasets provide imagesassociated with several captions. We use the part-of-speech (POS) tagger in the stanfordnlp pack-age to perform POS tagging. Each training in-stance is an image-caption pair with a text spanmasked. In Flickr dataset, we mask the nounphrase in each caption, which is included infor-mation in the dataset. In COCO dataset, we iden-tify text spans with a regular expression chunker,which always includes a noun, and optionally in-cludes an adjective before it or a verb after it.To construct a non-stationary data stream, wedeﬁne a “task” as the lemmatized noun in themasked text span in Flickr dataset. On COCOdataset, we map the lemmatized nouns to the pro-vided 80 object categories via a synonym tableprovided in (Lu et al., 2018). Note that the “task”is only used as an identiﬁer of data distributionfor constructing the dataset; the task identities arenot revealed to models and we construct the datastreams so that there are no clear task boundariesin the data streams. Speciﬁcally, we constructdata streams so that the task shifts happen gradu-ally. Figure 3 illustrate the task distribution in ourconstructed data streams. Table 1 shows statisticsabout the dataset. In this section, we introduce models, continuallearning algorithm baselines and metrics for Vis- https://stanfordnlp.github.io/stanfordnlp/ Dataset COCO-shift Flickr-shiftTraining

Test

Task

80 1,000

Table 1: Statistics on the constructed data streams.

COLL. We also propose metrics to address thefollowing research questions: (1) whether exist-ing continual learning algorithms effectively alle-viate forgetting in our problem setup and (2) howmemory based continual learning algorithms mayinﬂuence compositional generalization.

We modify VLBERT (Su et al., 2020; Sur´ıs et al.,2019) as our base model. We ﬁrst encode the im-age with a ResNet-34 (He et al., 2015) to get animage embedding. Then, we feed the image em-bedding as well as the word embeddings of themasked captions into the 4-layer Transformer witha hidden size of 384. The output of the trans-former at the masked positions are fed into a lin-ear layer to output the word predictions. We usecross-entropy loss and use Adam (Kingma andBa, 2014) optimizer with a learning rate of 0.0002throughout the experiments.

We focus on memory based continual learning al-gorithms, as most of them are scalable and nat-urally applicable to the scenarios where no taskidentiﬁers or task boundaries are available. We useExperience Replay (ER) (Robins, 1995; Rolnicket al., 2019) algorithm with reservoir sampling asa strong baseline. The algorithms randomly storevisited examples in a ﬁx-sized memory. We use amemory size of 1,000, 10,000, and 100,000, whichcorresponds to roughly . , and of datafor two datasets. Besides, we also experiment withrecently proposed Experience Replay with Maxi-mally Interfering Retrieval (MIR) (Aljundi et al.,2019a) algorithm with a memory size of 10,000.We also compare the performances with the sce-nario where no continual learning algorithms areapplied (noted as Vanilla Online) as well as wherethe underlying data stream is shufﬂed and visitedfor a single pass (noted as single-pass Ofﬂine). ataset COCO-shift Flickr-shiftMethod/Metrics PPL Noun acc. Verb acc. Adj. acc. PPL Noun acc. Adj. acc. Vanilla Online 6.055 0.51 20.93 1.58 5.965 1.72 6.96Single-pass Ofﬂine 1.923 51.55 47.00 25.11 2.978 26.44 14.70Experience Reply (ER) − | M | =1,000 − | M | =10,000 − | M | =100,000 − | M | =10,000 Table 2: Overall performance of methods in MS COCO dataset and Flickr30k dataset.

To address the ﬁrst research question that whetherexisting continual algorithms are effective in oursetting, we employ perplexity (PPL) as the ma-jor metrics to measure the general performance oftraining methods. Throughout the paper, we re-port the perplexity in the log scale. We also eval-uate accuracies of noun, verb, adjective predic-tions separately. On Flickr-shift dataset, we onlyinclude the accuracy of nouns and adjectives as thephrases in the Flickr datasets are noun phrases.To address the second research question thathow replay memories inﬂuence compositionalgeneralization, we start by proposing a measurefor compositional overﬁtting. Given a referenceset of compositions S , the compositional overﬁt-ting of an atomic word w to the set S is measuredas the average perplexity difference when w ap-pears in a composition ( w, x ) in the test set D tt that also exists in S , and when w appear in a com-position in D tt that does not exist in S . Formally,the compositional overﬁtting is deﬁned as, f ovt ( w, S ) = 1 N (cid:88) ( w,x ) ∈ D tt − S P P L ( w ) − N (cid:88) ( w,x ) ∈ D tt ∩ S P P L ( w ) (1)We are able to compute compositional overﬁttingof a word w regarding the replay memory M ,note as f ovt ( w, M ) . A large f ovt ( w, M ) impliesthe perplexity of w is much larger when w ap-pears in compositions that do not exist in the re-play memory. We also compute the compositionaloverﬁtting of w regarding the training set, notedas f ovt ( w, D tr ) . We then compare f ovt ( w, M ) toa f ovt ( w, D tr ) to evaluate whether the model in-clines to overﬁt combinations stored in the mem- ory more compared to random examples in thetraining set. We note the difference between f ovt ( w, M ) and f ovt ( w, D tr ) as ∆ f ovt ( w ) . ∆ f ovt ( w ) = f ovt ( w, M ) − f ovt ( w, D tr ) (2) In this section, we ﬁrst show the overall perfor-mance of continual learning algorithms in our Vis-COLL task setup. We then measure the composi-tional generalization achieved by algorithms andanalyze how memory based continual learning al-gorithms may affect compositional generalization.

Table 2 show the overall performance achieved byvanilla online training, single-pass ofﬂine training,ER, and MIR. We see a clear performance gapfrom the comparison between vanilla online train-ing and the ofﬂine methods. We see the largestgap in the prediction accuracy of nouns, whichare bound with task identities according to ourstream construction. We also see ER could allevi-ate forgetting, but the performance is close to of-ﬂine training only when the replay buffer is verylarge ( | M | = 100 , , about 20% of the train-ing examples). It contradicts the performance ofER in popular benchmark datasets, where storingonly a few examples is believed to be sufﬁcientto achieve a good performance (Chaudhry et al.,2019). We see MIR, which is a state-of-the-artcontinual learning algorithm, could improve per-plexity and prediction accuracy on nouns on twodatasets at the same memory cost compared to ER.However, there is still a huge space where the per-formance can be improved. anilla ER1,000 ER10,000 ER100,000 Offline246810 L o g PP L ( N o un s ) seen N-V compositionnovel N-V composition (a) COCO N-V composition, Nouns Vanilla ER1,000 ER10,000 ER100,000 Offline246810 L o g PP L ( N o un s ) seen N-J compositionnovel N-J composition (b) COCO N-J composition, Nouns Vanilla ER1,000 ER10,000 ER100,000 Offline246810 L o g PP L ( N o un s ) seen N-J compositionnovel N-J composition (c) Flickr N-J composition, Nouns Vanilla ER1,000 ER10,000 ER100,000 Offline246810 L o g PP L ( V e r b s ) seen N-V compositionnovel N-V composition (d) COCO N-V composition, Verbs Vanilla ER1,000 ER10,000 ER100,000 Offline246810 L o g PP L ( A d j e c t i v e s ) seen N-J compositionnovel N-J composition (e) COCO N-J composition, Adjectives Vanilla ER1,000 ER10,000 ER100,000 Offline246810 L o g PP L ( A d j e c t i v e s ) seen N-J compositionnovel N-J composition (f) Flickr N-J composition, Adjectives Figure 4: Perpleity curves of nouns(N), verbs(V), adjectives(J) in seen or novel compositions to the training set.We consider noun-verb and noun-adjective compositions for COCO-shift and consider only noun-adjective com-positions for Flickr-shift.

We measure compositional generalization, i.e., ,how well the models predict a word when it ap-pears in a novel combination to the training set.We measure them with the compositional overﬁt-ting to the training set f ovt ( w, D tr ) introduced inSection 5.3. We consider noun-verb combinations D nvtr and noun-adjective combinations D natr sepa-rately. We average f ovt ( w, D tr ) of nouns, verbsand adjectives.Figure 4 show the perplexity plots of nouns,verbs and adjectives in seen and novel contexts.We see there are clear gaps between the perplex-ity of words in seen contexts and novel contextsfor almost all methods, which implies the modelssuffer from compositional overﬁtting. In addition to the compositional overﬁtting to thetraining set, we also measure the compositionaloverﬁtting to the examples stored in the memory.We measure the difference between two overﬁt-ting statics as introduced in section 5.3. A posi-tive ∆ f ovt ( w ) indicates the model predicts a wordpoorly when the composition is not stored in thememory, which implies the model may overﬁt thecompositions stored in the memory and that the re-play memory may potentially hurt compositional Memory size | M |

1k 10k 100k

N-V composition − Nouns -0.635 -0.137 0.143 − Verbs -0.744 0.462 0.345

N-J composition − Nouns 0.311 0.171 -0.155 − Adjectives -0.523 0.016 -0.184

Table 3: Differences between the compositional over-ﬁtting to the memory and the training set measuredon the COCO-shift. A positive difference implies themodel’s tendency to overﬁt the compositions stored inthe memory than the training set. generalization. A negative ∆ f ovt ( w ) indicates themodel predicts a word poorly when the compo-sition is stored in the memory, which is neithera good sign, as it implies the model overﬁts thespeciﬁc instances of the composition stored in thememory. We report the results in Table 3. Theresults show both compositional overﬁt and nor-mal overﬁt happen in the models. When the replaymemory is small ( | M | = 1 , ), we see a clearoverﬁt to the memory for noun, verb prediction innoun-verb compositions and adjective predictionin noun-adjective compositions. The overﬁt is rea-sonable, because the model may have visited theexamples stored in the memory nearly hundredsf times more than other examples. When the sizeof memory increases to 1,000 and 10,000, we seea compositional overﬁt to the memory for verbprediction in noun-verb compositions, but otherstatistics become closer to zero.Overall, the results indicate that both normaloverﬁtting and compositional overﬁtting exist inmemory based continual learning algorithms, andit is not certain which one may dominate. Theresults motivate researchers to study deeper intooverﬁtting and compositional overﬁtting in mem-ory based continual learning algorithms, and de-velop algorithms that can mitigate both. In this paper, we propose a problem setup Vis-COLL for simulating children’s language acqui-sition process from the visual world. We constructtwo datasets, namely COCO-shift and Flickr-shift,and propose evaluation to encode the challengesof continual learning and compositionality. Ouranalysis show there is a huge space where the per-formance of continual learning algorithms can beimproved. Our analysis further shows that ad-dress overﬁtting and compositional overﬁtting is-sues can be crucial for better performance in ourproblem setup.

References

Rahaf Aljundi, Lucas Caccia, Eugene Belilovsky, Mas-simo Caccia, Min Lin, Laurent Charlin, and TinneTuytelaars. 2019a. Online continual learning withmaximally interfered retrieval. In

NeurIPS .Rahaf Aljundi, Min Lin, Baptiste Goujaud, and YoshuaBengio. 2019b. Gradient based sample selection foronline continual learning. In

NeurIPS .Peter Anderson, Qi Wu, Damien Teney, Jake Bruce,Mark Johnson, Niko S¨underhauf, Ian Reid, StephenGould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environ-ments. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages3674–3683.Dzmitry Bahdanau, Harm de Vries, Timothy JO’Donnell, Shikhar Murty, Philippe Beaudoin,Yoshua Bengio, and Aaron Courville. 2019. Clo-sure: Assessing systematic generalization of clevrmodels. arXiv preprint arXiv:1912.05783 .Jean Berko. 1958. The child’s learning of english mor-phology.

Word , 14(2-3):150–177. Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho-seiny, Thalaiyasingam Ajanthan, Puneet K. Doka-nia, Philip H. S. Torr, and Marc’Aurelio Ranzato.2019. On tiny episodic memories in continual learn-ing.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Deep residual learning for image recog-nition. , pages 770–778.Daniel Keysers, Nathanael Sch¨arli, Nathan Scales,Hylke Buisman, Daniel Furrer, Sergii Kashubin,Nikola Momchev, Danila Sinopalnikov, LukaszStaﬁniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang,Marc van Zee, and Olivier Bousquet. 2020. Measur-ing compositional generalization: A comprehensivemethod on realistic data. In

International Confer-ence on Learning Representations .Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,Joel Veness, Guillaume Desjardins, Andrei A Rusu,Kieran Milan, John Quan, Tiago Ramalho, Ag-nieszka Grabska-Barwinska, et al. 2017. Over-coming catastrophic forgetting in neural networks.

Proceedings of the national academy of sciences ,114(13):3521–3526.Brenden M Lake and Marco Baroni. 2017. General-ization without systematicity: On the compositionalskills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350 .Yuanpeng Li, Liang Zhao, Kenneth Church, and Mo-hamed Elhoseiny. 2020. Compositional languagecontinual learning. In

International Conference onLearning Representations .Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In

European confer-ence on computer vision , pages 740–755. Springer.David Lopez-Paz and Marc’Aurelio Ranzato. 2017.Gradient episodic memory for continual learning. In

NIPS .Jiasen Lu, Jianwei Yang, Dhruv Batra, and DeviParikh. 2018. Neural baby talk. In

CVPR .Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, andRichard E. Turner. 2018. Variational continuallearning. In

International Conference on LearningRepresentations .Giang Nguyen, Tae Joon Jun, Trung Tran, and Daey-oung Kim. 2019. Contcap: A comprehensive frame-work for continual image captioning. arXiv preprintarXiv:1909.08745 .teven Pinker, David S Lebeaux, and Loren Ann Frost.1987. Productivity and constraints in the acquisitionof the passive.

Cognition , 26(3):195–267.Bryan A. Plummer, Liwei Wang, Chris M. Cervantes,Juan C. Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In

ICCV .Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov,Georg Sperl, and Christoph H Lampert. 2017. icarl:Incremental classiﬁer and representation learning.In

Proceedings of the IEEE conference on ComputerVision and Pattern Recognition , pages 2001–2010.Anthony V. Robins. 1995. Catastrophic forgetting, re-hearsal and pseudorehearsal.

Connect. Sci. , 7:123–146.David Rolnick, Arun Ahuja, Jonathan Schwarz, Timo-thy Lillicrap, and Gregory Wayne. 2019. Experiencereplay for continual learning. In

Advances in NeuralInformation Processing Systems , pages 348–358.Hanul Shin, Jung Kwon Lee, Jaehong Kim, and JiwonKim. 2017. Continual learning with deep generativereplay. In

Advances in Neural Information Process-ing Systems , pages 2990–2999.Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-training of generic visual-linguistic representations.In

International Conference on Learning Represen-tations .D´ıdac Sur´ıs, Dave Epstein, Heng Ji, Shih-FuChang, and Carl Vondrick. 2019. Learning tolearn words from visual scenes. arXiv preprintarXiv:1911.11237 .Xingdi Yuan, Marc-Alexandre Cˆot´e, Jie Fu, ZhouhanLin, Christopher Pal, Yoshua Bengio, and AdamTrischler. 2019. Interactive language learn-ing by question answering. arXiv preprintarXiv:1908.10909 .Wlodek Zadrozny. 1992. On compositional semantics.In

Proceedings of the 14th conference on Computa-tional linguistics-Volume 1 , pages 260–266. Associ-ation for Computational Linguistics.Friedemann Zenke, Ben Poole, and Surya Ganguli.2017. Continual learning through synaptic intel-ligence. In