[PDF] Naive Artificial Intelligence

Abstract

In the cognitive sciences, it is common to distinguish between crystal intelligence, the ability to utilize knowledge acquired through past learning or experience and fluid intelligence, the ability to solve novel problems without relying on prior knowledge. Using this cognitive distinction between the two types of intelligence, extensively-trained deep networks that can play chess or Go exhibit crystal but not fluid intelligence. In humans, fluid intelligence is typically studied and quantified using intelligence tests. Previous studies have shown that deep networks can solve some forms of intelligence tests, but only after extensive training. Here we present a computational model that solves intelligence tests without any prior training. This ability is based on continual inductive reasoning, and is implemented by deep unsupervised latent-prediction networks. Our work demonstrates the potential fluid intelligence of deep networks. Finally, we propose that the computational principles underlying our approach can be used to model fluid intelligence in the cognitive sciences.

Full PDF

NNaïve Artiﬁcial Intelligence

Tomer Barak

The Edmond and Lily Safra Center for Brain SciencesThe Hebrew University, Jerusalem [email protected]

Yehonatan Avidan

The Edmond and Lily Safra Center for Brain SciencesThe Hebrew University, Jerusalem [email protected]

Yonatan Loewestein

Department of Cognitive SciencesThe Federmann Center for the Study of RationalityThe Alexander Silberman Institute of Life SciencesThe Edmond and Lily Safra Center for Brain SciencesThe Hebrew University, Jerusalem [email protected]

Abstract

In the cognitive sciences, it is common to distinguish between crystal intelligence,the ability to utilize knowledge acquired through past learning or experience andﬂuid intelligence, the ability to solve novel problems without relying on priorknowledge. Using this cognitive distinction between the two types of intelligence,extensively-trained deep networks that can play chess or Go exhibit crystal but notﬂuid intelligence. In humans, ﬂuid intelligence is typically studied and quantiﬁedusing intelligence tests. Previous studies have shown that deep networks can solvesome forms of intelligence tests, but only after extensive training. Here we presenta computational model that solves intelligence tests without any prior training.This ability is based on continual inductive reasoning, and is implemented by deepunsupervised latent-prediction networks. Our work demonstrates the potentialﬂuid intelligence of deep networks. Finally, we propose that the computationalprinciples underlying our approach can be used to model ﬂuid intelligence in thecognitive sciences.

Consider the intelligence test depicted in Fig.1: ﬁve ordered tiles are presented to the agent in aone-dimensional Raven’s Progressive Matrix (RPM) test. The tiles are characterized by features: thenumber of objects, their color, shape, size, and positions. One of the features changes in accordancewith a predeﬁned rule. The objective of the agent is to select the sixth tile that adheres to that rule outof a selection of four alternative tiles. To complicate the task, randomly-changing features, which werefer to as distractors , also characterize the tiles. The task is challenging because the relevant featureand rule should be simultaneously inferred from the examples. Solving RPMs requires inductivereasoning , loosely deﬁned as the ability to derive general rules out of speciﬁc observations [1–3].Inductive reasoning is arguably the most important component of ﬂuid intelligence, a cognitive

Preprint. Under review. a r X i v : . [ c s . A I] S e p aculty that is correlated with skills such as problem solving and comprehension. Indeed, RPMs arecommonly used to quantify ﬂuid intelligence [5].Incorporating inductive reasoning in machines has been challenging. Traditional computationalmodels that solved RPMs utilized either a set of predeﬁned features [6], a set of predeﬁned rules[7] or both [4]. With the development of modern machine learning, the use of predeﬁned featuresor rules became unnecessary, as they can be learned by deep artiﬁcial neural networks. However,such learning requires extensive supervised learning [8, 9]. By contrast, humans effectively performinductive reasoning as an unsupervised continual process [3] using a small number of examples [10].Figure 1: Intelligence tests for measuring inductive reasoning . In each test, a sequence of t = 5 tiles is presented, and the objective is to choose the next tile from a set of n = 4 alternatives. Thetiles are 100 ×

100 pixels images that are characterized by the features: {

Color, Number, Shape, Size,Positions }. One of these features follows a rule. In this example, the color intensity increases alongthe sequence. The other features can be either constant or change randomly. When a feature changesat random we refer to it as a distractor and the difﬁculty of a tests is deﬁned by the number ofdistractors. In this example, the number, shape, size and positions of the objects are all distractors.The tests were constructed according to [11, 8], focusing on the settings that are best for measuringinductive reasoning (see supplementary materials for more details).Here we present a prediction model that performs inductive reasoning without prior training. Themodel is based on an unsupervised latent-prediction network [12–14], which means that it looks for apredictable latent representation rather than attempting to make predictions in pixel space. Thus, themodel can ﬁnd a latent representation that corresponds to the predictable feature and its underlyingrule. We show that this enables the model to solve RPMs, hence to perform inductive reasoning.

The challenge.

In the test depicted in Fig. 1, the world generates a sequence of grayscale images x j in the following way: each image is characterized by a low dimensional vector of features, f j where f ji denotes the value of feature i in image j . The image x j is constructed by applying anon-linear and complex generative function from the low features dimension to the high pixel space x j = g (cid:0) f j (cid:1) . Importantly, while all features but one are either constant over the images or are i.i.d.,one of the features f jp changes predictably according to a speciﬁc rule. After observing a sequenceof t images, the agent’s task is to select the correct t + 1 th image from a set of n images that weregenerated using the same generative model over the low features dimension. In the correct image, In many cases, tiles in RPM tests are arranged in a 3 × t +1 p follows the predictable rule whereas it is randomly chosen for the incorrect images. This task isdifﬁcult because neither the features (or even the set of possible features) nor the rule (or even the setof possible rules) are given to the agent, and they have to be inferred from the sequence of t images. Predictable representations.

In the cognitive sciences literature, it has been shown that humanssolve intelligence tests by concurrently identifying the features in the sequence of images and therules that underlie their change [2]. The relationships between the x t image and the n alternativetiles are then considered in view of the identiﬁed features and rules. The selected image is the onethat is the most congruent with the rule ([2, 4]). Motivated by this solution, we sought to construct anetwork that concurrently identiﬁes the predictable feature and the rule. Rather than attempting topredict the t + 1 th image, it predicts its lower-dimensional representation in the feature space. Thedisadvantage of this approach is that the feature needs to be inferred from the images. However, theadvantages of making predictions in the latent feature space are (1) its dimensionality is much smallerthan that of the pixel space, which implies that fewer images are needed for making such a prediction.(2) Some of the irrelevant features may be stochastic. Thus, the images in the pixel-space may not bepredictable.By construction, there exists a function Z ∗ from the image dimension to a scalar that extracts therelevant feature from an image f jp = Z ∗ ( x j ) and a function T ∗ that describes the rule, such that T ∗ (cid:0) f jp (cid:1) = f j +1 p . Z ∗ and T ∗ are solutions to the equation T ( Z ( x j )) = Z ( x j +1 ) (1)From a cognitive point of view, the function Z is an encoder that projects the image to a one-dimensional variable, and the function T is a predictor that predicts the value of the projection of theimage in the next tile based on that projection in the current tile. Dynamic representations.

Naively, by solving equation (1) an agent can extract the solutions Z ∗ and T ∗ and use them to make predictions. However, there is no unique solution to equation (1), andnot all solutions to this equation are useful for solving the task. One trivial solution to equation (1)is: Z ( x j ) = Z ( x j +1 ) ∀ j and T ( Z ) = Z . This solution is clearly not useful for selecting the t + 1 th image. Therefore, in order to make predictions we should seek a dynamic solution that satisﬁes theinequality Z ( x j ) (cid:54) = Z ( x j +1 ) (2) Bounded representations.

Finally, for every solution Z and T to equations (1) and (2), there is acontinuum of other solutions that are given by the stretching and / or the shifting of the function Z with a corresponding compensation of the function Z . It is possible to set the scale of the solutionsby bounding the representations, e.g., to be between − and : max j (cid:12)(cid:12) Z ( x j ) (cid:12)(cid:12) ≤ (3)From here on, Z ∗ and T ∗ will denote the set of all stretched and shifted Z and T such that the Z -sare in the homeomorphism class of the ground truth Z ∗ and the T -s are the corresponding predictors, T ( Z ( x )) = T ∗ ( Z ∗ ( x )) ∀ x . Decision making.

Given Z = Z ∗ and T = T ∗ , choosing the correct tile is trivial. The correctimage o correct satisﬁes T ∗ ( Z ∗ ( x t )) = Z ∗ ( o correct ) and therefore,Correct option = arg min k (cid:0) T ( Z ( x t )) − Z ( o k ) (cid:1) (4)where the alternatives are denoted by { o k } (Fig. 2).3igure 2: Inductive reasoning model for solving intelligence tests . The model is composed of twofunctions: An encoder Z that encodes the relevant feature, and a predictor T that predicts in latentspace. Decision is made by encoding the image of the last test tile x t and determining which of theoptions o k is best predicted in the latent space. In this example, a good encoder will encode the sizeof the squares, and the predictor will predict that they monotonically increase - together determiningthat Option 1 best completes the sequence. Inductive reasoning as an optimization problem.

The exact solution, Z ∗ and T ∗ satisﬁes twoconditions, Eqs. (1) and (2). Eq. (3) sets the scale of the solution. We propose that good encoder andpredictor can be found by minimizing the three loss functions that correspond to the three equations(adapted from [15]): L pred = (cid:0) T (cid:0) Z (cid:0) x j (cid:1)(cid:1) − Z (cid:0) x j +1 (cid:1)(cid:1) L dis = exp (cid:32) − (cid:12)(cid:12) Z (cid:0) x j (cid:1) − Z (cid:0) x j +1 (cid:1)(cid:12)(cid:12) σ (cid:33) L bound = max (cid:18) max j (cid:0) Z ( x j ) − (cid:1) , (cid:19) (5)The loss function L pred is minimized when equation (1) is satisﬁed, i.e., when a predictable low-dimensional representation is found. The second loss function, L dis , decreases the more dynamicthe representations are. The parameter σ determines the scale of difference between consecutiverepresentations. Note that because of the exponential shape of the loss function, if the representationsare sufﬁciently different (relative to σ ) then further separating them will have only a small effect onthe loss function. In our simulations we used σ = 0 . . This parameter becomes meaningful in viewof L bound , which acts to maintain the representations in the [ − , range. The network model.

For the encoder Z ( x ) we used a 8-layer convolutional neural network fromthe 100 ×

100 pixel space to a single neuron. The predictor T ( Z ) calculates the representationstransition T ( Z ) = Z + ∆ T ( Z ) where ∆ T is a 5-layer fully-connected network. To learn theparameters of the two networks, we concurrently minimized the loss functions, Eq. (5), each with itsown RMSprop optimizer [15]. Given a sequence of t tiles { x j } tj =1 , each optimization step optimizesa minibatch that consists of the t − consecutive pairs of tiles (see supplementary materials for moreinformation). 4 .1 The expressivity of the model - extensive training (a) Training set. Rule: color (easy) (b) Training set. Rule: size (easy)(c) Test set. Rule: color (difﬁcult) (d) Test set. Rule: size (difﬁcult) Sequence P e r f o r m a n c e (e) Performance. Rule: color (difﬁcult) Sequence P e r f o r m a n c e (f) Performance. Rule: size (difﬁcult) Figure 3:

Extensive training . We tested the model’s ability to solve our most difﬁcult tests in two testconditions: when the predictable feature is the color of the objects (left side of the ﬁgure) versus thesize of the objects (right side of the ﬁgure). (a), (b) Training.

In each condition, 10 networks wereextensively trained on easy sequences in which either the size (a) or the color (b) were predictable.The networks performed two optimization steps per training sequence, and then moved to the nextsequence. (c), (d) Test set.

While the networks were training, we measured their performance on100 difﬁcult tests complying with matching rules (size (c), color (d)). Note that because the networkis trained on sequences other than the test sequence and because prediction in the test sequence relieson the last tile, network’s performance is independent of the all test tiles but the last, and hence thesetiles are bleached in the ﬁgure. (e), (f) Performance.

The networks performance in the difﬁcult trialsimproved with training, reaching success rates that exceeded 90% in the most difﬁcult tests afterthousands of training sequences. Note the fast and substantial improvement after only a few trainingsequences. The dark lines denote the mean performance and the shades are the standard error of themeans (SEMs). The dashed red lines denote the chance levels (25%).The networks’ ability to solve difﬁcult intelligence tests such as the one depicted in Fig. 1 dependson it being able to ﬁnd accurate approximations of Z ∗ and T ∗ . Speciﬁcally, the networks shouldbe sufﬁciently expressive to approximate Z ∗ and T ∗ well enough, and the SGD-based optimizationprocess on the loss functions should converge to such a solution. To test these, we extensively trainedthe networks on easy sequences (Fig. 3a-3b), in which one feature is monotonically increasingwhereas the other features remain constant, and tested them on the difﬁcult intelligence tests, in whichthe predictable feature of the training set followed the same rule but all other features were distractors(i.e., randomly changed) (Fig. 3c-3d). Training on easy sequences and testing on difﬁcult ones5inimized the possibility that a consecutive pair of tiles appearing in the training set would reappearin the test set, thus minimizing the possibility that overﬁtting underlay our results. Interestingly,we found that training on difﬁcult problems resulted in slower and less robust learning - a furtherindication that the performance of the networks after learning did not reﬂect overﬁtting.The results of this training procedure are depicted in Fig. 3e-3f. Within thousands of trainingsequences, the model achieved success rates that exceeded 90% (compared with 25% chance perfor-mance), demonstrating that our network is expressive enough to solve the intelligence tests of Fig. 1.The success of the training also indicates that the training procedure, i.e. minimization of the lossfunctions of equation (5) with three RMSprop optimizers can lead to a good approximation of Z ∗ and T ∗ . This result is consistent with previous studies that demonstrated that unsupervised latentprediction models are capable of learning good abstract representations when extensively trained([12–14]). A fundamental difference between the performance of the networks in the previous section and humanintelligence is that humans do not seem to require any extensive training in order to solve RPMs(although training does improve performance [16]). In fact, a hallmark of ﬂuid intelligence is theability to infer a rule from a very small number of examples. This observation motivated us to studythe extent to which our networks can solve intelligence tests in the absence of any prior training. Tothat goal, rather than training using a large number of different sequences of tiles, we used the testsequence itself as our training sequence. Fig. 4 depicts an example of a naïve network that solve anintelligence test of intermediate difﬁculty (the test from Fig. 2). We used exactly the same trainingprocedure as in section 3.1 with one important difference - we used identical copies of the samesingle test sequence as our training set (Fig. 4a) The parameters of the encoder Z and predictor T are learned by minimizing the three loss functions over the sequence (Fig. 4b). Decision was basedon the best-predicted option, Eq. 4 (Fig. 4c). The prediction errors, (cid:0) T ( Z ( x t )) − Z ( o k ) (cid:1) for thecorrect option o correct (black) and incorrect (orange) options as a function of optimization steps aredepicted in Fig. 4d. Within 10 optimization steps, the prediction error associated with the correct tilewas already substantially smaller than that of the incorrect tiles, directing choice towards the correctanswer.Figure 4: The solving of an intelligence test by a naïve network . (a) The network trains on t = 5 tiles by minimizing the three loss functions. (b) The loss functions as a function of training steps.Note that L bound occasionally sets the scale of the representations. (c) The prediction errors betweenthe ﬁfth test tile and the n = 4 options are measured and used for decision-making. (d) The predictionerrors signals out the correct option (black) from the incorrect options (orange) after less than 10optimization steps.In order to quantitatively quantify the naïve networks’ performance, we tested the model in multipletest conditions (Fig. 5). Each test condition contained 100 intelligence tests, each solved by a6ifferent randomly-initialized network. Remarkably, we found that training on the t = 5 tiles of thetest is sufﬁcient for solving the easy tests, as well as for achieving a level of performance that issubstantially higher than chance in the difﬁcult tests. All this was achieved without any prior learningand knowledge, using networks whose weights were randomly-chosen. We posit that the successof the model in solving intelligence tests without any training indicates that the architecture andoptimization process can also be used as models for inductive reasoning in the cognitive sciences. P e r f o r m a n c e (a) Rule: color P e r f o r m a n c e (b) Rule: size Figure 5:

Naïve networks’ performance . The model’s ability to solve tests without training wasevaluated on multiple test conditions. The test conditions differed in the predictable feature (color in(a) and size in (b)) and by the number and types of distractors: the markers’ type denote the type ofdistractors used (see middle legend). Each test condition was composed of 100 tests and each testwas solved by a different randomly initiated network, trained with 200 optimization steps. The errorbars are the standard error of the mean for the corresponding test condition. The black lines denotethe mean performance and the shades are the standard deviation over the different conditions.

Machine learning and RPMs.

A previous study has shown that with extensive supervised learning,RPMs can be solved by deep networks (WREN [8]). Worthwhile noting is that these networks cansolve RPMs characterized by rules that cannot be learned by our network, e.g., logical rules thatrequire "working memory". Moreover, learning is faster and performance is improved if latentrepresentations are ﬁrst learned in an unsupervised way and these representations are then used asinput to a supervised-trained network [17, 18]. The two fundamental differences between theseapproaches and our naïve network are (1) our learning is fully unsupervised and (2) our network doesnot utilize any prior information beyond the test tiles.

Unsupervised latent prediction models.

Several studies have proposed to use the predictive in-formation between the past and the future for dimensionality reduction. Dynamical ComponentAnalysis (DCA) [19] ﬁnds predictable latent representations Z ( x ) by maximizing predictive infor-mation between the past and future I ( Z ( x past ); Z ( x future )) using a linear approximation. Anotherrelated linear method is Slow Feature Analysis (SFA) [20] that ﬁnds slowly varying features of thedata. Contrastive predictive coding [12] is an unsupervised optimization problem that ﬁnds suchrepresentations using a deep neural network. Such contrastive predictive coding has been success-fully used to ﬁnd useful latent representations of ATARI games [13] and deformable objects [14].Conceptually, our approach is similar to these previous studies. The challenge of ﬁnding a solution tothe equation T ( Z ( x past )) = Z ( x future ) such that T ( Z ( x past )) is as dissimilar as possible to Z ( x future ) can be viewed as an approximation to the challenge of maximizing the predictive information, withthe advantage of separating the information into encoding and prediction functions, which is usefulfor making actual predictions. The ability to make predictions in latent space (world models) hasproven useful for planning in RL, which results in improved overall performance [15, 21]. Our latentprediction model is based on these studies, but is used for a very different purpose - a model of ﬂuidintelligence. 7 Discussion

We identiﬁed an analogy between data-efﬁcient latent prediction models and the ﬂuid intelligence’score cognitive ability of inductive reasoning. We used this analogy to build a computational modelthat can solve ﬂuid intelligence tests without prior training or knowledge.

Data efﬁciency.

Deep neural networks are expressive enough to overﬁt large random datasets, andare especially capable of overﬁtting small number of examples. However, a remarkable feat of deepneural networks is that they can generalize even when the number of examples in the training set issubstantially smaller than the number of parameters, a result that is still not fully understood [22]. Theability of our networks to approximate a rule by observing only ﬁve tiles in the "naïve" experimentstakes this ability to the extreme. It has been argued that the remarkable capabilities of the humanbrain to learn from a small number of examples ([10]) in comparison to artiﬁcial networks resultfrom priors that are learned prior to the experiment, or even in evolutionary time-scales [23–25]. Ourresults indicate that in fact, much can be achieved without priors even when the training set is limited.

The limitations of the model.

There are rules that by construction of the model, cannot be learned.Speciﬁcally, rules that require memory, e.g., logical operations and long-term relations betweenthe tiles cannot be learned. Incorporating such rules to the repertoire of the network can be doneby deﬁning the input to the encoder to be a set of several consecutive tiles. Alternatively, workingmemory can be incorporated into the model by replacing the feed-forward networks Z and / or T with recurrent networks [12]. Another limitation of the model is that the solutions ˆ Z and ˆ T are likelyto differ from the true solutions Z ∗ and T ∗ even when the networks identiﬁed the correct solution.Our measure of success is not the learning of the rule, i.e., the similarity between ˆ Z, ˆ T and Z ∗ , T ∗ .Rather it relies on the ability of the network to choose the correct tile from a ﬁnite set of n = 4 alternatives. The results indicate that the networks found solutions that were correlated with theground truth, a correlation that enabled them to solve the task. Fluid vs. crystal intelligence.

We studied the networks’ performance in two regimes. Usingcognitive terminology, the extensive training experiment was a test of crystal intelligence in whichperformance improved with the accumulation of knowledge. By contrast, in the naïve experimentsetting, performance relied on the ﬂuid intelligence of the model - the predeﬁned model architecture,its loss-functions and the optimization process. It could be interesting to combine the two typesof intelligence by considering learning in multiple time-scales, the shorter ones corresponding toimproving crystal intelligence whereas the longer ones to improving the hyperparameters of thenetwork - hence its ﬂuid intelligence. Improving humans’ ﬂuid intelligence via training is a hardchallenge in psychology with no existing method showing deﬁnite success [26–28]. Our model putsus in position to try and study the computational requirements for improving ﬂuid intelligence.

Conclusion.

We showed that deep neural networks can solve intelligence tests and exhibit ﬂuidintelligence. Our model demonstrates the potential ﬂuid intelligence of artiﬁcial networks and helpus identify the computational challenges of ﬂuid intelligence in humans and animals.

Acknowledgments and Disclosure of Funding

This work was supported by the Israel Science Foundation (Grant No. 757/16) and the GatsbyCharitable Foundation.

References [1] Lenore Blum and Manuel Blum. Toward a mathematical theory of inductive inference.

Informa-tion and Control , 28(2):125–155, 1975. ISSN 00199958. doi: 10.1016/S0019-9958(75)90261-2.[2] Robert J. Sternberg. Components of human intelligence.

Cognition , 15(1-3):1–48, 1983. ISSN00100277. doi: 10.1016/0010-0277(83)90032-X.[3] Michael Siebers, David L. Dowe, Ute Schmid, José Hernández-Orallo, and Fernando Martínez-Plumed. Computer models solving intelligence test problems: Progress and implications.8 rtiﬁcial Intelligence , 230:74–107, 2015. ISSN 00043702. doi: 10.1016/j.artint.2015.09.011.URL http://dx.doi.org/10.1016/j.artint.2015.09.011 .[4] Patricia A Carpenter, Marcel Adam Just, and Peter Shell. What One Intelligence Test Measures:A Theoretical Account of the Processing in the Raven Progressive Matrices Test. (3):28, 1990.[5] Robert M Kaplan and Dennis P. Saccuzzo.

Psychological Testing: Principles, Applications, andIssues . 2009. ISBN 0495095559.[6] Daniel Rasmussen and Chris Eliasmith. A neural model of rule generation in inductive reasoning.

Topics in Cognitive Science , 3(1):140–153, 2011. ISSN 17568757. doi: 10.1111/j.1756-8765.2010.01127.x.[7] Ron Sun and David Yun Dai. Deep Learning of Raven’s Matrices.

Advances in CognitiveSystems , pages 1–6, 2018.[8] David G. T. Barrett, Felix Hill, Adam Santoro, Ari S. Morcos, and Timothy Lillicrap. Measuringabstract reasoning in neural networks. 2018. ISSN 17740746. doi: 10.1051/agro/2009059.URL http://arxiv.org/abs/1807.04225 .[9] Felix Hill, Adam Santoro, David G. T. Barrett, Ari S. Morcos, and Timothy Lillicrap. Learningto Make Analogies by Contrasting Abstract Relational Structure. 2019. URL http://arxiv.org/abs/1902.00120 .[10] Nicolas Barascud, Marcus T. Pearce, Timothy D. Grifﬁths, Karl J. Friston, and Maria Chait.Brain responses in humans reveal ideal observer-like sensitivity to complex acoustic patterns.

Proceedings of the National Academy of Sciences of the United States of America , 113(5):E616–E625, 2016. ISSN 10916490. doi: 10.1073/pnas.1508523113.[11] Ke Wang and Zhendong Su. Automatic generation of Raven’s progressive Matrices.

IJCAIInternational Joint Conference on Artiﬁcial Intelligence , 2015-Janua(Ijcai):903–909, 2015.ISSN 10450823.[12] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with ContrastivePredictive Coding. 2018. URL http://arxiv.org/abs/1807.03748 .[13] Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R DevonHjelm. Unsupervised State Representation Learning in Atari. (NeurIPS), 2019. URL http://arxiv.org/abs/1906.08226 .[14] Wilson Yan, Ashwin Vangipuram, Pieter Abbeel, and Lerrel Pinto. Learning PredictiveRepresentations for Deformable Objects Using Contrastive Estimation. 2020. URL http://arxiv.org/abs/2003.05436 .[15] Vincent Francois-Lavet, Yoshua Bengio, Doina Precup, and Joelle Pineau. Combined Reinforce-ment Learning via Abstract Representations.

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , 33:3582–3589, 2019. ISSN 2159-5399. doi: 10.1609/aaai.v33i01.33013582. URL https://github.com/VinF/deer/. [16] N. W. Denney and S. M. Heidrich. Training effects on Raven’s progressive matrices in young,middle-aged, and elderly adults.

Psychology and aging , 5(1):144–145, 1990. ISSN 08827974.doi: 10.1037/0882-7974.5.1.144.[17] Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving Generalizationfor Abstract Reasoning Tasks Using Disentangled Feature Representations. (Nips 2018):1–8,2018. URL http://arxiv.org/abs/1811.04784 .[18] Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, and Olivier Bachem. AreDisentangled Representations Helpful for Abstract Visual Reasoning? (NeurIPS), 2019. URL http://arxiv.org/abs/1905.12506 .[19] David G. Clark, Jesse A. Livezey, and Kristofer E. Bouchard. Unsupervised Discovery ofTemporal Structure in Noisy Data with Dynamical Components Analysis. (NeurIPS):1–12,2019. URL http://arxiv.org/abs/1905.09944 .920] Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learningof invariances.

Neural Computation , 14(4):715–770, 2002. ISSN 08997667. doi: 10.1162/089976602317318938.[21] David Ha and Jürgen Schmidhuber. World Models. 2018. doi: 10.1016/b978-0-12-295180-0.50030-6. URL https://arxiv.org/abs/1803.10122 .[22] Chiyuan Zhang, Benjamin Recht, Samy Bengio, Moritz Hardt, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. , 2019.[23] François Chollet. On the Measure of Intelligence. pages 1–64, 2019. URL http://arxiv.org/abs/1911.01547 .[24] Gianluigi Mongillo, Hanan Shteingart, and Yonatan Loewenstein. The misbehavior of rein-forcement learning.

Proceedings of the IEEE , 102(4):528–541, 2014. ISSN 00189219. doi:10.1109/JPROC.2014.2307022.[25] Gianluigi Mongillo, Hanan Shteingart, and Yonatan Loewenstein. Race against the machine.

Proceedings of the IEEE , 102(4):542–543, 2014. ISSN 00189219. doi: 10.1109/JPROC.2014.2308599.[26] J te Nijenhuis, A E M van Vianen, and H van der Flier. Score Gains on g-loaded Tests: No g ,2007.[27] Jacky Au, Ellen Sheehan, Nancy Tsai, Greg J Duncan, Martin Buschkuehl, and Susanne MJaeggi. Improving ﬂuid intelligence with training on working memory: a meta-analysis.

Psychonomic Bulletin and Review , 22(2):366–377, 2015. ISSN 15315320. doi: 10.3758/s13423-014-0699-x.[28] Taylor R Hayes, Alexander A Petrov, and Per B Sederberg. Do we really become smarterwhen our ﬂuid-intelligence test scores improve?

Intelligence , 48:1–14, 2015. ISSN 0160-2896.doi: https://doi.org/10.1016/j.intell.2014.10.005. URL