[PDF] Learning task-agnostic representation via toddler-inspired learning

Abstract

One of the inherent limitations of current AI systems, stemming from the passive learning mechanisms (e.g., supervised learning), is that they perform well on labeled datasets but cannot deduce knowledge on their own. To tackle this problem, we derive inspiration from a highly intentional learning system via action: the toddler. Inspired by the toddler's learning procedure, we design an interactive agent that can learn and store task-agnostic visual representation while exploring and interacting with objects in the virtual environment. Experimental results show that such obtained representation was expandable to various vision tasks such as image classification, object localization, and distance estimation tasks. In specific, the proposed model achieved 100%, 75.1% accuracy and 1.62% relative error, respectively, which is noticeably better than autoencoder-based model (99.7%, 66.1%, 1.95%), and also comparable with those of supervised models (100%, 87.3%, 0.71%).

Full PDF

LLearning Task-agnostic Representation viaToddler-inspired Learning

Kwanyoung Park Junseok Park Hyunseok Oh Byoung-Tak Zhang* Youngki Lee*

Department of Computer Science and Engineering, AI Institute (AIIS)Seoul National University, Seoul 08826, South Korea {william202, jspark227, ohsai}@snu.ac.kr, [email protected], [email protected]

Abstract

One of the inherent limitations of current AI systems, stemming from the passivelearning mechanisms (e.g., supervised learning), is that they perform well onlabeled datasets but cannot deduce knowledge on their own. To tackle this problem,we derive inspiration from a highly intentional learning system via action: thetoddler. Inspired by the toddler’s learning procedure, we design an interactiveagent that can learn and store task-agnostic visual representation while exploringand interacting with objects in the virtual environment. Experimental results showthat such obtained representation was expandable to various vision tasks such asimage classiﬁcation, object localization, and distance estimation tasks. In speciﬁc,the proposed model achieved 100%, 75.1% accuracy and 1.62% relative error,respectively, which is noticeably better than autoencoder-based model (99.7%,66.1%, 1.95%), and also comparable with those of supervised models (100%,87.3%, 0.71%).

Although recent deep learning methods are showing an overwhelming performance in the computervision domain [1], there are major limitations of these data-driven learning: (i) a well-distributedlarge labeled dataset is needed to learn a feature properly [2], (ii) they are task-speciﬁc in the sensethat adapting to multiple different tasks or transfer learning is difﬁcult [3].Several learning frameworks are suggested to overcome each limitation, but none of them are ﬁt tosolve both of them due to their drawbacks. Semi-supervised learning trains the model using both thelabeled data and extra unlabeled data to reduce the labeling cost. However, it increases the samplecomplexity only by a constant factor compared to supervised learning without strong assumptionson unlabeled data [4]. Meta-learning trains in a set of well-known tasks and leverage the acquiredknowledge in learning a similar task. Still, there should be a sufﬁcient number of well-labeled priortasks that are much similar to the target task [5]. Multi-task learning effectively adapts to a set ofrelated tasks simultaneously by forming an inductive bias from intrinsic dependencies of tasks, but itis challenging to optimize shared parameters in each task’s competing objectives [6].These challenges stem from the difference in how human and data-driven AI models the learningprocess [3]. Data-driven AI takes a statistical pattern recognition-based approach, so the knowledgeaccumulation relies on the observed data. In contrast, humans actively inspect the environment tocollect data and learn with only a few of them by building a generalized world model. Thus, a solutioncould be to integrate the learning process of human into AI techniques. *Corresponding authors.NeurIPS 2020 Workshop on BabyMind, Vancouver, Canada. a r X i v : . [ c s . A I] J a n L researchers are recently taking an interest in learning of a child to seek technical advances [7, 8, 9].Learning properly in a child stage is crucial for learning-to-learn capabilities like goal-setting, self-control [10]. Life experiences and prior knowledge learned in childhood are known to inﬂuence thelearning of a grown-up. The latest works in deep learning like visual object learning [7] are seekingadvance from how children learn. By understanding how the child learns, we can understand howlearning-to-learn capabilities and task-agnostic knowledge of objects are nurtured.In this work, we propose a new learning framework to deal with the challenges, inspired by a highlyintentional learning system via action: the toddler. Toddlers unconsciously learn through interactionand play with their surrounding environment, rather than self-directed task-speciﬁc learning of anadult [11]. Large-scale studies suggest that general understandings of objects develop in an earlystage of the toddler without any supervision, through interaction on objects like mouthing, chewing,and rotating [12, 13]. Furthermore, teaching formal subjects too early for a child is counterproductivein the long run [14], and children learn cognitive or self-regulatory abilities through playing [13, 10].These studies motivated us to organize unsupervised or weakly-supervised learning through play inan interactive and playful environment to simulate how toddlers learn.We formulate a toddler-inspired learning framework and simulate the exploration and interaction-based learning in an interactive environment. In particular, we ﬁrst designed a virtual environmentwhere the agent can freely roam and interact with objects and gets feedback (reward). Second, wedesigned the agent’s network architecture to extract the visual knowledge of objects to the embeddingcalled interaction feature maps . Interaction feature maps are designed to have only one featureimage per interaction to make the agent learn a compact interaction-based representation. Finally, wetransferred the visual knowledge to downstream computer vision tasks by using the interaction featuremap as prior. Learning downstream tasks with the interaction feature map were able to achieve 99.7%,62.8% accuracy, and 3.0% relative error in image classiﬁcation, object localization task, and distanceestimation, which is 0.3%, 16.9%, 13.6% better than the autoencoder-based unsupervised transferlearning. Moreover, the number of images to develop the embedding prior was notably smaller thanthe unsupervised counterpart. It shows that the toddler-inspired learning framework can efﬁcientlygain a transferrable knowledge of objects with active interaction-based data collection.

Interactive Virtual Environment.

Motivated by [12], we designed an environment supporting thehuman-like visual observation and active physical interaction with the object, to train the visualknowledge prior without any explicit labels. We used VECA [15], a virtual environment generationtoolkit for human-like agents, to implement the environment. The environment’s reward structureprovides a sparse positive reward signal when the agent is touching or playing with the prop objects,and provides a near-zero negative reward to aid the navigation to the prop object. This rewardstructure motivates the agent to visually locate distant objects while freely exploring and observingobjects in depth when it is nearby. The agent collects data without any label and establishes a moreprofound visual understanding of an object compared to the unsupervised learning on object imageswithout any context.

Toddler-inspired learning.

With the interactive environment, we formulate the toddler-inspiredlearning framework, which aims to acquire a general understanding of objects without any supervision,but by exploring the environment and interacting with the objects, as a toddler does. We assumethat the agent can only visually observe and interact with the object during the sparsely rewardedreinforcement learning task. With the reinforcement learning, we want the agent to learn a transferablerepresentation embedding f θ ( x ) parameterized by θ through sufﬁcient observation and interaction,without any supervision of the downstream tasks. The representation mapping f θ ( x ) will be a generalprior for supervised downstream tasks T = { T i } i =1 , ··· ,n with datasets D T = {D T i } collected fromthe environment.The framework consists of two phases. First, the transferable representation f θ is pretrained in aninteractable environment, by solving a reinforcement learning problem in Eq. 1. o t , a t representsobservation, action in time t , while r ( · ) , π ψ ( ·|· ) and γ are reward function, policy and discount factor,respectively. Second, we use the representation mapping f θ ( x ) as an embedding prior and transfer tothe downstream tasks T . We evaluate the generality of representation with its transferability, whichresults in maximizing the sum of each objective shown in Eq. 2. and { J T i } denotes the set of the2igure 1: Overview of toddler-inspired learning framework and network architecture.objective function for each task { T i } . Please note that we cannot directly optimize the formula sincethe task distribution is unknown while training. ˆ θ, ˆ ψ = arg max θ,ψ J train ( f θ , π ψ ) = arg max θ,ψ ∞ (cid:88) t =0 γ t r ( s t , π ψ ( a t | f θ ( o t ))) (1)to indirectly maximize J test ( z ) = (cid:88) T i ∈ T (cid:88) ( x,y ) ∈D Ti J T i ( f ˆ θ ( x ) , y ) (2)The Fig. 1. shows the overview of the procedure in our toddler-inspired learning framework. Inthe ﬁrst phase, the agent trains under the reinforcement learning framework and learns the efﬁcienttransferable representation, which we named as interaction feature maps , through exploring theenvironment and interacting with the object. In the second phase, the representation embeddingbecomes a feature extractor of the data points on downstream tasks. In this ﬁgure, the representationembedding’s generality is evaluated with the transferability to three vision tasks: image classiﬁcation,distance estimation, and object localization. Network Architecture.

To learn and store transferable knowledge, we designed the architectureof the agent as Fig. 1. Visual observation of the agent is encoded with CNN and MLP, resultingin interaction feature maps . Those feature maps are masked with linearly embedded intention anddetermine the action of the agent. Since the agent’s movement only depends on masked features, theagent must learn to represent abstract features of the object corresponding to its interaction.

To show that the agent could acquire transferable knowledge through the toddler-inspired learningframework, we evaluated the interaction feature map’s transfer performance on three supervisedvisual downstream tasks: image classiﬁcation, distance estimation, and object localization. In speciﬁc,we ﬁxed the parameters of the interaction feature map f θ after pretraining and connected linear layersfor transfer learning. We used VECA [15] toolkit to implement an interactive 3D environment thatincludes three prop objects (toy pyramid, ball, doll) and a baby agent. The agent receives RGB 84x84binocular vision and has two kinds of action - movement and interaction(hold, kick, press). Thereward signal differs for each object-interaction pair. It could be positive (If the baby presses thedoll, then the doll will make sound and the baby will be joyful) or negative (If the baby kicks the toypyramid, then the baby will feel pain). Baselines.

Since the performance of the agent is related to its architecture, we compare the perfor-mance of the agent to baseline agents with the same architecture but different training methods:3igure 2: Learning curve of transfer learning. For classiﬁcation and recognition, higher is better. Fordistance estimation, lower is better. Best viewed in color.• Random: Randomly initialize the agent’s network parameters and only train the connected layer. Itshows the naive performance from the architecture.• Autoencoder: The agent’s network is trained as an autoencoder. We use the performance as abaseline of representation learning without explicitly labeled data.• Supervised: Both agent’s network and the connected network are trained supervised (withouttransferring) for a speciﬁc downstream task. We interpret its performance as an optimal achievableperformance with this architecture.

Task (Metric, %) Random Autoencoder Proposed SupervisedClassiﬁcation (Accuracy) 90.0 ± ± ± ± ± ± ± ± ± ± ± ± Inspired by how toddlers learn, we proposed a toddler-inspired learning framework to gain transferablevisual knowledge of objects by exploring and interacting with the environment. We evaluated itstransfer performance to several supervised downstream visionary tasks. Evaluation results showthat the agent could gain a transferable knowledge of objects by exploring and interacting with theenvironment.However, our method is still far from how a toddler learns. We used hand-crafted reward and applieda conventional reinforcement learning algorithm to train the agent. We suggest that substitutinghand-crafted reward for intrinsic reward and developing a human-like fast and adaptive learningalgorithm would be a promising future direction. 4 cknowledgement

This work was supported by Institute for Information & Communications Technology Planning &Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2019-0-01367, Infant-MimicNeurocognitive Developmental Machine Learning from Interaction Experience with Real World(BabyMind)).

References [1] Voulodimos, A., N. Doulamis, A. Doulamis, et al. Deep learning for computer vision: A brief review.

Computational intelligence and neuroscience , 2018, 2018.[2] Mastorakis, G. Human-like machine learning: limitations and suggestions.

CoRR , abs/1811.06052, 2018.[3] Lake, B. M., T. D. Ullman, J. B. Tenenbaum, et al. Building machines that learn and think like people.

Behavioral and brain sciences , 40, 2017.[4] Lu, T. T.

Fundamental limitations of semi-supervised learning . Master’s thesis, University of Waterloo,2009.[5] Kang, B., J. Feng. Transferable meta learning across domains. In

UAI , pages 177–187. 2018.[6] Liu, Y., B. Zhuang, C. Shen, et al. Training compact neural networks via auxiliary overparameterization. arXiv preprint arXiv:1909.02214 , 2019.[7] Bambach, S., D. Crandall, L. Smith, et al. Toddler-inspired visual object learning. In

Advances in neuralinformation processing systems , pages 1201–1210. 2018.[8] Schank, R. C. Conceptual dependency: A theory of natural language understanding.

Cognitive psychology ,3(4):552–631, 1972.[9] Turing, A. M. Computing machinery and intelligence. In

Parsing the Turing Test , pages 23–65. Springer,2009.[10] Zosh, J. N., E. J. Hopkins, H. Jensen, et al.

Learning through play: a review of the evidence . 2017.[11] McDonough, D., et al. Similarities and differences between adult and child learners as participants in thenatural learning process.

Psychology , 4(03):345, 2013.[12] Gibson, E. J. Exploratory behavior in the development of perceiving, acting, and the acquiring of knowledge.

Annual review of psychology , 39(1):1–42, 1988.[13] Piaget, J., M. Cook.

The origins of intelligence in children , vol. 8. International Universities Press NewYork, 1952.[14] Suggate, S. P., E. A. Schaughency, E. Reese. Children learning to read later catch up to children readingearlier.

Early Childhood Research Quarterly , 28(1):33–48, 2013.[15] Park, K., J. Heo, Y. Lee. Veca: A vr toolkit for training and testing cognitive agents. https://github.com/GGOSinon/VECA , 2020.[16] Mnih, V., K. Kavukcuoglu, D. Silver, et al. Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 , 2013.[17] Kingma, D. P., J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[18] Haarnoja, T., A. Zhou, P. Abbeel, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcementlearning with a stochastic actor. arXiv preprint arXiv:1801.01290 , 2018.

At the start of the each episode, the agent has an intention to do certain interaction (e.g. in thisepisode, the baby wants to kick something). At the initial point, the agent randomly explores andinteracts with the objects, receiving both positive and negative rewards. During the training process,the agent learns to ﬁnd the object matching its intention to maximize the reward.

We collected 2400 binocular RGB image data by randomly rotating and positioning the objects usedin the environment. For all experiments, we split the data into 2100/300 images and used them fortraining/testing. 5 .3 Technical detailsArchitecture.

We used the CNN architecture introduced in [16], and connected two layers to make512-dimensional interaction feature maps. Those features are transfered with single linear layer.

Image classiﬁcation.

Like casual image classiﬁcation tasks, the agent has to classify images by itsincluded object. We used cross-entropy loss with softmax activation to train the model.

Distance estimation.

The agent has to estimate the distance between the camera and the object. Thedistance is log-normalized to have zero mean and unit variance. We used mean squared error loss totrain the model.

Object localization.

The agent has to localize the object with a bounding box. Coordinate of verticesof bounding box is within the range of (0, 1). We designed the network to output four values: outputcenter coordinate, width and height of the bounding box.

Training.

While training in the virtual environment, we used Adam[17] optimizer with learning rateof . and trained the agent using SAC algorithm[18] for 3.2M frames. For transferring theagent, we used the same optimizer with learning rate of .001