[PDF] OpenLORIS-Object: A Robotic Vision Dataset and Benchmark for Lifelong Deep Learning

Abstract

The recent breakthroughs in computer vision have benefited from the availability of large representative datasets (e.g. ImageNet and COCO) for training. Yet, robotic vision poses unique challenges for applying visual algorithms developed from these standard computer vision datasets due to their implicit assumption over non-varying distributions for a fixed set of tasks. Fully retraining models each time a new task becomes available is infeasible due to computational, storage and sometimes privacy issues, while naïve incremental strategies have been shown to suffer from catastrophic forgetting. It is crucial for the robots to operate continuously under open-set and detrimental conditions with adaptive visual perceptual systems, where lifelong learning is a fundamental capability. However, very few datasets and benchmarks are available to evaluate and compare emerging techniques. To fill this gap, we provide a new lifelong robotic vision dataset ("OpenLORIS-Object") collected via RGB-D cameras. The dataset embeds the challenges faced by a robot in the real-life application and provides new benchmarks for validating lifelong object recognition algorithms. Moreover, we have provided a testbed of 9 state-of-the-art lifelong learning algorithms. Each of them involves 48 tasks with 4 evaluation metrics over the OpenLORIS-Object dataset. The results demonstrate that the object recognition task in the ever-changing difficulty environments is far from being solved and the bottlenecks are at the forward/backward transfer designs. Our dataset and benchmark are publicly available at at \href{this https URL}{\underline{this https URL}}.

Full PDF

OOpenLORIS-Object: A Robotic Vision Dataset and Benchmark forLifelong Deep Learning

Qi She , Fan Feng , Xinyue Hao , , Qihan Yang , Chuanlin Lan , , Vincenzo Lomonaco , Xuesong Shi ,Zhengwei Wang , Yao Guo , Yimin Zhang , Fei Qiao , and Rosa H. M. Chan Abstract — The recent breakthroughs in computer vision havebeneﬁted from the availability of large representative datasets(e.g. ImageNet and COCO) for training. Yet, robotic vision posesunique challenges for applying visual algorithms developedfrom these standard computer vision datasets due to theirimplicit assumption over non-varying distributions for a ﬁxedset of tasks. Fully retraining models each time a new taskbecomes available is infeasible due to computational, storageand sometimes privacy issues, while na¨ıve incremental strategieshave been shown to suffer from catastrophic forgetting. It iscrucial for the robots to operate continuously under open-set and detrimental conditions with adaptive visual perceptualsystems, where lifelong learning is a fundamental capability.However, very few datasets and benchmarks are available toevaluate and compare emerging techniques. To ﬁll this gap,we provide a new lifelong robotic vision dataset (“OpenLORIS-Object”) collected via RGB-D cameras. The dataset embeds thechallenges faced by a robot in the real-life application and pro-vides new benchmarks for validating lifelong object recognitionalgorithms. Moreover, we have provided a testbed of state-of-the-art lifelong learning algorithms. Each of them involves tasks with evaluation metrics over the OpenLORIS-Objectdataset. The results demonstrate that the object recognition taskin the ever-changing difﬁculty environments is far from beingsolved and the bottlenecks are at the forward/backward transferdesigns. Our dataset and benchmark are publicly available atat https://lifelong-robotic-vision.github.io/dataset/object. I. INTRODUCTIONHumans have the remarkable ability to learn continuouslyfrom external environments and inner experiences. One ofthe grand goals of robots is building an artiﬁcial “lifelonglearning” agent that can shape a cultivated understanding ofthe world from the current scene and their previous knowledgevia an autonomous lifelong development. Recent advancesin computer vision and deep learning techniques have beenachieved through the occurrence of large-scale datasets, suchas ImageNet [1] and COCO [2]. The breakthroughs in objectclassiﬁcation, detection, and segmentation heavily dependon the availability of these large representative datasets fortraining. However, robotic vision poses new challenges forapplying visual algorithms developed from computer vision Robot Innovation Lab, Intel Labs, Beijing, China Department of Electrical Engineering, City University of Hong Kong,China Department of Electronic Engineering, Tsinghua University, China Beijing University of Posts and Telecommunications, China School of Electronic Information, Wuhan University, China Department of Computer Science and Engineering, University ofBologna, Bologna, Italy Insight Centre for Data Analytics, Dublin City University, Ireland The Hamlyn Centre for Robotic Surgery, Imperial College London, UKCorresponding author: [email protected], [email protected] datasets in real-world applications, due to their implicitassumptions over non-varying distributions for a ﬁxed set ofcategories and tasks. In practice, the deployed model cannotsupport the capability to learn and adapt as new data comesin autonomously.Fig. 1: OpenLORIS robotic platform (left) mounted withmultiple sensors (right). In OpenLORIS-Object dataset, theRGB-D data is collected from the depth camera.The semantic concepts of the real environment are dynam-ically changing over time. In real scenarios, robots should beable to operate continuously under open-set and sometimesdetrimental conditions, which requires the lifelong learningcapability with reliable uncertainty estimates and robustalgorithm designs [3].

Providing a robotic vision datasetcollected from the time-varying environments can accelerateboth research and applications of visual models for robotics .The ideal dataset should contain the objects recorded fromthe high-variant environments, e.g., the variations of theillumination and clutter. The presence of temporal coherentsessions (i.e., videos where the robot mounted with thecamera gently move around the objects) is another key featuresince temporal smoothness can be used to simplify objectdetection, improve classiﬁcation accuracy and to addressunsupervised scenarios [4]. In this work, we utilize a realrobot mounted with multiple high-resolution sensors (e.g.,depth and IMU, see Fig. 1) to actively collect the data fromthe real-world objects in several kinds of typical scenarios,like homes, ofﬁces, and malls. We consider the variations ofillumination, occlusion of the objects, object size, camera-object distance/angle, and clutter. These real-world factorsencountered by the robots after deployment are exploredand quantiﬁed at different difﬁculty levels. Speciﬁcally, wequantify the learning capability of the robotic vision systemwhen faced with the objects appearing in the dynamicenvironments.

When evaluating the existing algorithms designed for real- a r X i v : . [ c s . C V ] M a r orld lifelong learning, utilizing the whole historical data forretraining the model (i.e., cumulative approach) is unlikelyfor application engineering because this kind of method needsto 1) store all data streams; and 2) retrain the model withthe whole dataset once the new datum is available. In thereal-world deployment, it is required to update the trainedmodel with the new data in terms of computation [5], memoryconstraint [6] or privacy issues (cannot have access to or storeprevious data) [7]. The vanilla learning approach is transferlearning or ﬁne-tuning method to apply previously learnedknowledge on quite similar domain tasks [8]. Yet, they cannotsolve the concept drift problem when encountering dissimilardomain dataset [9] and have degradation performances onthe previous tasks after learning the new task due to thecatastrophic forgetting problem [10]–[12].For quantifying the lifelong learning capability of roboticvision systems, three-fold contributions are made in this paper: • We have provided a novel RGB-D object datasetfor ( L )ifel( O )ng ( R )obotic V( IS )ion research calledOpenLORIS-object. The dataset is collected via depthcameras. It collects the data under dynamic environmentswith diverse illumination, occlusion, object size, camera-object distance/angle, and clutter. • We have released benchmarks for evaluating lifelonglearning capability of the robotic vision with ever-changing difﬁculty . • We have done comprehensive analysis with state-of-the-art lifelong learning algorithms, each of which isevaluated with recognition tasks on OpenLORIS-Object. It demonstrated the bottlenecks of the SOTAsfor learning continuously in a real-life scenario.II. R ELATED W ORK

A. Lifelong Learning Algorithms

The ultimate lifelong robotic vision system should embedﬁve capabilities: 1) learn the new knowledge and furthersummarize the patterns from the data; 2) avoid catastrophicforgetting and keep the memory of the old knowledge,especially for the widely-used deep learning techniques;3) generalize and adapt well to the future time-varyingdistributions. Traditionally supervised classiﬁers are trainedon one distribution and often fails when faced with a quitedifferent distribution, and most of the current adaptationalgorithms may suffer from the performance degradation; 4)be equipped with the few-shot/zero-shot learning capabilitythat can learn from the small dataset; and 5) be ableto learn from a theoretically-inﬁnite stream of examplesusing limited time and memory. In this work, we focus onevaluating the existing algorithms for enabling the ﬁrst threecapabilities on the lifelong object recognition task. We notethat the ﬁrst three points can be viewed as evaluating theperformances of the algorithms over current, previous, andfuture tasks, respectively. However, a well-known constraintcalled stability-plasticity dilemma [13] impedes this continualadaptation of learning systems. Learning in a parallel anddistributed system needs plasticity for integrating the new knowledge, but also stability to prevent forgetting of theprevious knowledge. Too much plasticity leads to the learnedpatterns of previous data being forgotten, whereas too muchstability cannot make efﬁcient encoding of the new knowledge.Thus, the recent deep neural network-based algorithms aretrying to achieve a trade-off between stability and plasticitywhen learning from the high-dimensional data space (e.g.,image) continuously.Conceptually, these approaches can be divided into 1)methods that retrain the whole network via regularizingthe model parameters learned from previous tasks, e.g.,Learning without Forgetting (LwF) [14], Elastic WeightConsolidation (EWC) [11] and Synaptic Intelligence (SI) [15];2) methods that dynamically expand/adjust the network archi-tecture if learning new tasks, e.g., Context-dependent Gating(XdG) [16] and Dynamic Expandable Network (DEN) [17];3) rehearsal approaches gather all methods that save rawsamples as memory of past tasks. These samples are usedto maintain knowledge about the past in the model and thenreplayed with samples drawn from the new task when trainingthe model, e.g., Incremental Classiﬁer and RepresentationLearning (ICaRL) [18]; and generative replay approachestrain generative models on the data distribution, and theyare able to afterward sample data from experience whenlearning new data, e.g., Deep Generative Replay (DGR) [19],DGR with dual memory [20] and feedback [21]. Most ofcurrent generative models are based on standard Generativeadversarial networks and its extensions [22], [23]. For roboticvision, ideally, the lifelong learning should be triggered by theavailability of short videos of single objects and performedonline on the hardware with ﬁne-grained updates, while themainstream of methods we study are limited with muchlower temporal precision as our previous sequential learningmodels [24], [25].For evaluating these algorithms, standard datasets likeMNIST [26] and CUB-200 [27] are normally utilized. How-ever, the visual algorithms developed from these computervision datasets have not concerned about learning withevery-changing difﬁculty. They simplify the lifelong learningalgorithms with recognizing new object instances or classesunder very constraint environment. Moreover, current roboticvision datasets also have limitations on evaluating lifelonglearning algorithms because they either neglect some crucialchallenges in the environment or have not explicitly quantiﬁedthe difﬁculty levels of these challenges in the dataset. Thus, forpushing the boundary of practical lifelong object recognitionalgorithms, it is required the dataset considers the real-world challenges that the robot encounters and formulatethe benchmark as a testbed of object recognition algorithms.

B. Related Datasets

One of the relevant robotic vision datasets is RGB-D ObjectDataset (ROD) [33], which has become the standard bench-marks in the robotics community for the object recognitiontask. Although the dataset is well organized and containsover everyday household objects, it has been acquired ataset Illumination Occlusion Dimension (pixel) Clutter Context Quantiﬁable AcquisitionCOIL-100 [28] normal no 30-200 simple home (cid:55) turntableNORB [29] weak/normal/strong no <

30, 30-200 simple outdoor turntableOxford Flowers [30] weak/normal no 200 simple outdoor websitesCIFAR-100 [31] normal no >

200 simple outdoor websitesUCB-200 [32] normal few 200 simple outdoor websitesROD [33] normal no few <

30, 30-200 regular home turntableCORe50 [34] normal no 30-200 simple home/outdoor hand holdARID [35] 80% weak/normal few 30-200 regular/complex home robot

Ours: OpenLORIS-Object weak/normal/strong no/25%/50% <

30, 30-200, >

200 simple/regular/complex home/ofﬁce/mall (cid:51) robot

TABLE I: OpenLORIS-Object compared with other object recognition datasets. This summary of the characteristics ofdifferent datasets focuses on the variations in illumination, occlusion, object dimension (pixel size) in the image, clutter,context information, and whether or not these characteristics are provided in an explicit (that can be quantiﬁed ) or an implicitway (cannot isolate these characteristics of the data, and deﬁne the difﬁculty levels explicitly. Thus we cannot identify howlifelong object recognition algorithms perform w.r.t. the real world challenges rigorously).under a very constrained setting and neglects some crucialchallenges that a robot faces in the real-world deployment.Another recently proposed dataset is Autonomous RobotIndoor Dataset (ARID) [35], and the data is collected from arobot patrolling in a deﬁned human environment. Analogouslyto ROD, the object instances in ARID are organized intovarious categories. The dataset is manually designed to includereal-world characteristics such as variation in lighting condi-tions, object scale, and background as well as occlusion andclutter. ARID seems to be similar to our dataset, both of whichare considering the real-world challenges (e.g., illumination,occlusion) that the robots many naturally encounter; however,two main differences exist. First, OpenLORIS-object hasrigorously isolated each characteristic/environment factor ofthe dataset, such as illumination, occlusion, object pixel size,and clutter, and deﬁnes the difﬁculty levels of each factorexplicitly. However, ARID contains these challenges in animplicit manner. Although these challenges exist implicitly ina real-life deployment, during system design and developmentstages, the dataset including implicit variants cannot beneﬁtthe system as much as ours via providing the difﬁculty levelsof each challenge explicitly; Second, the OpenLORIS-Objectis designed for evaluating the lifelong learning capabilityof the robotic vision system. Thus we have provided thebenchmarks for several “ever-changing difﬁculty” scenarios.Continual Object Recognition Dataset (CORe50) [34] isa collection of domestic objects, which evaluates thecontinual learning capability of the models. Different fromOpenLORIS-Object, they focus on incrementally recognizingnew instances or new classes. We focus on how to learnthe objects under varying environmental conditions, which isessential for enabling the robots to perform continuously androbustly in the dynamic environment. Moreover, Objects inCORe50 are handhold by the operator, and the camera point-of-view is that of the operator’s eyes. While OpenLORIS-Object is more suitable for autonomous system developmentbecause the data is acquired via the real robots mounted withdepth cameras, which is in an active vision manner.Non-I.I.D. Image dataset with Contexts (NICO) supportstesting the lifelong learning algorithms on the Non-I.I.D.data, which focuses on the context variants of the sameobject. We highlight that NICO is the mixed effects ofour considered factors. Furthermore, we have decomposed orthogonal contexts, e.g., illumination, occlusion, object pixelsizes, clutter in OpenLORIS-Object. Note that the context inour dataset is classiﬁed as home, ofﬁce and shopping mallscenarios, and we admit that they also contain mixed factorsas NICO (we tried to keep other factors at the normal level).More robotic vision datasets exists in lifelong research [36],but in this work, the related datasets compared focus on objectrecognition problem.We brieﬂy show the features of OpenLORIS-Object com-pared with others in Table I. It demonstrates that ours isquantiﬁable and more complete w.r.t. the real-life challengesfor robotic object recognition.III. O PEN

LORIS-O

BJECT D ATASET

A. Dataset Collection

Several grounded robots mounted by depth cameras andother sensors are used for the data collection. These robots aremoving in the ofﬁces, homes, and malls, where the scenes arediverse and changing all the time. In the OpenLORIS-Objectdataset, we provide the RGB-D video dataset for the objects.

B. Dataset Details

We include the common challenges that the robot is usuallyfaced with, such as illumination, occlusion, camera-objectdistance, etc. Furthermore, we explicitly decompose thesefactors from real-life environments and have quantiﬁed theirdifﬁculty levels. In summary, to better understand whichcharacteristics of robotic data negatively inﬂuence the resultsof the lifelong object recognition, we independently consider:1) illumination, 2) occlusion, 3) object size, 4) camera-objectdistance, 5) camera-object angle, and 6) clutter.1).

Illumination . The illumination can vary signiﬁcantlyacross time, e.g., day and night. We repeat the datacollection under weak, normal, and strong lightingconditions, respectively. The task becomes challengingwith lights to be very weak.2).

Occlusion . Occlusion happens when a part of an objectis hidden by other objects, or only a portion of the objectis visible in the ﬁeld of view. Occlusion signiﬁcantlyincreases the difﬁculty for recognition.3).

Object size . Small-size objects make the task challeng-ing, like dry batteries or glue sticks. evel Illumination Occlusion (percentage) Object Pixel Size (pixels) Clutter Context > × Simple Home/ofﬁce/mall 40 121 202 Normal

25% 30 × − × Normal3 Weak < × Complex

TABLE II: Details of each levels for real-life robotic vision challenges. Accuracy BWT FWT Over-all accuracy (cid:80) Ni ≥ j R i,j / N ( N + 1)2 (cid:80) Ni>j R i,j / N ( N − (cid:80) Ni

Camera-object distance . It affects actual pixels of theobjects in the image.5).

Camera-object angle . The angles between the camerasand objects affect the attributes detected from the object.6).

Clutter . The presence of other objects in the vicinity ofthe considered object may interfere with the classiﬁcationtask.The st version of OpenLORIS-Object is a collectionof instances, including categories daily necessitiesobjects under scenes. For each instance, a to seconds video (at fps) has been recorded with a depthcamera delivering around to frames ( to distinguishable object views are manually picked and providedin the dataset). environmental factors, each has levelchanges, are considered explicitly, including illuminationvariants during recording, occlusion percentage of the objects,object pixel size in each frame, and the clutter of the scene.Note that the variables of 3) object size and 4) camera-objectdistance are combined together because in the real-worldscenarios, it is hard to distinguish the effects of these twofactors brought to the actual data collected from the mobilerobots, but we can identify their joint effects on the actualpixel sizes of the objects in the frames roughly. The variable5) is considered as different recorded views of the objects.The deﬁned three difﬁculty levels for each factor are shownin Table. II (totally we have levels w.r.t. the environmentfactors across all instances). The levels , , and are rankedwith increasing difﬁculties.For each instance at each level, we provided to samples, both have RGB and depth images. Thus, the totalimages provided is around (RGB and depth) × (meansamples per instance) × (instances) × (factors per level) × (difﬁculty levels) = , , images. Also, we haveprovided bounding boxes and masks for each RGB image. Anexample of two RGB-D frames of simple and complex clutterwith 2D bounding box and mask annotations is shown inFig. 2. The size of images under illumination, occlusion andclutter factors is × pixels, and the size of images underobject pixel size factor are × , × , × pixels. Picked samples have been shown in Fig. 3. Fig. 2: Example of two RGB-D frames of simple clutter (left)and complex clutter (right) from OpenLORIS-Object Datasetwith 2D bounding box and mask annotations.IV. E XPERIMENTS : L

IFELONG O BJECT R ECOGNITIONWITH E VER - CHANGING D IFFICULTY

A. State-of-the-art Methods

We evaluate lifelong learning algorithms overOpenLORIS-Object dataset. These methods can be classi-ﬁed into categories: 1) transfer and multi-task learning:na¨ıve and cumulative methods [34], [37]; 2) regularizationapproaches: Learning without Forgetting (LwF) [14], ElasticWeight Consolidation (EWC) [11], Online EWC [12] andSynaptic Intelligence (SI) [15]; and 3) Generative Replayapproaches: Deep Generative Replay (DGR) [19], DGR withdistillation [38], [39] and DGR with feedback [21]. Moredetails of these methods can be seen in the recent review [40]. B. Evaluation Metrics

Finding the metrics, which are most useful to report theLifelong learning performances, are non-trivial like inves-tigating how to evaluate other deep learning methods [41].In this paper, for a comprehensive analysis of SOTAs overOpenLORIS-Object datasets, quite different from ROD [33],CORe50 [34], ARID [35], and NICO [42], we utilize fourmetrics for evaluating the performances [43]: Accuracy,Backward transfer (BWT), Forward transfer (FWT), and Over-all accuracy as shown in Table III. It is denoted that thesemetrics still focus on the accuracy aspect for learning, whileignoring the key indicators of computational efﬁciency andmemory storage.During lifelong learning process, the data D is a po-tentially inﬁnite sequence of unknown distributions D = { D , · · · , D N } . For the dataset D n , We denote the trainingset T r n and testing set T e n , and deﬁne the task T n forrecognizing the object categories in this dataset. A train-testaccuracy matrix R ∈ R N × N contains in each entry R ij theig. 3: Picked samples of objects (row) under multiple level environment conditions (column). The variants from left to rightare illumination (weak, normal, and strong); occlusion ( , , and ); object pixel size ( < × , × − × ,and > × ); clutter (simple, normal and complex); and multi-views of the objects. (Note that we use different viewsas training samples of each difﬁculty level in each factor).testing classiﬁcation accuracy of dataset T e j after trainingthe model over the dataset T r i , which is shown in Table. IV. R T e T e · · · T e N T r R R · · · R N T r R R · · · R N · · · · · · · · · · · · · · · T r N R N R N · · · R NN TABLE IV: Train-test accuracy matrix R , where T r = training data, T e = testing data, and R ij = classiﬁcationaccuracy of the model training on T r i and testing on T e j .The number of tasks is N , and the train/test split is .The Accuracy metric considers the performance of themodel at very timestep i in time that can better characterizethe dynamics of the learning algorithms (average of white andgray elements in Table IV); BWT evaluates the memorizationcapability of the algorithms, which measures the accuracyover previously encountered tasks (average of gray elementsin Table IV); FWT measures the inﬂuence that learning thecurrent task on the performance of future tasks (average ofcyan elements in Table IV); and Over-all accuracy summarizesthe performances on all the previous, current, and future tasks,which can be viewed as an overall metric for a speciﬁc model. C. Benchmarks of Lifelong Object Recognitiona) Single factor analysis with ever-changing difﬁculty:

The experiments are conducted on the robotic visionchallenges, each of which has difﬁculty levels that canbe explored under the sequential learning settings. We ﬁrstinvestigate the individual factor, and change the difﬁcultylevels of each continuously. Note that we keep other factorsat the level as in Table II when investigating each factor,e.g., the levels of the illumination can be weak, normal, andstrong, and at the same time, the occlusion is kept at ,the object pixel size is larger × , and the clutter issimple.Fig. 4 demonstrates the experimental details of each factoranalysis. For example in the ﬁg. 4 (a), under illuminationvariants (shown in yellow bars “factor”), the model shouldbe updated with the data from the difﬁculty level , , and (shown in blue bars “level”) for totally tasks (shown ingreen bars “task”). We separate each difﬁculty level into tasks (e.g., blue bars we have three / / level) with different views. The same experiment has been done on occlusion,object pixel size, and clutter factors. For each task, the totalimages are around , . The number of images in training,testing, validation sets is around , , , and , respectively. (a) illumination f a c t o r l e v e l t a s k

111 112 113 124 125 126 137 138 139 task learning order (b) occlusion (c) object pixel size (d) clutter

211 212 238 239 ……

311 312 338 339 ……

411 412 438 439 …… task Fig. 4: Four-factor analysis (illumination, occlusion, objectpixel size, and clutter) under the sequential learning setting.Yellow bars indicate the factor encountered (“1”: illumination,“2”: occlusion, “3”: object pixel size, and “4”: clutter); bluebars highlight the difﬁculty levels within each factor, andgreen bars represent the task ID. Within each difﬁculty level,three tasks are provided w.r.t. their variants in object views.The performances of all tasks ( factors × tasks/factor)have been evaluated with metrics (Accuracy, BWT, FWT,and Over-all accuracy) obtained from train-test accuracymatrix as in Table IV. In detail, for each factor (including sequential tasks), we have train-test accuracy matrix,thus ﬁnally we obtain matrices ( factors) and get theevaluation metrics from them. The results are shown inFig. 5. commonly-used lifelong learning algorithms areimplemented to test the overall performances on OpenLORIS-Object dataset. Compared with the other factors, theillumination factor is more challenging with smaller areasof all metrics. It also consistently conveys that because ofthe low accuracy of the forward transfer (FWT) across allmethods and factors ( − classiﬁcation accuracy), itwill ﬁnally lead to lower Over-all accuracy (around ).As is known, most of existing lifelong learning algorithmsare designed for overcoming catastrophic forgetting problems.They focus on keeping the backward transfer (BWT) (bluelines) as large as possible, approaching to Accuracy (greylines), but ignoring the forward transfer capability of themodel design. Furthermore, the cumulative approach (retrainwith both previous and current data) performs best, however, a) illumination (b) occlusion (c) object pixel size (d) clutter Fig. 5: The spider chart of evaluation metrics: Accuracy (grey), BWT (blue), FWT (red), and Over-all accuracy (yellow) of lifelong learning algorithms, which are evaluated on illumination, occlusion, object pixel size, and clutter factors. Largerarea is better. The maximum value of each evaluation metric is .it needs much more memory storage (linearly grow with thenumber of tasks encountered), which is impossible for thereal-world deployment scenario. More benchmarks for testingthe robustness of lifelong learning algorithms with each ofrandomly encountered challenges can be found in assistiverobotics [44], which is a more realistic problem that robotmay be faced with. b) Sequential factors analysis with ever-changing difﬁ-culty: We further explore the task learning capabilities whenencountering four factors sequentially. As shown in Fig. 6,the data from factors with difﬁculty levels (totally tasks) are learned sequentially with about , trainingimages and , testing images ( objects) for each task.The number of total training and testing images of all thetasks is about , and , . We would like to test therobustness and adaptation capabilities of the lifelong learningalgorithms for the long sequential tasks with more variantsencountered.Fig. 6: Sequential factors analysis. The models are trained forlearning the factors with difﬁculty levels continuously.The evaluation results of sequential tasks combining allfactors are shown in Fig. 7. The accuracy, BWT, FWT andoverall accuracy degrade signiﬁcantly compared with singlefactor analysis. For example, the BWT, FWT and Overallaccuracy are averagely below with more high-varianttask learning, while in single factor analysis, they can achieve accuracy. It conveys that the existing methods are notstable and robust enough for dealing with long sequentialand high variant tasks.We would like to denote that the current experimentsare conducted with focus on lifelong/continual learningcapability, while we have not designed more speciﬁc andprecise algorithms/modules for overcoming the illumination,occlusion, small-object detection, and object recognition withcomplex clutter problems. Fig. 7: Evaluation results of sequential task learning.V. C ONCLUSION

In order to enable the robotic vision system with lifelonglearning capability, we provide a novel lifelong object recog-nition dataset (“OpenLORIS-Object”). The dataset embedsthe real-world challenges (e.g., illumination, occlusion, objectpixel sizes, clutter) faced by a robot after deployment andprovides novel benchmarks for validating state-of-the-artlifelong learning algorithms. With intensive experiments( algorithms each of which has learning tasks), wehave found the object recognition task in the ever-changingdifﬁculty (dynamic) environments is far from being solved,which is another challenge besides the object recognition taskin the static scene. Two main conclusions can be drawn here: • The bottlenecks of developing robotic perception sys-tems in real-world scenarios are forward and backwardtransfer model designs, which can ﬁgure out how theknowledge can transfer across different scenes. • Under ever-changing difﬁculty environment, the SOTAsdegrade sharply with more tasks, which demonstratesthe current learning algorithms are not robust enough,far from being deployed to the real world.VI. A

CKNOWLEDGEMENT

The work was partially supported by a grant from theResearch Grants Council of the Hong Kong Special Ad-ministrative Region, China (Project No. CityU 11215618).The authors would like to thank Hong Pong Ho from IntelRealSense Team for the technical support of RealSensecameras for recording the high-quality RGB-D data sequences.Thank Yang Peng, Kelvin Yu, Dion Gavin Mascarenhas fordata collection and labeling.

EFERENCES[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in

Proceedings of theIEEE conference on Computer Vision and Pattern Recognition (CVPR) .IEEE, 2009, pp. 248–255.[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objectsin context,” in

European Conference on Computer Vision (ECCV) .Springer, 2014, pp. 740–755.[3] N. S¨underhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner,B. Upcroft, P. Abbeel, W. Burgard, M. Milford et al. , “The limits andpotentials of deep learning for robotics,”

The International Journal ofRobotics Research , vol. 37, no. 4-5, pp. 405–420, 2018.[4] D. Maltoni and V. Lomonaco, “Semi-supervised tuning from tem-poral coherence,” in . IEEE, 2016, pp. 2509–2514.[5] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, andK. Taha, “Efﬁcient machine learning for big data: A review,”

Big DataResearch , vol. 2, no. 3, pp. 87–93, 2015.[6] D. Lopez-Paz et al. , “Gradient episodic memory for continual learning,”in

Advances in Neural Information Processing Systems (NIPS) , 2017,pp. 6467–6476.[7] P. Mohassel and Y. Zhang, “Secureml: A system for scalable privacy-preserving machine learning,” in . IEEE, 2017, pp. 19–38.[8] S. J. Pan and Q. Yang, “A survey on transfer learning,”

IEEETransactions on Knowledge and Data Engineering , vol. 22, no. 10, pp.1345–1359, 2009.[9] I. Khamassi, M. Sayed-Mouchaweh, M. Hammami, and K. Gh´edira,“Discussion and review on evolving data streams and concept driftadapting,”

Evolving Systems , vol. 9, no. 1, pp. 1–23, 2018.[10] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio,“An empirical investigation of catastrophic forgetting in gradient-basedneural networks,” arXiv preprint arXiv:1312.6211 , 2013.[11] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al. , “Overcoming catastrophic forgetting in neural networks,”

Pro-ceedings of the National Academy of Sciences (PNAS) , pp. 3521–3526,2017.[12] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W.Teh, R. Pascanu, and R. Hadsell, “Progress & compress: A scalableframework for continual learning,” in

Proceedings of the 35th Interna-tional Conference on Machine Learning (ICML) , 2018, pp. 4535–4544.[13] M. Mermillod, A. Bugaiska, and P. Bonin, “The stability-plasticitydilemma: Investigating the continuum from catastrophic forgetting toage-limited learning effects,”

Frontiers in Psychology , vol. 4, p. 504,2013.[14] Z. Li and D. Hoiem, “Learning without forgetting,”

IEEE Transactionson Pattern Analysis and Machine Intelligence , vol. 40, no. 12, pp.2935–2947, 2017.[15] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synap-tic intelligence,” in

Proceedings of the 34th International Conferenceon Machine Learning (ICML) , vol. 70, 2017, pp. 3987–3995.[16] N. Y. Masse, G. D. Grant, and D. J. Freedman, “Alleviating catastrophicforgetting using context-dependent gating and synaptic stabilization,”

Proceedings of the National Academy of Sciences (PNAS) , vol. 115,no. 44, pp. 467–475, 2018.[17] J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning withdynamically expandable networks,” arXiv preprint arXiv:1708.01547 ,2017.[18] S.-A. Rebufﬁ, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl:Incremental classiﬁer and representation learning,” in

Proceedings ofthe IEEE conference on Computer Vision and Pattern Recognition(CVPR) , 2017, pp. 2001–2010.[19] H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning withdeep generative replay,” in

Advances in Neural Information ProcessingSystems (NIPS) , 2017, pp. 2990–2999.[20] N. Kamra, U. Gupta, and Y. Liu, “Deep generative dual memorynetwork for continual learning,” arXiv preprint arXiv:1710.10368 ,2017.[21] G. M. van de Ven and A. S. Tolias, “Generative replay with feedbackconnections as a general strategy for continual learning,” arXiv preprintarXiv:1809.10635 , 2018. [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”in

Advances in Neural Information Processing Systems (NIPS) , 2014,pp. 2672–2680.[23] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial networks:A survey and taxonomy,” arXiv preprint arXiv:1906.01529 , 2019.[24] Q. She, Y. Gao, K. Xu, and R. H. Chan, “Reduced-rank linear dynamicalsystems,” in

Thirty-Second AAAI Conference on Artiﬁcial Intelligence(AAAI) , 2018.[25] Q. She and A. Wu, “Neural dynamics discovery via gaussian processrecurrent neural networks,” arXiv preprint arXiv:1907.00650 , 2019.[26] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-basedlearning applied to document recognition,”

Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.[27] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, andP. Perona, “Caltech-ucsd birds 200,” 2010.[28] S. A. Nene, S. K. Nayar, H. Murase et al. , “Columbia object imagelibrary (coil-20),” 1996.[29] Y. LeCun, F. J. Huang, L. Bottou et al. , “Learning methods for genericobject recognition with invariance to pose and lighting,” in

Proceedingsof the IEEE conference on Computer Vision and Pattern Recognition(CVPR) . Citeseer, 2004, pp. 97–104.[30] M.-E. Nilsback and A. Zisserman, “Automated ﬂower classiﬁcationover a large number of classes,” in . IEEE,2008, pp. 722–729.[31] A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of featuresfrom tiny images,” Citeseer, Tech. Rep., 2009.[32] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie,“The Caltech-UCSD Birds-200-2011 Dataset,” California Institute ofTechnology, Tech. Rep. CNS-TR-2011-001, 2011.[33] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in

IEEE International Conference onRobotics and Automation (ICRA) . IEEE, 2011, pp. 1817–1824.[34] V. Lomonaco and D. Maltoni, “Core50: a new dataset and benchmarkfor continuous object recognition,” in

Conference on Robot Learning(CoRL) , 2017, pp. 17–26.[35] M. R. Loghmani, B. Caputo, and M. Vincze, “Recognizing objectsin-the-wild: Where do we stand?” in

IEEE International Conferenceon Robotics and Automation (ICRA) . IEEE, 2018, pp. 2170–2177.[36] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song,F. Qiao, L. Song et al. , “Are we ready for service robots? TheOpenLORIS-Scene datasets for lifelong SLAM,” arXiv preprintarXiv:1911.05603 , 2019.[37] Y. Gao, J. Ma, M. Zhao, W. Liu, and A. L. Yuille, “Nddr-cnn: Layerwisefeature fusing in multi-task cnns by neural discriminative dimensionalityreduction,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2019, pp. 3205–3214.[38] R. Venkatesan, H. Venkateswara, S. Panchanathan, and B. Li, “Astrategy for an uncompromising incremental learner,” arXiv preprintarXiv:1705.00744 , 2017.[39] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, Z. Zhang, and Y. Fu,“Incremental classiﬁer learning with generative adversarial networks,” arXiv preprint arXiv:1802.00853 , 2018.[40] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continuallifelong learning with neural networks: A review,”

Neural Networks ,2019.[41] Z. Wang, Q. She, A. F. Smeaton, T. E. Ward, and G. Healy, “Neuroscore:A brain-inspired evaluation metric for generative adversarial networks,” arXiv preprint arXiv:1905.04243 , 2019.[42] Y. He, Z. Shen, and P. Cui, “Nico: A dataset towards non-iid imageclassiﬁcation,” arXiv preprint arXiv:1906.02899 , 2019.[43] N. D´ıaz-Rodr´ıguez, V. Lomonaco, D. Filliat, and D. Maltoni, “Don’tforget, there is more than forgetting: new metrics for continual learning,” arXiv preprint arXiv:1810.13166 , 2018.[44] F. Feng, R. H. Chan, X. Shi, Y. Zhang, and Q. She, “Challenges intask incremental learning for assistive robotics,”