[PDF] A Multisensory Learning Architecture for Rotation-invariant Object Recognition

Abstract

This study presents a multisensory machine learning architecture for object recognition by employing a novel dataset that was constructed with the iCub robot, which is equipped with three cameras and a depth sensor. The proposed architecture combines convolutional neural networks to form representations (i.e., features) for grayscaled color images and a multi-layer perceptron algorithm to process depth data. To this end, we aimed to learn joint representations of different modalities (e.g., color and depth) and employ them for recognizing objects. We evaluate the performance of the proposed architecture by benchmarking the results obtained with the models trained separately with the input of different sensors and a state-of-the-art data fusion technique, namely decision level fusion. The results show that our architecture improves the recognition accuracy compared with the models that use inputs from a single modality and decision level multimodal fusion method.

Full PDF

AA Multisensory Learning Architecture for Rotation-invariantObject Recognition

Murat Kirtay [email protected] Systems Group, Departmentof Computer Science,Humboldt-Universität zu BerlinBerlin, Germany

Guido Schillaci [email protected] BioRobotics Institute, ScuolaSuperiore Sant’AnnaPontedera (PI), Italy

Verena V. Hafner [email protected] Systems Group, Departmentof Computer Science,Humboldt-Universität zu BerlinBerlin, Germany

ABSTRACT

This study presents a multisensory machine learning architecturefor object recognition by employing a novel dataset that was con-structed with the iCub robot, which is equipped with three camerasand a depth sensor. The proposed architecture combines convolu-tional neural networks to form representations (i.e., features) forgrayscaled color images and a multi-layer perceptron algorithm toprocess depth data. To this end, we aimed to learn joint represen-tations of different modalities (e.g., color and depth) and employthem for recognizing objects. We evaluate the performance of theproposed architecture by benchmarking the results obtained withthe models trained separately with the input of different sensorsand a state-of-the-art data fusion technique, namely decision levelfusion. The results show that our architecture improves the recog-nition accuracy compared with the models that use inputs from asingle modality and decision level multimodal fusion method.

KEYWORDS multisensory learning architecture, object recognition, representa-tion learning

ACM Reference Format:

Murat Kirtay, Guido Schillaci, and Verena V. Hafner. 2020. A MultisensoryLearning Architecture for Rotation-invariant Object Recognition. In

Pro-ceedings of ACM Conference (Conference’17).

ACM, New York, NY, USA,8 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Vision-based robotics tasks (e.g., object recognition, vision-guidedmanipulation) rely on extracting useful features – by processing dif-ferent sensory readings– of the perceived object and robustly com-bining them to be inputs of a machine learning algorithm [2], [17].In this study, we address the single-axis rotation invariant objectrecognition task by employing a multimodal machine learning ar-chitecture on a novel multimodal dataset, which was constructedwith the iCub robot by using three cameras to obtain color imagesand a depth sensor. The proposed architecture – i.e., intermediate

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Conference’17, , Washington, DC, USA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn representation fusion– leverages convolutional neural networks(CNNs) and multilayer perceptron (MLP) to process grayscaledcolor and depth inputs, respectively. To test the recognition per-formance of the architecture, we benchmarked the results withdifferent models that were trained by using color and depth data.Additionally, we adopt a state-of-the-art fusion technique to bench-mark the obtained result in multimodal settings.To further test the performance of our architecture and decisionlevel fusion technique, we replaced one of the camera readingswith randomly constructed data with the same size of images. Sincehardware defects (e.g., miscalibrated sensors and unexpected noisesources) are common for robot experiments, we aim to assess ourarchitecture in a setting that an employed sensor provides noisyreadings for object recognition.In this work, we adopt the term modality as an employed sensorydata that are associated with different aspects (e.g., the color anddepth information) of the observed phenomena [10], [9]. In oursetting, we used preprocessed color images and depth data as twodifferent modality data – though in biological agents, color anddepth information emanates from the same sensory organ (i.e., eye)– to perform object recognition.A broad range of multimodal (i.e., multisensory) learning studieswas reported in the fields of machine learning and robotics. Here,we merely introduce representative studies; however, we pointout that more detailed information can be found in [9], [16], [4].Since the employed dataset is recently constructed, and there areno benchmarked studies yet, we present the literature studies interms of the performed methods and fusion techniques. To be con-crete, we leverage the intermediate representations (i.e., features) ofcolor information provided by three cameras to construct in-classshared representations. We used the term in-class to underline thepresentations for images, which were obtained by employing dif-ferent cameras. We then fused these representations with depthinformation processed by the multi-layer perceptron.The authors of the study in [3] – and similarly in [21]– imple-mented an architecture based on a convolutional neural network(CNN) to process depth and color modalities. Two streams of CNNfor depth and preprocessed color data were merged in a fully con-nected layer before performing object recognition. By doing so,the authors reported that the recognition performance was im-proved by applying a late fusion technique. In our architecture,we employed CNNs to process grayscaled color images that wereobtained from different cameras. Since the employed dataset ismodality-wise imbalanced –that is, the number of images is higherthan corresponding depth data– we first aimed at extracting the a r X i v : . [ c s . R O ] S e p onference’17, , Washington, DC, USA Murat Kirtay, Guido Schillaci, and Verena V. Hafner shared representations of the color modality. Then, we combinedthe shared representations constructed by CNNs with the last layerof MLP activations by using depth data to build a joint representa-tion of different sensory data. Unlike the study in [3], we did notpreprocess depth data (i.e., colorization of the depth); instead, wetest our architecture in a setting where using depth only modalityis not performing well in object recognition.In [22], the authors first separately trained two deep CNN streamsby using color and depth data. Then, the activations of the fullyconnected layers were combined with learning shared multimodalrepresentations. The study shows that combining depth and colorinformation via CNNs enhances object recognition performance.The authors in [14] applied the decision level fusion technique bytraining different CNNs in a multisensor setting (i.e., lidar for dis-tance and charge-coupled device for color images) for detecting andclassifying objects. The authors build a unary classifier for differentsensory inputs, then fuse the decisions of each classifier to performobject classification. Our proposed architecture shares similaritieswith the model introduced in [22]; however, in our setting, we didnot separately train the different sensory data. Instead, the compo-nents of our architecture (i.e., CNN streams for images and MLPfor depth data) were trained together. What is more, to show theproposed architecture can achieve high accuracy, we benchmarkedour results with the decision level fusion method by using a similartechnique as introduced in [14].We present the contributions of our study in the following way.First, we propose and implement a multisensory machine learningarchitecture that fuses intermediate representations to create in-class shared representations and combine them with depth data toperform object recognition. To this end, we show that this archi-tecture improves recognition accuracy in which employing depthdata performs poorly. Second, the results show that the proposedarchitecture performs better to utilize the multimodal informationcompared to state-of-the-art fusion technique (that is, decision-levelfusion) and the models that trained with single sensory data. Third,to test further the decision level fusion and the proposed architec-ture, we used randomly generated images instead of the iCub’s leftcamera inputs. In that, the results obtained in this setting showsthat the decision level fusion method significantly decreased –i.e.,the recognition rate lowered by 4 . . This section presents the experimental setup and the data acqui-sition procedure. The setup consists of a multisensory-equippediCub robot and a motorized turntable, which was positioned infront of the robot. As shown in Figure 1, two Dragonfly cameraswere placed on the robot’s eyes, and a RealSense d435i camera wasmounted on above the eyes.

Figure 1: The iCub robot equipped with three color camerasand one depth sensor.

The data acquisition procedure begins with putting an object onthe turntable and rotating it by approximately five degrees untilcompleting a full rotation. After each rotation, we recorded thecolor images captured from three cameras (i.e., the iCub left andright cameras, and RealSense camera) and the depth data capturedfrom the depth sensor (i.e., depth channel of the RealSense camera).The same acquisition steps were repeated for all the objects in thedataset. Note that the data acquisition procedure is similar to thosedescribed in [13], [7]. However, in this dataset, the number of theobjects, 210, is higher; the employed sensors used in our datasetare different from similar studies. We depicted the color imagesand the colorized depth data for some of the objects in the firstand second row of Figure 2. We point out that detailed informationof this dataset with hardware specifications can be found in thefollowing technical report [8].At the end of the acquisition procedure, 72 different views foreach object were obtained, and for each view, 72 depth and 216 colorimages were collected. In total, the whole dataset provides 60480color and depth images. However, due to the size of the dataset,in this study, we only used the first 100 objects resulting in a totalof 28800 color images and depth data matrices. In this study, thedataset was employed to perform multimodal representation learn-ing for object recognition. However, the models trained with this

Multisensory Learning Architecture Conference’17, , Washington, DC, USA

Figure 2: Color images and colorized depth images. dataset can be further implemented in robotic applications: grasp-ing type predictions for different views and learning sensorimotorschemes by mapping the images to the degree of rotation [19].

We, here, introduce the methods for data processing and objectrecognition. In the first subsection, we describe the steps to pre-process color images and depth data. In the second subsection,we present the recognition algorithms that use preprocessed colorand depth data as inputs. To benchmark the performance of therecognition algorithms, we first separately employed convolutionalneural networks (CNNs) for color data and a multilayer perceptron(MLP) for depth data. Then, we provide the results for multimodallearning methods: decision level fusion as a state-of-the-art fusiontechnique and intermediate representation fusion, which is ourproposed method.We note that the reasons why we use the MLP for depth modal-ity are two-fold. First, instead of implementing hand-engineeredpreprocessing methods such as colorizing depth data and derivingthe object mask by combining pixel values, we leverage the actual(raw) distance information. To this end, we test the performanceof implemented multimodal methods without relying on intensivepreprocessing procedures. However, we will compare the resultspresented in this paper with different preprocessing methods forcolor and depth data for future studies. Second, we assess multisen-sory learning architecture’s performance in a setting in which usingdepth-only data leads to poor performance for object recognition.

We performed separate preprocessing steps for the color and depthimages. The color (RGB) images (whose original resolution was640 ×

480 pixels) were first grayscaled to obtain a single-channelmatrix of the images and then downsized to 32 ×

32 matrix. Sincethe employed dataset was constructed in an indoor environment(that is, the background of the images are the same for all objects),we deem that the information to identify the object preserved afterthe grayscaling operation. As can be seen in Figure 3, the objectcan be identified in color and grayscale versions of the image.Since the depth sensor provides distance (in millimeter-scale)information for a pixel value in color matrices, the preprocessingsteps for depth information differ from the color images. To down-size the depth data from a matrix with a size of 640 ×

480 to a 32 × × (a) Color image (b) Grayscaled image Figure 3: Color and grayscale images of the same object. associated with an output consisting of the corresponding objectidentifier, which was labeled during the dataset recording phase.Here, we emphasize that flattening the depth data without fur-ther preprocessing (e.g., colorization) might lead to losing spatialinformation and poor performance for object recognition. How-ever, we aimed to show that the proposed architecture will mitigatethe drawbacks of preprocessing while combining color and depthinformation for object recognition.At the end of preprocessing, we first normalized input matricesand vectors. Then, to presented cross-validated results, we ran-domly grouped the inputs with corresponding object ids as outputsinto training, validation, and testing sets by using 50%, 25%, and25% of the whole dataset, respectively.

In this section, we introduced the implementation details of theconvolutional neural networks for color modality and multilayerperceptron network for depth modality. We note that the sourcecode, data flow diagrams, and model parameters of the networkswere shared in a public repository (see Section 6).

In order to perform object recognition using color images for eachcamera, we employed a stream convolutional neural network (CNN)to extract non-hand-crafted representations (i.e., features). TheCNN stream consists of three consecutive convolutional layers –where the first layer and the remaining layers consist of 32 and 64filters, respectively– in which the ReLU activation function appliedafter convolution operation, then 2 × .

01, for 20 epochs, then we terminate thetraining. We emphasize that the recognition results were obtainedfrom the test set.We note that the explained CNN stream was applied for grayscaledcolor images that were obtained from the iCub’s left and right cam-eras and the mounted RealSense d435i color camera. onference’17, , Washington, DC, USA Murat Kirtay, Guido Schillaci, and Verena V. Hafner

To recognizeobjects by using depth data, we formed a perceptron network withthree hidden layers in which each layer consists of 256 units withRectified Linear Unit (ReLU) as the activation function. The networkwas trained for 600 epochs to minimize cross-entropy cost functionby updating the weights via the Adam optimization algorithm. Notethat we use the same early stopping procedure for 150 epochs withCNNs, and the recognition results were reported by employing testset data.The reason why we implemented a multilayer perceptron al-gorithm to process depth data instead of a convolutional neuralnetwork is twofold. On the one hand, we avoid computationallyintensive preprocessing (e.g., colorization of the depth data). Onthe other hand, we test our proposed architecture in a simple set-ting that does not require hand-crafted processing features such ascolorizing depth images and extracting depth-mask of the object.

In this subsection, we describe the implementation of multimodaldata fusion techniques in the following way. First, we introducea state-of-art fusion technique – namely, decision (or late) levelfusion. This method was applied to train models that accept sen-sory input separately, then fusing the decision made by each modelto form a final decision. Second, we describe our implementationprocedure of the proposed architecture – i.e., intermediate repre-sentation fusion. Unlike decision level fusion, our method extractsfeatures from different modalities then combine them to performobject recognition. By doing so, we aimed at benchmarking recog-nition results with different models and assessing our proposedarchitecture performance.

Tobenchmark the results with a state-of-the-art multimodal fusionmethod, we employed the decision (or late) level fusion. This fusiontechnique is applied to exploit the performance of the models (i.e.,CNNs and MLP) while mitigating the effect of the poorly performedlearner. Here, decision level fusion refers to the collection of de-cisions from three streams of the CNNs in which each CNN wastrained by employing different datasets for the same modality.Figure 4 illustrates the data flow to construct decision fusion byemploying the decisions made by different models. In this architec-ture, a decision vector refers to a discrete probability distribution ofobjects. The object which has the highest probability is consideredas a recognized object. In our setting, we applied CNNs for the colorand MLP for the depth data to obtain decision probabilities fromeach model.After this step, the final decision for an object in the test setwas formed by summing the decision probabilities to extract therecognized object id, which has the highest value. Note that normal-izing decision probabilities will also yield the same results. Since weused the decisions that were generated by the models introducedin Section 3.2.1 and 3.2.2, the training procedure for each model isthe same as described in these subsections. Concretely, each modelwas trained separately, and then the decision vectors were fusedby using the set objects.

Figure 4: Multisensory decision level fusion. Decision vectorlengths were visualized for 4 objects, but in our experiment,each decision vector has a size of × where each elementrepresents a probability values over the object ids. The val-ues of the output vector were obtained by performing cross-entropy loss. Toexplore further multimodal methods, we structured the proposedarchitecture to derive two different representations: in-class sharedrepresentation and joint representation. By in-class shared repre-sentations, we refer to the representations formed by merging thelast layers of convolutional neural networks that accept color dataas inputs. We extract joint representations by combining the out-puts of three streams CNNs with the last layer activation of theMLP, which consists of three hidden layers.

Figure 5: The proposed intermediate representation fusionarchitecture. The output vector lengths were visualized for4 objects, but in our experiment, each decision vector has asize of × where each element represents a probabilityvalues over the object ids. The values of the output vectorwere obtained by performing cross-entropy loss. To be concrete, as shown in Figure 5, to form joint representation,we concatenated the last layer of the multilayer perceptron networkwith the shared representation formed by three CNN streams. We

Multisensory Learning Architecture Conference’17, , Washington, DC, USA transfer the joint representation to a densely connected layer tolearn the object’s representations from different sensors: color anddepth.As the last step, we provided the joint representation to theinput of a fully connected layer to extract object id, which has thehighest probability, to perform object recognition. We highlightthat this architecture was trained for 600 epochs with an earlystopping condition, same as with single CNN implementation, tominimize the cross-entropy cost function by employing the Adamoptimization algorithm to update the network’s parameters.

This section presents the results obtained by employing the ma-chine learning models that use single and multiple sensory inputsfor object recognition. To evaluate the results, we first providethe numeric values for recognition performance metrics: accuracy,precision, recall, and F1 scores. The derivation of these metricswas adopted from [15]. Then, we illustrate confusion matrices asheatmaps to visually describe the recognition performance of themodels.We additionally compared the performance of the proposed ar-chitecture and decision level fusion in a setting where the iCubleft camera inputs converted to the matrices where their elementswere randomly assigned between 0 and 255. In this way, we aimedto show that this approach will provide a use case to evaluate theperformance of multimodal learning fusion methods in which sen-sory information can not provide useful information for objectrecognition. Accuracy Precision Recall F1-scoreiCub left camera (CNN) 0.9517 0.9560 0.9517 0.9520iCub right camera (CNN) 0.9317 0.9465 0.9317 0.9331RealSense color (CNN) 0.9422 0.9485 0.9422 0.9427RealSense depth (MLP) 0.4256 0.5511 0.4256 0.4425Decision level fusion 0.9689 0.9710 0.9689 0.9690Intermediate fusion 0.9822 0.9842 0.9822 0.9825

Table 1: Average values of recognition metrics based on sen-sor type and fusion architecture.

Table 1 shows the weighted average values of the recognitionmetrics for each corresponding model grouped by the sensor nameand employed methods for object recognition. Here, the numberof correct recognitions in the test set used as support values (i.e.,weights). The first column of Table 1 presents the sensor namesfor single sensory input and multimodal data fusion techniques. Inparticular, the row titled “Intermediate fusion" row indicates ourproposed architecture. The rest of the columns show the valuesfor accuracy, precision, recall, and F1 score, respectively. Althoughwe evaluate the results based on accuracy and F1 scores, the othermetrics were provided for further benchmarking purposes for ourfuture studies. We note that the accuracy values derived as thenumber of the correctly recognized objects in the test set dividedby the number of all prediction and the F1 score calculated as theharmonic mean of precision and recall values to explain the effectof two metrics in a single value. Based on Table 1 entries, using preprocessed color data as inputsof convolutional neural networks achieves close performance toeach other in terms of accuracy, albeit these inputs were obtainedfrom a different camera. The accuracy ranging from 93% to 95%. Thesame trend can also be observed for the F1 score. Employing thedepth data as input of a multilayer perceptron method yields 42%accuracy, which is lower than the color data, but it is substantiallyhigher than chance level 0 .

01. We emphasize that this accuracyrate can be further improved by applying preprocessing techniques:extracting the region of interest and colorizing the depth data. Weexplain the methods can be performed to improve the recognitionrate by using depth modality in Section 5. We, here, conclude thatobject recognition can be achieved in a single sensory setting byusing color modality with higher accuracy than depth data.Although high recognition accuracy can not be achieved usingonly depth modality, we exploit the effect of using depth data inmultimodal learning methods. The last two rows of Table 1 obtainedby applying the decision level fusion and intermediate representa-tion fusion, which is presented as the last row of the table. Fromthese entries, we conclude that our proposed multisensory learningarchitecture provides the highest accuracy, and the difference be-tween decision level fusion, as the closest competitor, is 1 . Table 2: Multimodal fusion methods perfomance metricswhich were obtained by employing random values insteadof the iCub left camera’s color images.

Table 2 shows the recognition metrics obtained by employingdecision level fusion and intermediate fusion methods. The firstcolumn of Table 2 indicates the performed methods, and the re-maining columns show the derived values of accuracy, precision,recall, and F1 score, respectively. As can be seen in the correspond-ing rows of Table 1 and Table 2, using randomly generated inputsinstead of color images, leads to a decreasing recognition rate by4 . , onference’17, , Washington, DC, USA Murat Kirtay, Guido Schillaci, and Verena V. Hafner (a) The iCub left camera (b) The iCub right camera (c) RealSense color(d) RealSense depth (e) Decision level fusion (f) Intermediate representation fusion Figure 6: Confusion matrices with unimodal and multimodal methods. The rows in these subfigures indicate that the actualobject ids and the columns show the predicted object ids. sensors (i.e., the iCub’s left camera) while processing multimodaldata.We hold that this setting can be seen as a realistic scenario forobject recognition in robotic applications in which a sensor can beimproperly calibrated or generate noisy inputs. Here, we show thatthe proposed architecture can overcome this deficiency in one ofthe sensors while performing object recognition.To visually present the recognition results, we depicted confu-sion matrices as heatmaps in Figure 6 by using the only test setdata to recognize objects. The rows and columns of these matricesindicate the actual object id and recognized (or predicted) object id,respectively. The uniformity of the main diagonal of these matricesshows that the model’s better performance, i.e., darker the colorof the main diagonal, presents better performance. In contrast, thesparsely populated confusion matrix leads to a lower recognitionrate.The confusion matrices for color and depth inputs are shownin Figure 6(a), 6(b), and 6(c), and 6(d). Figure 6(e) and 6(f) depictthe confusion matrices of multimodal methods, which used bothpreprocessed color and depth inputs. Based on these heatmaps, theconfusion matrix for depth modality shows sparse characteristicsalong the main diagonal, which corresponds to poor recognitionperformance. For the color modality, these matrices have denselypopulated along the main diagonal, which presents better recog-nition performance than the depth modality. The figures for thedecision level fusion and intermediate representation fusion have more uniform main diagonals. That is why these methods yield highrecognition accuracy, as shown in Table 1. We emphasize that thecharacteristics of the main diagonals are in accord with the numericvalues for accuracy metrics in Table 1. For instance, the confusionmatrix of intermediate representation fusion shows almost no spar-sity, whereas the confusion matrix of the depth modality showshigh sparse characteristics.We highlight the following conclusions based on the entries inTable 1, Table 2 and the illustrated confusion matrices in Figure 6.Since the characteristics of the main diagonals for the color sen-sors are different – i.e., the color code of the main diagonal variesin distinct locations of the line– each CNN model can recognizedifferent objects well. To exploit this observation, our intermediaterepresentation fusion method combines the multisensor inputs in away that captures the object’s features and leads to better recog-nition performance. Based on the accuracy metric in Table 1, werealize that our proposed multimodal architecture leverages thisobservation better than the decision level fusion method and otherbenchmarked models that use single sensory input.What is more, the recognition metrics in Table 2 shows that ourmethod’s performance lowered less than the decision level fusionmethod in a setting where a sensory reading manually contami-nated inputs with random values. This scenario is an indication thatthe proposed architecture is suitable for robotic applications (e.g.,multisensory guided grasping) in an environment where externalnoise and hardware deficiencies are unpredictable.

Multisensory Learning Architecture Conference’17, , Washington, DC, USA

In this section, we discuss the steps to improve each model’s perfor-mance to guide future studies. Recall that, in this paper, we aimedto present our multisensory learning architecture and provide thebaseline results by employing the models trained with single sen-sory and multisensory inputs.As presented in Section 4, the recognition rates for the color dataare higher – that is, above 90%– than the depth data. These accuracyrates indicate that the convolutional neural network achieves toform useful representations for object recognition. We envision thatthe results can be further improved while neglecting the computa-tional aspects of the training process in the following ways. Thestructure of the CNNs can be modified by increasing the numberof filters, performing drop-out, and adding more fully connectedlayers. Transfer learning can also be performed for color data byadapting pre-trained models of ResNet and VGGNet to performobject recognition in the dataset employed in our study [20], [5].The steps for reducing the shortcomings by using depth datacan be listed as follows. The depth data can be colorized by usingthe JetColor map and deep depth colorization methods [1], [12].Then, a convolutional neural network can be employed to performobject recognition. By doing so, the hand-crafted depth encodingalgorithms can also be applied to process raw depth data to formfeature vectors before using the multilayer perceptron algorithm.For instance, the background can be removed from the depth dataand color images can be employed to extract the depth mask of theobject; then, the depth mask can be used as input of MLP [11].The multimodal learning techniques performed in this studyresult in better performance than the models that use single sensoryinputs. The decision level fusion exploits the multimodal data bysumming the discrete probability distribution values over classesobtained from the models which use single sensory input. Unlikedecision level fusion, our proposed method achieves high accuracyby extracting robust features from the color images via CNNs andtraining these representations with an MLP network, which usesdepth data.Here, we conclude that the suggested methods to improve theresults obtained using color and depth modalities will also enhancethe performance of multimodal learning techniques.

To reproduce the presented results and provide the related data –including scripts, datasets, trained models, parameters, model dia-grams, and preprocessed input data– to the interested researchers,we used a public repository . In this study, we proposed a multisensory learning architecture,which utilizes convolutional neural networks and multi-layer per-ceptron method, for object recognition. We presented the results forobject recognition by employing machine learning models trainedwith single sensory inputs and multimodal fusion method: decisionlevel fusion. The results indicate that our proposed architectureenhances accuracy rates compared with the benchmarked objectrecognition methods. The presented results can also be considered as a baseline for other researchers who can leverage the dataset todevelop multimodal machine learning models to address the objectrecognition task on the iCub robot.We envision that this study can be extended in the followingways. First, the scaling characteristics of the proposed architecturecan be assessed by adding the remaining 110 objects to the employeddataset. Second, explainable artificial intelligence (XAI) methodscan be applied in a multimodal way to interpret the role of eachsensory information and contribution of each layer of the modelsfor object recognition [18]. Lastly, since the color images and depthdata are associated with rotation angle, the dataset can also beemployed to learn multimodal sensorimotor schemes by sensoryreadings (e.g., images) to the degree of rotation [19]. ACKNOWLEDGMENTS

Murat Kirtay and Verena V. Hafner have received funding fromthe Deutsche Forschungsgemeinschaft (DFG, German ResearchFoundation) under Germany’s Excellence Strategy - EXC 2002/1“Science of Intelligence" - project number 390523135. Guido Schillacihas received funding from the European Union’s Horizon 2020research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 838861 (Predictive Robots).

REFERENCES [1] F. M. Carlucci, P. Russo, and B. Caputo. 2018. (DE) CO: Deep Depth Colorization.

IEEE Robotics and Automation Letters

3, 3 (July 2018), 2386–2393. https://doi.org/10.1109/LRA.2018.2812225[2] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard. 2015.Multimodal deep learning for robust RGB-D object recognition. In . 681–687. https://doi.org/10.1109/IROS.2015.7353446[3] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard. 2015.Multimodal deep learning for robust RGB-D object recognition. In . 681–687.[4] M. Gao, J. Jiang, G. Zou, V. John, and Z. Liu. 2019. RGB-D-Based Object Recogni-tion Using Multimodal Convolutional Neural Networks: A Survey.

IEEE Access

Proceedings of the IEEE conference on computervision and pattern recognition . 770–778.[6] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014).[7] M. Kirtay, U. Albanese, L. Vannucci, E. Falotico, and C. Laschi. 2017.

A graspableobject dataset constructed with the iCub humanoid robot’s camera . TechnicalReport. https://doi.org/10.5281/zenodo.1170629[8] Murat Kirtay, Ugo Albanese, Lorenzo Vannucci, Guido Schillaci, Cecilia Laschi,and Egidio Falotico. 2020. The iCub Multisensor Datasets for Robot and ComputerVision Applications.. In .https://doi.org/10.1109/IROS.2011.6094814[9] M. Kirtay, L. Vannucci, U. Albanese, E. Falotico, and C. Laschi. 2018. MultimodalSensory Representation for Object Classification via Neocortically-inspired Al-gorithm. In . 245–250.[10] Dana Lahat, Tülay Adali, and Christian Jutten. 2015. Multimodal data fusion:an overview of methods, challenges, and prospects.

Proc. IEEE . IEEE, 1817–1824.[12] Lorand Madai-Tahy, Sebastian Otte, Richard Hanten, and Andreas Zell. 2016. Re-visiting deep convolutional neural networks for RGB-D based object recognition.In

International Conference on Artificial Neural Networks . Springer, 29–37.[13] Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. 1996.

Object Image Library(COIL-100) . Technical Report.[14] Sang-Il Oh and Hang-Bong Kang. 2017. Object detection and classification bydecision-level fusion for intelligent vehicle systems.

Sensors

17, 1 (2017), 207. onference’17, , Washington, DC, USA Murat Kirtay, Guido Schillaci, and Verena V. Hafner [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python.

Journal of Machine Learning Research

12 (2011), 2825–2830.[16] D. Ramachandram and G. W. Taylor. 2017. Deep Multimodal Learning: A Surveyon Recent Advances and Trends.

IEEE Signal Processing Magazine

34, 6 (Nov2017), 96–108. https://doi.org/10.1109/MSP.2017.2738401[17] Javier Ruiz-del Solar, Patricio Loncomilla, and Naiomi Soto. 2018. A survey ondeep learning methods for robot vision. arXiv preprint arXiv:1803.10862 (2018).[18] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. MÃĳller. 2017. Eval-uating the Visualization of What a Deep Neural Network Has Learned.

IEEETransactions on Neural Networks and Learning Systems

28, 11 (Nov 2017), 2660–2673. https://doi.org/10.1109/TNNLS.2016.2599820 [19] Guido Schillaci, Antonio Pico Villalpando, Verena V Hafner, Peter Hanappe,David Colliaux, and Timothée Wintz. 2020. Intrinsic motivation and episodicmemories for robot exploration of high-dimensional sensory spaces.

AdaptiveBehavior (2020), 1059712320922916.[20] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).[21] Richard Socher, Brody Huval, Bharath Bath, Christopher D Manning, and An-drew Y Ng. 2012. Convolutional-recursive deep learning for 3d object classifica-tion. In

Advances in neural information processing systems . 656–664.[22] Anran Wang, Jianfei Cai, Jiwen Lu, and Tat-Jen Cham. 2015. Mmss: Multi-modalsharable and specific feature learning for rgb-d object recognition. In