[PDF] Unsupervised Learning of Shape Concepts - From Real-World Objects to Mental Simulation

Abstract

An unsupervised shape analysis is proposed to learn concepts reflecting shape commonalities. Our approach is two-fold: i) a spatial topology analysis of point cloud segment constellations within objects is used in which constellations are decomposed and described in a hierarchical and symbolic manner. ii) A topology analysis of the description space is used in which segment decompositions are exposed in. Inspired by Persistent Homology, groups of shape commonality are revealed. Experiments show that extracted persistent commonality groups can feature semantically meaningful shape concepts; the generalization of the proposed approach is evaluated by different real-world datasets. We extend this by not only learning shape concepts using real-world data, but by also using mental simulation of artificial abstract objects for training purposes. This extended approach is unsupervised in two respects: label-agnostic (no label information is used) and instance-agnostic (no instances preselected by human supervision are used for training). Experiments show that concepts generated with mental simulation, generalize and discriminate real object observations. Consequently, a robot may train and learn its own internal representation of concepts regarding shape appearance in a self-driven and machine-centric manner while omitting the tedious process of supervised dataset generation including the ambiguity in instance labeling and selection.

Full PDF

UUnsupervised Learning of Shape Concepts –From Real-World Ob jects to Mental Simulation

Christian A. Mueller and Andreas Birk ∗ Abstract

An unsupervised shape analysis is proposed to learn concepts reﬂectingshape commonalities. Our approach is two-fold: i) a spatial topology anal-ysis of point cloud segment constellations within objects is used in whichconstellations are decomposed and described in a hierarchical and symbolicmanner. ii) A topology analysis of the description space is used in whichsegment decompositions are exposed in. Inspired by Persistent Homology,groups of shape commonality are revealed. Experiments show that ex-tracted persistent commonality groups can feature semantically meaningfulshape concepts; the generalization of the proposed approach is evaluatedby diﬀerent real-world datasets. We extend this by not only learning shapeconcepts using real-world data, but by also using mental simulation of ar-tiﬁcial abstract objects for training purposes. This extended approach isunsupervised in two respects: label-agnostic (no label information is used)and instance-agnostic (no instances preselected by human supervision areused for training). Experiments show that concepts generated with mentalsimulation, generalize and discriminate real object observations. Conse-quently, a robot may train and learn its own internal representation ofconcepts regarding shape appearance in a self-driven and machine-centricmanner while omitting the tedious process of supervised dataset generationincluding the ambiguity in instance labeling and selection.

Studies of early object perception in infants [1] suggested that objects can becharacterized by a set of properties such as continuity, i.e., objects successivelymove along a path, or solidity, i.e., objects can only move through free-space.Furthermore, shape is a key visual cue as it fundamentally contributes to reasoningand understanding of objects [2, 3]. Inferred shape commonalities among objectsallow to infer similar object (including semantic) properties. Shape is used inmany robotic application areas ranging from household to industry, e.g., in object ∗ The authors are with the Robotics Group of the Computer Science & Electrical EngineeringDepartment, Jacobs University Bremen, Germanye-mail: {chr.mueller, a.birk}@jacobs-university.de. a r X i v : . [ c s . C V ] N ov igure 1: Illustration of the proposed object shape conceptualization approach.shape categorization tasks [4], in generation of grasping primitives for similarobject appearances in manipulation [5], or in ﬁnding substitutes for currentlyabsent objects [6, 7], to name just a few examples.In traditional object perception in form of object instance or category recog-nition, an association is formed between a label and a speciﬁc instance (e.g., John’s mug ) or a generic group of instances (e.g., mug ), which share commonal-ities in appearance [8]. A group of instances can be denoted as a category andthe description and abstraction of group commonalities as a concept . Learningconcepts from objects by associating meaning to a system’s percepts is often con-ducted through interaction [9] and supervision [10, 11]. Eventually, associationsare generally human-made, individually and continuously evolved over lifetimeexperience [12] based on a set of modalities like tactual, auditory or visual sensa-tions [13, 14]. The combination of those sensations allows us to reliably interpretperceived object information [15]. Humans are capable of incorporating furthermodalities including functional object knowledge to diﬀerentiate even though vi-sual percepts can be similar, e.g., mug , cup , vase or bowl . Consequently, suchnatural concepts are often not inferable from a machine-perspective due to thelack of dimensionality representing the perceived observations (e.g., only imagesor point clouds). From a machine-vision perspective, human-supervised learningmethods are particularly highly vulnerable to incorporate such knowledge, e.g.,the function or aﬀordances of objects, which is not inferable from pure sensordata. This is often inevitable when a supervised labeling process is conducted byhumans, which will ultimately lead to biases in the learning phase.Our work in contrast focuses on object understanding from a machine-perspectiveavoiding supervision. The work presented here builds upon our method from [16]that learns shape concepts in an unsupervised (label agnostic) and data-drivenmanner from point clouds irrespective of human-annotations, which may containbiases. Extracted segment constellations within object point clouds are used tolearn patterns and eventually concepts of shape commonalities in a hierarchi-cal manner. It is shown that concepts can be learned from real-world RGBD-snapshots of objects, or more precisely single view point clouds omitting thecolor information, using well-known datasets like the Washington RGB-D ObjectDataset [17] (WD) or the

Object Segmentation Database [18] (SD). From the ma-chine perspective, the concepts learned are purely derived by the given data, i.e.,2hey are not aﬀected by variable biases that may be caused by individual humaninterpretations with respect to the instance label annotation. As can be seen, theconcepts learned on one real-world dataset generalize well across other real-worlddatasets not seen before.Nevertheless, biases can be in the selection of the dataset instances used fortraining and validation. Moreover, the dataset generation process is cumbersomeand generally requires eﬀort in preparation including object instances selection,deﬁning the experimental setup or labeling of object ground truth with regardto the background. Therefore, we further investigate in this article the capabil-ity of learning the essence of object appearance from artiﬁcially generated objectobservations in simulation and whether the learned concepts are applicable todiscriminate real-world object concepts. This capability allows an artiﬁcial sys-tem like a robot to train its own internal representation of concepts regardingshape appearance in a self-driven and machine-centric manner without human-bias. Thus, we present here an approach, which is unsupervised in two respects:it is label-agnostic (no label information is used) as well as instance-agnostic (noinstances preselected by human supervision are used).

Shape analysis relies on a robust description and representation [19, 20] of ob-jects, particularly in real world scenarios where snapshots of objects are aﬀectedby sensor noise and occlusions [21]. Theories of object perception from CognitiveScience and Psychology suggest a hierarchical and component-based represen-tation of object information [22]. Inspired by this, an analysis of topologicalpatterns is applied here to sensor data in form of point clouds observed from sin-gle viewpoints with a Kinect-like camera. The analysis is two-fold (see Fig. 1): i) an analysis of the spatial topology in point cloud decompositions, ii) a topologyanalysis of these decompositions in description space.Regarding i) , point clouds are initially over-segmented [23] (Fig. 1 A ) andfurther post-processed to segments that can reﬂect meaningful components ofobjects. Subsequently, a hierarchical decomposition of point clouds is generatedin a bottom-up manner (Fig. 1 B ). These segment compositions of objects allowto reason about shape characteristics and commonalities; commonalities observedwithin objects can be generalized to a shape concept.Constellation models, which learn concepts from perceived feature (e.g., key-points or segments) constellations have been successfully used in recent years [4,24, 25, 26, 27]. The inference is typically based on local analysis of featurecoherences with a priori learned constellation models, i.e., local evidences in aconstrained spatial range with respect to the features using, e.g., Markov Net-works [28, 29]. This inference is robust to the absence of features due to noiseand partial object occlusion. Shape facets, which become apparent on a global scale – especially in case of complex structures, are in contrast insuﬃciently re-ﬂected considering only local inferences. In the work presented here, a hierarchical3onstellation model (Fig. 1 C ) is proposed in which segment constellations aredecomposed over multiple topological levels that gradually (from local to global)reﬂect shape facets: from individual segment occurrences over segment groups toa single group of segments, which represent an entire object. On each topologicallevel, shape characteristics are observed and learned.A related research ﬁeld focuses on compositional hierarchies [30, 31, 32] inwhich general geometric building-blocks like edges or contours are hierarchicallycomposed to unions of these building-blocks. Similarly, skeletonization methods[19, 33, 34] try to extract structure within objects from which regions and ob-ject components can be decomposed for reasoning purposes. Our work diﬀersin several aspects; especially as here a) the building-blocks are represented assymbols which characterize underlying 3D point cloud segments, and b) theirconstellations are subsequently learned in a multi-hierarchical manner.Regarding ii) , i.e., the topology analysis of the decompositions in descriptionspace: observed decompositions over the topological levels are here analyzed togather distinctive insights and patterns that can be interpreted and related toconcepts of speciﬁc shape appearances. Persistent Homology (PH) is a conceptrelated to Topological Data Analysis that has been applied in various areas relatedto high dimensional data visualization or to ﬁnding relations and coherencies inBig Data scenarios in general [35]. PH allows to extrapolate features from data bymeans of ﬁnding persistent (or stable) feature appearances through an iterativeﬁltration of the data compared to standard clustering approaches. Standard clus-ter algorithm (e.g., k-Means, Expectation-Maximization, tree-based algorithms,etc.) associate data points to groups of data, which share similar properties, whichis measured by a metric or similarity function. Inherent parameters of clusteringalgorithms are related to the number, size, variance of clusters, neighborhood dis-tance between data points or in case of tree-like clustering, a splitting criterion.The parameterization is often computationally costly and it depends on the con-crete data on which the clustering process is applied to. Furthermore, partitioningthe topology of a continuous description space with a static parameterization isoften not a good solution due to over- and under-ﬁtting eﬀects. Soft-clustering ap-proaches like probability-based Expectation-Maximization (EM) provide a feed-back of the actual ﬁt of a query to the set of previously extracted clusters; butsuch approaches require additional post-processing in order to make a ﬁnal deci-sion about cluster membership.PH in contrast allows to investigate the topological evolution of the data ina step-wise manner. The concept of PH has already shown its applicability ingeometric shape analysis to detect persistent shape patterns when being directlyapplied on point cloud data [35, 36, 37]. But instead of directly applying PH onpoint cloud data, we use here the responses that are retrieved from our topologi-cal analysis of point cloud decompositions proposed in i) . The PH-based analysisallows to detect persistent appearances of the responses during the ﬁltration pro-cess, which reveal shape commonalities of instances that can form concepts (Fig. 1 D ).Consequently, we focus on shape reasoning with a symbolic representation of4eometric information, which is further exploited to learn visual patterns from ob-served object point cloud compositions on multiple granularity levels which allowto learn concepts from. Our ﬁnal goal is to investigate whether the visual pat-terns can be learned in a data-driven manner by encoding real object observationsor on the basis of abstract artiﬁcial data from simulation. The question ariseswhether artiﬁcial data from simulation allows to encode visual patterns that leadto concepts which can then be applied to discriminate real object observations(Fig. 1 E ). Building on our work from [38] as basis, an object point cloud is initially over-segmented into atomic patches and further processed to segments, also known assuper patches, which can represent semantically meaningful shape componentslike planar surfaces of a box or cylindrical and planar surfaces of a can (seeFig. 2(a)). Subsequently, objects are represented as a set of point cloud segments .These segments can be interpreted as building blocks that constitute objects. (a) Point cloud abstraction (b) Hierarchical dictionary

Figure 2: (a) A two-step segmentation [38] from atomic patch segments to superpatches is used as basis – here illustrated by sample snapshots of 4 example ob-jects ( can , box , cordless drill , teddy ). (b) An example hierarchical dictionary [4] D = { d , d , d } is shown that consists of 3 description levels using divisive clus-tering. For illustration, each visual word is depicted as a circle with a coloredpolygon. A segmented object is shown as a graph on the left of the dictionary; onits right, the visual words assigned for each segment according to the respectivedescription level are shown.Tackling with real world data, object observations are imperfect, e.g., noisyand partially occluded, which leads to a degradation of the detection of thesebuilding blocks. This leads to failures in associating observed data to knownbuilding blocks, which is also in general known as the correspondence problem.Therefore the stability of the detection of such building blocks in real world data5s a major challenge. To mitigate the correspondence problem among imperfectsegments, a symbolic representation of segments is chosen as an abstraction stepto facilitate further shape reasoning. Segment appearances are quantized to a setof discrete visual words following the well-known bag-of-words methodology [4],i.e., the visual words constitute a dictionary. The idea is that similar appearingsegments are abstracted to the same symbol, respectively, visual word. The levelof quantization plays a crucial role, since too few words may lead to under-ﬁtting,whereas too many words may lead to over-ﬁtting symptoms. For an unbiased andpurely data-driven word generation, a hierarchical divisive clustering procedureis applied as introduced in our previous work [4]. Therein segments are initiallydescribed with a description vector that is generated by a point cloud descriptorlike FPFH [39]. As a result of the clustering procedure, a hierarchical dictionary D is created that consists of multiple description levels { d , d , ... } , where level f consists of f words (Fig. 2(b)). Each word represents a description vectorwhose position is inferred by the clustering procedure during the training phaseusing a set of segments captured from random scenes. Given an object segment,the extracted description vector of the segment is passed through the hierarchicaldictionary D . For each description level, the propagated description vector isaccordingly labeled with the visual word whose description vector is closest usingthe l -norm (Fig. 2(b)). A segment composition of a captured object o is initially represented as graph g o in which each segment corresponds to a vertex and neighboring vertices areconnected with an edge. Each vertex is augmented with the corresponding pointcloud segment and the visual word that is inferred from the set of visual wordson the respective description level in the dictionary D (see Sec. 3.1); the visualword inferences can hence diﬀer according to the description level as illustratedon the right side in Fig. 2(b).The spatial topology of segments is analyzed in an unsupervised manner andencoded in a hierarchical representation, which we denote as Shape Motif Hierar-chy ; an illustration of a hierarchy H is shown in Fig. 3(a). H is based on a graph-ical representation of visual word constellations which are denoted as motifs ; notethat these constellations can only contain visual words of a speciﬁc descriptionlevel. Therefore for a dictionary D = { d , d , ..., d n } which contains n descriptionlevels, n hierarchies are created that constitute an ensemble HE = {H , H , ..., H n } ,see Fig. 3(c).In the training phase for each hierarchy H , object observations are encodedin a bottom-up manner, beginning with single object segments over groups ofsegments until a single constellation of segments represents the entire object. Asample propagation ( ) of a box (consisting of three segments) throughthe hierarchy H is shown in Fig 3(a). Object segments are propagated throughthe hierarchy using the corresponding visual words associated to the segments.Within the propagation process, newly observed visual word constellations ( word otiflevel 1motiflevel 2motiflevel 3motiflevel 4 (a) A shape motif hierarchy H word motifmotif vertexmotif edgemotif prototypemotif graph (b) A shape motif level D HE γ γ γ n d d d H H H || γ ∗ (c) A shape motif hierarchy ensemble Figure 3: An example of a shape motif hierarchy H is shown in (a); it consistsof multiple motif levels . Each node represents a speciﬁc motif vertex , whereaseach smaller linked node represents a motif prototype . A sample propagation( ) of a box (consisting of three segments) through H is shown in (a).Feasible propagations, which have been previously encoded in the hierarchy duringthe training phase but which are not aﬀected by the box , are depicted as .Components of a motif level are illustrated in (b). In (c) the combined approach isillustrated: an example shape motif hierarchy ensemble HE based on three shapemotif hierarchies {H , H , H } using respective description levels { d , d , d } of D (see Fig. 2(b)). motifs ) are integrated into the hierarchy as motif vertices (see Fig. 3(b)). Eachmotif in the hierarchy is unique with respect to visual words, i.e., a newly observedword motif of an object leads to a creation of a motif vertex if the motif doesnot exist in the hierarchy. For further characterization of a motif vertex, a pointcloud description is extracted of a propagated segment constellation and addedas motif prototype to the motif vertex that corresponds to the motif of thepropagated constellation (Fig. 3(b)). As a result, each motif vertex represents a shape motif that can be exploited as building block and that can constitute – evenunknown – objects. Further at motif level l =1 , an edge ( ) between two motifvertices is created if the corresponding object segments are neighbors. For l> an edge is created if two motif vertices contain a visual word that corresponds tothe same segment of the propagated object. In each propagation step from level7 to l +1 , the union of word motifs connected to an edge in level l forms a vertexin l +1 ( ). Consequently, upper levels can consist of fewer edges or vertices,i.e., a single motif vertex can encompass a word constellation that represents anentire object; see, e.g., the box sample at motif level 4 in Fig. 3(a). In thismanner, objects are decomposed in various motifs by the propagation throughthe hierarchy H .As a result, the Shape Motif Hierarchy Ensemble HE = {H , H , ..., H n } doesnot only take the structural appearance with respect to the variety of the segmentconstellations into account but also the symbolic appearance of constellationsby using a speciﬁc dictionary description level for the respective hierarchy. Forillustration purposes, the propagation process of a segmented teddy bear is shownin Fig. 4 from a set of primitive motif vertices to more complex motifs vertices inwhich eventually a single motif represents the teddy bear .Figure 4: Illustration of a segmented teddy sample propagated through H (seeFig. 3(a)) showing activated motif vertices in each motif level (see Fig. 3(b)), i.e.level , , , , and (the teddy segments in the point cloud are randomlycolored). In the training phase, object segment constellations represented by correspondingvisual words are propagated through the hierarchy and are memorized as motifprototypes within motif vertices that match visual word constellations of the ob-ject. Inspired by the

Prototype Theory [40], each motif vertex is formed by theseprototypes, which are used to generate stimuli for unknown objects as describedin the following: given a graph of segments g o of object o , the segments are an-notated with the corresponding words and subsequently propagated through thehierarchy as in the training phase, see the box example in Fig. 3(a) – note that thehierarchy is not modiﬁed during the stimuli generation. Through the propaga-tion of segments, motif vertices are activated that correspond to the words of thepropagated segments. An activation of a vertex v is represented by the Indicatorfunction v ( g o ) , which returns in case of a match, otherwise if no match isfound. For an activated v , a stimulus α ( v, g o ) is computed based on point clouddescriptions of the memorized motif prototypes T v of v and the respective de-scription q of object segments in g o , which activated v . By applying ProbabilisticNeural Networks [41], the stimulus is computed with an adapted Gaussian kernel(bandwidth σ =0 . ) in which Jenson-Shannon divergence ( J SD ) [42] is used as8istance measure, see Eq. 1. α ( v, g o ) =  | T v | · (cid:80) | T v | i =1 e JSD( t i ∈ T v ,q ) − σ , if v ( g o )=1 , otherwise (1) As a result for each propagated object, stimuli of motif vertices in H i are accu-mulated and projected into vector form γ oi =[ α ( v , g o ) , α ( v , g o ) , ... ] . Given n de-scription levels and correspondingly trained n shape motif hierarchies that formthe ensemble HE = {H , H , ..., H n } , the object graph g o is propagated througheach motif hierarchy. Subsequently, a ﬁnal stimuli vector ∗ γ o =[ γ o , γ o , ..., γ on ] iscomposed ( || ) of stimuli retrieved from n motif hierarchies, see Fig. 3(c). Commonalities among shape appearances can vary from speciﬁc to generic shapefacets: a concept generation process is hence used, which in a gradual manner de-tects commonalities ranging from individual to common facets, i.e., very speciﬁcto often re-occurring facets. Persistent Homology (PH) provide the computa-tional model that allows to gradually reveal topologically persistent patterns ingenerated stimuli ∗ γ , which are interpreted as commonalities and eventually asshape concepts. We brieﬂy introduce terms from algebraic topology which are related to our shapeconcept learning approach. Comprehensive literature can be found in [35, 37, 43,44, 45].

Given a continuous topological space X = { x , x , ..., x m | x i ∈ R n , ≤ i ≤ m } with m n -dimensional data points. A simplex π is a d -dimensional polytope,which is a graph consisting of a convex hull of d +1 aﬃne independent verticeswhere each vertex is a point in X . A composition of simplices is denoted as simplicial complex K = { π , π , π , ... } . This composition is a union of vertices,edges, triangles or other higher dimensional polytopes. We focus on vietoris-rips complexes in which a complex K vri is extracted from asubspace X i ⊆X with a given scale parameter (cid:15)> . K vri consists of vertices thatare only connected if the distances between the vertices is lower than the givenparameter (cid:15) . The vietoris-rips complex K vri can also be denoted as (cid:15) -complex ,where (cid:15) is also denoted as radius or distance threshold.9 .1.3 Homology Groups Homology is a concept in algebraic topology, which allows to reveal speciﬁc char-acteristics or features in X . Characteristics are organized therein into homologygroups HG = { H ( X ) , H ( X ) , H ( X ) , ... } . Often, the ﬁrst three homology groupsare analyzed: in the context of geometry H ( X ) is related to connected compo-nents or clusters of vertices. H ( X ) is related to the complexes in form of loops or holes and H ( X ) is related to voids which represent fully connected complexes.Here, we focus on H ( X ) since it complies with our goal to extract topologicalgroups from stimuli vectors (see Sec. 3.3), which can represent concepts. The ﬁltration of the topological space X is initiated by a subsequently nestedapplication of a set of radii E = { (cid:15) , (cid:15) , ..., (cid:15) j } where (cid:15) i − <(cid:15) i <(cid:15) i +1 . For H ( X ) , eachpoint x i ∈ X is represented at the beginning of the ﬁltration process by a 0-simplex π i ∈ vietoris-rips complexes K vr . These simplices are so to say born at radius .Note that the K vri is extracted using radius (cid:15) i . While the ﬁltration progresses, thevietoris-rips complex grows since the radius increases, which can cause fusions ofsimplices that form a larger simplex: a union is performed between simplices whileone simplex enlarges and sustains by annexing the other that dies . Eventually,a complex K vr is ﬁltered that contains a single high dimensional simplex – seeEq. 2. ∅ ⊆ K vr ⊆ K vr ⊆ . . . ⊆ K vrj = K vr (2) Persistent Homology provides a way to analyze and track birth and death of sim-plices (also known as homology classes ) along the ﬁltration process: H ( K vri ) → H ( K vri +1 ) .The according results can be represented in persistence or barcode diagrams (Fig. 9(a)).While considering the gradual evolution of vietoris-complex K vri , the extractionof homology classes (birth and death) is inherently robust to deformation due tothe topological organization of the data in a graphical manner. In the following, the shape concept extraction process is described – from topo-logical space and concept generation to concept inference.

Given a set of raw stimuli vector responses (Sec. 3.3), the responses are initiallyused to create a topological space in a graphical manner. Therein, a stimulivector ∗ γ can be interpreted as an independent point in the space, in which adistance metric can be used to measure the similarity to other stimuli vectors;these vectors serve as anchor points in a space of an unknown topology. Thegoal is to interrelate these vectors in order to discover topological relationshipsamong the anchor points. We make use of a graphical representation, in whicheach anchor point represents a vertex. Initially a complete graph is created,10here each edge between vertices is augmented with the corresponding distance;distances are measured by the Jenson Shannon divergence (JSD).To minimize the search space and to initiate the construction of the topologicalspace X , the Minimum Spanning Tree [46] is extracted using the respective JSDdistances. Subsequently, a substantial amount of edges perishes and a minimumnumber of edges remain, which reveal the structural and topological organizationof the stimuli vectors. Fig. 5 shows an example based on the object instances fromthe Object Shape Category Dataset (Sec. 6) that consists of seven shape categories( sack , can , box , teddy , ball , amphora , plate ).Figure 5: The minimum spanning tree, which spans the topological space X ofstimuli vectors extracted from instances of the Object Shape Category Dataset (OSCD), see Sec. 6. Note that each vertex represents a sample object of thedataset. Vertices are colored only for illustration purposes by their correspondingcategory label of the dataset, which is not used in our unsupervised learningphase.From this point on, we focus on the topological similarity among stimuli inform of the geodesic distance within X . Therefore each edge is uniformly weightedby assigning a distance of . Due to the inherent sparsity of edges in X , Johnsonsall-pair-shortest path algorithm allows to eﬃciently generate a distance map whichis used to infer a heat for each vertex x ∈ X . A vertex heat h ( x ) is inferred bythe mean geodesic distances d geo ( · ) to all other vertices in X whereas the edgeheat h ( e j,k ) is determined by the mean heat of the connected vertices x j and x k as shown in Eq. 3. 11 ( x )= (cid:80) |X | i =0 d geo ( x, x i ∈ X ) |X | , h ( e j,k )= h ( x j ) + h ( x k )2 (3) Henceforth, we use edge heats as edge distances between respective vertices. Byscaling the heat in X to the interval [0 , and inverting the heat, vertices locatedat leaf regions of X come closer to each other whereas vertices in the inner regionmove farther away from each other. Furthermore, two observations can be made:a) the heat of exteriorly located edges is lower than the interiorly located ones;b) vertices which are interiorly located reﬂect more heterogeneity with respect totheir neighbors, compared to vertices which are exteriorly located in X . Given the topological space X , the ﬁltration is applied over a range of radii E = { (cid:15) , (cid:15) , ..., (cid:15) j } . The step size (cid:15) i → (cid:15) i +1 is determined by the minimum edge dis-tance in X that also initializes the ﬁltration at (cid:15) . The ﬁltration is completedwhen the maximum edge distance in X is reached at (cid:15) j . In practice, the numberof steps |E | can reach a computationally intractable number. An upper boundlimit for |E | can be applied by increasing the step size until the upper bound ismet. Consequently, the ﬁltration is initialized with 0-simplices where each sim-plex represents a stimuli vector, i.e., a vertex of the topological space X . Thisﬁltration is performed on X as described in Sec. 4.1.4; note that the equidistantﬁltration steps from (cid:15) to (cid:15) j are often denoted as time.Persistent Homology allows to track the birth and death of simplices in K vr of X during the ﬁltration. Due to the nature of evolving simplices complexes (seeEq. 2) in each time step, the complex changes its appearance after annexations ofsimplices complexes of previous time steps. These changes during the ﬁltrationare encoded in graph F , which is shown in Fig. 6(a).An edge represents an annexation during the ﬁltration process of a simplicescomplex to another complex – beginning with 0-simplices representing leaves in F . Each edge is augmented with the annexation time. So, outer simplices livedshorter since they have been annexed earlier in time compared to inner ones. Asa result, F represents the ﬁltration progression of X . The lifetime of simplices can be interpreted as a feature indicator in X , i.e., persistent or long living simplices tend to represent a signiﬁcant feature, i.e.,a shape property that is prominent for an object or even object category. Atthe same time, short living simplices can be interpreted as being insigniﬁcant.The goal is hence to detect persistent simplices. In order to ease the persistenceanalysis, the ﬁltration time range is scaled within the interval [0 , , i.e., from (start of ﬁltration = (cid:15) ) to (end of ﬁltration = (cid:15) j ). In the ﬁltration process,trivial homology classes are obtained at time where 0-simplices exist and at time where a single simplex consists of all simplices in X . We are interested of ﬁndingpersistent groups between these extrema. A group is a connected component12 (a) Filtration result F (b) Concepts C Figure 6: (a) A ﬁltration graph F showing annexations over time according to thegiven graph X (see Fig. 5). For illustrations purposes, each vertex is colored withthe corresponding label as shown in Fig. 5. (b) Connected components extractedfrom F (a) that represent concepts C ( |C| =36 ). For illustration purposes, eachvertex (concept prototype) is colored with the corresponding label as in Fig. 5.of vertices, i.e., a d -simplex ( d> ). Due to the gradual ﬁltration, each groupconsists of topologically similar vertices. Therefore, the groups can constituteshape concepts, where each vertex within a group is a representative conceptprototype .Given the entire time spectrum [0 , , Persistent Homology allows to access anystate of detected concepts C in X at an arbitrary time in the spectrum; note thatthe ﬁltration starts with |C| = |X | and ends with |C| =1 . Consequently, a distinctivetime can be determined. An optimal time varies according to the topology that isreﬂected by the given stimuli vector. Consider an optimal time when the globalmaximum of annexations (see Sec. 6.1) is reached, and subsequently edges in F that are augmented with an older time than the optimal time are removed. Thisoptimal time leads to a set of connected components in F that can reﬂect usefulshape concepts as illustrated in Fig. 6(b). Note that edges which are created atlater time connect more heterogeneous groups and subsequently represent more,and possibly too generic concepts, in contrast to more speciﬁc concepts whichemerge when edges are created at earlier time. Given a stimuli vector ∗ γ o that is extracted from an unknown object o , a re-sponse is retrieved based on similarity to previously learned shape concepts (seeFig. 6(b)). Each concept c ∈ C consists of a set of concept prototypes P c = { p ,p , ... } , which are used to derive the correspondence of unknown objects to con-cepts. In the spirit of Prototype Theory [40], unknown instances are classiﬁed13ased on the similarity to known instances, which are associated to the previ-ously learned shape concepts. To demonstrate the discrimination capability ofour shape representation, the similarity φ c ( · ) to a concept c is determined by a(basic) mean similarity among ∗ γ o and prototypes P c of concept c (see Eq. 4); asdistance measure the Mahalanobis distance d mah ( · ) is used. φ c ( ∗ γ o ) = (cid:80) | P c | i =1 d mah ( ∗ γ o , p i ∈ P c ) | P c | (4) An interesting question is how the training data is generated to learn the shapeconcepts. One option is to use datasets of real-world objects. In contrary, we pro-pose the use of mental simulation to generate abstract artiﬁcial objects for shapeconcept learning purposes. This approach generates concepts in a machine-centricmanner, i.e., concepts are learned in an unsupervised fashion in two respects: a) label-agnostic (no label information given by supervision is used) and additionallyb) instance-agnostic (no real-world instances preselected by human supervisionare used for training). As will be shown in the experiments in Sec.7, the shapeconcepts learned in this way generalize well when applied to objects from real-world datasets.The core idea for the mental simulation is described in the following. Westart with primitive-shaped building-blocks or prototypes, namely box , sphere ,and cylinder . Multiple prototypes can be randomly combined to a prototypecomposition which forms an abstract object. We denote the number of intro-duced prototypes of an abstract object as the prototype order . The Gazebo sim-ulation environment [47] is then used to generate these artiﬁcial abstract objectsin simulation and to capture samples of the generated objects with a virtual sen-sor in simulation. Fig. 7 shows samples of artiﬁcial abstract objects of diﬀerentprototype orders, captured in simulation. Algorithm 1

Artiﬁcial Sample Generation

Input: prototype order n , empty sample s i ← while i < n do p ← get_random_prototype({ box , cylinder , sphere })4: p ← set_random_dimensions({ length , width , height , radius })5: if i > then p ← set_random_pose({position, orientation})so that p intersects with s end if s ⇐ = p (introduce prototype p to sample s )9: i ← i +1 end whileOutput: sample s representing a composition of prototypes. a) (b) (c) (d)(e) (f) (g) (h) (i) Figure 7: Examples of randomly generated abstract objects. Objects can encom-pass up to ﬁve primitive-shaped prototypes. Only for illustration purposes, theprimitive-shaped prototypes of each object are distinctively colored: box , can and sphere . (a) (b) (c) (d) (e) (f) (g) Figure 8: Examples of 2.5D scans from the OSCD dataset: sack (a), can (b), box (c), teddy (d), ball (e), amphora (f) and plate (g).Each prototype of an artiﬁcial object sample is not only randomly generatedwith respect to its type ( box , sphere , cylinder ) but also with respect to its spatialdimensions (e.g., length , width , height , radius ). Each prototype has to overlapwith at least an other prototype in order to form a connected structure, whichis considered as a valid object (see Alg. 1). Using this random approach, theseobject samples are obviously generated without any human bias. The experimental evaluation is two-fold. This section deals with the performanceof the label-agnostic concept generation. This means that the shape concepts arelearned in an unsupervised manner; semantic object labels generated by humansare only used to evaluate how reasonable the generated concepts are. Amongothers, it will be shown that concepts learned on one real-world dataset alsogeneralize well to other real-world datasets consisting of diﬀerent objects. In the15ollowing Sec. 7, the focus is on the evaluation of machine-centric learning ofthe shape concepts, i.e., not real-world data but abstract artiﬁcial objects frommental simulation are used for training, which also leads to concepts that alsoperform well on the real-world datasets.The

Object Shape Category Dataset (OSCD) [16] is used for the ﬁrst partof the evaluation. It consists of 468 RGBD scans of real-world objects from 7categories. A few examples are shown in Fig. 8). In the training phase, each training sample scan (OSCD dataset) is propagatedthrough HE = {H , H , ..., H n } , omitting any label-related information, i.e., eachscan is applied in an unsupervised manner to the HE ; in our evaluation n =4 has been heuristically selected – a smaller n may not allow HE to suﬃcientlydiscriminate the observed range of object shape variety. Afterwards, extractedstimuli vectors are fed to the ﬁltration process (see Sec. 4). Fig. 6(a) illustratesthe ﬁltration result of the stimuli vectors; the visualization does not reﬂect metricdiﬀerences, it visualizes topological similarities among samples. Already at thisstage, topological similarity can be observed with respect to the category labels ofthe objects. Note that the category labels are only associated to the prototypesfor visualization purposes - as mentioned, they were not used in the training. InFig. 9(a), the barcode is shown of the homology group 0. (a) Barcode (H ) (b) Annexation (c) Ranked concepts from Fig. 6(b). Figure 9: (a) Barcode of the homology group 0. (b) The number of annexationsamong Homology classes. (c) The proportional distribution of prototypes perconcept. For visualization purpose, each proportion within a bar is colored withthe corresponding label according to Fig. 5 and sorted in ascending order by rs ( · ) ,see Eq. 6.

16t time (cid:15) all concept prototypes – depicted as bars – are born. While theﬁltration progresses, more and more prototypes form larger homology classes thatlead to the death (end of a bar) of prototypes, which have been annexed. As aresult, only a single simplex at time (cid:15) j survives the ﬁltration (see Sec. 4.1.4).Moreover, Fig. 9(b) shows only the number of annexation of homology classesover time. It can be observed that the ﬁltration reaches a global maximum ofannexations at (cid:15) max =0 . , i.e., the annexation of classes decreases even though (cid:15) reaches its maximum value. It can be interpreted that the extracted homologyclasses after (cid:15) max =0 . are already discriminative by their persistence. The gradual ﬁltration process as described in Sec. 4.1 allows to analyze the topo-logical space at any ﬁltration step. Each ﬁltration step oﬀers insights about thetopology and the relation among concept prototypes. Note that the choice ofa speciﬁc number of concepts and concept size depends on the objective of theapplication scenario.Using (cid:15) max as indicator to stop the ﬁltration process and to subsequently se-lect the existing homology classes at time (cid:15) max as concepts, we receive in total concepts C (see Fig. 6(b)) with a minimum concept size of . To assess thequality of the extracted concepts we can make use of the human-annotated cate-gory labels, which are associated to the prototypes (see Fig. 6(b)). Therein, thecorrelation between the concepts and the labels given a priori by a human canbe interpreted as a quality measure for the concepts learned in an unsupervisedmanner. The amount of this correlation or purity pu ( · ) can be deﬁned as thelargest proportion in the distribution of prototypes of a category label, see Eq. 5,where concept c ∈ C consists of a set of concept prototypes P c = { p , p , ... } whichare accordingly attributed with labels Y c = { y , y , ... } , i.e., y i =retrieve _ label( p i ) ,given the set of category labels Y of the dataset where y i ∈ Y . pu ( c ) = arg max y ∈Y (cid:80) | P c | i =1 y ( y i ∈ Y c ) (cid:12)(cid:12) P c (cid:12)(cid:12) (5) Given the concepts inferred by (cid:15) max as described in Sec. 4.2.3 and illustratedin Fig. 6(b), it can be observed that connected components of diﬀerent sizes areextracted, which is caused by the shape heterogeneity of the prototypes in X . Alarge portion of the concepts is pure (see Eq. 5), i.e., there is a perfect correlationand only prototypes of a speciﬁc category y ∈ Y are assigned to a concept c ∈ C .In Fig. 9(c), the resulting distribution of prototypes within a concept is illustrated.Concepts are sorted in ascending order by the rank score rs ( c ) , which computesthe concept purity pu ( c ) with respect to the concept size (cid:12)(cid:12) P c (cid:12)(cid:12) , see Eq. 6. rs ( c ) = (cid:12)(cid:12) P c (cid:12)(cid:12) − pu ( c ) + ε , where ε is a small constant (0 <ε (cid:28) (6) While of the concepts are pure, other concepts show a lower purity, i.e.,samples of diﬀerent categories are assigned to a particular concept. However, these17able 1: Unsupervised concept selection: testing set (5 repetitions)

Label: sack can box teddy ball amphora plateMean error (%): 4.2 6.5 2.5 8.8 0 10.4 0 categories show shape similarities like sack and can or plate and box . Furthermore,the mean concept purity is . .Given the concepts, responses are extracted for each sample of the dataset,i.e., each sample object o is represented by ρ o = { φ ( ∗ γ o ) , φ ( ∗ γ o ) , ... } ( | ρ o | = |C| =36 )and labeled with the corresponding dataset label. Accordingly, a Support VectorMachine (SVM) is trained and evaluated, see Table 1. Discriminative resultshave been obtained, which allow to assess how reasonable the extracted conceptsare, e.g., shapes like ball , plate or box show low cross-validation error, whereasappearance variety of categories that include deformability or strong viewpointdependability, e.g., teddy or amphora can appear more ambiguous. The following experiment evaluates the generalization capability of the proposedapproach. First, HE is trained once with the training set of the OSCD dataset.This training process is unsupervised, i.e., HE is solely trained with instances ina label-agnostic manner. Then, instances from the OSCD dataset are propagatedthrough HE (see Sec. 3). Based on the resulting stimuli vector of the propagation,concepts C are generated (see Sec. 4).Given the previously trained HE model and the generated concepts C , weevaluate in the following the discriminative power of the concepts with instancesfrom diﬀerent real-world datasets. In addition to the OSCD objects, additionaldatasets with completely diﬀerent real-world objects are used, namely the Wash-ington RGB-D Object Dataset [17] (WD) and the

Object Segmentation Database [18] (SD) (see Table 2); note that all three datasets are sampled from diﬀerentdistributions as illustrated in Fig. 10(a)-(f).In order to analyze the spectrum of responses for these dataset objects, eachobject o is initially represented with as graph of segments g o (see Sec.3.1) andapplied to the two-step procedure: propagate g o through HE to generate astimuli vector ∗ γ o (see Sec. 3.3); compute for each concept c ∈ C the responsewith φ c ( ∗ γ o ) (see Eq. 4 in Sec. 4.3). As a result, an object o generates a set ofconcept responses ρ o = { φ ( ∗ γ o ) , φ ( ∗ γ o ) , ... } ( | ρ o | = |C| =36 , see Fig. 6(b)).Consequently, a |C| -dimensional space of concept responses CR |C| is created.The generalization capability can be assessed by CR |C| , which allows to observerelations and similarities among sample objects. To visualize and reason aboutthe |C| -dimensional CR |C| space, the t-SNE [48] embedding technique is applied toreduce the dimensionality to two; we denote this 2D space as CR . The embeddingis performed in an unsupervised manner, i.e., it is label-agnostic. Fig. 11 showsinstances from the WD, SD and OSCD datasets projected to the two-dimensional18 a) (b) (c) (d) (e) (f) Figure 10: Examples of appearance variations of sample point clouds relatedto the concept can , respectively cylinder, from diﬀerent distributions (datasets):(a), (b) show can 0 and of OSCD-training set, (c), (d) show food_can_1_1_1 and food_can_14_1_1 of WD and (e), (f) show cylindrical instances from scenes learn 34 and test 42 of SD .Table 2: Sample distribution of the CR space Label WD [17] scans Σ sack food bag sack sack can food can learn can soda can test can box cereal box learn box food box test box teddy teddy 0-44 (tr. set) 45 59teddy 0-13 (te. set) 14 ball ball ball lime ball orange amphora amphora amphora plate plate plate plate Σ - 335 - 154 - 468 957Note, for each instance of WD the st to th point cloud scans are selected of the ﬁrst video sequence.(tr. = training, te. = testing) CR space.For illustration, regions in CR are colored according to their correlation witha certain label (see Fig. 11) by exploiting the projected instances as anchor pointsin space. A uniform grid is created in the 2D CR space; for each cell in the gridthe k -nearest instances are determined (e.g., k =

5% of total number of instances);then the majority label of the k instances is determined and the cell is coloredaccording to the majority label; each cell is weighted and visually depicted in formof cell opacity. The weight represents the observed proportion of the k instancesassociated to the majority label, which is depicted in an interval [0 , from low tohigh proportion [low: transparent (white) =

0, high: opaque (solid majority labelcolor) = CR shown in Fig. 11 allows to observe regional charac-teristics and relations among locations in CR and instances of the three datasets.A main observation is that instances from diﬀerent datasets are propagated through19igure 11: CR with instances from the WD, SD and OSCD datasets (see Table 2).The instance annotations are scaled for better visibility.the HE and the resulting concept responses show a strong coherency with respectto shape appearance: diﬀerent instances from the diﬀerent datasets that can beconsidered to be similar on a human semantic level, form interrelated and coher-ent groups, as shown by the uniformly colored regions in Fig. 11. This is alsoreﬂected in Fig. 12(a) and (b) that illustrate the distribution of instances in CR space. Instances labeled as can , box , ball , amphora , plate form distinct regionswhereas deformable instances like sack and teddy lead to more scatter. However, teddies are still represented as a connected region and regions dedicated to sack are located at transitions to other labeled regions, e.g., can to plate , can to box or can to teddy . This observation can be explained that sacks can be interpreted asan intermediate shape, e.g., between a box and a can in CR space due to theirroundish, bulgy or cylindric appearance depending on viewpoint and deformation.20 a) (b) Figure 12: According to CR in Fig. 11, the distribution is shown of instanceswithin a region (a) and assignment of instances to particular regions (b). In this section, the performance is evaluated when using the mental simulationfor training (Sec.5). To allow a comparison of our approach with other work, theconcrete random samples that are used in this experiment are provided as opendataset, which is denoted in the following as

Artiﬁcial Object Dataset (AOD).Examples of samples of the artiﬁcial abstract objects from this dataset are shownin Fig. 7. The dataset contains training samples, which were artiﬁciallygenerated with an equally distributed number of samples per prototype order ( to ). These artiﬁcially generated samples are used to generate shape conceptsincluding the HE generation. We start the evaluation with an illustrative example. In Fig. 13 three (very sim-ple) simulated objects are shown with their respective extracted segment graph( g o ). The simulated instances consist of noise-free point clouds; thus segments areoptimally segmented. When using simulation-based training sample generation,an open question is whether the perception system is able to transfer the knowl-edge observed in simulation to real object observations. To test this in this simpleillustrative example, the three artiﬁcial instances in Fig. 13 are consecutively fed(from Fig. 13(a) to (c)) to HE and the learned motif prototypes are labeled withthe respective label.In Fig. 14(b) the classiﬁcation results of this illustrative example on a real-world scene is shown. More precisely, the label with the highest accumulated a) can instance (b) box instance (c) ball instance Figure 13: Examples of simulated instances of (simple) shape categories and thecorresponding extracted super patch graphs (top left in (a), (b), (c)). (a) RGB image (b) Classiﬁcation result

Figure 14: A sample classiﬁcation result on real-world scene data using a modeltrained with simulated objects only (Fig. 13). Note that, objects are segmentedfrom the scene with our previous work [38] and then classiﬁed with the trainedmodel.stimulus considering the observed (labeled) motif prototypes (Sec. 3.3) is shownfor each object. Note that, HE has been only trained with a single artiﬁcial in-stance per label ( can , box and ball ). Several observation can be made from theclassiﬁcation results. Considering the correct classiﬁcation, one may interpret itas a knowledge transfer from simulated data to real noisy observed data. Regard-ing sensor noise, segmented surfaces are distorted and may even contain holes(Fig. 14(b)). These distortions lead to segment constellations, which have notbeen observed in the training phase. The simulated cylinder ( can ) in Fig. 13(a)naturally consists of an upper planar segment and a cylindric body, whereas thereal can shown in Fig. 14(b) is over-segmented and subsequently consists of threesegments caused by sensor noise. Nevertheless, this segment constellation hasnot been observed in training phase but still led to a correct classiﬁcation as itis closest to the ideal cylinder concept. Further on, diﬀerent viewpoints on ob-jects can lead to diﬀerent segment constellations due to self-occlusion eﬀects. Theviewpoint on the box in Fig. 13(b) results to two planar segments whereas the22iewpoint on the box shown in Fig. 14(b) leads to three segments and a hole (redcolored segment) caused by sensor noise. Also in this case, this segment constel-lation has not been observed in the training phase but it still leads to a correctand conﬁdent classiﬁcation. Furthermore, note that the simulated instances usedfor training have in addition completely diﬀerent spatial dimensions compared tothe real objects shown in Fig. 14(b). This experiment evaluates the generalization ability using extensive artiﬁcialtraining data from mental simulation, i.e., the Artiﬁcial Object Dataset (AOD)with 250 simulated samples of abstract objects. Initially HE is trained and con-cepts are generated once in an unsupervised and label-agnostic manner withthe artiﬁcial samples of the AOD dataset. Given the HE model and the con-cepts C generated with the AOD, the generalization capability of C is evaluatedwith real-world instances from the Object Shape Category Dataset [16] (OSCD),the

Washington RGB-D Object Dataset [17] (WD) and the

Object SegmentationDatabase [18] (SD), see Table 2. Note that all three real-world datasets are sam-pled from diﬀerent distributions (see Fig. 10), i.e., the datasets consist of various,very diﬀerent objects and they diﬀer with respect to the experimental setups forthe sensor data generation. In order to analyze the spectrum of responses for thesedataset objects, each object o is applied to a two-step procedure: propagate o through HE to generate a stimuli vector ∗ γ o (see Sec. 3.2); compute for eachconcept c ∈ C the response with φ c ( ∗ γ o ) . As a result, an object o is representedby the set of concept responses ρ o = { φ ( ∗ γ o ) , φ ( ∗ γ o ) , ... } ( | ρ o | = |C| =28 ).Consequently, in order to investigate the generalization capability, the ap-proach as described in Sec. 6.3 is followed, i.e., a |C| -dimensional space of conceptresponses CR |C| is created and the embedding is performed to reduce the dimen-sionality to two. As a result, instances from the WD, the SD and the OSCDdatasets are projected to this two-dimensional CR space (Fig. 15).When looking at CR , an important observation is that after propagatingthe instances of the three datasets through the HE , the resulting concept re-sponses show also here coherency regarding shape appearance as in shown inSec. 6.3. Instances of all evaluated datasets together form interrelated and co-herent groups, see in Fig. 15 uniformly colored regions according to the labelsof the real datasets. This is also reﬂected in Fig. 16 illustrating the instancedistribution in CR space. By averaging the diagonal (bottom-left to top-right)one can observe . (Fig. 16(a)) / . (Fig. 16(b)) vs. . (Fig. 12(a))/ . (Fig. 12(b)), i.e., a similar discrimination has been achieved with theartiﬁcially generated training set (Fig. 16) compared to a real object training set(Fig. 12). Note that in an unsupervised manner CR forms regions of variousshapes and degree of label-association in a continuous space (Fig. 15) comparedto these hard-assigned discrete results w.r.t. labels in Fig. 12 and 16, which mayalso contain noise in point clouds and in the labeling process. Thus, the dis-crete results may only partially reﬂect the underlying label-association strength23igure 15: The projection of real-world samples from the OSCD, the WD and theSD dataset to the CR space that is generated with mental simulation (Fig. 7). Asummary of the instances used is shown in Table 2.of objects compared to the continuous CR space.Consequently, this indicates that randomly generated, abstract instances basedon composition of primitive shape prototypes from mental simulation carry in-formation about facets of shape appearance that allow to create shape conceptswhich facilitate the generation of an abstract space suited to discriminate andcategorize real object observations in a reasonable way. From the perspectiveof Cognitive Science, speciﬁcally in the ﬁeld of representation architectures, CR can be seen as a Conceptual Space [49, 50, 51] where points (prototypes) in theabstract space represent multidimensional vectors of stimuli and regions in space concepts . These stimuli are often denoted as

Quality Dimensions and can beinterpreted as concept responses ρ o with respect to C given an object o . An-other property can be observed that supports that concept responses of similarinstances appear close in CR in comparison to dissimilar ones: the majority of24 a) (b) Figure 16: According to CR in Fig. 15, the distribution is shown of instanceswithin a region (a) and assignment of instances to particular regions (b).instances of the respective label given by humans are closest or within the sameregion and form groups (Fig. 15 and Fig. 11). We presented an unsupervised abstraction process for machine learning of shapeconcepts: from 3D point clouds over hierarchically organized motifs to (seman-tically meaningful) concepts of shape commonalities. The proposed Shape MotifHierarchy Ensemble encodes object segment compositions in a hierarchical sym-bolic manner. Inspired by the concept of Persistent Homology, stimuli generatedby the ensemble are ﬁltered in a gradual manner to reveal topological structures.The ﬁltration leads to stimuli groups which can be interpreted as shape conceptsthat reﬂect commonalities of shape appearances.An important question is how this unsupervised learning is trained. Even whennot using human labels, biases can be in the selection of the dataset instancesused for training. Moreover, the generation of real-world datasets is cumbersomeand generally requires substantial eﬀort. Therefore, the use of mental simulationis investigated in this article, i.e., the generation of virtual sensor data fromartiﬁcial abstract objects. This approach is unsupervised in two respects: it islabel-agnostic (no label information is used) and instance-agnostic (no instancespreselected by human supervision are used).In a ﬁrst set of experiments, the shape concepts are learned in an unsupervised,label-agnostic fashion from a single real-world dataset and it is shown that a)semantically meaningful categories emerge, i.e., associations to shape categorieslinked to human-annotated labels appear, and that b) the concepts generalizeto other real-world datasets, i.e., the concepts learned on one dataset lead tomeaningful label associations when being applied to completely diﬀerent real-25orld datasets. In a second set of experiments, these results are extended tomental simulation, i.e., the training is both label-agnostic and instance-agnostic.It is shown that training with virtual sensor data from artiﬁcial abstract objectsleads to a semantically meaningful shape concept space, which generalizes to real-world object datasets. I.e., it leads to a shape concept space, in which unknownobjects of real-world sensor data are grouped (based on their commonalities) intoregions in concept space that can be, for instance, linked to human-annotatedlabels.

References [1] E. S. Spelke, “Principles of object perception,”

Cognitive Science , vol. 14,no. 1, pp. 29 – 56, 1990.[2] L. B. Smith, “Learning to recognize objects,”

Psychological Science , vol. 14,no. 3, pp. 244–250, 2003.[3] M. Graf,

Categorization and Object Shape . Springer Berlin Heidelberg, 2010,pp. 73–101.[4] C. A. Mueller, K. Pathak, and A. Birk, “Object shape categorization in rgbdimages using hierarchical graph constellation models based on unsupervis-edly learned shape parts described by a set of shape speciﬁcity levels,” in

International Conference on Intelligent Robots and Systems , 2014.[5] C. Eppner and O. Brock, “Grasping unknown objects by exploiting shapeadaptability and environmental constraints,” in

International Conference onIntelligent Robots and Systems , 2013.[6] P. Abelha, F. Guerin, and M. Schoeler, “A model-based approach to ﬁndingsubstitute tools in 3d vision data,” in

International Conference on Roboticsand Automation , 2016.[7] M. Thosar, C. A. Mueller, and S. Zug, “What stands-in for a missing tool?a prototypical grounded knowledge-based approach to tool substitution,” in

International Cognitive Robotics Workshop on Principles of Knowledge Rep-resentation and Reasoning (KR) , 2018, arXiv:1808.06423 [cs.RO].[8] Sloutsky Vladimir M., “From Perceptual Categories to Concepts: What De-velops?”

Cognitive Science , vol. 34, no. 7, pp. 1244–1286, 2010.[9] V. Högman, M. Björkman, A. Maki, and D. Kragic, “A sensorimotor learningframework for object categorization,”

IEEE Transactions on Cognitive andDevelopmental Systems , vol. 8, no. 1, pp. 15–25, 2016.[10] T. Nakamura and T. Nagai, “Ensemble-of-concept models for unsupervisedformation of multiple categories,”

IEEE Transactions on Cognitive and De-velopmental Systems , 2018. 2611] J. Nishihara, T. Nakamura, and T. Nagai, “Online algorithm for robots tolearn object concepts and language model,”

IEEE Transactions on Cognitiveand Developmental Systems , vol. 9, no. 3, pp. 255–268, 2017.[12] F. G. Ashby and E. M. Waldron, “On the nature of implicit categorization,”

Psychonomic Bulletin & Review , vol. 6, no. 3, pp. 363–378, 1999.[13] A. M. S. Barry,

Visual Intelligence : Perception, Image, and Manipulationin Visual Communication . State University of New York Press, 1997.[14] Palmeri Thomas J. and Gauthier Isabel, “Visual object understanding,”

Na-ture Reviews Neuroscience , vol. 5, no. 4, pp. 291–303, 2004.[15] S. Zmigrod and B. Hommel, “Feature integration across multimodal per-ception and action: A review,”

Multisensory Research , vol. 26, no. 1-2, pp.143–157, 2013.[16] C. A. Mueller and A. Birk, “Conceptualization of Object Compositions UsingPersistent Homology,” in

International Conference on Intelligent Robots andSystems , 2018.[17] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in

International Conference on Robotics and Automation ,2011.[18] A. Richtsfeld, T. Morwald, J. Prankl, M. Zillich, and M. Vincze, “Segmenta-tion of unknown objects in indoor environments,” in

International Conferenceon Intelligent Robots and Systems , 2012.[19] S. Biasotti, L. De Floriani, B. Falcidieno, P. Frosini, D. Giorgi, C. Landi,L. Papaleo, and M. Spagnuolo, “Describing shapes by geometrical-topologicalproperties of real functions,”

ACM Computing Surveys , vol. 40, no. 4, pp.12:1–12:87, 2008.[20] J. J. DiCarlo and D. D. Cox, “Untangling invariant object recognition,”

Trends in Cognitive Sciences , vol. 11, pp. 333–341, 2007.[21] R. Jonschkowski, C. Eppner, S. Höfer, R. M. Martin, and O. Brock, “Proba-bilistic multi-class segmentation for the amazon picking challenge,” in

Inter-national Conference on Intelligent Robots and Systems , 2016.[22] J. a. Fodor and Z. W. Pylyshyn, “Connectionism and cognitive architecture:a critical analysis.”

Cognition , vol. 28, pp. 3–71, 1988.[23] J. Papon, A. Abramov, M. Schoeler, and F. Wörgötter, “Voxel cloud connec-tivity segmentation - supervoxels for point clouds,” in

Computer Vision andPattern Recognition , 2013. 2724] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena, “Contextually guidedsemantic labeling and search for three-dimensional point clouds,”

The Inter-national Journal of Robotics Research , vol. 32, no. 1, pp. 19–34, 2013.[25] B. Leibe, A. Leonardis, and B. Schiele, “Combined Object Categorizationand Segmentation With An Implicit Shape Model,” in

European Conferenceon Computer Vision Workshop on Statistical Learning in Computer Vision ,2004.[26] M. Prasad, J. Knopp, and L. Van Gool, “Class-speciﬁc 3D Localization usingConstellations of Object Parts,” in

British Machine Vision Conference , 2011.[27] U. Asif, M. Bennamoun, and F. Sohel, “Eﬃcient rgb-d object categoriza-tion using cascaded ensembles of randomized decision trees,” in

InternationalConference on Robotics and Automation , 2015.[28] R. Kindermann and J. L. Snell,

Markov Random Fields and Their Applica-tions , 1980.[29] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, andthe Bayesian Restoration of Images,”

Pattern Analysis and Machine Intelli-gence , 1984.[30] J. Utans, “Learning in compositional hierarchies: Inducing the structure ofobjects from data,” in

Advances in Neural Information Processing Systems6 , 1993, pp. 285–292.[31] S. Fidler, M. Boben, and A. Leonardis, “Learning hierarchical compositionalrepresentations of object structure,” in

Object Categorization: Computer andHuman Vision Perspectives , S. Dickinson, A. Leonardis, B. Schiele, and M. J.Tarr, Eds. Cambridge University Press, 2009.[32] M. Ozay, U. R. Aktas, J. L. Wyatt, and A. Leonardis, “Compositional hier-archical representation of shape manifolds for classiﬁcation of non-manifoldshapes,” in

International Conference on Computer Vision , 2015, pp. 1662–1670.[33] K. R. Jerripothula, J. Cai, J. Lu, and J. Yuan, “Object co-skeletonizationwith co-segmentation,” in

Conference on Computer Vision and PatternRecognition , 2017.[34] W. Shen, K. Zhao, Y. Jiang, Y. Wang, X. Bai, and A. Yuille, “Deepskeleton:Learning multi-task scale-associated deep side outputs for object skeleton ex-traction in natural images,”

IEEE Transactions on Image Processing , vol. 26,no. 11, pp. 5298–5311, 2017.[35] G. Carlsson, “Topological pattern recognition for point cloud data,”

ActaNumerica , vol. 23, pp. 289–368, 005 2014.2836] C. Li, M. Ovsjanikov, and F. Chazal, “Persistence-based structural recogni-tion,” in

Conference on Computer Vision and Pattern Recognition , 2014.[37] W. J. Beksi and N. Papanikolopoulos, “3d point cloud segmentation usingtopological persistence,” in

International Conference on Robotics and Au-tomation , 2016.[38] C. A. Mueller and A. Birk, “Hierarchical Graph-Based Discovery of Non-Primitive-Shaped Objects in Unstructured Environments,” in

InternationalConference on Robotics and Automation , May 2016.[39] R. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms (FPFH)for 3D registration,” in

International Conference on Robotics and Automa-tion , 2009.[40] E. H. Rosch, “Natural categories,”

Cognitive Psychology , vol. 4, no. 3, pp.328–350, 1973.[41] C.-J. Huang and W.-C. Liao, “Application of probabilistic neural networksto the class prediction of leukemia and embryonal tumor of central nervoussystem,”

Neural Process. Lett. , vol. 19, no. 3, pp. 211–226, 2004.[42] J. Lin, “Divergence measures based on the shannon entropy,”

IEEE Trans-actions on Information Theory , vol. 37, no. 1, pp. 145–151, 1991.[43] H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistenceand simpliﬁcation,”

Discrete & Computational Geometry , vol. 28, no. 4, 2002.[44] A. Zomorodian and G. Carlsson, “Computing persistent homology,”

DiscreteComputional Geometry , vol. 33, no. 2, 2005.[45] X. Zhu, “Persistent homology: An introduction and a new text representa-tion for natural language processing,” in

Proceedings of the Twenty-ThirdInternational Joint Conference on Artiﬁcial Intelligence , 2013.[46] J. B. Kruskal, “On the shortest spanning subtree of a graph and the travel-ing salesman problem,”

Proceedings of the American Mathematical Society ,vol. 7, no. 1, pp. 48–50, 1956.[47] N. Koenig and A. Howard, “Design and Use Paradigms for Gazebo, An Open-Source Multi-Robot Simulator,” in

International Conference on IntelligentRobots and Systems , 2004.[48] L. van der Maaten and G. E. Hinton, “Visualizing high-dimensional datausing t-sne,”

Journal of Machine Learning Research , vol. 9, pp. 2579–2605,2008.[49] P. Gärdenfors,

Conceptual Spaces: The Geometry of Thought . MIT Press,2000. 2950] F. Zenker and P. Gärdenfors,

Applications of Conceptual Spaces: The Casefor Geometric Knowledge Representation . Springer International Publishing,2015.[51] S. Rama Fiorini, P. Gärdenfors, and M. Abel, “Representing part–wholerelations in conceptual spaces,”