[PDF] Where is my forearm? Clustering of body parts from simultaneous tactile and linguistic input using sequential mapping

Abstract

Humans and animals are constantly exposed to a continuous stream of sensory information from different modalities. At the same time, they form more compressed representations like concepts or symbols. In species that use language, this process is further structured by this interaction, where a mapping between the sensorimotor concepts and linguistic elements needs to be established. There is evidence that children might be learning language by simply disambiguating potential meanings based on multiple exposures to utterances in different contexts (cross-situational learning). In existing models, the mapping between modalities is usually found in a single step by directly using frequencies of referent and meaning co-occurrences. In this paper, we present an extension of this one-step mapping and introduce a newly proposed sequential mapping algorithm together with a publicly available Matlab implementation. For demonstration, we have chosen a less typical scenario: instead of learning to associate objects with their names, we focus on body representations. A humanoid robot is receiving tactile stimulations on its body, while at the same time listening to utterances of the body part names (e.g., hand, forearm and torso). With the goal at arriving at the correct "body categories", we demonstrate how a sequential mapping algorithm outperforms one-step mapping. In addition, the effect of data set size and noise in the linguistic input are studied.

Full PDF

WWhere is my forearm? Clustering of body parts from simultaneous tactile andlinguistic input using sequential mapping

Karla ˇStˇep´anov´a , , Matˇej Ho ﬀ mann , , Zdenˇek Straka , Frederico B. Klein , Angelo Cangelosi , Michal Vavreˇcka Center for Machine Perception, Department of Cybernetics,Faculty of Electrical Engineering, Czech Technical University in Prague (2)

Czech Institute of Informatics, Robotics, and Cybernetics, CTU in Prague (3) iCub Facility, Istituto Italiano di Tecnologia (4)

School of Computing, Electronics and Mathematics, Plymouth University, Plymouth, UK

Abstract

Humans and animals are constantly exposed to a con-tinuous stream of sensory information from di ﬀ erentmodalities. At the same time, they form more com-pressed representations like concepts or symbols. Inspecies that use language, this process is further struc-tured by this interaction, where a mapping between thesensorimotor concepts and linguistic elements needs tobe established. There is evidence that children mightbe learning language by simply disambiguating poten-tial meanings based on multiple exposures to utterancesin di ﬀ erent contexts (cross-situational learning). In ex-isting models, the mapping between modalities is usu-ally found in a single step by directly using frequenciesof referent and meaning co-occurrences. In this paper,we present an extension of this one-step mapping andintroduce a newly proposed sequential mapping algo-rithm together with a publicly available Matlab imple-mentation. For demonstration, we have chosen a lesstypical scenario: instead of learning to associate ob-jects with their names, we focus on body representa-tions. A humanoid robot is receiving tactile stimula-tions on its body, while at the same time listening toutterances of the body part names (e.g., hand, forearmand torso). With the goal at arriving at the correct “bodycategories”, we demonstrate how a sequential mappingalgorithm outperforms one-step mapping. In addition,the e ﬀ ect of data set size and noise in the linguistic in-put are studied. Body representation has been the topic of psychologi-cal, neuroanatomical and neurophysiological studies formany decades. Spurred by the account of Head andHolmes (1911) and their proposal of superﬁcial and pos-tural schema, a number of di ﬀ erent concepts has beenproposed since: body schema, body image, and corpo-real schema being only some of them. Body schemais usually thought of as more “low-level”, sensorimotorrepresentation of the body used for action. Body image is an umbrella term uniting higher level representations,for perception more than for action, and accessible toconsciousness. Schwoebel and Coslett (2005) amassedevidence for distinguishing between three types of bodyrepresentations: body schema, body structural descrip-tion, and body semantics—constituting a kind of hier-archy. The body structural description is a topologicalmap of locations derived primarily from visual inputthat deﬁnes body part boundaries and proximity rela-tionships. Finally, body semantics is a lexical–semanticrepresentation of the body including body part names,functions, and relations with artifacts (e.g., shoes areused on the feet, and feet can be used to kick a foot-ball). While the details of every particular taxonomyor hierarchy can be discussed, clearly, there is a trendfrom continuous, modality-speciﬁc representations (likethe tactile homunculus) to multimodal, more aggre-gated representations. This may be ﬁrst instantiated byincreasing receptive ﬁeld size and combining sensorymodalities, as it is apparent in somatosensory process-ing, e.g. areas relatively specialized on proprioceptionor touch and with small receptive ﬁelds (like Brodmannareas 3a and 3b), touch and proprioception are gettingincreasingly combined in areas 1 and 2. Then, goingfrom anterior to posterior parietal cortex, the receptiveﬁelds grow further and somatosensory information iscombined with visual. One can then ask whether thisprocess of bottom-up integration or aggregation maygive rise to discrete entities, or categories, similar to in-dividual body parts. Vignemont et al. (2009) focusedon how body segmentation between hand and arm couldappear based on a combined tactile and visual percep-tion. They explored category boundary e ﬀ ect which ap-peared when two tactile stimuli were presented: thesestimuli felt farther away when they were applied acrossthe wrist than when they were applied within a singlebody part (palm or forearm). In conclusion, they sug-gest that the representation of the body is structured incategorical body parts delineated by joints, and that thiscategorical representation modulates tactile spatial per-ception.Next to the essentially bottom-up clustering of1 a r X i v : . [ c s . N E ] J un ultimodal body-related information, an additional“categorization” of body parts is imposed through lan-guage, such as when the infant hears her parents nam-ing the body parts. Interestingly, recent research (Ma-jid, 2010) showed that there are some cross-linguisticvariabilities in naming body parts and this may in turnoverride or inﬂuence the “bottom-up” multimodal (non-linguistic) body part categorization.While the ﬁeld is relatively rich in experimen-tal observations, the mechanisms behind the develop-ment and operation of these representations are still notwell understood. Here, computational and in particularrobotic modeling ties in—see (Ho ﬀ mann et al., 2010;Schillaci et al., 2016) for surveys on body schema inrobots. Petit and Demiris (2016) developed an algo-rithm for the iCub humanoid robot to associate labelsfor body parts and later proto-actions with their em-bodied counterparts. These could then be recombinedin a hierarchical fashion (e.g., “close hand” consistsof folding individual ﬁngers). Mimura et al. (2017)used Dirichlet process Gaussian mixture model with la-tent joint to provide a Bayesian body schema estima-tion based on tactile information. Their results sug-gest that kinematic structure could be estimated di-rectly from tactile information provided by a movingfetus without any additional visual information—albeitwith a lower accuracy. Our own work on the iCubhumanoid robot has thus far focused on learning pri-mary representations—tactile (Ho ﬀ mann et al., 2017)and proprioceptive (Ho ﬀ mann and Bednarova, 2016).In this work, we use the former (the “tactile homuncu-lus”) as input for further processing—interaction withlinguistic input.In this work, we strive to ﬁnd segmentation ofbody parts based on a simultaneous tactile and linguis-tic information. However, body part categorization andmapping to body part names is one instance of a moregeneral problem of segmenting objects from the envi-ronment, learning compressed representations (looselyspeaking: concepts, categories, symbols) to stand infor them and associating them with words to which theinfant is often exposed simultaneously. Borghi et al.(2004), for example, studied the interaction of objectnames with situated action on the same objects.We made use of a newly proposed sequential map-ping algorithm which extends an idea of one-step map-ping (Smith et al., 2006) and compared its overall accu-racy to one-step mapping as well as to accuracies of seg-menting individual body parts. We further explore howthe accuracy of the learned mapping is inﬂuenced by alevel of noise in the linguistic domain and data set size.The sequential mapping strategy was shown to be veryrobust as it can ﬁnd the mapping under circumstances ofvery noisy input and clearly outperformed the one-stepmapping.Complete source code used for generating resultsin this article is publicly available at https://github. com/stepakar/sequential-mapping .This article is structured as follows. The inputsand their preprocessing and the mapping algorithms aredescribed in Section 2. This is followed by Results(Section 3) and a Discussion and Conclusion. In this section, we will ﬁrst present the inputs and theirpreprocessing pipelines: tactile input (Section 2.1) andlinguistic input (Section 2.2). In total, 9 body parts ofthe right half of the robot’s upper body were stimulated:torso / chest, upper arm, forearm, palm and 5 ﬁngertips.Tactile stimulation coincided with an utterance of thebody part’s name. Then, the one-step and sequentialmapping algorithms (sections 2.3.1 and 2.4) are pre-sented, and a description of the evaluation (Section 2.5). To generate tactile stimulation pertaining to di ﬀ erentbody parts, we built on our previous work on the iCubhumanoid robot. In particular, the “tactile homuncu-lus” (Ho ﬀ mann et al., 2017)—a primary representationof the artiﬁcial sensitive skin the robot is covered with(see Fig. 1 – one half of the robot’s upper body). Inthe current work, the skin was not physically stimulatedanymore, but the activations were emulated and then re-layed to the “homunculus”, as detailed below. We created a YARP (Metta et al., 2006) software mod-ule to generate virtual skin contacts . A skin part wasrandomly selected and then stimulated. The number ofpressure-sensitive elements (henceforth taxels) for dif-ferent skin parts was 440 for the torso, 380 for upperarm, 230 for forearm, and 104 for the hand (44 forpalm and 5 ×

12 for ﬁngertips)—1154 taxels in total.Once the skin part was randomly selected, a small re-gion was also randomly picked within that part for thetactile stimulation—10 taxels at a time, correspondingto the triangular modules the skin is composed of. Forthe hand, the situation was slightly di ﬀ erent: the en-tire hand was treated as one skin part. Then, within thehand, a random choice was made between 5 subregionson the palm skin (8 to 10 taxels) and 5 ﬁngertips (12taxels each). Data was collected for 100 minutes, cor-responding to approximately 2000 individual 3 secondstimulations. For all skin parts, the stimulation lastedfor 3 seconds and was sampled at 10 Hz. A label–bodypart name–was saved along with the tactile data. Theselabels are used to generate the linguistic input and forperformance evaluation later, but do not directly take https://github.com/robotology/peripersonal-space/tree/master/modules/virtualContactGeneration art in the clustering of tactile information. Please notethat there were separate labels for the palm and indi-vidual ﬁngers, while these were all treated as one “skinpart” in the virtual touch generation and hence the num-ber of samples per ﬁnger, for example, was lower thanfor other non-hand body parts. The input layer of the “tactile homunculus” (Ho ﬀ mannet al., 2017) consists of a vector, a ( t ), of activationsof 1154 taxels at time t —the output of the previoussection—that have binary values (1 when a taxel is stim-ulated, 0 otherwise). The output layer then forms a7 ×

24 (168 “neurons” in total) grid – see Figure 1 B.This layer is a compressed representation of the skinsurface—the receptive ﬁelds of neurons (the parts ofskin they respond to) are schematically color-coded.However, this code (and “clustering”) is not availableas part of the tactile input.The output layer will be represented as a singlevector x ( t ) = [ x ( t ) , ..., x ( t )]. The activations of theoutput neurons, x i ( t ), are calculated as dot products ofthe weight vector u i corresponding to the i -th outputneuron and the tactile activation vector a ( t ) as follows: x i ( t ) = u i · a ( t ) (1) The output of the ﬁrst layer, vector x ( t ) (168 elements,continuous-valued) serves as input to the second tactileprocessing layer. This layer aims to cluster individualbody parts and represent them as abstract models. Re-sulting models T j are subsequently mapped in the mul-timodal layer to clusters found in the language layer.To process the outputs from the ﬁrst layer, we useda Gaussian mixture model (GMM), which is a convexmixture of D -dimensional Gaussian densities l ( x | θ j ). Inthis case, each tactile model T j is described by a set ofparameters θ j . The posterior probabilities p ( θ j | x ) arecomputed as follows: p ( θ j | x ) = J (cid:88) j = r kj l ( x | θ j ) , (2) l ( x | θ j ) = (cid:112) (2 π ) D (cid:112) | S j | exp[ −

12 ( x − m j ) T ( S j ) − ( x − m j )] , (3)where x is a set of D -dimensional continuous-valueddata vectors, r kj are the mixture weights, J is the numberof tactile models, parameters θ j are cluster centers m j and covariance matrices S j .Mixture of Gaussians is trained by the EM al-gorithm (Dempster et al., 1977). Number of tactilemodels J is in this model preset based on the num-ber of di ﬀ erent linguistic labels. In future, we plan to use an adaptive extension of GMM algorithm such asgmGMM ( ˇStep´anov´a and Vavreˇcka, 2016) to detect thisnumber autonomously.An output of this layer for each data point x ( t )is the vector y ( t ) of J output parameters describing thedata point (the likelihood that the data point belongs toeach individual cluster in a mixture). This correspondsto the fuzzy memberships (distributed representation). Tactile stimulation of a body part was accompanied withthe corresponding utterance. In our case, where we have9 separate body parts, these are ’torso’, ’upper arm’,’forearm’, ’palm’, ’little ﬁnger’, ’ring ﬁnger’, ’middleﬁnger’, ’index ﬁnger’ and ’thumb’. Linguistic and tac-tile inputs are processed simultaneously.We conducted experiments with spoken languageinput—one-word utterances pronounced by a non-native English speaker. To process this data, we madeuse of CMU Sphinx (an open-source ﬂexible Markovmodel-based speech recognizer system) (Lamere et al.,2003) and achieved 100% accuracy of word recogni-tion. The word-forms are extracted from the audio inputand compared to prelearned language models by meansof the log-scale scores p ( w nt | L i ) of the audio matching.Based on these data, posterior probability can be com-puted.However, in the current work, we employed ashortcut and used the labels (ground truth) directly. Thisallowed us to fully explore the e ﬀ ect of misclassiﬁcationin linguistic subdomain to mapping accuracy. The noiseto the language data was added subsequently and evenlyto all classes (a given proportion of labels was randomlypermuted). One possible way how to establish mapping betweensensorimotor concepts and linguistic elements is to usefrequencies of referent and meaning co-occurrences,that is, the ones with the highest co-occurrence aremapped together (Smith et al., 2006; Xu and Tenen-baum, 2007). This method is usually called cross-situational learning and supposes the availability of theideal associative learner who can keep track and store allco-occurrences in all trials, internally memorizing andrepresenting the word–object co-occurrence matrix ofinput. This allows the learner to subsequently choosethe most strongly associated referent (Yu and Smith,2012).

The simplest one-step word-to-referent learning algo-rithm only accumulates word-referent pairs. This can beviewed as Hebbian learning: the connection between a ig. 1: iCub skin and tactile homunculus. (A) Photograph of the iCub robot with artiﬁcial skin exposed on the righthalf of the upper body (1154 taxels in total). (B) Representation of tactile inputs learned using a Self-Organizing Map– a 24 × ﬀ mann et al., 2017).word and an object is strengthened if the pair co-occursin a trial. To extend this basic idea, we can enable alsoforgetting by introducing a parameter η , which can cap-ture the memory decay (Yu and Smith, 2012). Suppos-ing that at each trial t we observe an object o nt and heara corresponding word w nt ( N t possible associations), wecan describe the update of the strength of the associa-tion between word model L ( i ) and object—in our casetactile model T ( j )—as follows: A ( i , j ) = R (cid:88) t = η ( t ) N t (cid:88) n = δ ( w nt , i ) δ ( o nt , j ) , (4)where R is the number of trials, δ is the Kronecker deltafunction (equal to 1 when both arguments are identicaland 0 otherwise), w nt and o nt indicate the n th word–objectassociation that the model attends to and attempts tolearn in the trial t and η ( t ) is the parameter controllingthe gain of the strength of association.Now let’s assume that the word w ( i ) is modeled bythe model L i in the language domain and object (refer-ent) o ( j ) is modeled by the model T m ( i ) in the tactile do-main. Our goal is to ﬁnd the corresponding model T m ( i ) from tactile subdomain for each model L i from languagedomain to assign them together. Indices m ( i ) are foundas follows: ∀ i : m ( i ) = argmax i A ( i , j ) , (5)where A is the co-occurrence matrix computed in theEq. 4 (element A ( i , j ) captures co-occurrence betweenthe word w ( i ) and object o ( j )). To capture dynamic competition among models, we ex-tend the basic one-step mapping algorithm for cross-situational learning by sequential addition of inhibitoryconnections. The inhibitory mechanisms and situation-time dynamics were already partially included into themodel of cross-situational learning proposed by Mc-Murray et al. (2012). Even though our model sharessome similarities with the model proposed by McMur-ray, it stems from di ﬀ erent computational mechanisms.After a reliable assignment between a language and tac-tile model is found, inhibitory connections among thistactile model and all other language models are added.Thanks to this mechanism, mutual exclusivity principle(the fact that children prefer mapping where object hasonly one label to multiple labels (Markman, 1990)) isguaranteed.The assignment between tactile models T j andlanguage models L j is found using the following iter-ative procedure:1. Tactile and language data are clustered separatelyand the corresponding posterior probabilities arefound.2. For each data point the most probable tactile andlanguage clusters are selected and the data point isassigned to these clusters.3. Co-occurrence matrix with elements A ( i , j ) is com-puted and the best assignment is selected:[ im , m ( im )] = argmax i argmax j A ( i , j ) . (6)n this step, the tactile model T m ( im ) is assigned tothe language model L im .4. Inhibitory connections are added between the as-signed tactile model T m ( im ) and all language models L i , where i (cid:44) im (mutual exclusivity).5. Assigned data points (data points which belong toboth T m ( im ) and L im ) are deleted from the data set.6. If data set is not empty or not all tactile clusters areassigned to some language cluster go to (1), elsestop. Accuracy of the learned mapping is calculated in thefollowing manner: We cluster output activations fromthe tactile homunculus and assign each data point to themost probable cluster. Then, we ﬁnd indices m ( i ) for allclusters as deﬁned in equation 5 for one-step mappingand equation 6 for sequential mapping. Based on thismapping we can assign each data point to the languagelabel. These language labels are subsequently comparedto the ground truth (the body part name is equivalentto the language label prior to the application of noise).Accuracy is then computed as: acc = T P / N (7)where T P (true positive) is the number of correctly as-signed data points and N is the number of all data points. We studied the performance of one-step vs. the sequen-tial mapping algorithms on the ability to cluster individ-ual body parts from simultaneous tactile and linguisticinput. That is, all the skin regions on the same body partshould “learn” that they belong together (to the forearm,say), thanks to the co-occurrences with the body part la-bels. In addition, the e ﬀ ect of data set size and levels ofnoise in the linguistic domain are investigated (Section3.1). A detailed analysis of the mapping accuracy forindividual body parts and a backward projection ontothe tactile homunculus are shown in sections 3.2 and3.3 respectively. The performance of the one-step and sequential map-ping algorithms is shown in Fig. 2. The comparison isprovided for di ﬀ erent data set sizes (namely for 6 dif-ferent data sets with number of data points from 64 to63806) and noise levels. As can be seen, the accuracyof sequential mapping remains very stable and outper-forms one-step mapping for all values of the noise (in the linguistic domain) and all data set sizes. For smallerdata sets, we can see a steeper drop in accuracy withincreasing noise in the language data. The accuracy calculated in the previous section and Fig.2 is an overall accuracy and we don’t take into accountthe number of data points per individual body part. Toexplore the performance in more detail, we focused alsoon the accuracy of sequential mapping for individualbody parts. The results for the data set with 3190 and638 data points can be seen in Fig. 3 top and bottompanel, respectively. The accuracy for all body parts de-creases with increasing noise in the linguistic input. Theaccuracy for ﬁngers is signiﬁcantly lower—this is dueto the lower number of samples per ﬁnger (see Section2.1.1). Comparing the top and bottom panel in Fig. 3demonstrates poorer performance with higher variance,especially for the ﬁngers.

After tactile data from homunculus are clustered andthese clusters are mapped to appropriate language clus-ters (representing body parts utterances), we can projectthese labels back onto the original tactile homunculus.Considering that x i ( t ) are activations of neuron i in thehomunculus, D is the whole data set consisting of vec-tor of homunculus activations for each data point, and Lan g Label ( d ) is the language label assigned to a datapoint d based on the sequential mapping procedure de-scribed in the Section 2.4, we can project results of se-quential mapping onto the homunculus in a followingmanner. First, we compute strength of activation n ki ofeach neuron i for a given language label k as follows: n ki = (cid:88) x ( t ) ∈ D k x i ( t ) , i ∈ { , . . . , } , (8)where D k = { d ∈ D | LangLabel( d ) = k } and k = { torso,upper arm, forearm, palm, little ﬁnger, ring ﬁnger, mid-dle ﬁnger, index ﬁnger, thumb } .Afterwards, we visualize for each neuron howmuch it is activated for individual body parts. Resultsfor data sets of di ﬀ ering size and level of noise in the lin-guistic domain can be seen in Fig. 4. Clearly, for largeenough data sets and limited noise, the mapping fromlanguage to the tactile modality is successful in delin-eating the body part categories (the ﬁngers with fewerdata points being more challenging)—as can be seen bycomparing panels A and B. ig. 2: Accuracy of one-step vs. sequential mapping for di ﬀ erent levels of noise in language. Number denotes the sizeof data set, SM - sequential mapping and OM - one-step mapping. The mean and standard deviation from 20 repetitionsis visualized. Fig. 3:

Accuracy of sequential mapping for individual body parts: Visualization of sequential mapping accuracy basedon the noise in linguistic data for 2 data set sizes: 3190 data points (upper) and 638 data points (lower) , noise in languagedata 0-100%(random). The mean and standard deviation from 40 repetitions are visualized.

To study the problem of associating (mapping) betweensensorimotor or multimodal information, concepts or categories, and language or symbols, we have chosena speciﬁc but less studied instance of this problem: seg-mentation and labeling of body parts. Perhaps, from adevelopmental perspective, this could be plausible, as ig. 4:

Projection of mapping results back onto the tac-tile homunculus – sample runs of the algorithm. Colorcode for individual body parts is the same as in Fig. 1.(A) Original homunculus with true labels. (B) Resultsfrom data set with 6381 data points and 10% noise. (C)Results from 638 data points and 80% noise.the body may be the ﬁrst “object” the infant is discover-ing. The self-exploration occurs in the sensorimotor do-main, but at the same time or slightly later, the infant isexposed to utterances of body part names. In this work,we study the mapping between the tactile modality andbody part labels from linguistic input.We present a new algorithm for mapping languageto sensory modalities (sequential mapping), compare itto one-step mapping and test it on the body part cate-gorization scenario. Our results suggest that this map-ping procedure is robust, resistant against noise, and se-quential mapping shows better performance than one-step mapping for all data set sizes and also slower per-formance degradation with increasing noise in the lin-guistic input. Furthermore, we explored accuracy of thesequential mapping for individual body parts, reveal-ing that body parts less represented in the data set—ﬁngers—were categorized less accurately. This prob-lem might be mitigated with increased overall data setsize; yet, dealing with clusters with uneven data pointnumber is a common problem of clustering algorithms(in our case GMM).Projecting the labels or categories induced by lan-guage back onto the tactile homunculus showed that the body part categories are quite accurate. Given the na-ture of the tactile input—the skin is a continuous recep-tor surface—and the random-uniform tactile input gen-erator used, the linguistic input was the only one thatcan facilitate cluster formation. However, more realis-tic, non-uniform touch and, in particular, the additionof additional modalities (proprioception, vision) shouldenable bottom-up non-linguistic body part category for-mation, as described by (Vignemont et al., 2009), forexample. These constitute possible directions of ourfuture work: the “modal” cluster formation will inter-act with the labels imposed by language. Furthermore,thus far, only one half of the body was considered—corresponding to the lateralized representations in thetactile homunculus—, but one can imagine stimulatingboth left and right arm, for example, while hearing al-ways the same utterance: ‘upper arm’. Further study ofthe brain areas involved in this processing is needed, inorder to develop models more closely inspired by thefunctional cortical networks, like in (Caligiore et al.,2010) that model the experimental ﬁndings of (Borghiet al., 2004).For our experiments we used artiﬁcially gener-ated linguistic input (i.e., body part labels) with addednoise (i.e. wrong labels with a certain probability). Inthe future, we are planning to use actual auditory input(spoken words) with real noise. This will also add theadditional dimension of similarity in the auditory do-main: ‘arm’ and ‘forearm’ are phonetically closer toeach other than to, say, ‘torso’. Thus, the linguisticmodality will not constitute crisp, discrete labels any-more, but these will have to be extracted ﬁrst—openingup further possibilities for bidirectional interaction withother modalities.

K.S. and M.H. were supported by the Czech ScienceFoundation under Project GA17-15697Y. M.H. was ad-ditionally supported by a Marie Curie Intra EuropeanFellowship (iCub Body Schema 625727) within the 7thEuropean Community Framework Programme. Z.S.was supported by The Grant Agency of the CTU Pragueproject SGS16 / / OHK3 / /

13. M.V. was supportedby European research project TRADR funded by the EUFP7 Programme, ICT: Cognitive systems, interaction,robotics (Project Nr. 609763).

References

Borghi, A. M., Glenberg, A. M. and Kaschak, M. P.(2004). Putting words in perspective.

Memory & Cognition , 32(6):863–873.Caligiore, D., Borghi, A. M., Parisi, D. and Baldassarre,G. (2010). Tropicals: A computational embodiedeuroscience model of compatibility e ﬀ ects. Psycho-logical Review , 117(4):1188.Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977).Maximum likelihood from incomplete data via theEM algorithm.

Journal of the royal statistical soci-ety. Series B (methodological) , pp. 1–38.Head, H. and Holmes, H. G. (1911). Sensory distur-bances from cerebral lesions.

Brain , 34:102–254.Ho ﬀ mann, M. and Bednarova, N. (2016). The encod-ing of proprioceptive inputs in the brain: knowns andunknowns from a robotic perspective. Vavrecka, M.,Becev, O., Ho ﬀ mann, M. and Stepanova, K. (eds.), In Kognice a umˇel´y ˇzivot XVI [Cognition and ArtiﬁcialLife XVI] , pp. 55–66.Ho ﬀ mann, M., Marques, H., Hernandez Arieta, A.,Sumioka, H., Lungarella, M. and Pfeifer, R.(2010). Body schema in robotics: A review. Au-tonomous Mental Development, IEEE Transactionson , 2(4):304–324.Ho ﬀ mann, M., Straka, Z., Farkas, I., Vavrecka, M. andMetta, G. (2017). Robotic homunculus: Learningof artiﬁcial skin representation in a humanoid robotmotivated by primary somatosensory cortex. IEEETransactions on Cognitive and Developmental Sys-tems .Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R.,Walker, W., Warmuth, M. and Wolf, P. (2003). Thecmu sphinx-4 speech recognition system. In

IEEEIntl. Conf. on Acoustics, Speech and Signal Process-ing (ICASSP 2003), Hong Kong , vol. 1, pp. 2–5. Cite-seer.Majid, A. (2010). Words for parts of the body.

Wordsand the mind: How words capture human experience ,pp. 58–71.Markman, E. M. (1990). Constraints children place onword meanings.

Cognitive Science , 14(1):57–77.McMurray, B., Horst, J. S. and Samuelson, L. K. (2012).Word learning emerges from the interaction of onlinereferent selection and slow associative learning.

Psy-chological review , 119(4):831.Metta, G., Fitzpatrick, P. and Natale, L. (2006). Yarp:yet another robot platform.

International Journal onAdvanced Robotics Systems , 3(1):43–38.Mimura, T., Hagiwara, Y., Taniguchi, T. and Inamura,T. (2017). Bayesian body schema estimation usingtactile information obtained through coordinated ran-dom movements.

Advanced Robotics , 31(3):118–134. Petit, M. and Demiris, Y. (2016). Hierarchical actionlearning by instruction through interactive groundingof body parts and proto-actions. In

Robotics and Au-tomation (ICRA), 2016 IEEE International Confer-ence on , pp. 3375–3382. IEEE.Schillaci, G., Hafner, V. V. and Lara, B. (2016). Ex-ploration behaviors, body representations, and simu-lation processes for the development of cognition inartiﬁcial agents.

Frontiers in Robotics and AI , 3:39.Schwoebel, J. and Coslett, H. B. (2005). Evidence formultiple, distinct representations of the human body.

Journal of cognitive neuroscience , 17(4):543–553.Smith, K., Smith, A. D., Blythe, R. A. and Vogt, P.(2006). Cross-situational learning: a mathemati-cal approach.

Lecture Notes in Computer Science ,4211:31–44.ˇStep´anov´a, K. and Vavreˇcka, M. (2016). Estimatingnumber of components in gaussian mixture model us-ing combination of greedy and merging algorithm.

Pattern Analysis and Applications , pp. 1–12.Vignemont, d. F., Majid, A., Jola, C. and Haggard, P.(2009). Segmenting the body into parts: evidencefrom biases in tactile perception.

The Quarterly Jour-nal of Experimental Psychology , 62(3):500–512.Xu, F. and Tenenbaum, J. B. (2007). Word learn-ing as bayesian inference.

Psychological review ,114(2):245.Yu, C. and Smith, L. B. (2012). Modeling cross-situational word–referent learning: Prior questions.