[PDF] Toward Abstraction from Multi-modal Data: Empirical Studies on Multiple Time-scale Recurrent Models

Abstract

The abstraction tasks are challenging for multi- modal sequences as they require a deeper semantic understanding and a novel text generation for the data. Although the recurrent neural networks (RNN) can be used to model the context of the time-sequences, in most cases the long-term dependencies of multi-modal data make the back-propagation through time training of RNN tend to vanish in the time domain. Recently, inspired from Multiple Time-scale Recurrent Neural Network (MTRNN), an extension of Gated Recurrent Unit (GRU), called Multiple Time-scale Gated Recurrent Unit (MTGRU), has been proposed to learn the long-term dependencies in natural language processing. Particularly it is also able to accomplish the abstraction task for paragraphs given that the time constants are well defined. In this paper, we compare the MTRNN and MTGRU in terms of its learning performances as well as their abstraction representation on higher level (with a slower neural activation). This was done by conducting two studies based on a smaller data- set (two-dimension time sequences from non-linear functions) and a relatively large data-set (43-dimension time sequences from iCub manipulation tasks with multi-modal data). We conclude that gated recurrent mechanisms may be necessary for learning long-term dependencies in large dimension multi-modal data-sets (e.g. learning of robot manipulation), even when natural language commands was not involved. But for smaller learning tasks with simple time-sequences, generic version of recurrent models, such as MTRNN, were sufficient to accomplish the abstraction task.

Full PDF

aa r X i v : . [ c s . N E ] F e b Toward Abstraction from Multi-modal Data:Empirical Studies on Multiple Time-scale RecurrentModels

Junpei Zhong ∗†‡ , Angelo Cangelosi † and Tetsuya Ogata ∗‡∗ National Institute of Advanced Industrial Science and Technology (AIST), Aomi 2-3-26, Tokyo, JapanEmail: [email protected] † Centre for Robotics and Neural Systems, Plymouth University, Plymouth, UK ‡ Lab for Intelligent Dynamics and Representation, Waseda University, Tokyo, Japan

Abstract —The abstraction tasks are challenging for multi-modal sequences as they require a deeper semantic understandingand a novel text generation for the data. Although the recurrentneural networks (RNN) can be used to model the context ofthe time-sequences, in most cases the long-term dependenciesof multi-modal data make the back-propagation through timetraining of RNN tend to vanish in the time domain. Recently,inspired from Multiple Time-scale Recurrent Neural Network(MTRNN) [1], an extension of Gated Recurrent Unit (GRU),called Multiple Time-scale Gated Recurrent Unit (MTGRU), hasbeen proposed [2] to learn the long-term dependencies in naturallanguage processing. Particularly it is also able to accomplish theabstraction task for paragraphs given that the time constants arewell deﬁned. In this paper, we compare the MTRNN and MTGRUin terms of its learning performances as well as their abstractionrepresentation on higher level (with a slower neural activation).This was done by conducting two studies based on a smaller data-set (two-dimension time sequences from non-linear functions) anda relatively large data-set ( -dimension time sequences fromiCub manipulation tasks with multi-modal data). We concludethat gated recurrent mechanisms may be necessary for learninglong-term dependencies in large dimension multi-modal data-sets(e.g. learning of robot manipulation), even when natural languagecommands was not involved. But for smaller learning tasks withsimple time-sequences, generic version of recurrent models, suchas MTRNN, were sufﬁcient to accomplish the abstraction task. I. INTRODUCTIONThe long-term dependencies of the natural language sen-tences are difﬁcult to be learnt [3] by Vanilla recurrent networkbecause in most cases the gradients tend to vanish in timewhile the back-propagation through time is being processed [4,5]. This makes most gradient-based learning methods forrecurrent neural networks hardly form a long-term effect. Tosolve this problem, the earliest attempt was the long short-term memory (LSTM) [6] which consists of various gatingfunctions that controlled by simple element-wise operations.Since it was designed, it has achieved satisfaction results incompetitions [7] as well as tasks such as dialogue system [8],sentiment analysis [9] and machine translation [10].The Gated Recurrent Unit (GRU) models [11], as a moreefﬁcient version than the LSTM [12], has recently been usedwidely for language processing where they were also ableachieve state-of-the-art results with less computation require- ments than LSTM, as a GRU has fewer control gates thanthe LSTM unit does [12]. Despite of the differences in theirinternal operations, both of them can efﬁciently eliminatethe gradient vanishing problem with the following commonfeatures: • They can store the previous activations in the internalmemories which can be later refreshed or be retrieved,depending on different contexts; • These operations on the internal activations are controlledby different gates within the recurrent units. • The control policies of the gates are learnt by the contextof the training sequences; They form the compositionoperations which control the information ﬂow that goesin and out of the internal memory.With all these features, the recurrent units have the abil-ity to modify their internal weights (i.e. internal structures)based on the long-term dependencies existing in the temporalsequences to the cell states, given that the gated structuresare well-trained. Furthermore, while input and/or output areof variable length, the gated-like units stack in a hierarchicalmanner [13] are also able to extract the neural representationwith a ﬁxed length, based on the time dependencies in thetemporal domain, in which the unpredictable inputs of lowerlevel of RNN become inputs to the connecting higher levelunits, where a slower activation is updated [14]. This is alsoone of the theoretical foundations of the state-of-the-art deeplearning methods. In the context of language processing, thehigher level units of deep learning architectures can repre-sent the extracted meaning of the phrase/sentence, while theinputs are (almost) raw data with one-hot/embedding wordrepresentations of this phrase/sentence. Furthermore, connect-ing and training with two deep (recurrent or convolutional)networks with a shared higher level representation, namelyencoder-decoder architecture, can abstract meanings for thesentences, even if such “sentences” are in different languagesor modalities. As a result, a few applications such as imagecaptioning [15] (LSTM + CNN) and machine translation [10](two LSTMs) have been developed based on such encoder-decoder architecture.hile previous architectures used the encoder-decoder ar-chitecture to connect visual images and language sequences,in this paper, we proposed that it is possible to do abstractiontasks for multi-modal information when applying the hierar-chical RNN architectures, especially with gated-like units, tothe sensorimotor information sequences obtained from roboticplatforms and language sequences. This is an extension workof our previous experiments [16] based on the Multiple Time-scale Neural Network (MTRNN) [1]. Inspired by the time-constant concept of MTRNN, a Multiple Time-scale GatedRecurrent Units (MTGRU) were recently proposed [2] to applythis idea into gated-like recurrent units to accomplish the textextraction task. Moreover, its dynamic representation on higherlevels along time makes it an ideal architecture to connectnatural language commands, the dynamic multi-modal envi-ronment and the motor actions for robotic systems. Therefore,we herein conducted experiments about robot manipulationbased on the MTGRU network. We also did comparisonabout the performances between MTRNN and MTGRU. Theorganization of this paper is as follows: a brief introduction ofMTRNN and MTGRU model is presented at the next section.The empirical studies for algorithmic and multi-modal datafrom iCub manipulation are showed at the third section. Atthe last section, discussion and summaries will be given.II. M

ODELS

A recurrent neural network (RNN) is a feed-forward neuralnetwork with directed connecting weights. As the weightsform a directed connection between neural units in the time-domain, a neural unit that is connected with the recurrentweights is dependent on neural activity at the previous time-step(s). With sufﬁcient learning, it is able to model a variable-length sequence input data.More formally, given a sequence x = ( x , x , , x t ) , theRNN updates its recurrent hidden state h t by Eq.1: h t = (cid:26) , t = 0 (1a) Φ( h t − , x t ) , t > (1b)where Φ is a non-linear function. Ideally, given the hiddenstates of the network, the output y = ( y , y , · · · , y t ) iscomputed as p ( y , y , · · · , y t ) = p ( y ) · p ( y | x ) · p ( y | x , x ) · · · p ( y t | x , x , · · · , x t − ) (2)In the case of RNN, the last term p ( y t | x , x , · · · , x t − ) can be presented as the activation of hidden unit at time t : p ( y t | x , x , · · · , x t − ) = g ( h ( t )) (3)where h ( t ) is from Eq. 1. And the term h ( t ) is the units weinvestigated in the empirical studies below, in which we couldobserve the abstract information from previous time-steps. Fig. 1: The MTGRU Unit A. Multiple Time-scale Recurrent Neural Network

In the MTRNN network [1], the learning of each neuronfollows the updating rule of classical ﬁring rate models, inwhich the activity of a neuron is determined by the averageﬁring rate of all the connected neurons. Additionally, theneuronal activity is also decaying over time following anupdating rule of leaky integrator model.Assuming the i -th MTRNN neuron has the number of N connections, the current membrane potential status of a neuroncan be deﬁned as both by the previous activation as well asthe current synaptic inputs: u i,t +1 = (1 − τ i ) u i,t + 1 τ i [ X j ∈ N w i,j x j,t ] ( if t > (4)where w i,j represents the synaptic weight from the j -th neuronto the i -th neuron, x j,t is the activity of j -th neuron at t -thtime-step and τ is the time-scale parameter which determinesthe decay rate of this neuron: a larger τ means their activitieschange slowly over time compared with those with a smaller τ . As we can see, the MTRNN essentially is a continuousrecurrent model. Therefore, it also exists vanished gradientproblem. B. Multiple Time-scale Gated Recurrent Units

Although both GRU and LSTM have gating mechanisms forthe recurrent units, compared with the three gates that existin LSTM, a GRU has only two gates: a reset gate r and anupdate gate z . As the names imply, the reset gate determineshow to combine the current input with the previous status ofinternal memory, and the update gate deﬁnes how much ofthe previous memory to be preserved. The basic idea of usingsuch a gating mechanism to learn long-term dependencies issimilar as in a LSTM, but it was reported that fewer numberof gates leads to more efﬁcient in training [12].When the concept of multiple time-scales (MT) is ap-plied in GRU, it has a similar meaning as in MTRNN: itsummarises the dynamics with different time scales of thetemporal sequences. Compared with GRU, the output of themultiple time-scale gated recurrent units (MTGRU) contains aso-called “time-scale” constant, which controls how the outputfrom previous time steps inﬂuences the current output. It isequivalent that this constant is being multiplied to the outputand modulates the mixture of the current and previous states.ig. 2: The Same Network Architecture was chosen for bothMTGRU and MTRNNIn Fig. 1, the internal structure of MTGRU is shown, whichdemonstrates how the candidate activation ˜ h is multiplied withconstant /τ to the current output. In the mean while, the resetgate r t , update gate z t , and the candidate activation u t arecomputed similarly to those of the original GRU in [11]. r t = σ ( W xr x t + W hr h t − ) (5) z t = σ ( W xz x t + W hz h t − ) (6) u t = tanh ( W xu x t + W hu ( r t ⊙ h t − )) (7) h t = ((1 − z t ) h t − + z t u t ) 1 τ + (1 − τ ) h t − (8)Similar as the MTRNN, the pre-deﬁned time-scale τ is in-troduced to the activation term h t at Eq. 8 to control thelevels abstraction. The time-constant controls in what ratio thecurrent and past output to the GRU cell are mixed to compute.A larger τ indicates the past activations have larger inﬂuencesto the current activation, presenting the long term dynamicfeature of the temporal sequences.In the original MTGRU paper [2], the learning formula ofthe MTGRU and the performances of MTGRU in abstractionwas presented. In this paper, we concentrate the abstractionof multi-modal sequences obtained from a humanoid robot,especially its difference with MTRNN. Starting from simplesequences, we conducted two empirical studies based on theMTGRU units. III. E MPIRICAL S TUDIES

In this section, we conducted two case studies on a simpletime-sequence-learning task and a more complicated multi-modal-sequence learning task. In order to have a fair com-parison, the same architecture with the same parameters wereused for both MTRNN and MTGRU. As shown in Fig. 2, thearchitecture for these empirical studies contains three layers:an input-output layer ( IO ) and two context layers calledContext fast ( C f ) and Context slow ( C s ). The values of thetime constants here were obtained with experiments in [16].The Input-output neurons have full connections with the fastcontext layers. And the slow context layer only connectswith the fast context layer, representing a slower feature thatextracts from the fast context dynamics. In the following text, we denote the indices of these neuronsas: I all = I IO ∪ I C f ∪ I C s (9)where I IO represents the indices to the neurons at the input-output layer, I C f belongs to the neurons at the context fastlayer and I C s belongs to the neurons at the context slowlayer. We adopted a tanh function on the IO layer, then thecorresponding RNN functions on context layers: x cf = tanh k ( i t ) , k ∈ I IO (10) x cs = y cf = RN N k ( x cf ) , k ∈ I C f (11) y cs = RN N k ( x cs ) , k ∈ I C s (12) i t +1 = o t = RN N k ( x cf + y cs ) , k ∈ I C f (13)where RNN functions represent either MTRNN or MTGRUfunctions in the k th neuron. Note that in MTRNN, the neuronson one layer own full connectivity to all neurons within thesame and adjacent layers. In MTGRU, as we introduced be-fore, the internal activations also have full connectivities withinputs and outputs. Therefore, the only difference betweentwo architectures of MTRNN and MTGRU are the neuralactivations within each neuron units.While training multiple sequences, both the MTRNN andMTGRU should balance the training epochs for each sequenceand over-ﬁtting should be avoided. Therefore, one epoch wasdeﬁned to include a few iterations for training all the sequenceusing stochastic gradient descent (SGD) [17, 18], as showedin Algorithm 1: Algorithm 1

Multiple Time-sequences Training procedure O NE E POCH ( data ) ⊲ data contains multipletime sequences. for seq ∈ data do while error > threshold or iteration >maximum iteration do ⊲ Repeat iteration for one sequence until threshold isachieved Run SGD( seq ) end while end for ⊲ Choose the next sequence end procedure A. Case 1: Simple Non-linear Sequences Abstraction

In this case study, two time sequences that include twodimensions were generated to examine the learning perfor-mances of MTRNN and MTGRU. The two dimensions X = [ x , x ] of the ﬁrst sequence was deﬁned as: X = (cid:26) x = sin t (14a) x = sint (14b)And the second sequence was deﬁned as: = ( x = sin t · cos t (15a) x = sin t t · sintt − . (15b)In both cases, time-steps were applied, i.e. t = k − · π, k = { , , , · · · , } .The parameters in the MTRNN and MTGRU experimentsare shown as Tab. I. The parameters n C f and n C s in the caseof MTRNN mean the numbers of neurons on C f and C s layers, while in the case of MTGRU, they mean the numberof dimensions of the MTGRU unit on the C f and C s layers.Note that, compared with our previous experiment [16],we did not employ the SOM pre-processing because a faircomparison between MTRNN and MTGRU is needed.TABLE I: MTRNN & MTGRU Parameters (Case 1) Parameters Parameter’s Descriptions Value η Learning Rate − n C f Size of C f n C s Size of C s τ f Time-constant of C f τ s Time-constant of C s max iteration Max. iteration for training one sequence threshold

Threshold for early stop − α Mixed ratio for prediction/real . With the network implemented in Theano [19], the trainingwas done on an AWS G2 (2x large) server equipped withGrid K520. The training curve of MTRNN and the MTGRUwere depicted in Figs. 3. We can see that with the samelearning rate, the error of the MTRNN converged faster thanthe MTGRU, which was compatible with our intuition thatMTGRU converges slower because it owns more weights thanthe MTRNN, although they have the same parameters. (a) Training Curve with MTRNN (Case 1)(b) Training Curve of MTGRU (Case 1)

Fig. 3: Training Curves with MTRNN and MTGRU The quantitative results about the performances of MTRNNand MTGRU were shown in Tab. II. And the comparisonbetween the MTRNN and MTGRU output and the real valueof Seq.1 were shown in Fig. 6.TABLE II: MTRNN & MTGRU Performances (Case 1)

MTRNN MTGRUPrediction Error (RMS after 30 Epochs) 1.6353 2.2255Time per GD (ms) .

66 226 . To further examine the internal dynamics of both networks,we selected the neural activities of the C f and C s layerswhile the Seq. 1 ( X ) was as the input. From the internaldynamics of the context layers, we could observe signiﬁcantdifference between the dynamics of MTRNN (Fig. 4) andMTGRU (Fig. 5): • oscillations in the activation could be found in the MT-GRU context units. • the range of neural dynamics in MTRNN was signiﬁ-cantly larger than in MTGRU. B. Case 2: Multi-modal Data Abstraction

To examine the network performance in more complicatedtasks such as abstraction from robot multi-modal data, werecorded the multi-modal data from object manipulation ex-periments based on an iCub robot [20]. iCub is a child sizedhumanoid robot which was built as a testing platform fortheories and models of cognitive science and neuroscience.Mimicking a two-year old infant, this unique robotic platformhas degrees of freedom totally. As such, using the iCub,we set a learning scenario in which a human instructor wasteaching the robotic learner a set of language commands whilstproviding kinaesthetic demonstration of the named actions aswell as the corresponding visual inputs from the camera. Thetarget of this case study was to evaluate the performances ofMTRNN and MTGRU in this complicated task with a largedata-set, which may toward a natural language understandingfor humanoid robots.TABLE III: Dictionaries of verbs and nouns for the data sets:The instructor showed the robot with different combinationsfrom the actions and nouns. The actions and the objects arerepresented in two discretised values for semantic commandinputs which range from − . . For instance, the command“lift [the] ball” is translated into values [0 . , . . Actions Slide Left Slide Right Touch Reach PushVerb Value 0.0 0.1 0.2 0.3 0.4Actions Pull Point Grasp LiftVerb Values 0.5 0.6 0.7 0.8Objects Tractor Hammer Ball Bus ModiNoun Value 0.0 0.1 0.2 0.3 0.4Objects Car Cup Cubes SpikyNoun Values 0.5 0.6 0.7 0.8a) C f Activity of MTRNN (Case 1) (b) C s Activity of MTRNN (Case 1)

Fig. 4: Neural Activity of MTRNN (Case 1) (a) C f Activity of MTGRU (Case 1) (b) C s Activity of MTGRU (Case 1)

Fig. 5: Neural Activity of MTGRU (Case 1)Fig. 6: Predicted and real values of MTRNN and MTGRU(Seq. 1)

1) Experimental Setup:

Fig. 7 shows the setup used in ourmanipulation experiments to collect the multi-modal data-set.It was obtained using the following steps:1) The different objects with signiﬁcantly differentcolours and shapes were placed at different locations along a line on the table in front of the iCub.2) A vocal command was spoken by an instructor accordingto the visual scene that was perceived by the iCub.A complete combination in a sentence of the vocalcommand is composed of a verb and a noun. Thecorresponding verb and noun were recognised and thentranslated into two dedicated discrete values based on theverb and noun dictionaries like we did in the previousexperiment in [16] (Tab. III) .3) The built-in vision tracker of the iCub searched fora ball-shaped object based on the dictionary-generatedvalues using its vision tracker system.4) Once the object was located, the iCub rotated its headand triggered the object tracking, which changed theencoder values of the neck and eyes. ig. 7: Data Collection from iCub Robot5) Joint positions of the head and neck were recorded.The sequence recorder module of the iCub was used torecord the sensorimotor trajectories while the instructorwas guiding the robot by holding its arms to perform acertain action for each object.The whole experimental data-set for the iCub manipulationincluded combinations of actions and objects. In eachof the -dimensional temporal sequence, it includes thevocal commands (i.e. a complete sentence includes verb andnoun), the visual information (presented as joint angles ofneck and eyes) and the change of torso angles (resulting inmotor actions) sequences. We used such a large number ofdata to test how the MTRNN and MTGRU perform in sucha complicated task. We also aimed at applying the powerof recurrent network [21] in natural language processing inrobotic platforms, especially to apply humanoid robots incognitive tasks such as multi-modal interaction and dialoguerobots.

2) Experiment Results:

In this case study, the parametersof both models were also kept the same as shown in Tab. IV.TABLE IV: MTRNN & MTGRU Parameters (Case 2)

Parameters Parameter’s Descriptions Value η Learning Rate − n C f Size of C f n C s Size of C s τ f Time-constant of C f τ s Time-constant of C s max iteration Max. iteration for training one sequence threshold

Threshold for early stop − α Mixed ratio for prediction/real . Figs. 8 show the training curves from MTRNN and MTGRUrespectively. Quite different from the previous study, althoughthe MTGRU converged slower than MTRNN, it converged in amore steady way. The training of MTRNN, on the other hand,often converged to local minimum and took more epochs tothe same level of error of MTGRU. An output example ofthe th sequence in the data-set were shown in Fig. 9. Thequantitative results of the training can be found in Tab. V.As we expected, the training of MTGRU also took almost twice of the computational time in one iteration comparedwith MTRNN. One interesting thing was that the computationtime for one iteration in this case is less than in the previouscase, which was probably because of the advantage of GPUfor parallel computing in neural networks. (a) Training Curve with MTRNN (Case 2)(b) Training Curve with MTGRU (Case 2) Fig. 8: Training Curves with MTRNN and MTGRUIn Figs 10 and Figs 11, the internal dynamics of thecontext neurons were also depicted. But since the number ofdimensions was too large to examine in details, we made aPCA to reduce the dimension to before demonstrating theneural dynamics, in which we could observe that in the case oflarger dimensions, oscillation in dynamics can also be foundin MTGRU.TABLE V: MTRNN & MTGRU Performances (Case 2) MTRNN MTGRUPrediction Error (RMS after 30 Epochs) . . Time per GD (ms) .

79 62 . IV. DISCUSSIONS AND CONCLUSIONSInspired by the multiple time-scales which determine theupdating rate for the membrane activities in continuous neu-rons as MTRNN did, the MTGRU was recently proposed as anextended version of GRU model. In this paper, the empiricalstudies comparing with MTRNN and MTGRU were conductedin terms of its training performances and their feasibility inabstraction for time sequences. Speciﬁcally, two cases werestudied: 1) the -dimensional non-linear time sequences (Sec.III-A); 2) the -dimensional multi-modal time sequences(Sec. III-B). As expected, with the two data-sets we provided, a) Predicted Value of MTRNN (b) Predicted Value of MTGRU Fig. 9: Predicted and real value of MTRNN and MTGRU ( th Seq.) (a) C f Activity of MTRNN (Case 2) (b) C s Activity of MTRNN (Case 2)

Fig. 10: Neural Activity of MTRNN ( th Seq, Case 2)the complexity of training in GRU (i.e. gates inside the units)costed more computational effort than the MTRNN. We couldconclude that for such relatively trivial tasks (without signif-icant long-term dependencies in sequences), the advantage ofGRU (possibly LSTM as well) was hardly been exhibited.However, we also noticed the training of MTGRU for largedimension data converges faster than MTRNN (case 2). Thiswas probably because that the robot manipulation data we usedactually exhibits long-term dependencies to some extent. Forinstance, the movement of hands for grasping depends on theverb showed in the command sentence. If we use more sophis-ticated time-dependency data in the multi-modal experiments,the gated mechanisms may result in more steady trainingperformance than ordinary RNNs. Furthermore, according tothe previous literature with natural language modelling, thegated mechanisms RNN would be necessary to model thelong-term dependencies in multi-modal environment when thelanguage commands are involved.In future work, we will further investigate the following twotopics: • We will investigate the internal dynamics of MTGRU; forexample, how the neural oscillation on the context layershappened is still unknown; • We plan to use natural language as robot commands whileusing word2vec [22] as a pre-processed input as [23]did, instead of the look-up table III. The ﬁnal target ofthis work is to apply multi-modal understanding for bothsensorimotor and language temporal sequences on roboticsystems. APPENDIXThe code of MTGRU can be found on Github .ACKNOWLEDGMENTThe research was supported by Waseda SGU Program, theEU project POETICON++ under grant agreement 288382 andthe UK EPSRC project BABEL. JZ would like to thank FH forthe working space provided when the paper was being drafted.R EFERENCES [1] Y. Yamashita and J. Tani. “Emergence of functionalhierarchy in a multiple timescale neural network model:a humanoid robot experiment”. In:

PLoS Comput. Biol. https://github.com/jonizhong/mtgru.gita) C f Activity of MTGRU (Case 2) (b) C s Activity of MTGRU (Case 2)

Fig. 11: Neural Activity of MTGRU ( th Seq, Case 2)[2] M. Kim, M. D. Singh, and M. Lee. “Towards Ab-straction from Extraction: Multiple Timescale GatedRecurrent Unit for Summarization”. In: arXiv preprintarXiv:1607.00718 (2016).[3] G. Mesnil et al. “Investigation of recurrent-neural-network architectures and learning methods for spokenlanguage understanding.” In: INTERSPEECH . 2013,pp. 3771–3775.[4] Y. Bengio, P. Simard, and P. Frasconi. “Learning long-term dependencies with gradient descent is difﬁcult”. In:

IEEE Trans. Neural Networks

Diploma, Technische Universit¨atM¨unchen (1991), p. 91.[6] S. Hochreiter and J. Schmidhuber. “Long short-termmemory”. In:

Neural Comput.

IEEE Trans.Pattern Anal. Mach. Intell. arXiv preprint arXiv:1506.05869 (2015).[9] A. M. Dai and Q. V. Le. “Semi-supervised sequencelearning”. In:

Advances in Neural Information Process-ing Systems . 2015, pp. 3079–3087.[10] I. Sutskever, O. Vinyals, and Q. V. Le. “Sequenceto sequence learning with neural networks”. In:

Ad-vances in neural information processing systems . 2014,pp. 3104–3112.[11] K. Cho et al. “On the properties of neural machine trans-lation: Encoder-decoder approaches”. In: arXiv preprintarXiv:1409.1259 (2014).[12] J. Chung et al. “Empirical evaluation of gated recur-rent neural networks on sequence modeling”. In: arXivpreprint arXiv:1412.3555 (2014).[13] J. Zhong. “Artiﬁcial Neural Models for Feedback Path-ways for Sensorimotor Integration”. In: (2015). [14] J. Schmidhuber. “Learning complex, extended se-quences using the principle of history compression”. In:

Neural Comput.

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition . 2015,pp. 3156–3164.[16] J. Zhong et al. “Sensorimotor Input as a LanguageGeneralisation Tool: A Neurorobotics Model for Gen-eration and Generalisation of Noun-Verb Combina-tions with Sensorimotor Inputs”. In: arXiv preprintarXiv:1605.03261 (2016).[17] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu.“Advances in optimizing recurrent networks”. In: . IEEE. 2013, pp. 8624–8628.[18] R. Pascanu, T. Mikolov, and Y. Bengio. “On the difﬁ-culty of training recurrent neural networks.” In:

ICML(3)

28 (2013), pp. 1310–1318.[19] F. Bastien et al. “Theano: new features and speedimprovements”. In: arXiv preprint arXiv:1211.5590 (2012).[20] G. Metta et al. “The iCub humanoid robot: an openplatform for research in embodied cognition”. In:

Pro-ceedings of the 8th workshop on performance metricsfor intelligent systems . ACM. 2008, pp. 50–56.[21] A. Karpathy. “The unreasonable effectiveness of recur-rent neural networks”. In:

Andrej Karpathy blog (2015).[22] T. Mikolov et al. “Distributed representations of wordsand phrases and their compositionality”. In:

Advances inneural information processing systems . 2013, pp. 3111–3119.[23] J. Zhong, A. Cangelosi, and T. Ogata. “Sentence Em-beddings with Sensorimotor Embodiment”. In: