[PDF] Human-like general language processing

Abstract

Using language makes human beings surpass animals in wisdom. To let machines understand, learn, and use language flexibly, we propose a human-like general language processing (HGLP) architecture, which contains sensorimotor, association, and cognitive systems. The HGLP network learns from easy to hard like a child, understands word meaning by coactivating multimodal neurons, comprehends and generates sentences by real-time constructing a virtual world model, and can express the whole thinking process verbally. HGLP rapidly learned 10+ different tasks including object recognition, sentence comprehension, imagination, attention control, query, inference, motion judgement, mixed arithmetic operation, digit tracing and writing, and human-like iterative thinking process guided by language. Language in the HGLP framework is not matching nor correlation statistics, but a script that can describe and control the imagination.

Full PDF

HHuman-like general language processing

Feng Qi 1,2

1. Intelligent Innovation LabAlibaba GroupBeijing, 100101 [email protected]

Guanjun Jiang 1

2. Nufﬁeld Department of Clinical NeurosciencesUniversity of OxfordOxford, OX39DU [email protected]

Abstract

Using language makes human beings surpass animals in wisdom. To let machinesunderstand, learn, and use language ﬂexibly, we propose a human-like general lan-guage processing (HGLP) architecture, which contains sensorimotor, association,and cognitive systems. The HGLP network learns from easy to hard like a child,understands word meaning by coactivating multimodal neurons, comprehendsand generates sentences by real-time constructing a mental world model, and canexpress the whole thinking process verbally. HGLP rapidly learned 10+ differenttasks including object recognition, sentence comprehension, imagination, attentioncontrol, query, inference, motion judgement, mixed arithmetic operation, digittracing and writing, and human-like iterative thinking process guided by language.Language in the HGLP framework is not matching nor correlation statistics, but ascript that can describe and control the imagination.

Future strong machine intelligence requires intelligent language processing techniques. We believethat humans have such ability but modern machines don’t, which can be reﬂected in, but not limitedto, the following aspects. First, learning is a step by step process. The human brain can graduallyassimilate and accumulate various concepts, knowledge and skills. But, current natural languageprocessing (NLP) machine does not seem to care about the order of learning, rather than the quantityof corpus materials that contributes to robust relational statistics among words [1]. Second, wordmeaning is perceived by virtue of multimodal neuronal activation. For example, when we hearthe word ‘acid’, we can feel sour as the saliva is excreted due to the activation of our gustatoryneurons. However, if you ask NLP machine what is acid, ideally, it queries out the dictionaryexplanation of ‘[n] a chemical substance that neutralizes alkalis or [adj] having a pH value of lessthan 7’, but we know that NLP itself does not know what the words ‘neutralize’, ‘alkalis’ or ‘pHvalue’ mean. Third, humans comprehend and generate sentences by real-time constructing a virtualworld model. For example, when I said ‘I have a gift in the box’, you may naturally think of a pieceof chocolate or ring being placed in the box. Based on the imagined scenario, you may ask ‘is itchocolate?’. On the contrary, NLP focuses on the embedding matching between question and answer,and the correlation score for output [2]. So, NLP knows ‘Donald Trump’ and ‘US President’ arestrongly linked, but does not imagine it as ‘an old white man with shining blonde hair sits in theOval Ofﬁce’. Fourth, human thinking is a language guided process that is consciously describableand self-controllable. For example, a child can distinguish monkeys from humans and explain thejudgment by saying ‘monkeys have tails, but humans don’t’. However, the current NLP only beable to report the classiﬁcation results mechanically, but never be aware of its own thinking process,let alone control the thinking process [1-4]. Finally, humans can understand and apply languagecommand at one trial. For example, we can play Gomoku once after knowing the rule sentence of‘win if Five in a row’. However, modern machines cannot understand the rule verbally and has to betrained billions of times with reinforcement learning merely for one such skill [4].

Preprint. Under review. igure 1: The architecture of the HGLP. (A) It consists of three hierarchies of sensorimotor, associa-tion and executive systems. The low-level sensorimotor cortices are made by visual and languageautoencoders which are trained with unsupervised learning in the early stage. The post-trained visualand language autoencoders could provide visual vector ( vv ) of images viewed and language vector( lv ) of sound heard, respectively. In the association cortices, there are middle temporal gyri (MTG)and intraparietal lobe (IPL, BA39/40), functioning as lv - vv translator and lv - lv associator, respectively.Wernicke area comprehends a sentence by decomposing it into words or phrases, while the Brocaarea constructs a sentence with various inputs of words and phrases syntactically. The high-leveldorsal lateral prefrontal cortex (dlPFC) acquires tasks and environmental states by receiving the lv and vv vectors. It keeps the information in the working memory, generates task response according torules, and top-down control signals to lower level modules to properly interact with the environment.(B) Anatomical connection of HGLP modules in human brain. Red represents language paths, bluerepresents visual paths, purple represents both modalities; solid represents feedforward paths, anddashed represents feedback paths.Solving these problems will bring us closer to strong machine intelligence. Based on the researchof the human brain neuroscience, we propose a human-like general language processing (HGLP)architecture, which aims to build a new language processing architecture through mimicking thebrain with correct function implementation of cortical modules and information interaction amongthese modules. Machine language processing could follow the human brain blue-print [5, 6] in order to achievehuman-like general language processing skills. To construct such language processing architecturewith sufﬁcient simplicity, it is necessary to guarantee that the connections among cortical modules arecorrect, the functionalities of each module are reasonable, and the training progress of each module isin a similar order to human development. Figure 1 demonstrates the scheme of HGLP architecture. Ingeneral, we follow Baddeley’s model of working memory [7], which consists of a central executivesystem used to control two slave systems (sensorimotor cortices): the phonological loop (PL) andthe visuospatial sketchpad (VSS). This is in line with people’s perception and cognition. However,between the central executive and primary sensorimotor cortices, we add association cortices, such asmiddle temporal gyri (MTG) as a visual-language translator, and Intraparietal lobe (IPL or BA39/40)as a substrate of abstract knowledge. Next, we will talk about the neural functionalities and AIimplementation of each module.HGLP’s sensorimotor system are implemented in the form of autoencoders, that can process visualand language inputs and generate imagination and articulation outputs. Human visual system developsbefore language system. Without language labels, visual cortices could develop its neural networkunder an unsupervised learning mechanism, namely that, imagine what have been viewed, and thenadjust the neural network to make the imagination more consistent with the image viewed. For visualprocessing, we construct a visual autoencoder in Figure 1, with a visual encoder part (V1-V4) toextract a 32-byte visual vector ( vv ) from each image viewed, and a visual decoder part (V4’-V1’) toreconstruct an image from a given vv . The vv can be either the untangled representation of viewedimages for higher level processing [8] or top-down signals given by higher hierarchical modules to be2igure 2: Object recognition task. (A) The HGLP visualized image ‘3’ and heard ‘it is ?’, whichwere encoded into vv and lv by visual and language encoders. After that, Wernicke decomposedthe sentence into word-level lv , also corrected the wrong pronunciation of ‘iu’ into ‘it’. (B) MTGresponded to the language command by identifying the digit and outputting the verbal identity ‘three’.(C-D) BA39/40 and dlPFC did not respond to ‘it is ?’. (E) Broca could combine lv from variousmodules grammatically for sentence utterance. Broca combined ‘it is’ from Wernicke and ‘square’ or‘three’ from MTG into a sentence lv for future articulation via PM-M1. In the lesion test, Broca couldstill generate readable sentences with only 1 out of 32 neurons lesioned, its performance devastatedrapidly if more than 2 out of 32 neurons were lesioned, and could only generate utterance ‘u7 3’ if allBroca neurons were silenced. The Wernicke, BA39/40 and Broca only process language-related lv vectors, while the MTG and dlPFC process both lv and vv vectors. The top of each block shows inputimage and language, and the bottom shows the reconstructed imagination and utterance according tothe module output.imagined via the visual decoder. For example, MTG could pass the attention modulated vv to visualdecoder for speciﬁc object imagination (Fig. 3F). For simplicity, we have not implemented the visualtwo-stream model yet, so the vv vector contains both object features and spatial information.Babies learn to speak at one year old. After the auditory system processed the sound heard, thearticulatory system could repeat such sound, and then the higher-level cortex could future processthese language representations, such as endowing with meanings [9, 10]. HLGP also follows asimilar architecture and processing work-ﬂow. For language processing, we construct a languageautoencoder, with a language encoder part (primary auditory cortex A1 and Sylvian parietal temporal(SPT)) [11] to convert the physical sound of a sentence into a 32-byte language vector ( lv ) and alanguage decoder part (vocal cord, laryngeal and tongue areas of PreMotor (PM) and M1) to articulatecorresponding utterance from a given lv . Similarly, the lv can be either the language representation ofheard sentence, or top-down signals given by higher hierarchical modules such as Broca for languagegeneration. Here, the A1 module can convert physical sound into temporal spectrum signals, whileM1 can articulate sound of the temporal spectrum signals (Supplementary Methods). The SPT-PMnetwork is implemented by a sequence-to-sequence [12] model, where the PM aims to reconstructaccurately a temporal spectrum signal from a 32-byte lv encoded by SPT.Human association cortex takes up a wide-spread cortical area between the sensorimotor cortexand the executive cortex. First, it is a knowledge center. Skills such as object recognition, motiondetection are processed in MTG [13, 14], arithmetic computing in IPL [15], language comprehensionin Wernicke area [16] and language generation in Broca area [16, 17], etc. Second, it is an informationhub that receives multimodal representations processed by sensory cortices, and top-down query orcontrol signals from executive cortices. After association processing, it provides reply according tohigh-level queries and signals to visual and language decoders for articulation and imagination. Weimplement the modules of the association system with LSTM [18] and use supervised learning toadjust the network parameters to acquire the corresponding functions.Wernicke module is responsible for understanding the sound heard, mainly including the task ofdecomposing the utterance heard into phrases or words, which allows future semantic endowment bycoactivating visual, gustatory and somatosensory, etc. neurons. To train Wernicke to have the sentence3igure 3: Language and visual interaction via MTG. (A) Recognition task: MTG could understandlanguage commands and generate verbal identity according to the viewed object. (B) Imaginationtask: imagine digit according to language heard. (C) Imagined objects could be moved up, down,left and right according to language commands. (D) Language guided digit shrinking or enlarging.(E) Language guided digit rotating. (F) Language guided attention and object identiﬁcation: MTGhighlighted the object according to the heard prepositional phrase, and then identify it. (G) Motionprediction: MTG responded to command ‘separated ?’ and predicted whether the point will leavethe digit at the next time step by outputting T or F lv . (H) Information ﬂow of ’rotate’ task. MTGconverts vv [[6], [6]] into vv [[6], [9]] after hearing lv [[rotate], [.]]. Green indicates the activated path.decomposition functionality, the language encoder A1-SPT is needed for the training data preparationby generating the lv of both input sentence and the expected output phrases and words. Wernicke canalso ﬁlter out non-language sounds such as music and correct the external utterance according to thepronunciation and grammar rules. Therefore, the lesion of the Wernicke can inﬂuence the languagecomprehension [16]. Figure 2A demonstrates one example of Wernicke processing, which not onlydecomposed the input sentence into word-level lv but also corrected the wrong external pronunciationof ‘iu’ into ‘it’.Broca module is responsible for the syntactic synthesis of languages (Fig. 2E), an opposite roleof sentence decomposition of the Wernicke module. In general, our brain determines the sentenceto be expressed before articulating it word by word. Broca plays a key role in synthetizing thesentence lv syntactically for language generation. Accordingly, we give Broca module in HGLP suchfunctionalities via supervised training. Figure 2E demonstrated that Broca combines the ‘square’from MTG and ‘it is’ from Wernicke into the sentence lv of "it is square", which could future bearticulated by PM-M1. Therefore, Broca not only synthesizes sentences from various modules butalso rearranged them according to the syntactical rules. Lesion to Broca does not affect languageunderstanding but causes problems in language production, such as agramatical and effortful speech,4amely, Broca aphasia [19]. Our lesion test in Figure 2E reveals a similar symptom. When a smallnumber of neurons were silenced (1 out of 32 neurons was set to zero activation), the lesioned lv could still be converted into a readable sentence; when a larger proportion of neurons (2/32 neurons)were lesioned, the language generation performance of Broca devastated rapidly; If most of theBroca neurons (32/32 neurons) were silenced, the articulating system (PM-M1) could only generateutterance ‘u7 3’, no matter what Broca intended to say, which showed quite similar symptoms toBroca’s patient ‘Tan’ [20]. Also, since Broca’s language generation is through contents synthesis,there is no exposure bias nor gradient vanishing issue in word by word language generation.Human MTG is located between the language-related superior temporal cortex and visual-relatedinferior temporal cortex. It receives both visual and verbal processed information and functions as aninterface or translator between these two modalities. The anterior MTG could explain a verbal nameby co-activating the representation of associated visual neurons; on the other hand, after viewing anobject, MTG could elicit activation of associated verbal neurons for naming articulation. Semanticdementia often has early neurodegeneration in this area [21], then patients could not tell names afterseeing things, nor remember faces after hearing names. Posterior MTG is adjacent to parietal andoccipital lobes, dealing with spatial and motion perception, which involves the association betweenvisual scenario changes and verb understanding. We used supervised learning to give MTG suchinteractive functionalities between visual and language modalities. In the digit recognition task (Fig.3A), MTG could understand the language command, process the vv of viewed object accordingly,and generate the identity lv that could ﬁnally be articulated via PM-M1. In the imagination task(Fig. 3B), MTG could imagine a digit according to language heard. In another word, MTG givesmeanings to verbal words by coactivating corresponding visual neurons. By correctly manipulatingimagined objects (Fig.3 C-E), we can claim that HGLP understands verbs and preposition such asmove up/down/left/right, enlarge/shrink, rotate, etc. We also propose a language guided attentionmechanism as is shown in Fig. 3F. After hearing the preposition phrases such as ‘on left’ or ‘onbottom right’, MTG shifted its attention to the corresponding objects for future processing, such asidentity recognition. Such language-guided attention can also be given by the executive cortex viatop-down control. Finally, Fig.3 G-H demonstrate that MTG could predict the visual motion of anobject, and output whether point and digit will be separated. The ‘subjective’ judgment can be usedin future tasks such as ‘write and trace’ in Fig. 4.In addition, the human brain needs to understand abstract concepts and their relationships, such asthe concept of ‘even number’, which cannot be explained by visual representation, but by verbalrepresentation of ‘number that can be divided by two is an even number’. The abstract informationprocessing is approximately located at conjunction areas between parietal and temporal lobes. So,we construct a BA39/40 module to implement the corresponding functions. Fig. 4A shows parts ofthese functions, such as 8 is bigger than 4, 8 + 4 = 12 and F = ma, etc. During learning, these abstractrelations are wired in the cerebral cortex as knowledge. When BA39/40 receives queries about suchabstract knowledge (Fig. 4B), it can provide the answers accordingly.The executive functions of the human brain are located in the prefrontal area [22, 23], which isconsidered to be orchestration of thoughts and actions to achieve internal goals. For simplicity,HGLP merely involves a dlPFC module with executive functionalities including task/rule recognition,attention, working memory, query, and inference. Human dlPFC is the neural substrate for task andrule identiﬁcation and representation [24, 25]. In mixed arithmetic operation, the computing rule isordered as parenthesis ﬁrst, then multiplication and division, ﬁnally addition and subtraction, andleft operation ﬁrst at the same level. We constructed a dlPFC module trained on these arithmeticoperations (four 1-digit numbers operations with addition, multiplication, and parenthesis). Figure4B shows that the dlPFC could understand the language input of mathematical formula, correctlydecompose the formula into single-step operations to be handled by BA39/40, save and retrieve thetemporary results in the working memory, and generalize the rule to more numbers (ﬁve 1-digitnumbers). For working memory, Fig. 4C demonstrates that the dlPFC could distinguish ‘last’ from‘this’, by successfully presenting the previous digit with correct identity and shape without beingaffected by the current distractor. Moreover, the dlPFC could convert the verbal question ‘what is ’ toexecutable query command ‘it is ?’ for object identiﬁcation that can be handled in MTG. Conditionalselection ‘if then’ is the abstract expression of rule-based reasoning, which allows people to ﬂexiblyhandle numerous tasks in real-time, such as syllogism reasoning. That is why all programminglanguages have ‘if then’ statement. We trained an LSTM to acquire such conditional selection (orabstract reasoning) capacity, with training data generated by template ‘if condition , action A , else5igure 4: BA39/40 and dlPFC. (A) BA39/40 learnes and processes abstract knowledge such asarithmetic operation. (B) In mixed arithmetic operations, dlPFC is responsible for recognition ofoperation rules, while BA39/40 answers each single step operation. (C) dlPFC’s role in workingmemory and task identiﬁcation. It can distinguish ’this’ from ’last’, and convert question ’what isthis/last ?’ into query sentence ’it is ?’ to trigger object identiﬁcation in MTG. (D) dlPFC learned theconditional selection rule ‘if condition , action A , else action B ’, where the training data of condition,action A , and action B are one or two randomly selected words from vocabulary. After hearingan ‘if then’ sentence, dlPFC outputs ‘ condition ?’ and the association cortex gives an answer ofTrue or False, then dlPFC outputs action A or B respectively. (E) Word tracing task. After hearingthe conditional selection sentence ‘if sep, SAY turn, else ’, the HGLP guided pen to trace the digittemplate. Left panel is the template and right is the traced result (the tracing process is displayed insupplementary movies). (F) Word writing task constitutes of imagining a digit and then tracing it. (G)Language guided iterative thinking process. (1) After closing the eyes, HGLP will not get any visualinformation from the outside world. (2) Language ‘six .’ is translated by MTG and the correspondingimagined ﬁgure is reconstructed by decoder V4’-V1’. (3) The imagined ﬁgure 6 is then fed intovisual encoder, and language ‘rotate .’, could ﬂip the ﬁgure upside down into 9. (4) ‘it is ?’ can elicitthe identity of the manipulated ﬁgure via MTG, and articulates ‘nine’ via Broca-PM-M1. (5) theimagined nine can be future manipulated by other language commands such as ‘shrink’ iteratively.Red arrows indicate that V1 input comes from previously reconstructed imagination.6 ction B ’, where condition , action A and action B were assigned with one or two random words ofvocabulary (Supplementary Methods). As displayed in Fig. 4D, after dlPFC hearing the statement, itstarted to output query ‘ condition ?’; if received the answer ‘T’, dlPFC output action A , otherwise action B . After sufﬁcient training, conditional selection in dlPFC could also be generalized to word‘sep’ beyond the training vocabulary.Figure 5D-F demonstrate how HGLP learns the ‘trace’ and ‘write’ skills at one trial guided bylanguage. After telling HGLP ‘if sep, SAY turn, else .’, the sentence was ﬁrst encoded into 32-byte lv by A1-SPT language encoder, then decomposed into word level lv by Wernicke module. BA39/40and MTG did not respond, but dlPFC (Fig. 4D) knew how to handle this ‘if then’ command byoutputting ‘sep ?’ to query whether pen and digit template would separate. Meanwhile, the digittemplate was converted into 32-byte vv by visual encoder V1-V4, which accompanied with ‘sep ?’were received and processed by MTG. As is shown in Fig. 3G-H, MTG predicted the separation bysaying ‘T’ when the pen would leave the digit template, otherwise saying ‘F’, which were receivedby dlPFC. If MTG’s feedback was ‘T’, dlPFC would initiate the articulation of ‘turn’ with the help ofBroca, Premotor and M1 (current HGLP has not implemented sensorimotor modules yet), so as to letthe external pen to randomly turn its motion direction. In this way, the HGLP traced digit 5 like achild according to the heard language command and viewed digit template (Supplementary MovieS1). Finally, the abstract knowledge lv of ‘trace is ‘if sep, SAY turn” could be saved in BA39/40 forfuture use. The writing skill can be decomposed as ‘imagine a template, trace it’. MTG ﬁrst imaginesa digit (Fig. 2B) which could be taken as a template at V1 and be traced by the verbally controlledexternal pen. The Movie S2 and Fig. 4F show the writing progress and ﬁnal result, respectively. Thistracing and writing processes demonstrate skill learning guided by language instruction, and laterrepeated practice will turn it into habitual skill with reinforcement learning implemented in basalganglia [26-28].The hippocampus and its surrounding structures are very important for navigation, episode memory,and learning. The brain records the daily attended events as episodic memory into hippocampus[29], and then consolidate the abstract knowledge into neocortex through the engram mechanism[30, 31]. We have not implemented the mechanism into HGLP yet, but simply organizes the trainingdata in the form of a hippocampus-like template, so as to supervised train the above association anddlPFC modules. For example, to train MTG, the hippocampus-like template needs to prepare bothinput and expected output n × lv and vv , where n = 5 indicates that each sentence has ﬁveframes of words or phrases. Its input and output contents are generated according to tasks, whichare scheduled from easy to hard, just like teaching kids. First train HGLP some simple tasks such asobject recognition, and then more complex and integrated tasks, which can take use of previouslylearned skills ﬂexibly. For any task, we need to arrange the hippocampal template to provide thecorresponding input and expected output for each module. Even those modules that are not directlyrelated to the task should learn to give a silent response (Fig. 1C-D). The development of HGLPdepends on the arrangement of tasks to be trained. Learning new tasks will inevitably affect theperformance of old skills, so it is necessary to occasionally re-experience speciﬁc tasks in order tokeep the skills and knowledges from being forgotten. So, the contents saved in the hippocampusdetermine the direction of knowledge development of cortical modules.HGLP has two routing directions (Fig. 1): the bottom-up route conveys the processed sensoryinformation to high-level modules and the response of association modules to dlPFC’s query; thetop-down route conveys the verbal attention and query from dlPFC to association modules, andthe control signals to sensorimotor decoders for visual imagination and verbal articulation. Theseenable the HGLP network to handle all kinds of information, interact with the outside world and evenself-think ﬂexibly and effectively. Note that, as there are a large number of interconnections amongvarious brain regions, HGLP also supports these interconnections if module A’s output modality isthe same as module B’s input. For example, MTG’s or BA39 language output can be fed into Broca.After learning the above skills, HGLP could perform the human-like thinking process guided bylanguage in Fig. 4G. (1) HGLP ﬁrst closed its eyes, namely that, no external images were fed into V1(all the subsequent input images were generated through the imagination process). (2) Hearing ‘six.’, MTG translated it into the corresponding vv that could be reconstructed into an imagined ﬁgureby V4’-V1’. (3) The imagined ﬁgure 6 is then fed into visual encoder, and language ‘rotate .’ couldﬂip it upside down into 9 as Fig. 3E. (4) Following the sentence ‘it is ?’, HGLP could consciouslyidentify the manipulated ﬁgure and articulate ‘nine’ via Broca-PM-M1. (5) Finally, HGLP used‘shrink .’ to make the virtual digit smaller, where red arrows indicate that V1 input comes from the7reviously reconstructed imagination. This experiment demonstrates HGLP could understand theword and sentence by properly manipulating imagination and generating verbal output. The HGLPalso forms the iterative thinking process guided by language. HGLP demonstrates the ability to solve those ﬁve questions raised at the beginning. (1) Step by steplearning is achieved by proper arrangement of task learning schedule with hippocampal templates.Previously learned skills can be ﬂexibly reused in subsequent tasks. For example, to understandthe word meaning, HGLP needs to be able to repeat the heard pronunciation; and word writingtask (Fig. 4F) requires the digit imagination and understanding of ‘if then’ statement. (2) Wordmeaning is explained with multimodal neuronal activation. Previously, activating digit 2 relatedvisual neurons requires to visualize a digit 2 instance, but now (Fig. 3B) with the verbal word ‘two’and the MTG translator, those 2 related visual neurons can be activated. Or we can say verbal‘two’ is explained by these visual neurons. (3) Use the virtual world to understand and generatesentences. This virtual world is dynamically built in HGLP’s visual autoencoder as Baddeley’s VSSand language autoencoder as Baddeley’s PL. The imagined visual output can be fed back to thevisual encoder and the sentence to be articulated can be fed back to the language encoder, so theprocessed information is not lost after articulation or imagination. The visual autoencoder is a mentalstage, where objects can be created and manipulated, the PL can be viewed as a script to guide thedevelopment of a story on the stage, and dlPFC is the director. The mental virtual world is built andmaintained in real-time according to the language and visual input from the outside world and thecontrol commands from dlPFC, so HGLP has the ability to interact with the environment through thevirtual world model. HGLP can manipulate its mental world according to other people’s language tounderstand their intention, meanwhile, it can generate verbal reply according to the evolution of thevirtual world. (4) The thinking process of HGLP is expressed via vv and lv , which could be visualizedand articulated via HGLP’s own visual and language decoders. Like human consciousness stream, itis hard to know what you are thinking via any brain measurements, but you can easily express it viayour articulation system. The vv and lv streams have the same characteristics. (5) The ‘digit tracing’experiment demonstrates that HGLP has the human-like ability of one trial learning. It understandsthe rule conveyed by language and knows how to behave according to the rule. This is a goal-directedrapid learning system, a complementary system to the habitual learning system implemented with thereinforcement learning framework.HGLP also demonstrates some other human-like features. It could identify digit instance at a glance,instead of ranking every item in vocabulary by softmax [32]. Just like people can immediatelyrecognize watermelon without evaluating the probability of being a cheery or car. In this way,HGLP does not need to have a ﬁxed vocabulary size, and can dynamically remember novel items iflearned or forget items that are not experienced frequently. Moreover, HGLP is a human-like neuralnetwork because it shows similar symptom when special module is lesioned. Fig. 2E demonstratedthe Broca aphasia symptom when neurons in the Broca module were silenced. We also know,by removing the hippocampus template, HGLP will behave like patient H.M. [33] who preservedsensorimotor functions, working memory, learned knowledge, etc. but could not form episodicmemory nor learning new knowledge. In addition, HGLP also demonstrated a more human-likevoluntary attention mechanism guided by language (Fig. 3F), which future allows machine to actaccording to its won choice mediated by its inner language lv to achieve ’free will in machine’, that isbeyond the capacity of current self-attention mechanism [1-3].In the future, we will add other human-like brain modules, such as sensorimotor system and basalganglia system, so that robots can have both goal-directed action output under HGLP languagecontrol and non-describable habitual behavior under reinforcement learning control. We may alsoadd a human-like value system including insular, amygdala and OFC, etc. to detect the interoceptionand voluntarily generate language to guide its own thinking and behavior [34, 35]. We will alsodivide visual autoencoder into ventral stream and dorsal stream to process object features andspatial information respectively [36]. Some other cortices can also be implemented under the HGLPframework to make it more intelligent and human-like.8 eferences [1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805. [2] Zhou, X., Li, L., Dong, D., Liu, Y., Chen, Y., Zhao, W. X., ... & Wu, H. (2018). Multi-turnresponse selection for chatbots with deep attention matching network. ACL 2018 .[3] Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). Videobert: A joint model forvideo and language representation learning.

In Proceedings of the IEEE International Conference onComputer Vision (pp. 7464-7473).[4] Xie, Z., Fu, X., & Yu, J. (2018). AlphaGomoku: An AlphaGo-based Gomoku artiﬁcial intelligenceusing curriculum learning. arXiv preprint arXiv:1809.10595. [5] Friederici, A. D. (2012). The cortical language circuit: from auditory perception to sentencecomprehension.

Trends in Cognitive Sciences , (5), 262-268.[6] Pulvermüller, F., & Fadiga, L. (2010). Active perception: sensorimotor circuits as a cortical basisfor language. Nature Reviews Neuroscience , (5), 351-360.[7] Baddeley, A. (2010). Working memory. Current Biology , (4), R136-R140.[8] DiCarlo, J. J., Zoccolan, D., & Rust, N. C. (2012). How does the brain solve visual objectrecognition?. Neuron , (3), 415-434.[9] Friederici, A. D., Chomsky, N., Berwick, R. C., Moro, A., & Bolhuis, J. J. (2017). Language,mind and brain. Nature Human Behaviour , (10), 713-722.[10] Skeide, M. A., & Friederici, A. D. (2016). The ontogeny of the cortical language network. Nature Reviews Neuroscience , (5), 323.[11] Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. NatureReviews Neuroscience , (5), 393-402.[12] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neuralnetworks. In Advances in Neural Information Processing Systems (pp. 3104-3112).[13] Whitney, C., Kirk, M., O’Sullivan, J., Lambon Ralph, M. A., & Jefferies, E. (2011). The neuralorganization of semantic control: TMS evidence for a distributed network in left inferior frontal andposterior middle temporal gyrus.

Cerebral Cortex , (5), 1066-1075.[14] Ralph, M. A. L., Jefferies, E., Patterson, K., & Rogers, T. T. (2017). The neural and computationalbases of semantic cognition. Nature Reviews Neuroscience , (1), 42.[15] Nieder, A. (2016). The neuronal code for number. Nature Reviews Neuroscience , (6), 366.[16] Shapiro, L. P., Gordon, B., Hack, N., & Killackey, J. (1993). Verb argument structure processingin complex sentences in Brocas and Wernickes aphasia. Brain and Language , (3), 423-447.[17] Hagoort, P. (2005). On Broca, brain, and binding: a new framework. Trends in CognitiveSciences , (9), 416-423.[18] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation , (8),1735-1780.[19] Mohr, J. P., Pessin, M. S., Finkelstein, S., Funkenstein, H. H., Duncan, G. W., & Davis, K. R.(1978). Broca aphasia: pathologic and clinical. Neurology , (4), 311-311.[20] Broca, P. (1861). Remarks on the seat of the faculty of articulated language, following anobservation of aphemia (loss of speech). Bulletin de la Société Anatomique , , 330-57.[21] Mummery, C. J., Patterson, K., Price, C. J., Ashburner, J., Frackowiak, R. S., & Hodges, J. R.(2000). A voxel-based morphometry study of semantic dementia: relationship between temporal lobeatrophy and semantic memory. Annals of Neurology , (1), 36-45.[22] Goldman-Rakic, P. S. (1995). Cellular basis of working memory. Neuron , (3), 477-485.[23] Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. AnnualReview of Neuroscience , (1), 167-202. 924] Sakai, K. (2008). Task set and prefrontal cortex. Annual Review of Neuroscience , , 219-245.[25] Menon, V., Mackenzie, K., Rivera, S. M., & Reiss, A. L. (2002). Prefrontal cortex involvement inprocessing incorrect arithmetic equations: Evidence from event-related fMRI. Human Brain Mapping , (2), 119-130.[26] Yin, H. H., & Knowlton, B. J. (2006). The role of the basal ganglia in habit formation. NatureReviews Neuroscience , (6), 464-476.[27] Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science , (5306), 1593-1599.[28] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen,S. (2015). Human-level control through deep reinforcement learning. Nature , (7540), 529-533.[29] Leutgeb, S., Leutgeb, J. K., Barnes, C. A., Moser, E. I., McNaughton, B. L., & Moser, M. B.(2005). Independent codes for spatial and episodic memory in hippocampal neuronal ensembles. Science , (5734), 619-623.[30] Tonegawa, S., Morrissey, M. D., & Kitamura, T. (2018). The role of engram cells in the systemsconsolidation of memory. Nature Reviews Neuroscience , (8), 485-498.[31] Josselyn, S. A., & Tonegawa, S. (2020). Memory engrams: Recalling the past and imagining thefuture. Science , (6473).[32] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classiﬁcation with deepconvolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097-1105).[33] Corkin, S. (2002). What’s new with the amnesic patient HM?.

Nature Reviews Neuroscience , (2), 153-160.[34] Mayer, E. A. (2011). Gut feelings: the emerging biology of gut–brain communication. NatureReviews Neuroscience , (8), 453-466.[35] Murray, E. A., & Rudebeck, P. H. (2018). Specializations for reward-guided decision-making inthe primate ventral prefrontal cortex. Nature Reviews Neuroscience , (7), 404-417.[36] Van Essen, D. C., Anderson, C. H., & Felleman, D. J. (1992). Information processing in theprimate visual system: an integrated systems perspective. Science ,255