[PDF] A hybrid classical-quantum workflow for natural language processing

Abstract

Natural language processing (NLP) problems are ubiquitous in classical computing, where they often require significant computational resources to infer sentence meanings. With the appearance of quantum computing hardware and simulators, it is worth developing methods to examine such problems on these platforms. In this manuscript we demonstrate the use of quantum computing models to perform NLP tasks, where we represent corpus meanings, and perform comparisons between sentences of a given structure. We develop a hybrid workflow for representing small and large scale corpus data sets to be encoded, processed, and decoded using a quantum circuit model. In addition, we provide our results showing the efficacy of the method, and release our developed toolkit as an open software suite.

Full PDF

AA hybrid classical-quantum workﬂow for natural language processing

Lee J. O’Riordan,

1, 2, ∗ Myles Doyle,

1, 2

Fabio Baruﬀa, and Venkatesh Kannan

1, 2 Irish Centre for High-End Computing, Dublin, Ireland. National University of Ireland, Galway, Ireland. Intel Deutschland GmbH, Feldkirchen, Germany. (Dated: April 16, 2020)Natural language processing (NLP) problems are ubiquitous in classical computing, where theyoften require signiﬁcant computational resources to infer sentence meanings. With the appearanceof quantum computing hardware and simulators, it is worth developing methods to examine suchproblems on these platforms. In this manuscript we demonstrate the use of quantum computingmodels to perform NLP tasks, where we represent corpus meanings, and perform comparisonsbetween sentences of a given structure. We develop a hybrid workﬂow for representing small andlarge scale corpus data sets to be encoded, processed, and decoded using a quantum circuit model.In addition, we provide our results showing the eﬃcacy of the method, and release our developedtoolkit as an open software suite.

I. INTRODUCTION

Natural language processing (NLP) is an active area ofboth theoretical and applied research, and covers a widevariety of topics from computer science, software engi-neering, and linguistics, amongst others. NLP is oftenused to perform tasks such as machine translation, senti-ment analysis, relationship extraction, word sense disam-biguation and automatic summary generation [1]. Mosttraditional NLP algorithms for these problems are de-ﬁned to operate over strings of words, and are commonlyreferred to as the “bag of words” approach [2]. The chal-lenge, and thus limitation, of this approach is that thealgorithms analyse sentences in a corpus based on mean-ings of the component words, and lack information fromthe grammatical rules and nuances of the language. Con-sequently, the qualities of results from these traditionalalgorithms are often unsatisfactory when the complexityof the problem increases.On the other hand, an alternate approach called“compositional semantics” incorporates the grammati-cal structure of sentences from a given language into theanalysis algorithms. Compositional semantics algorithmsinclude the information ﬂows between words in a sen-tence to determine the meaning of the whole sentence [3].One such model in this class is “(categorical) distribu-tional compositional semantics”, known as DisCoCat [4–6], which is based on tensor product composition to givea grammatically informed algorithm that computes themeaning of sentences and phrases. This algorithm hasbeen noted to potentially oﬀer improvements to the qual-ity of results, particularly for more complex sentences, interms of memory and computational requirements. How-ever, the main challenge in its implementation is the needfor large classical computational resources.With the advent of quantum computer programmingenvironments, both simulated and physical, a question ∗ Corresponding author:[email protected] may be whether one can exploit the available Hilbertspace of such systems to carry out NLP tasks. The Dis-CoCat methods have a natural extension to a quantummechanical representation, allowing for a problem to bemapped directly to this formalism [5]. Using an oracle-based access pattern, one can bound the number of ac-cesses required to create the appropriate states for use bythe DisCoCat methods [7]. Though, this requires the useof a quantum random access memory, or qRAM [8, 9].Currently, qRAM remains unrealised, and expectationsare that the resources necessary to realise are as challeng-ing as a fault tolerant quantum computer [10]. As such,it can be useful to examine scenarios where qRAM is notpart of the architectural design of the quantum circuit.This will allow us to examine proof-of-concept methodsto explore and develop use-cases later improved by itsexistence.In this paper we examine the process for mappinga corpus to a quantum circuit model, and use the en-coded meaning-space of the corpus to represent funda-mental sentence meanings. With this representation wecan examine the mapping of sentences to the encodingspace, and additionally compare sentences with overlap-ping meaning-spaces. We follow a DisCoCat-inspired for-malism to deﬁne sentence meaning and similarity basedupon a given compositional sentence structure, and re-lationships between sentence tokens determined using adistributional method of token adjacency.This paper will be laid out as follows: Section II willgive an introduction to NLP, the application of quantummodels to NLP, and discuss the encoding strategy for aquantum circuit model. Section III will discuss the prepa-ration methods required to enable quantum-assisted en-coding and processing of the text data. Section IV willdemonstrate the proposed methods using our quantumNLP software toolkit [11] sitting atop Intel QuantumSimulator (IQS) [12]. For this we showcase the meth-ods, and compare results for corpora of diﬀerent sizesand complexity. Finally, we conclude in Section V. a r X i v : . [ qu a n t - ph ] A p r II. NLP METHODS

One of the main concerns of NLP methods is the ex-traction of information from a body of text, whereinthe data is not explicitly structured; generally, the textis meant for human, rather than machine, consump-tion [13]. As such, explicit methods to infer meaningand understand a body of text are required to encodesuch data in a computational model.Word embedding models, such as word2vec , havegrown in popularity due to their success in representingand comparing data using vectors of real numbers [14].Additionally, libraries and toolkits such as NLTK [15]and spaCy [16] oﬀer community developed models andgenerally incorporate the latest research methods forNLP. The use of quantum mechanical eﬀects for embed-ding and retrieving information in NLP has seen muchinterest in recent years [17–23].An approach that aims to overcome the ambigu-ity oﬀered by traditional NLP methods, such as thebag-of-words model is the categorical distributional-compositional (DisCoCat) model [4, 5]. This method in-corporates semantic structure, where sentences are con-structed through a natural tensoring of individual com-ponent words following a set of rules determined fromcategory theory. These rule-sets for which sentence struc-tures may be composed are largely based on the frame-work of pre-group grammars [24].The DisCoCat approach oﬀers a means to employgrammatical structure of sentences with token relation-ships in these sentences. Words that appear closer intexts are more likely to be related, and sentence struc-tures can be determined using pre-group methods. Thesemethods can easily be represented in a diagrammaticform, and allow for a natural extension to quantumstate representation [6]. This diagrammatic form, akinto a tensor network, allows for calculating the similar-ity between other sentences. This similarity measure as-sumes an encoded quantum state representing the struc-ture of the given corpus, and an appropriately preparedtest state to compare with. This alludes to a tensor-contraction approach to perform the evaluation.While this approach has advantages in terms of accu-racy and generalisation to complex sentence structures,state preparation is something we must consider. Giventhe current lack of qRAM, the speciﬁed access boundsare unrealised [7], and so it is worth considering statepreparation as part of the process. Ensuring an eﬃcientpreparation approach will also be important to enableprocessing on a scale to rival that of traditional high-performance computing NLP methods.As such, we aim to provide a simpliﬁed model, frame-work and hybrid workﬂow for representing textual datausing a quantum circuit model. We draw inspirationfrom the DisCoCat model to preprocess our data to astructure easily implementable on a quantum computer.We consider simple sentences of the form “noun - verb- noun” to demonstrate this approach. All quantum circuit simulations and preprocessing is performed byour quantum NLP toolkit (QNLP), sitting atop the In-tel Quantum Simulator (formerly qHiPSTER) to handlethe distributed high-performance quantum circuit work-loads [12, 25]. We release our QNLP toolkit as an opensource (Apache 2.0) project, and have made it availableon GitHub [11].

III. METHODSA. Representing meaning in quantum states

In this section, we discuss the implementation of thealgorithms required to enable encoding, processing, anddecoding of our data. We consider a simpliﬁed restrictedexample of the sentence structure “noun-verb-noun” asthe representative encoding format. To represent sen-tence meanings using this workﬂow, we must ﬁrst con-sider several steps to prepare our corpus data set foranalysis:1. Data must be pre-processed to tag tokens withthe appropriate grammatical type; stop-words (e.g.“the”, “a”, “at”, etc.) and problematic (e.g. non-alphanumeric) characters should be cleaned fromthe text to ensure accurate tagging, wherein typeinformation is associated with each word.2. The pre-processed data must be represented in anaccessible/addressable (classical) memory medium.3. There must be a bijective mapping between the pre-processed data and the quantum circuit represen-tation to allow both encoding and decoding.Assuming an appropriately prepared dataset, the en-coding of classical data into a quantum system can bemapped to two diﬀerent approaches: state (digital), oramplitude (analogue) encoding [26, 27]. We aim to oper-ate in a mixed-mode approach: encoding and represent-ing corpus data using state methods, then representingand comparing test sentence data through amplitude ad-justment, measurement, and overlap.Our approach to encoding data starts with deﬁning afundamental language (basis) token set for each represen-tative token meaning space (subject nouns, verbs, objectnouns). The notion of similarity, and hence orthogonal-ity, with language can be a diﬃcult problem. Do weconsider the words “stand” and “sit” to be completelyopposite, or are they similar because of the type of ac-tion taken? For this work, we let the degree of ‘closeness’be determined by the distributional nature of the termsin the corpus; words further apart in the corpus are morelikely to be opposite.To eﬃciently encode the corpus data, we decide to rep-resent the corpus in terms of the n most fundamentallycommon tokens in each meaning space. This draws simi-larity with the use of a word embedding model to repre-sent a larger space of tokens in terms of related meaningsin a smaller space [28–30]. This is necessary as repre-senting each token in the corpus matching the sentencestructure type can create a much larger meaning spacethan is currently representable, given realistic simulationconstraints. However, one can note as we increase thelimit of fundamental tokens in our basis, we tend to thefull representative meaning model.Taking inspiration from the above methods, we imple-ment an encoding strategy that given the basis tokens,maps the remaining non-basis tokens to these, given somedistance cut-oﬀ in the corpus. A generalised representa-tion of each token t i , in their respective meaning space m would be deﬁned as t i = n (cid:88) j d i,j m j (1)where d i,j deﬁnes the distance between the base token m j and non-base t i . As such, we obtain a linear combinationof the base tokens with representative weights to describethe mapped tokens.We have identiﬁed the following key steps to eﬀectivelypre-process data for encoding:1. Tokenise the corpus and record position of occur-rence in the text.2. Tag tokens with the appropriate meaning spacetype (e.g. noun, verb, stop-word, etc.)3. Separate tokens into noun and verb datasets.4. Deﬁne basis tokens in each set as the N nouns and N verbs most frequently occurring tokens.5. Map basis tokens in each respective space to a fullyconnected graph, with edge weights deﬁned by theminimum distance between each other basis token.6. Calculate the shortest Hamiltonian cycle for theabove graph. The token order within the cycle isreﬂective of the tokens’ separation within the text,and a measure of their similarity.7. Map the basis tokens to binary strings, using agiven encoding scheme.8. Project composite tokens (i.e. non-basis tokens)onto the basis tokens set using representation cut-oﬀ distances for similarity, W nouns and W verbs .9. Form sentences by matching composite noun-verb-noun tokens using relative distances and a noun-verb distance cut-oﬀ, W nv .After conducting the pre-processing steps, the corpusis represented as a series of binary strings of basis tokens.At this stage the corpus is considered prepared and canbe encoded into a quantum register. B. Token encoding

To ensure the mapping of the basis words to encodingpattern is reﬂective of the underlying distributional re-lationships between words in the corpus, it is necessaryto choose an encoding scheme such that the inter-tokenrelationships are preserved. While many more complexschemes can give insightful relationships, we choose acyclical encoding scheme where the Hamming distance, d H , between each bit-string is equal to the distance be-tween the bit-strings in the data set. For 2 and 4-qubitregisters respectively, this would equate to the patterns p = [00 , , , ,p = [0000 , , , , , , , , of which the i = [1 , n ] range of indexed (base-10) ele-ments of p n can be generated iteratively by p n ( i + 1) =  p n ( i ) + 1 , if i ≤ n + 1 ,p n ( n + 1) − p n ( i − n ) , if n + 1 < i ≤ n, undeﬁned otherwise , (2)given p n (1) = 0. For the simple 2-bit pattern, thisequates to a Gray code mapping, but diﬀers for largerregister sizes. With this encoding scheme, we can showthat the Hamming distances between each pattern andothers in the set have a well-deﬁned position-to-distancerelationship.As an example, let us consider a 4-element basis oftokens given by b = { up , down , left , right } . We deﬁne up and down as opposites, and so should preserve thelargest Hamming distance between them. This requiresmapping the tokens to either 00,11, or to 10,01 for thesepairs. Similarly, we ﬁnd the same procedure with theremaining tokens. In this instance, we have mapped thetokens as up → down → left → right → up - down relationship has more similarities than, say, a left - down relationship, but for the purpose of our example thisdeﬁnition is suﬃcient. Care ought to be taken into deﬁn-ing inter-token relationships, requiring some domain ex-pertise of the problem being investigated. The choice ofinter-token relationship taken during preparation will in-ﬂuence the subsequent token mappings determined laterin the process.For our work we have deemed it suﬃcient to deﬁnethese similarities by distance between the tokens in atext; larger distances between tokens deﬁning a largerrespective Hamming distance, and smaller distances asmaller one. We can similarly extend this method to rl ud FIG. 1. Graph showing the mapping of tokens (blue) to bit-strings (white). Edge weights between the bit-strings repre-sent the Hamming distances, d H between connected nodes.By mapping the tokens to the appropriate basis bit-string wecan use the Hamming distances to represent diﬀerences be-tween tokens. larger datasets, though the ordering problem requires amore automated approach.For the 4-qubit encoding scheme, we must deﬁne astrategy to map the tokens to a fully connected graph,where again the respective positions of the bit-stringsreﬂect the Hamming distance between them, as shown inFig. 1. To eﬀectively map tokens to these bit-strings, weuse the following procedure:1. Given the chosen basis tokens, and their positionsin the text, create a graph where each basis tokenis a single node.2. Calculate the distances between all token positionsgiven a pairing of each token with the others, forall ( n − n ) / networkx package for the ﬁnding the minimum Hamiltonian cy-cle [31]. C. Methods for quantum state encoding

To simplify our encoding procedure, we can assumea binary representation of distance for eq. (1), whereinall tokens within the given cutoﬀ are equally weighted.This allows us to encode the states as an equal-weightedsuperposition, and is easily implemented as a quantumcircuit [32, 33].For notational simplicity, we deﬁne the following map-pings: X a : | a (cid:105) → |¬ a (cid:105) , CX a,b : | a (cid:105)| b (cid:105) → | a (cid:105)| a ⊕ b (cid:105) nCX a ,...a n ,b : | a (cid:105) . . . | a n (cid:105)| b (cid:105) →| a (cid:105) . . . | a n (cid:105)| b ⊕ ( a ∧ · · · ∧ a n ) (cid:105) where | a (cid:105) and | b (cid:105) are computational basis states, X isthe Pauli-X ( σ x ) gate, CX and nCX are the controlledX, and n -controlled NOT (nCX) operations, respectively.Additionally, we may deﬁne controlled operations usingany arbitrary unitary gate using a similar constructionof the above.The goal of this algorithm is to encode a set of bit-strings representing our token meaning-space as an equalweighted superposition state. For a set of N unique bi-nary patterns p ( i ) = { p ( i )1 , . . . , p ( i ) n } each of length n for i = 1 , . . . , N , we require three registers of qubits; a mem-ory register | m (cid:105) of length n , an auxiliary register | a (cid:105) oflength n , and a control register | u (cid:105) of length 2 initialisedas | (cid:105) . | m (cid:105) and | a (cid:105) are initialised as | m (cid:105) = | a (cid:105) = | (cid:105) ⊗ n ,with the full quantum register initialised as | ψ (cid:105) = | a (cid:105)| u (cid:105)| m (cid:105) = | . . . n (cid:105)| (cid:105)| . . . n (cid:105) . (3)Each of the binary vectors are encoded sequentially.For each iteration of the encoding algorithm, a new stateis generated in the superposition (excluding the ﬁnal it-eration). The new state generated is termed as the active state of the next iteration. All other states are said tobe inactive . Note, in each iteration of the algorithm, theactive state will always be selected with | u (cid:105) = | (cid:105) .During a single iteration, a binary vector is stored ininteger format, which is then serially encoded bit-wiseinto the auxiliary register | a (cid:105) resulting in the state | ψ (cid:105) : | ψ (cid:105) = | a (1)1 . . . a (1) n (cid:105)| (cid:105)| . . . n (cid:105) . (4)This binary representation is then copied into thememory register | m (cid:105) of the active state by applying a2CX gate on | ψ (cid:105) : | ψ (cid:105) = n (cid:89) j =1 a ( i ) j u m j | ψ (cid:105) . (5)Next, we apply a CX followed by a X gate to all qubitsin | m (cid:105) using the corresponding qubits in | a (cid:105) as controls: | ψ (cid:105) = n (cid:89) j =1 X m j CX a ( i ) j m j | ψ (cid:105) . (6)This sets the qubits in | m (cid:105) to 1 if the respective qubitindex in both | m (cid:105) and | a (cid:105) match, else to 0. Thus, thestate whose register | m (cid:105) matches the pattern stored in | a (cid:105) will be set to all 1’s while the other states will haveat least one occurrence of 0 in | m (cid:105) .Now that the state being encoded has been selected,an nCX operation is applied to the ﬁrst qubit in theauxiliary register using the qubits in | m (cid:105) as the controls: | ψ (cid:105) = nCX m ...m n u | ψ (cid:105) . (7)The target qubit whose initial value is 0 will be set to1 if | m (cid:105) consists of only 1’s. This is the case when thepattern in | m (cid:105) is identical to the pattern being encoded(i.e. the pattern stored in | a (cid:105) ).In order to populate a new state into the superposi-tion, it is required to eﬀectively ‘carve-oﬀ’ some ampli-tude from the existing states so the new state has a non-zero coeﬃcient. To do this, we apply a controlled unitarymatrix CS ( i ) to the second auxiliary qubit u using theﬁrst auxiliary qubit u as a control: | ψ (cid:105) = CS ( p +1 − i ) u u | ψ (cid:105) , (8)where S ( i ) = (cid:113) i − i √ i − √ i (cid:113) i − i  = R y ( φ ( i )) , (9)with i ∈ Z + , and φ ( i ) = − cos − (( i − /i ). The newlygenerated state will be selected with | u (cid:105) = | (cid:105) , while theprevious active state used to ‘carve-oﬀ’ this new stateselected with | u (cid:105) = | (cid:105) . All other states will be selectedwith | u (cid:105) = | (cid:105) .To apply the next iteration of the algorithm we un-compute the steps from equations (5) - (7) as: | ψ (cid:105) = nCX m ...m n u | ψ (cid:105) , (10) | ψ (cid:105) = (cid:89) j = n CX a ( i ) j m j X m j | ψ (cid:105) , (11) | ψ (cid:105) = (cid:89) j = n a ( i ) j u m j | ψ (cid:105) . (12)This results in the previous active state now being se-lected with | u (cid:105) = | (cid:105) while the new state with | u (cid:105) = | (cid:105) ,which identiﬁes it as the new active state. The previousactive state’s memory register now contains the pattern { a ( i )1 , . . . , a ( i ) n } while the new active state’s memory reg-ister is set to all zeroes. Finally, the register | a (cid:105) for every state must be set toall zeroes by sequentially applying X gates to each qubitin | a (cid:105) according to the pattern that was just encoded.The quantum register is now ready for the next iterationto encode another pattern. Following the encoding of allpatterns, our state will be | ψ (cid:105) = | a (cid:105)| u (cid:105)| m (cid:105) = | . . . n (cid:105)| (cid:105) (cid:32) √ N N (cid:88) i =1 | p ( i )1 . . . p ( i ) n (cid:105) (cid:33) . (13)Note, this algorithm assumes that the number of patternsto be encoded is known beforehand, which is required togenerate the set of S i matrices and apply them in thecorrect order. The total number of qubits used in thisalgorithm is 2 n + 2, of which n + 2 are reusable afterthe implementation since the qubits in | a (cid:105) and | u (cid:105) are allreset to | (cid:105) upon completion.The additional n + 2 qubits allows for them to be usedas intermediate scratch to enable the large n-controlledoperations during the encoding stages. This ensures thatwe can perform the nCX operations with a linear, ratherthan polynomial, number of two-qubit gate calls [34]. D. Representing patterns using encoded data

The purpose of this methodology is to represent a sin-gle test pattern using the previously encoded meaning-space. The relative distance between each meaning-spacestate pattern and the single test pattern x = { x , . . . , x n } is then encoded into the amplitude of each respectivemeaning-space state pattern. Thus, each representedstate will have a coeﬃcient proportional to the Hammingdistance between itself and the test pattern. The methodwe present below calculates the binary diﬀerence betweenthe target state’s bit-string and the test pattern, denotedby d H .The algorithm assumes that we already have N statesof length n encoded into the memory register | m (cid:105) . Thesubsequent encoding requires 2 n + 1 qubits; n qubits tostore the test pattern, a single qubit register which therotations will act on, and n qubits for the memory reg-ister. As our previously used encoding stage required2 n + 2 qubits, we can repurpose the | a (cid:105) and | u (cid:105) registersas the test pattern and rotation registers respectively.Our meaning-space patterns are encoded in the memoryregister | m (cid:105) , with registers | a (cid:105) and | u (cid:105) initialised as all0’s. Hence, our initial state is given by eq. (13).Next, the test pattern x = { x , . . . , x n } is encoded intothe register | a (cid:105) sequentially by applying a X gate to eachqubit whose corresponding classical bit x i is set: | ψ (cid:48) (cid:105) = | x . . . x n (cid:105)| (cid:105) (cid:32) √ N N (cid:88) i =1 | p ( i )1 . . . p ( i ) n (cid:105) (cid:33) . (14)Rather than overwriting register | a (cid:105) with the diﬀer-ing bit-values, a two qubit controlled R y ( θ ) (2C R y ) gateis applied, such that θ = πn . This is done by iterativelyapplying the 2-controlled R y gate with a j and m j as con-trol qubits to rotate | u (cid:105) if both control qubits are set for j = 1 , . . . , n . The operation is performed twice, suchthat a j = 1 , m j = 1 and by appropriately ﬂipping thebits prior to use for a j = 0 , m j = 0.Finally, the test pattern stored in register | a (cid:105) is resetto consist of all 0’s by applying a X gate to each qubit in | a (cid:105) whose corresponding classical bit is set to 1.The above process can be written as follows: | ψ (cid:48) (cid:105) = n (cid:89) j =1 X a j X m j ( a j ,m j ,u ) y X a j X m j ( a j ,m j ,u ) y | ψ (cid:105) , (15)where the state after application is given by | ψ (cid:48) (cid:105) = | (cid:105) ⊗ ( n +1) √ k k (cid:88) j =1 [cos ( φ j ) | (cid:105) + sin ( φ j ) | (cid:105) ] ⊗| p ( j ) (cid:105) , (16)with φ j = d H ( p ( j ) , x ) πn = πn n (cid:88) l =1 p ( j ) l ⊕ x l . (17)Applying the linear map P = ⊗ ( n +1) ⊗ | (cid:105)(cid:104) | ⊗ ⊗ n , (18)we represent the meaning-space states weighted by theHamming distance with the test pattern, x . The statefollowing this is given by | ψ (cid:48) x (cid:105) = 1 (cid:112) k (cid:104) ψ (cid:48) | P | ψ (cid:48) (cid:105) k (cid:88) j sin( φ j ) | p j (cid:105) , (19)where qubit registers | a (cid:105) = | (cid:105) ⊗ n and | u (cid:105) = | (cid:105) are leftout for brevity.With the above method we can examine the similar-ity between patterns mediated via the meaning space.While one may directly calculate the Hamming distancebetween both register states as a measure of similarity,by doing this we lose distributional meaning discussedfrom Section III B. As such, we aim to represent bothpatterns in the meaning-space, and examine their result-ing similarity using state overlap, with the result deﬁnedby F ( x (0) ,x (1) ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:112) (cid:104) P (0) (cid:105)(cid:104) P (1) (cid:105) k (cid:88) j =1 sin (cid:16) φ (0) j (cid:17) sin (cid:16) φ (1) j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (20)with (cid:104) P ( i ) (cid:105) = (cid:104) x ( i ) | P | x ( i ) (cid:105) , and x ( i ) as test pattern i . IV. RESULTSA. Small-scale example

We now demonstrate an example of the method out-lined in Sec. III for a sample representation and sentencecomparison problem.We opt for the simpliﬁed noun-verb-noun sen-tence structure, and deﬁne sets of words within eachof these spaces, through which we can constructour full meaning space, following an approach out-lined in [4]. For nouns, we have: (i) subjects, n s = { adult , child , smith , surgeon } ; and (ii) ob-jects, n o = { outside , inside } . For verbs, we have v = { stand , sit , move , sleep } . With these sets, we canrepresent the full meaning-space as given by  adultchildsmithsurgeon  ⊗  standsitmovesleep  ⊗  outsideinside  . (21)Whilst all combinations may exist, subjected toa given training corpus, only certain patterns willbe observed, allowing us to restrict the informa-tion in our meaning-space. For simplicity, we canchoose our corpus to be a simple set of sentences: John rests inside. Mary walks outside . To repre-sent these sentences using the bases given by eq. (21),we must determine a mapping between each token in thesentences to the bases. In this instance, we manuallydeﬁne the mapping by taking the following meanings: • John is an adult, and a smith . The state isthen given as: | John (cid:105) = 1 / √ | adult (cid:105) + | smith (cid:105) ), which is a su-perposition of the number of matched entities fromthe basis set. • Mary is a child, and a surgeon . Similarly,the state is given as: | Mary (cid:105) = 1 / √ | child (cid:105) + | surgeon (cid:105) ), followingthe same procedure as above.We also require meanings for rests and walks . Ifwe examine synonyms for rests and cross-comparewith our chosen vocabulary, we can ﬁnd sit and sleep . Similarly, for walks we can have stand and move . We can deﬁne the states of these wordsas | rest (cid:105) = 1 / √ | sit (cid:105) + | sleep (cid:105) ) and | walk (cid:105) =1 / √ | stand (cid:105) + | move (cid:105) ). Now that we have a meansto deﬁne the states in terms of our vocabulary, we canbegin constructing states to encode the data.We begin by tokenising the respective sentences intothe 3 diﬀerent categories: subject nouns, verbs, and ob-ject nouns. With the sentence tokenised, we next repre-sent them as binary integers, and encode them using theprocesses of Sec. III. The basis tokens are deﬁned in ta-ble I. We deﬁne the mapping of “John rests inside, Mary Dataset Token Bin. Index n s adult 00 n s child 11 n s smith 10 n s surgeon 01 v stand 00 v move 01 v sit 11 v sleep 10 n o inside 0 n o outside 1TABLE I. Basis data walks outside” to this basis in table II. Dataset Token State n s John ( | (cid:105) + | (cid:105) ) / √ n s Mary ( | (cid:105) + | (cid:105) ) / √ v walk ( | (cid:105) + | (cid:105) ) / √ v rest ( | (cid:105) + | (cid:105) ) / √ n o inside | (cid:105) n o outside | (cid:105) TABLE II. Sentence data encoding using basis from Table I.

If we consider the

John and

Mary sentences sepa-rately for the moment, they are respectively given by thestates (1 / | (cid:105) ⊗ ( | (cid:105) + | (cid:105) ) ⊗ ( | (cid:105) + | (cid:105) ) for John, and(1 / | (cid:105) ⊗ ( | (cid:105) + | (cid:105) ) ⊗ ( | (cid:105) + | (cid:105) ) for Mary. Note thatwe choose a little endian encoding schema, wherein thesubject nouns are encoded to the right of the register andobject nouns to the left. Tidying these states up yields John rests inside → | J (cid:105) = ( | (cid:105) + | (cid:105) + | (cid:105) + | (cid:105) ) , Mary walks outside → | M (cid:105) = ( | (cid:105) + | (cid:105) + | (cid:105) + | (cid:105) ) , where the full meaning is given by | m (cid:105) = | J (cid:105) + | M (cid:105)√ , whichis a superposition of the 8 unique encodings deﬁned byour meaning-space and sentences.From here we will next encode a test state to bestored in register | a (cid:105) for representation using the en-coded meaning-space. We use the pattern denoted by“ Adult(s) stand inside ”, which is encoded as | a (cid:105) = | (cid:105) . Constructing our full state in the format ofeq. (14), we get | ψ (cid:105) = 12 √ | (cid:105) ⊗ | (cid:105) ⊗ ( | (cid:105) + | (cid:105) + | (cid:105) + | (cid:105) + | (cid:105) + | (cid:105) + | (cid:105) + | (cid:105) ) . By following the steps outlined in Sec. III D, rotating asingle qubit from the control register | u (cid:105) based on theHamming distance between both registers, and applying the map from eq. (18), the state of register | m (cid:105) encodesa representation of the test pattern in the amplitude ofeach unique meaning-space state.Through repeated preparation and measurement of the | m (cid:105) register we can observe the patterns closest to thetest. Figure 2 shows the observed distribution using twodiﬀerent patterns; adult, sit, inside (00000, orange),and child, move, inside (00111, green) compared withthe encoded meaning-space patterns following eq. (13)(blue).Given this ability to represent patterns, we can extendthis approach to examine the similarity of diﬀerent pat-terns using eq. (20). One can create an additional mem-ory register | m (cid:48) (cid:105) , and perform a series of SWAP testsbetween both encoded patterns, to determine a measureof similarity. For the above example, we obtain an over-lap of F (00000 , . a d u l t , s i t , i n s i d e | i a d u l t , s l e e p , i n s i d e | i s m i t h , s i t , i n s i d e | i s m i t h , s l e e p , i n s i d e | i c h i l d , s t a n d , o u t s i d e | i c h i l d , m o v e , o u t s i d e | i s u r g e o n , s t a n d , o u t s i d e | i s u r g e o n , m o v e , o u t s i d e | i . . . . . O u t c o m e p r o b . k meaning spaceencoding ( adult,stand,inside )encoding ( child,move,inside ) FIG. 2. Sentence encoding state distribution taken by multi-shot preparation and measurement of | m (cid:105) prior to, andpost, the encoding of test patterns. Two distinct patternsare used: 00000 → (adult, stand, inside) (orange) and00111 → (child, move, inside) (green). The distributionis sampled 5 × times, and shows how the Hamming dis-tance weighting modiﬁes the distribution relative to the k = 8unweighted meaning-space states (blue). B. Automated large-scale encoding

As that the previous example was artiﬁcially con-structed to showcase the method, an automated work-ﬂow that determines the basis and mapped tokens, andperforms the subsequent experiment is beneﬁcial. Herewe perform the same analysis, but using Lewis Carroll’s“Alice in Wonderland” in an end-to-end simulation.To showcase the basis choice, we will consider thenouns basis set. We deﬁne a maximum basis set of 8nouns ( N nouns = 8), taken by their frequency of occur-rence. Following the process outlined in Sec. III, wedeﬁne a graph from these tokens, and use their inter-token distances to determine ordering following a mini-mum Hamiltonian cycle calculation. The resulting graphis shown by Fig. 3. From here we map the tokens to anappropriate set of encoding bit-strings for quantum staterepresentation, making use of eq.(2). The resulting setof mappings is :head → | (cid:105) , turtle → | (cid:105) , hatter → | (cid:105) , king → | (cid:105) , queen → | (cid:105) , time → | (cid:105) , thing → | (cid:105) , alice → | (cid:105) . (22) hatterheadturtlekingqueen time thing alice FIG. 3. Relative ordering of an 8-basis set chosen for the noundataset in “Alice in Wonderland”, using the encodings andordering given by Eq. (22). The edge weight between eachtoken shows the Hamming distance between the respectiveencoding patterns.

We can now map the composite tokens onto the chosenbasis encoding using a distance cut-oﬀ, W nouns . Follow-ing the inter-word distance calculation approach used todetermine basis order, we calculate the distance betweenthe other corpus tokens and the respective basis set. Tak-ing the set of all nouns in the corpus as s n , and the nounbasis set as b n ⊂ s n , for every token t n in s n we perform t n : s n (cid:55)→ b n . (23)Tokens that fall outside W nouns are mapped to the emptyset, ∅ . This approach is then repeated for verbs, andlastly inter-dataset distances between noun-verb pair-ings, W nv , which are used to discover viable sentences.The mapped composite tokens may then be used to cre-ate a compositional sentence structure by tensoring therespective token states.Following the previous example, we may examine theautomatic encoding and representation of the string“Hatter say queen” to the meaning-space patterns.Given that representing the text in its entirety would be a substantial challenge, we limit the amount of infor-mation to be encoded by controlling the pre-processingsteps as N nouns = 8, N verbs = 4, W nouns = 5, W verbs = 5and W vn = 4. Here N nouns is again the number of ba-sis nouns in both subject and object datasets, N verbs thenumber of basis verbs, W nouns and W verbs the cutoﬀ dis-tances for mapping other nouns and verbs in the corpusnouns to the basis tokens, and W vn is the cutoﬀ distanceto relate noun and verb tokens.For the above parameters, the method ﬁnds a subsetof 75 unique patterns to represent the corpus. Follow-ing Section IV A one obtains the associated similarityof encoded elements by the resulting likelihood of occur-rence, as indicated by Fig. 4, where we have prepared andsampled the | m (cid:105) register 5 × times to build the dis-tribution. Clear step-wise distinctions can be observedbetween the diﬀerent categories of Hamming-weightedstates, with the full list presented in Appendix C Ta-ble III. Given the basis encoding tokens from eq. (22),the string “Hatter say queen” can be mapped to the value995 ( in binary). d H C o un t s FIG. 4. The result of measuring | m (cid:105) states following the en-coding of the string pattern “Hatter say Queen”. This ex-ample uses a 75 unique pattern basis set, and taking 5 × samples to build the distribution. The Hamming distances ofthe labels are indicated on the x -axis, and diﬀerentiated bycolour, where we can see clear distinction between the pat-terns in each Hamming category. The data mapping tokensto patterns in each category can be viewed in Appendix C. As before, we can also compare patterns mediatedvia the meaning-space. For the pattern “Hatter sayQueen”, the most similar patterns are “Hatter say King”( ), “Hatter go Queen” ( ) and“Turtle say Queen” ( ) with overlaps of 0.974,0.974 and 0.973 respectively. We include a variety ofother encoded comparisons in the Appendix as Table. IVto showcase the method.

V. CONCLUSIONS

In this paper we have demonstrated methods for en-coding corpus data as quantum states. Taking elementsfrom the categorical distributional compositional seman-tic formalism, we developed a proof-of-concept workﬂowfor preparing small and large scale data sets to be en-coded, processed, and decoded using a given quantumregister. We showed the preparation, encoding, compar-ison, and decoding of small and large datasets using thepresented methods.Recent works have shown the importance of the re-duction in classical data to be represented on a quantumsystem [35]. The approach deﬁned above follows an anal-ogous procedure, representing the important elements ofthe corpus data using a fundamental subset of the fullcorpus data. Using this subset, we have shown how torepresent meanings, and subsequently the calculation ofsimilarity between diﬀerent meaning representations. Wehave additionally released all of this work as part of anApache licensed open-source toolkit [11].For completeness, it is worth mentioning the circuitdepths required to realise the above procedures. Takingthe large scale example, we obtain single and two-qubitgate call counts of 2413 and 33175 respectively to en-code the meaning space. This may be diﬃcult to realiseon current NISQ generation quantum systems, where theuse of simulators instead allow us to make gains in un-derstanding of applying these methods to real datasets.The potential for circuit optimisation through the useof ZX calculus [36], or circuit compilation through toolssuch as CQC’s t | ket (cid:105) may oﬀer more realistic circuit depths, especially when considering mapping to physicalqubit register topologies [37].Very recent works on the implementation of the Dis-CoCat formalism on physical devices without the needfor qRAM, have also emerged [38]. These methods mayprovide a more generalised approach to investigate quan-tum machine learning models in NLP and beyond, andhave the potential to overcome the limitations discussedearlier with data encoding. We imagine the merging ofthis generalised approach [39] with the hybrid quantum-classical methods we have devised to allow interestingresults and further development of this ﬁeld. We leavethis to future work. ACKNOWLEDGMENTS

We would like to thank Prof. Bob Coecke and Dr.Ross Duncan for discussions and suggestions during theearly stages of this work. The work leading to this publi-cation has received funding from Enterprise Ireland andthe European Union’s Regional Development Fund. Theopinions, ﬁndings and conclusions or recommendationsexpressed in this material are those of the authors andneither Enterprise Ireland nor the European Union areliable for any use that may be made of information con-tained herein. The authors also acknowledge funding andsupport from Intel during the duration of this project. [1] E. Cambria and B. White. Jumping nlp curves: A reviewof natural language processing research.

IEEE Compu-tational Intelligence Magazine , 9(2):48–57, 2014.[2] Zellig S. Harris. Distributional structure.

WORD , 10(2-3):146–162, 1954.[3] Wlodek Zadrozny. On compositional semantics. In

COL-ING 1992 Volume 1: The 15th International Conferenceon Computational Linguistics , 1992.[4] Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark.Mathematical foundations for a compositional distribu-tional model of meaning. arXiv:1003.4394 [cs, math] ,Mar 2010. arXiv: 1003.4394.[5] William Zeng and Bob Coecke. Quantum algorithms forcompositional natural language processing.

Proceedingsof SLPCS , 221:6775, 2016.[6] Bob Coecke. The Mathematics of Text Structure. arXive-prints , page arXiv:1904.03478, April 2019.[7] Nathan Wiebe, Ashish Kapoor, and Krysta Svore. Quan-tum Algorithms for Nearest-Neighbor Methods for Su-pervised and Unsupervised Learning. arXiv e-prints ,page arXiv:1401.2142, January 2014.[8] Vittorio Giovannetti, Seth Lloyd, and Lorenzo Maccone.Quantum random access memory.

Phys. Rev. Lett. ,100:160501, Apr 2008.[9] Srinivasan Arunachalam, Vlad Gheorghiu, TomasJochym-O’Connor, Michele Mosca, and Priyaa VarshineeSrinivasan. On the robustness of bucket brigade quan-tum RAM.

New Journal of Physics , 17(12):123010, dec2015. [10] O. D. Matteo, V. Gheorghiu, and M. Mosca. Fault-tolerant resource estimation of quantum random-accessmemories.

IEEE Transactions on Quantum Engineering ,1:1–13, 2020.[11] Lee J. O’Riordan, Myles Doyle, Fabio Baruﬀa, andVenkatesh Kannan. QNLP: ICHEC quantum NLPtoolkit, 2020. https://github.com/ICHEC/QNLP.[12] Gian Giacomo Guerreschi, Justin Hogaboam, FabioBaruﬀa, and Nicolas Sawaya. Intel quantum simulator:A cloud-ready high-performance simulator of quantumcircuits, 2020.[13] R. Nisbet, J. Elder, and G. Miner. Part ii - the algorithmsin data mining and text mining, the organization of thethree most common data mining tools, and selected spe-cialized areas using data mining. In Robert Nisbet, JohnElder, and Gary Miner, editors,

Handbook of StatisticalAnalysis and Data Mining Applications , pages 119 – 120.Academic Press, Boston, 2009.[14] Tomas Mikolov, Kai Chen, Greg Corrado, and JeﬀreyDean. Eﬃcient estimation of word representations in vec-tor space, 2013.[15] Steven Bird, Edward Loper, and Ewan Klein.

NaturalLanguage Processing with Python . OReilly Media Inc.,2009.[16] Matthew Honnibal and Ines Montani. spacy 2: Naturallanguage understanding with bloom embeddings, convo-lutional neural networks and incremental parsing.

Toappear , 2017.[17] William Blacoe. Semantic composition inspired by quan- tum measurement. In Harald Atmanspacher, ClaudiaBergomi, Thomas Filk, and Kirsty Kitto, editors, Quan-tum Interaction , pages 41–53, Cham, 2015. Springer In-ternational Publishing.[18] Diederik Aerts, Jan Broekaert, Sandro Sozzo, and TomasVeloz. Meaning-focused and quantum-inspired informa-tion retrieval. 8369, 2014.[19] Benyou Wang. Dynamic content monitoring and explo-ration using vector spaces. In

Proceedings of the 42NdInternational ACM SIGIR Conference on Research andDevelopment in Information Retrieval , SIGIR’19, pages1444–1444, New York, NY, USA, 2019. ACM.[20] Amit Kumar Jaiswal, Guilherme Holdack, IngoFrommholz, and Haiming Liu. Quantum-like generaliza-tion of complex word embedding: A lightweight approachfor textual classiﬁcation. In

LWDA , 2018.[21] Prayag Tiwari and Massimo Melucci. Multi-class clas-siﬁcation model inspired by quantum detection theory.2018.[22] Benyou Wang, Emanuele Di Buccio, and MassimoMelucci.

Representing Words in Vector Space and Be-yond , pages 83–113. Springer International Publishing,Cham, 2019.[23] Nathan Wiebe, Alex Bocharov, Paul Smolensky,Matthias Troyer, and Krysta M. Svore. Quantum Lan-guage Processing. arXiv:1902.05162 [quant-ph] , Febru-ary 2019. arXiv: 1902.05162.[24] J. Lambek. Pregroup grammars and chomsky’s earliestexamples.

Journal of Logic, Language and Information ,17(2):141–160, 2008.[25] Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and AlnAspuru-Guzik. qhipster: The quantum high performancesoftware testing environment. arXiv:1601.07195 .[26] Maria Schuld.

Quantum machine learning for supervisedpattern recognition.

PhD thesis, 2017.[27] Kosuke Mitarai, Masahiro Kitagawa, and Keisuke Fu-jii. Quantum analog-digital conversion.

Phys. Rev. A ,99:012301, Jan 2019.[28] Omer Levy and Yoav Goldberg. Linguistic regularities insparse and explicit word representations. In

Proceedingsof the Eighteenth Conference on Computational NaturalLanguage Learning , pages 171–180, Ann Arbor, Michi-gan, June 2014. Association for Computational Linguis-tics.[29] Richard Socher, John Bauer, Christopher D. Manning,and Andrew Y. Ng. Parsing with compositional vectorgrammars. In

Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Volume1: Long Papers) , pages 455–465, Soﬁa, Bulgaria, August2013. Association for Computational Linguistics.[30] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado,and Jeﬀrey Dean. Distributed representations of wordsand phrases and their compositionality. In

Proceedings ofthe 26th International Conference on Neural InformationProcessing Systems - Volume 2 , NIPS13, page 31113119,Red Hook, NY, USA, 2013. Curran Associates Inc.[31] Daniel A. Schult Aric A. Hagberg and Pieter J. Swart.Exploring network structure, dynamics, and function us-ing networkx. In Travis Vaught Gel Varoquaux and Jar-rod Millman, editors,

Proceedings of the 7th Python inScience Conference (SciPy2008) , August 2008.[32] C. A. Trugenberger. Probabilistic quantum memories.

Phys. Rev. Lett. , 87:067901, Jul 2001.[33] Carlo A. Trugenberger. Quantum pattern recognition.

Quantum Information Processing , 1(6):471493, Dec 2002.[34] Adriano Barenco, Charles H. Bennett, Richard Cleve,David P. DiVincenzo, Norman Margolus, Peter Shor, Ty-cho Sleator, John A. Smolin, and Harald Weinfurter. El-ementary gates for quantum computation.

Phys. Rev. A ,52:3457–3467, Nov 1995.[35] Aram W. Harrow. Small quantum computers andlarge classical data sets. arXiv e-prints , pagearXiv:2004.00026, March 2020.[36] Bob Coecke and Ross Duncan. Interacting quantum ob-servables: categorical algebra and diagrammatics.

NewJournal of Physics , 13(4):043016, Apr 2011.[37] Alexander Cowtan, Silas Dilkes, Ross Duncan, Alexan-dre Krajenbrink, Will Simmons, and Seyon Sivarajah.On the Qubit Routing Problem. In Wim van Damand Laura Mancinska, editors, , volume 135 of

Leibniz In-ternational Proceedings in Informatics (LIPIcs) , pages5:1–5:32, Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.[38] B. Coecke, G. de Felice, K. Meichanetzidis, andA. Toumi. Quantum natural language processing. https://medium.com/cambridge-quantum-computing/quantum-natural-language-processing-748d6f27b31d ,April 2020. [Online; posted 07-April-2020].[39] A. Toumi and G. de Felice. discopy - natural languageprocessing with string diagrams. https://github.com/oxford-quantum-group/discopy , 2020.[40] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan.pybind11 – seamless operability between c++11 andpython, 2017. https://github.com/pybind/pybind11.[41] Steven Bird, Ewan Klein, and Edward Loper.

NaturalLanguage Processing with Python . O’Reilly Media, 2009.[42] J. D. Hunter. Matplotlib: A 2d graphics environment.

Computing in Science & Engineering , 9(3):90–95, 2007.[43] Ste´fan van der Walt, S. Chris Colbert, and Ga¨el Varo-quaux. The numpy array: A structure for eﬃcient numer-ical computation.

Computing in Science & Engineering ,13(2):22–30, 2011.[44] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, MattHaberland, Tyler Reddy, David Cournapeau, EvgeniBurovski, Pearu Peterson, Warren Weckesser, JonathanBright, St´efan J. van der Walt, Matthew Brett, JoshuaWilson, K. Jarrod Millman, Nikolay Mayorov, AndrewR. J. Nelson, Eric Jones, Robert Kern, Eric Larson,CJ Carey, ˙Ilhan Polat, Yu Feng, Eric W. Moore, JakeVand erPlas, Denis Laxalde, Josef Perktold, Robert Cim-rman, Ian Henriksen, E. A. Quintero, Charles R Har-ris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pe-dregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors.SciPy 1.0: Fundamental Algorithms for Scientiﬁc Com-puting in Python.

Nature Methods , 17:261–272, 2020.[45] Wes McKinney. pandas: a foundational python libraryfor data analysis and statistics.

Python for High Perfor-mance and Scientiﬁc Computing , 14, 2011. Appendix A: Corpus Preparation

Our QNLP software solution [11] can target most cor-pora provided that adequate pre-processing is conductedprior to the main routines of the application, and followsthe outline approach from Sec. III.This approach has several variables that can be ad-justed to control the operation of the pre-processingstage. The limiting number of top N nouns and N verbs aredeﬁned with the run-time parameters NUM_BASIS_NOUN and

NUM_BASIS_VERB , and deﬁned as environment vari-ables. The number of neighbouring nouns, W nouns ,and verbs, W verbs , to consider when mapping thecorpus tokens to basis tokens, are controlled bythe run-time parameters BASIS_NOUN_DIST_CUTOFF and

BASIS_VERB_DIST_CUTOFF respectively, and again de-ﬁned as environment variables.Finally, for forming noun-verb-noun sentence struc-tures, the number of neighbouring nouns to consider fordetermining the basis verbs, W vn , are controlled throughthe environment variable VERB_NOUN_DIST_CUTOFF . Ad-ditionally, the sentence is only valid if the inter-noundistance on a noun-verb-noun structure is within 2 W vn .To choose appropriate values for these parameters, onemust consider overall complexity of the corpus, number of noun-verb-noun sentences, available qubit resources, andintended detail in representing the overall meaning. Forthe simpliﬁed example in Sec. IV A, we have a somewhatsparsely encoded set of patterns in the meaning space (8patterns out of a possible 32), with a small number ofqubits to represent the processing and assist with the en-coding. A more complex text, with a larger basis set willrequire substantially more resources. For example, choos-ing NUM_BASIS_NOUN=10 and

NUM_BASIS_VERB=10 usingthe discussed simpliﬁed cyclic encoding from eq. (2) willrequire at least 32 qubits. However, depending on theamount of information the pre-processing stage can ex-tract, this may be an overestimate or underestimate ofthe required resources.

Appendix B: Software dependencies

All results in this manuscript were generated usingour QNLP toolkit, which is available at [11]. Jupyternotebooks, packages and scripts exist for all operationsdescribed. We made use of the Intel Quantum Simu-lator [12] to perform all quantum gate-level simulations,running on Kay, the Irish national supercomputer. To in-tegrate our C++ work with Python we have made use ofthe pybind11 suite [40]. All results obtained were throughcompilation with Intel Parallel Studio XE 2019 Update5 for distributed workloads (Sec. IV B), and GCC 9.2 forshared (OpenMP) workloads (Sec. IV A).To analyse and prepare the corpus data for encodinginto the quantum state-space, we have used the well-deﬁned classical routines for corpus tokenisation and tag-ging from the NLTK [41] and spaCy [16] software suites. For plotting we explicitly used pgfplots/tikz for Fig. (1),and Matplotlib for all others [42]. We additionally usedthe Scipy ecosystem and pandas during results analysisand during the preprocessing stages[43–45].2

Appendix C: Encoded meaning-space data

Table III is used to generate Fig. 4. It encodes datafrom ‘Alice in Wonderland‘ using the preprocessing con-trol parameters • Number of basis elements for state encoding:

NUM_BASIS_NOUN=8 NUM_BASIS_VERB=4 • Inter-token composite representation dis-tance:

BASIS_NOUN_DIST_CUTOFF=5,BASIS_VERB_DIST_CUTOFF=5 • Verb-noun distance cut-oﬀ for association:

VERB_NOUN_DIST_CUTOFF=4

Label Bin. pattern d H Countking,go,queen 1111110111 2 1260king,say,time 1110100111 2 1217time,say,queen 1111101110 3 1165king,go,time 1110110111 3 1123queen,go,queen 1111111111 3 1102hatter,say,alice 1000100011 3 1094king,would,time 1110000111 3 1088king,say,hatter 0011100111 3 1087king,go,king 0111110111 3 1080head,go,queen 1111110000 3 1075queen,say,time 1110101111 3 1069alice,say,king 0111101000 4 940queen,go,king 0111111111 4 925alice,go,queen 1111111000 4 924head,go,king 0111110000 4 922king,go,hatter 0011110111 4 919time,go,queen 1111111110 4 913king,would,hatter 0011000111 4 908queen,would,time 1110001111 4 908alice,say,time 1110101000 4 899time,would,queen 1111001110 4 899time,say,king 0111101110 4 894king,say,alice 1000100111 4 894time,say,time 1110101110 4 893queen,say,hatter 0011101111 4 889hatter,would,alice 1000000011 4 885queen,go,time 1110111111 4 872king,think,time 1110010111 4 866hatter,go,alice 1000110011 4 836thing,say,time 1110101100 5 730king,think,hatter 0011010111 5 729queen,say,alice 1000101111 5 712queen,think,time 1110011111 5 704time,go,time 1110111110 5 702time,would,king 0111001110 5 700alice,go,king 0111111000 5 698time,would,time 1110001110 5 696alice,go,time 1110111000 5 689 Label Bin. pattern d H Countalice,say,hatter 0011101000 5 681hatter,think,alice 1000010011 5 681queen,go,hatter 0011111111 5 680time,go,king 0111111110 5 662queen,would,hatter 0011001111 5 658king,would,alice 1000000111 5 657thing,go,queen 1111111100 5 657king,go,alice 1000110111 5 650alice,would,time 1110001000 5 642queen,think,hatter 0011011111 6 504alice,think,time 1110011000 6 490head,go,alice 1000110000 6 485thing,would,time 1110001100 6 479alice,go,hatter 0011111000 6 472queen,would,alice 1000001111 6 470alice,say,alice 1000101000 6 456time,say,alice 1000101110 6 455thing,go,time 1110111100 6 455king,think,alice 1000010111 6 449queen,go,alice 1000111111 6 449alice,would,hatter 0011001000 6 448thing,go,king 0111111100 6 436time,would,alice 1000001110 7 303alice,go,alice 1000111000 7 297alice,say,head 0000101000 7 283time,go,alice 1000111110 7 269alice,would,alice 1000001000 7 267thing,say,alice 1000101100 7 266queen,think,alice 1000011111 7 259time,say,head 0000101110 7 254alice,think,hatter 0011011000 7 239thing,go,alice 1000111100 8 139thing,would,alice 1000001100 8 125alice,think,alice 1000011000 8 123time,would,head 0000001110 8 117time,go,head 0000111110 8 113alice,think,head 0000011000 9 24

TABLE III. Results for Fig. 3 taking 5 × samples encoding AIW using the parameters NUM_BASIS_NOUN: 8 , NUM_BASIS_VERB: 4 , BASIS_NOUN_DIST_CUTOFF: 5 , BASIS_VERB_DIST_CUTOFF: 5 , VERB_NOUN_DIST_CUTOFF: 4 , and compar-ing with the test pattern (‘hatter,says,queen’) with binary string 1111100011. Appendix D: Overlap comparison data

Table IV presents comparison data for the basis-tokencomposed sentence “Hatter say Queen” and and a varietyof other allowed sentence structures. Data is again en-coded from ‘Alice in Wonderland‘ using the preprocessingcontrol parameters • Number of basis elements for state encoding:

NUM_BASIS_NOUN=8 NUM_BASIS_VERB=4 • Inter-token composite representation dis-tance:

BASIS_NOUN_DIST_CUTOFF=5,BASIS_VERB_DIST_CUTOFF=5 • Verb-noun distance cut-oﬀ for association: