Random vector generation of a semantic space
Random vector genera on of a seman c space
Jean‐François Delpech Sabine Ploux Ins tut des Sciences Cogni ves UMR5304 CNRS ‐ Université de Lyon 67, boulevard Pinel 69675 BRON cedex, France [email protected] [email protected]
We show how random vectors and random projec on can be implemented in the usual vector spacemodel to construct a Euclidean seman c space from a French synonym dic onary. We evaluatetheore cally the resul ng noise and show the experimental distribu on of the similari es of terms in aneighborhood according to the choice of parameters. We also show that the Schmidt orthogonaliza onprocess is applicable and can be used to separate homonyms with dis nct seman c meanings.Neighboring terms are easily arranged into seman cally significant clusters which are well suited to thegenera on of realis c lists of synonyms and to such applica ons as word selec on for automa c textgenera on. This process, applicable to any language, can easily be extended to colloca ons, is extremelyfast and can be updated in real me, whenever new synonyms are proposed.
In their seminal work, Ploux and Victorri have used synonymy rela ons deduced from French electronic dic onaries to createseman c spaces around French words and their neighbors. Their defini on of “synonymy” is fairly broad and includes hyponymy(moineau and oiseau), hyperonymy (arme and pistolet) or even non‐synonymous, but related terms (autocar and automobile);however, in their work, true synonyms (i.e. terms which are more or less interchangeable) form cliques of the graph of synonyms,i.e. maximally complete subgraphs. While this is very interes ng from a theore cal standpoint, as it then becomes straigh跀orwardto evaluate an interclique distance (or degree of separa on) between any two terms in the graph (as long as neither belongs to anisland, such as lapereau and lapinot), it is not very useful in prac ce. For example, an author in search of the right term may wellnot be interested in strict synonyms; terms with related or even opposed meanings can o en be preferable in rhetorical figures.Also, in many applica ons such as automa c text genera on, a well‐defined and mathema cally well behaved seman c distancebetween terms is o en a prerequisite.In this report, we show how an Euclidean seman c distance can quickly and easily be constructed from Ploux and Victorri'sdatabase (which contains 54,685 terms and 116,694 cliques). Since the pioneering work of Salton , , it is well understood that any combina on of terms, such as a clique, can be seen as avector in a space where each dimension represents a dis nct term (or lemma.) (1)This representa on is extremely frui跀ul and forms the basis of numerous informa on retrieval systems; it suffers however from asevere limita on in that each term is orthogonal to each other. Of course, the dual equa on from Equa on 1, (2) = ( , , … , ) C j t j t j t t , j = ( , , … , ) T k c k c k c s , k ay be used to compute term distances (or similari es) but the very high dimensionality of the subtending space makes suchdistances difficult to compute and to interpret: this is the “curse of dimensionality”. If with cardinality is the set of dis nct terms occurring in all the cliques containing term , we define the overlap similaritybetween two terms as the cardinality of the intersec on (each word being counted only once.) Obviously, for most pairs, since for any the total number of dis nct terms in the database is much larger than , which ranges from to in Ploux and Victorri's database with an average value of 8.5. “Contexonyms” are words which co‐occur in a given context (as for example in the same sentence of a corpus); while they are notsynonyms, they are obviously closely related. Ji, Ploux and Wehrli have proposed an automa c contexonym organizing model(ACOM) which relies on coun ng of co‐occurrences and evalua ng their probabili es to automa cally produce and organizecontexonyms for a target word. The test results, a er training on an English corpus maintained by Project Gutenberg, show thatthe model is able to classify contexonyms as well as to reflect words' minute usage and nuance. Dimensionality reduc on can be achieved by a low‐rank approxima on of the term‐document matrix. This can be done by LatentSeman c Indexing , which reduces dimensionality through a singular value decomposi on (SVD) of the term‐document matrix,retaining only a compara vely small number of the largest singular values. This method has been very successfully used fordocument indexing and retrieval. It suffers nevertheless from limita ons:SVD is computa onally intensive, even though the large term‐document matrix is very sparse;There is no really sa sfactory way to increment the results as new terms/documents become available.More importantly, it is not well suited to genera ng a seman c space from cliques. The resul ng, lower dimensional space is thebest approxima on, in the least squares sense, of the posi on of any term belonging to the whole set of cliques: the distancebetween any pair of terms will be op mal, while what is really of interest from the present perspec ve is the accuratedetermina on of distances between seman c neighbors.As a test, a SVD decomposi on of the clique‐term matrix (of which eq. 2 is a row) was performed. It reduced the matrix size from54,685 x 116,694 to 54,685 x 250, meaning that each term was associated with a vector having 250 orthogonal components. Thedecomposi on, which took 134 sec. on a desktop computer, was clearly unsa sfactory as the singular values decayed very slowly,from 63.31 for the first coordinate to 37.4 for the 250 th one. According to this computa on, the first few neighbors of rapsodewould be rimailleur, rimeur, versificateur, métromane, fils d'Apollon, favori des Muses, favori du Parnasse, nourrisson des Muses,héros du Pinde, mâche‐laurier, all with similari es extremely close to 1.0. While clearly in the right neighborhood, this seems tobe of limited usefulness; note however that in prac ce a restric on to the first two or three largest singular values may o en yielduseful informa on . Word order is not considered in this publica on, but it should be men oned for completeness that neural networks are o enused in natural language processing to encode word sequences (see for an extended review and bibliography). In a recentpublica on, Mikolov et al. have introduced two novel model architectures for compu ng con nuous vector representa ons ofwords from very large data sets. They report large improvements in accuracy at a computa onal cost which is s ll substan al, butthat they claim is much lower than previous architectures. An interes ng considera on is that, according to Mikolov et al. , thelearned vectors explicitly encode many linguis c regulari es and pa erns. D i d i t i d i,k ∩ D i D k = 0 d i,k ( i , k ) i d i .6. Random vectors and random projec on points out that intui ons valid in a low‐dimensionality space may be totally misleading in a high‐dimensionalityspace. For example, a set of points picked at random from the unit ball (3)will have some significant frac on near the origin, say within distance if , but this frac on becomes rapidly vanishinglysmall as the dimension becomes large; for example for , .Another useful remark is that while obviously one cannot create more than orthogonal vectors in a space of dimension , onecan create an exponen ally large number of vectors quasi‐orthogonal to each other; in other words , a set of vectors picked at random will with high probability be quasi‐orthogonal, i.e. have angles of with each others. The seedvectors referred to below will be selected from such a set While an orthogonal projec on will in general reduce the average distance between points, it is also known, as shown by Johnsonand Lindenstrauss in an o en cited paper , that distances may be almost perfectly preserved for any points in an arbitrarynumber of dimensions when projected to a random subspace of dimension.2.6.2. Building random vectorsThe compara vely recent method of random projec on , , is based on these three preceding remarks and proceeds asfollows:1. Uniquely associate with each term a random seed vector having independent coordinates;2. Associate with each clique the vector where refers to the set of terms found in clique and is a func on of the number of occurrences in of term and of the weights associated with ;3. Finally, associate with each term the (suitably weighted) sum of the vectors of the cliques in which appears, where refers to the cliques containing . In what follows, we shall assume without loss of generalitythat the term vectors are normalized to unity.Obviously, each term vector is now embedded in a ‐dimensional Euclidean seman c space and the similarity betweenterms and is the scalar product of the associated term vectors: (4)It is easy to see that ranges from ‐1 to 1. It is some mes more convenient to consider the distance which is related to thesimilarity by and ranges from (same and ) to (exactly opposite terms; note however that owing tothe extreme sparsity of a high‐dimensional space, the neighborhood exactly opposite a term is in prac ce always empty.)2.6.3. Locality propertyBuilding a term vector by the process described above involves only the terms pertaining to the set defined in sec on 2.2. Itis thus a purely local process: upda ng the seman c space requires only a few ten or a few hundred opera ons, orders ofmagnitude less than its ini al genera on (provided small changes to the weights are neglected, which is usually acceptable asthey are logarithmic in term frequency and inverse document frequency.)This does not imply that the similarity of term with term is zero, even though , since and may well have neighbors in common. For example caro e and fraude have a degree of separa on of 2 but a similarity of 0.364.The seed vectors are not quite orthogonal and the scalar product will usually be small, but not zero. Thus, even foruncorrelated term vectors and , their similarity will usually be small but non‐zero. This induce an unavoidable noisewhich is studied below in some detail. { x ∈ V : ∥ x ∥ < 1}1/2 d = 3 d d = 250 = ≈ 5.5 × d −76 d d exp( O ( d )) ϵ
90 ± ϵ . S d nd O (log n ) t i ∈ s i S d dc k = C k ∑ i ∈ c k ρ ki s i i ∈ c k c k ρ ki c k t i t i t i t i = T i ∑ k ∋ t i C k k ∋ t i t i T i T i d σ ij t i t j = ⟨ | ⟩ σ ij T i T j σ ij D ij = D ij σ ij − −−−−−−− √ t i t k T i D i t i ∉ t k D i ∉ ⟹ ∉ t k D i t i D k t i t k s i ⟨ | ⟩ s i s j T i T j σ ij A normalized seed vector embedded in a ‐dimensional space has coordinates of which are , are and are . Having the same number of posi ve and nega ve coordinates ensures that the scalar product of two seed vectorsis 0 on the average. As seed vectors need to be very close to orthogonal with each others, the number of non‐zerocoefficients must be substan ally smaller than the dimension .The number of available, dis nct seed vectors is the product of the number of combina ons of non‐zero coordinatesamongst coordinates, mes the number of ways of distribu ng posi ve and nega ve coordinates amongst thesenon‐zero coordinates : (5)In prac ce, should be much larger than the number of dis nct terms to guarantee a negligible collision probability (i.e. twodis nct terms having the same seed vector). This condi on is already amply met with as when a dimension is selected.Given and and no ng for simplicity , the probability of an overlap between two randomlyselected seed vectors is (6)When two randomly selected, normalized seed vectors have an overlap of non‐zero coordinates, their scalar products will bearranged symmetrically around zero and vary in discrete steps of . An overlap of will generate the two scalars and with probabili es , an overlap of 2 will generate with probability , with probability , and with probability ; more generally, an overlap will generate the scalar with the probability (7)where the factors are restricted to integer values and by virtue of the iden ty .It can be seen from equa ons 6 and 7 that the noise decreases more or less linearly with Theore cal and experimental scalarproducts of two seed vectors, as computed from equa ons 6 and 7, are plo ed in the next figure (next page) where:The light ver cal lines are increments of , the heavier ver cal lines are at 0, and .The two horizontal lines are at 1.0 and 0.136.The red dots are computed by taking the scalar products of 1,000,000 'term vectors' each synthesized by the addi on of 5random seed vectors. If instead we do the sta s cs directly on seed vectors, the result is unchanged except that the dotsnow occur only at mul ples of 0.01 and the total is accordingly 5 mes larger.The black dots are sta s cs over 40,000 points, star ng with 10,000, taken from the tail of the neighbors of an arbitraryterm (here rapsode.)The purple dots are a Gaussian with a standard devia on of d d d − 2 m m +1/ 2 m −−−√ m −1/ 2 m −−−√ 2 md ( ) d m md ( ) mm m m ( d , m ) = ( ) × ( ) N seed d m mmN seed m ≥ 4 (250, 4) ≈ 2.4 × N seed d = 250 d m p = 2 m / d ( v , d ) P overlap v ( v , d ) = ( ) × × (1 − pP overlap mv p v ) (2 m − v ) v m m −1/2 m m m v s /2 m ( v , s ) = ( ) × P scalar vq s − v = ( v + s )/2 q s ( ) ≡ ∑ vk =− v kq k v d .0.01 σ −0.1 σ +0.1 σ Figure 1 ‐ Slice Noise
Obviously, the size of the database of the term vectors is itself linearly dependent on the dimension; for , each vectoroccupies 10 kB if a single coordinate is represented by a 4‐byte floa ng point number. A database of 1,000,000 dis nct termswould thus occupy 10 GB with this elementary data structure; however, for many applica ons, will be sufficient and/ormore sophis cated data structures may be implemented. Computa on mes will also increase more or less linearly with ,because they mostly involve stepping through all the dimensions.
The following four figures have been constructed by compiling eight independent databases of term vectors for each ofthe four indicated couples; 跀‐idf sta s cal weights were used. Typically, on a small desktop computer, the compila on me is 3 to 4 seconds for and 20 to 30 seconds for . Abscissas are propor onal to the logarithm of the neighbor's rank (From 1 for maison in the upper right corner to 1000 in thelower le corner) and ordinates are the similari es to maison. For a given neighbor, the horizontally aligned red dots representthe eight scalars computed from the eight databases and the thicker, black dot is the average value of the eight scalars.The neighbors are arranged in non‐increasing order of their with maison. Even though the diameter of maison as defined in sec on 2.2 is only 98, there are several hundred significant neighbors:while most close neighbors belong to the set , many lie at more than one degree of separa on from each others (theircliques are separated by more than one vertex). Also, it can be seen, as expected, that the noise is inversely propor onal to but not very dependent on (in fact, the only no ceable effect of a lower is that there are more outliers) and that a standarddevia on of about is not unrealis c for . T i d = 2500 d = 250 d
54, 685( m , d ) d = 250 d = 2500 σ i σ avg σ avg d maison D maison d √ m m d = 2500 igure 2 ‐ Neighbors of maison scalar rank Figure 3 ‐ Neighbors of maison scalar rank
Figure 4 ‐ Neighbors of maison scalar rank
Figure 5 ‐ Neighbors of maison scalar rank
In what follows, unless otherwise noted, we'll use and . The size of the file containing the 54,685 termvectors is then 547,724,964 bytes, including some overhead. The number of available seed vectors being the riskof collision is totally negligible.The 100 first neighbors of maison are listed by decreasing similarity in table 1 next page. It is clear that the proximity decreaseswith , but that those neighboring words are all reasonably close to maison in its various meanings.
Table 1 ‐ First 100 neighbors of maison
From 1 to 20 From 21 to 40 From 41 to 60 From 61 to 80 From 81 to 1001 1.000 maison 21 0.499 chez‐soi 41 0.309 ménage 61 0.207 ermitage 81 0.163 domes que2 0.843 demeure 22 0.498 cassine 42 0.307 bâ ment 62 0.204 appartement 82 0.162 plaque_de_blindage3 0.820 habita on 23 0.491 cabane 43 0.306 case 63 0.198 cagna 83 0.159 mas4 0.810 logis 24 0.486 gourbi 44 0.301 taudis 64 0.191 reposée 84 0.155 tanière5 0.767 domicile 25 0.480 gîte 45 0.300 chalet 65 0.190 chartreuse 85 0.154 cache6 0.762 pénates 26 0.477 bercail 46 0.292 pavillon 66 0.188 domes cité 86 0.152 niche7 0.677 home 27 0.460 masure 47 0.292 villa 67 0.187 garde‐meubles 87 0.151 rendez‐vous_de_chasse8 0.669 mesnil 28 0.436 asile 48 0.287 chaumière 68 0.186 gabionnade 88 0.151 repaire9 0.661 chacunière 29 0.428 château 49 0.278 manse 69 0.182 lignée 89 0.149 clinique10 0.659 foyer 30 0.404 immeuble 50 0.275 hôtel_par culier 70 0.180 caponnière 90 0.148 glorie e11 0.653 train_de_maison 31 0.403 hu e 51 0.265 isba 71 0.172 ferme e 91 0.147 havre12 0.603 logement 32 0.385 maisonne e 52 0.254 bas‐lieu 72 0.171 grand_ensemble 92 0.147 lapinière13 0.596 maisonnée 33 0.382 ménil 53 0.249 manoir 73 0.171 lieu 93 0.146 kiosque14 0.593 nid 34 0.374 abri 54 0.244 hôtel 74 0.171 famille 94 0.146 intérieur15 0.576 résidence 35 0.369 galetas 55 0.241 standing 75 0.167 garde‐meuble 95 0.144 bouverie16 0.572 toit 36 0.368 lares 56 0.240 H.L.M. 76 0.167 deck‐house 96 0.144 parents17 0.545 bicoque 37 0.350 lare 57 0.233 habitacle 77 0.166 habitat 97 0.143 tourelle18 0.537 bâ sse 38 0.338 clapier 58 0.223 retraite 78 0.166 édifice 98 0.143 mantelet19 0.532 cahute 39 0.331 palais 59 0.211 carbet 79 0.164 firme 99 0.142 hangar20 0.502 baraque 40 0.314 train_de_vie 60 0.208 séjour 80 0.163 tranchée‐abri 100 0.141 pigeonnier m = 5 d = 250 m = 10 d = 2500 m = 40 d = 250 m = 50 d = 2500 d = 2500 m = 50 ≈ 2.0 × 10 σ .2. Similarity matrices and clusteriza on It is also straigh跀orward to build a similarity matrix (see table 2) and to use such matrices to group terms by clusters, i.e. lists ofterms which do not all belong to the same clique, but which are closely related seman cally. We use nearest‐neighbor clusteringin this work.
Table 2 ‐ Similarity matrix chacunière 1.000 mesnil 0.665 1.000 train_de_maison 0.657 0.659 1.000 maisonnée 0.546 0.554 0.560 1.000 demeure 0.456 0.470 0.449 0.394 1.000 habita on 0.414 0.433 0.403 0.344 0.852 1.000 maison 0.661 0.669 0.653 0.596 0.843 0.820 1.000 pénates 0.481 0.496 0.481 0.417 0.837 0.605 0.762 1.000 logement 0.257 0.278 0.250 0.223 0.796 0.724 0.603 0.616 1.000 domicile 0.457 0.464 0.462 0.394 0.780 0.670 0.767 0.620 0.588 1.000 logis 0.507 0.514 0.506 0.447 0.795 0.666 0.810 0.706 0.623 0.892 1.000 résidence 0.305 0.310 0.308 0.262 0.727 0.609 0.576 0.511 0.554 0.840 0.627 1.000
In table 3, the headers are the members of the original cliques including maison, grouped in seman cally homogeneous clusters,and the associated lists are terms similar with to the center of mass of their header. Terms in blue are from the originalcliques, terms in gray are repeats from a previous cluster, and the others could reasonably be aggregated to their head cluster,especially at similari es above
Table 3 ‐ Clusters around maison and their cohortschacunière, mesnil, train_de_maison, maisonnée, demeure, habita on, pénates, maison, logement, domicile, logis, résidenceabri, clapier, gîte, nid, asile, retraite, bercail, toit, foyer, habitaclebaraque, bicoque, cahute, cabane, hu e, gourbi, masure, case, cassine, chaumière, maisonne ebas‐lieu, naissance, origine, descendance, famille, lignée, race, parents, chez‐soi, home, intérieur, ménil, lare, lares, ménage, standing,train_de_vieappartement, bouge, taudis, galetas, chalet, pavillon, villa, château, manoir, palais, réduitbuilding, édifice, bâ ment, immeuble, construc on, bâ sse, hôtel, campagne, propriété, fermeboîte, entreprise, firme, établissement, prison, commerce, temple, ins tut, ins tu on, branche, couvert, domes cité, serviteur, domes que,gens, monde, suiteclinique, hôpital, nom, couronne, trône, pigeonnier, lieu, place, séjour, feu
Things get more complicated when two homonyms are seman cally disjoint, as is the case with le barde and la barde: σ > 0.250.35. able 4 ‐ First 100 neighbors of barde From 1 to 20 From 21 to 40 From 41 to 60 From 61 to 80 From 81 to 1001 1.000 barde 21 0.298 versificateur 41 0.167 croque‐notes 61 0.097 vic maire 81 0.080 injurié2 0.839 aède 22 0.297 mâche‐laurier 42 0.157 choriste 62 0.096 flamine 82 0.079 septemvir3 0.717 tranche_de_lard 23 0.290 héros_du_Pinde 43 0.148 harnais 63 0.096 prestolet 83 0.079 lama4 0.625 chantre 24 0.290 favori_des_Muses 44 0.147 prêtre 64 0.094 iman 84 0.078 salien5 0.533 poète 25 0.289 amant_du_Parnasse 45 0.146 cigale 65 0.094 mu i 85 0.078 brachyne6 0.521 chanteur 26 0.286 favori_du_Parnasse 46 0.124 coryphée 66 0.094 ra chon 86 0.077 ménestrier7 0.503 rhapsode 27 0.285 nourrisson_du_Parnasse 47 0.119 trouveur 67 0.093 ovate 87 0.076 curé8 0.493 bardit 28 0.285 poétereau 48 0.114 corybante 68 0.093 utopiste 88 0.076 talapoin9 0.468 trouvère 29 0.284 métromane 49 0.110 luperque 69 0.091 ministre_du_culte 89 0.075 épulon10 0.450 scalde 30 0.284 citharède 50 0.110 muezzin 70 0.090 pope 90 0.074 me re_dans_le_même_sac11 0.427 troubadour 31 0.282 crooner 51 0.109 druide 71 0.090 mystagogue 91 0.074 immodérément12 0.421 minnesinger 32 0.276 félibre 52 0.105 eubage 72 0.090 cantatrice 92 0.074 sacrificateur13 0.328 nourrisson_du_Pinde 33 0.274 due迀ste 53 0.105 parolier 73 0.089 archiprêtre 93 0.074 bombardier14 0.323 amant_des_Muses 34 0.267 rapsode 54 0.103 quindecemvir 74 0.086 sous‐ventrière 94 0.073 lévite15 0.315 lamelle 35 0.256 ménestrel 55 0.103 mollah 75 0.085 abbé 95 0.073 englober16 0.313 favori_d'Apollon 36 0.240 rimeur 56 0.103 padre 76 0.084 curète 96 0.072 chiennerie17 0.313 fils_d'Apollon 37 0.233 rimailleur 57 0.102 hiérogrammate 77 0.084 papas 97 0.072 rabbin18 0.308 nourrisson_des_Muses 38 0.216 panne 58 0.100 saronide 78 0.082 avarice 98 0.072 passivité19 0.302 enfant_d'Apollon 39 0.207 choreute 59 0.097 chansonnier 79 0.081 quindécemvir 99 0.072 capelan20 0.300 maître_du_Pinde 40 0.176 vocaliste 60 0.097 chapelain 80 0.080 officiant 100 0.072 eschatologique
If we meant barde as aède, the third neighbor, tranche_de_lard is clearly not appropriate, and conversely.However, in a Euclidean space, the Schmidt orthogonaliza on procedure does remove this kind of interference. Since termvectors are normalized to unity, one needs simply to subtract from the vector the collinear component of the vector : (8)with the following result, where the perturba on due to tranche_de_lard is totally eliminated: Table 5 ‐ First 100 neighbors of barde orthogonalized w.r.t. tranche_de_lard
From 1 to 20 From 21 to 40 From 41 to 60 From 61 to 80 From 81 to 1001 0.744 aède 21 0.396 poétereau 41 0.217 croque‐notes 61 0.162 harnais 81 0.130 lama2 0.697 barde 22 0.396 maître_du_Pinde 42 0.215 lamelle 62 0.161 prestolet 82 0.130 papas3 0.634 chantre 23 0.394 mâche‐laurier 43 0.205 coryphée 63 0.161 curète 83 0.126 soliste4 0.634 scalde 24 0.393 favori_des_Muses 44 0.188 cantatrice 64 0.157 ministre_du_culte 84 0.124 talapoin5 0.629 poète 25 0.389 nourrisson_du_Parnasse 45 0.185 quindecemvir 65 0.157 chapelain 85 0.121 directeur_de_conscience6 0.622 chanteur 26 0.389 amant_du_Parnasse 46 0.183 mystagogue 66 0.157 eubage 86 0.120 utopiste7 0.582 minnesinger 27 0.387 héros_du_Pinde 47 0.178 luperque 67 0.156 mu i 87 0.120 prêtraille8 0.548 trouvère 28 0.380 félibre 48 0.177 padre 68 0.155 saronide 88 0.115 ceinture_de_sécurité9 0.525 rhapsode 29 0.380 rapsode 49 0.177 mollah 69 0.153 trouveur 89 0.113 curé10 0.523 troubadour 30 0.364 versificateur 50 0.175 muezzin 70 0.152 ovate 90 0.113 sacrificateur11 0.429 nourrisson_du_Pinde 31 0.337 ménestrel 51 0.173 hiérogrammate 71 0.144 parolier 91 0.109 rabbin12 0.415 crooner 32 0.333 métromane 52 0.172 ra chon 72 0.140 chansonnier 92 0.106 salien13 0.414 amant_des_Muses 33 0.319 choreute 53 0.172 quindécemvir 73 0.139 hiérophante 93 0.105 aumônier14 0.414 nourrisson_des_Muses 34 0.292 rimeur 54 0.171 corybante 74 0.139 pope 94 0.105 sous‐ventrière15 0.409 favori_d'Apollon 35 0.287 rimailleur 55 0.170 archiprêtre 75 0.137 diva 95 0.105 capelan16 0.408 citharède 36 0.282 bardit 56 0.169 druide 76 0.137 épulon 96 0.105 affublement17 0.405 due迀ste 37 0.242 choriste 57 0.168 vic maire 77 0.135 abbé 97 0.104 virtuose18 0.404 favori_du_Parnasse 38 0.231 vocaliste 58 0.168 cigale 78 0.134 septemvir 98 0.104 ménestrier19 0.400 fils_d'Apollon 39 0.224 panne 59 0.165 flamine 79 0.133 musicien 99 0.102 exécutant20 0.397 enfant_d'Apollon 40 0.218 prêtre 60 0.165 iman 80 0.131 officiant 100 0.096 ténor
The number of terms which can be subtracted is only limited by the noise. | barde ⟩| tranche_de_lard ⟩ = | barde ⟩ − ⟨ barde | tranche_de_lard ⟩ × | tranche_de_lard ⟩| barde ⟩ ⊥ tranche _ de _ lard The corresponding clusters associated with now are:Table 6 ‐ Clusters around barde and their cohorts (orthogonalized w.r.t. tranche_de_lard) aède, barde, poète, chanteur, chantrebardit, harnais, prêtre, lamelle, panne, rhapsode, troubadour, trouvèreto be compared to the non‐orthogonalized result:Table 7 ‐ Clusters around barde and their cohortsaède, barde, chanteur, chantre, poètebardit, tranche_de_lard, lamelle, rhapsode, troubadour, trouvèreharnais, panne, prêtre
Conclusion and future work