Continual Learning for Sentence Representations Using Conceptors
CContinual Learning for Sentence Representations Using Conceptors
Tianlin Liu
Department of Computer Science andElectrical EngineeringJacobs University Bremen28759 Bremen, Germany [email protected]
Lyle Ungar and
Jo˜ao Sedoc
Department of Computer andInformation ScienceUniversity of PennsylvaniaPhiladelphia, PA 19104 { ungar, joao } @cis.upenn.edu Abstract
Distributed representations of sentences havebecome ubiquitous in natural language pro-cessing tasks. In this paper, we consider acontinual learning scenario for sentence rep-resentations: Given a sequence of corpora, weaim to optimize the sentence encoder with re-spect to the new corpus while maintaining itsaccuracy on the old corpora. To address thisproblem, we propose to initialize sentence en-coders with the help of corpus-independentfeatures, and then sequentially update sentenceencoders using Boolean operations of concep-tor matrices to learn corpus-dependent fea-tures. We evaluate our approach on semantictextual similarity tasks and show that our pro-posed sentence encoder can continually learnfeatures from new corpora while retaining itscompetence on previously encountered cor-pora.
Distributed representations of sentences are essen-tial for a wide variety of natural language pro-cessing (NLP) tasks. Although recently proposedsentence encoders have achieved remarkable re-sults (e.g., (Yin and Sch¨utze, 2015; Arora et al.,2017; Cer et al., 2018; Pagliardini et al., 2018)),most, if not all, of them are trained on a priori fixed corpora. However, in open-domain NLP sys-tems such as conversational agents, oftentimes weare facing a dynamic environment, where train-ing data are accumulated sequentially over timeand the distributions of training data vary withrespect to external input (Lee, 2017; Mathur andSingh, 2018). To effectively use sentence encodersin such systems, we propose to consider the fol-lowing continual sentence representation learn-ing task : Given a sequence of corpora, we aim totrain sentence encoders such that they can continu-ally learn features from new corpora while retain- ing strong performance on previously encounteredcorpora.Toward addressing the continual sentence rep-resentation learning task, we propose a simplesentence encoder that is based on the summationand linear transform of a sequence of word vec-tors aided by matrix conceptors. Conceptors havetheir origin in reservoir computing (Jaeger, 2014)and recently have been used to perform continuallearning in deep neural networks (He and Jaeger,2018). Here we employ Boolean operations ofconceptor matrices to update sentence encodersover time to meet the following desiderata:1.
Zero-shot learning . The initialized sentenceencoder (no training corpus used) can effec-tively produce sentence embeddings.2.
Resistant to catastrophic forgetting . Whenthe sentence encoder is adapted on a newtraining corpus, it retains strong perfor-mances on old ones.The rest of the paper is organized as follows.We first briefly review a family of linear sentenceencoders. Then we explain how to build upon suchsentence encoders for continual sentence represen-tation learning tasks, which lead to our proposedalgorithm. Finally, we demonstrate the effective-ness of the proposed method using semantic tex-tual similarity tasks. Notation
We assume each word w from a vo-cabulary set V has a real-valued word vector v w ∈ R n . Let p ( w ) be the monogram probability ofa word w . A corpus D is a collection of sen-tences, where each sentence s ∈ D is a multisetof words (word order is ignored here). For a col-lection of vectors Y = { y i } i ∈ I , where y i ∈ R l Our codes are available on GitHub https://github.com/liutianlin0121/contSentEmbed a r X i v : . [ c s . L G ] A p r or i in an index set I with cardinality | I | , we let [ y i ] i ∈ I ∈ R l ×| I | be a matrix whose columns arevectors y , · · · , y | I | . An identity matrix is denotedby I . We briefly overview “linear sentence encoders”that are based on linear algebraic operations overa sequence of word vectors. Among different lin-ear sentence encoders, the smoothed inverse fre-quency (SIF) approach (Arora et al., 2017) is aprominent example – it outperforms many neural-network based sentence encoders on a battery ofNLP tasks (Arora et al., 2017).Derived from a generative model for sentences,the SIF encoder (presented in Algorithm 1) trans-forms a sequence of word vectors into a sentencevector with three steps. First, for each sentencein the training corpus, SIF computes a weightedaverage of word vectors (line 1-3 of Algorithm1); next, it estimates a “common discourse direc-tion” of the training corpus (line 4 of Algorithm1); thirdly, for each sentence in the testing cor-pus, it calculates the weighted average of the wordvectors and projects the averaged result away fromthe learned common discourse direction (line 5-8of Algorithm 1). Note that this 3-step paradigmis slightly more general than the original one pre-sented in (Arora et al., 2017), where the trainingand the testing corpus is assumed to be the same.
Algorithm 1:
SIF sentence encoder.
Input :
A training corpus D ; a testingcorpus G ; parameter a , monogramprobabilities { p ( w ) } w ∈ V of words for sentence s ∈ D do q s ← | s | (cid:80) w ∈ s ap ( w )+ a v w end Let u be the first singular vector of [ q s ] s ∈ D . for sentence s ∈ G do q s ← | s | (cid:80) w ∈ s ap ( w )+ a v w f SIF s ← q s − uu (cid:62) q s . endOutput: { f SIF s } s ∈ G Building upon SIF, recent studies have pro-posed further improved sentence encoders (Kho-dak et al., 2018; Pagliardini et al., 2018; Yanget al., 2018). These algorithms roughly share thecore procedures of SIF, albeit using more refined methods (e.g., softly remove more than one com-mon discourse direction).
In this section, we consider how to design a lin-ear sentence encoder for continual sentence repre-sentation learning. We observe that common dis-course directions used by SIF-like encoders areestimated from the training corpus. However, in-crementally estimating common discourse direc-tions in continual sentence representation learningtasks might not be optimal. For example, considerthat we are sequentially given training corpora of tweets and news article . When the first tweets corpus is presented, we can train a SIFsentence encoder using tweets . When the sec-ond news article corpus is given, however,we will face a problem on how to exploit the newlygiven corpus for improving the trained sentenceencoder. A straightforward solution is to first com-bine the tweets and news article corporaand then train a new encoder from scratch usingthe combined corpus. However, this paradigm isnot efficient or effective. It is not efficient in thesense that we will need to re-train the encoderfrom scratch every time a new corpus is added.Furthermore, it is not effective in the sense that thecommon direction estimated from scratch reflectsa compromise between tweets and news articles,which might not be optimal for either of the stand-alone corpus. Indeed, it is possible that larger cor-pora will swamp smaller ones.To make the common discourse learned fromone corpus more generalizable to another, we pro-pose to use the conceptor matrix (Jaeger, 2017)to characterize and update the common discoursefeatures in a sequence of training corpora.
In this section, we briefly introduce matrix con-ceptors, drawing heavily on (Jaeger, 2017; He andJaeger, 2018; Liu et al., 2019). Consider a setof vectors { x , · · · , x n } , x i ∈ R N for all i ∈{ , · · · , n } . A conceptor matrix is a regularizedidentity map that minimizes n n (cid:88) i =1 (cid:107) x i − Cx i (cid:107) + α − (cid:107) C (cid:107) F . (1)where (cid:107) · (cid:107) F is the Frobenius norm and α − is ascalar parameter called aperture . It can be shownhat C has a closed form solution: C = 1 n XX (cid:62) ( 1 n XX (cid:62) + α − I ) − , (2)where X = [ x i ] i ∈{ , ··· ,n } is a data collec-tion matrix whose columns are vectors from { x , · · · , x n } . In intuitive terms, C is a soft pro-jection matrix on the linear subspace where thetypical components of x i samples lie. For conve-nience in notation, we may write C ( X, α ) to stressthe dependence on X and α .Conceptors are subject to most laws of Booleanlogic such as NOT ¬ , AND ∧ and OR ∨ . For twoconceptors C and B , we define the following op-erations: ¬ C := I − C, (3) C ∧ B :=( C − + B − − I ) − (4) C ∨ B := ¬ ( ¬ C ∧ ¬ B ) (5)Among these Boolean operations, the OR oper-ation ∨ is particularly relevant for our continualsentence representation learning task. It can beshown that C ∨ B is the conceptor computed fromthe union of the two sets of sample points fromwhich C and B are computed. Note that, how-ever, to calculate C ∨ B , we only need to knowtwo matrices C and B and do not have to accessto the two sets of sample points from which C and B are computed. We now show how to sequentially characterizeand update the common discourse of corpora us-ing the Boolean operation of conceptors. Sup-pose that we are sequentially given M trainingcorpora D , · · · , D M , presented one after another.Without using any training corpus, we first initial-ize a conceptor which characterizes the corpus-independent common discourse features. Moreconcretely, we compute C := C ([ v w ] w ∈ Z , α ) ,where [ v w ] w ∈ Z is a matrix of column-wiselystacked word vectors of words from a stop wordlist Z and α is a hyper-parameter. After initial-ization, for each new training corpus D i ( i =1 , · · · , M ) coming in, we compute a new concep-tor C temp := C ([ q s ] s ∈ D i , α ) to characterize thecommon discourse features of corpus D i , wherethose q s are defined in the SIF Algorithm 1. Wecan then use Boolean operations of conceptors to compute C i := C temp ∨ C i − , which character-izes common discourse features from the new cor-pus as well as the old corpora. After all M cor-pora are presented, we follow the SIF paradigmand use C M to remove common discourse featuresfrom (potentially unseen) sentences. The aboveoutlined conceptor-aided (CA) continual sentencerepresentation learning method is presented in Al-gorithm 2. Algorithm 2:
CA sentence encoder.
Input :
A sequence of M training corpora D = { D , · · · , D M } ; a testingcorpus G ; hyper-parameters a and α ;word probabilities { p ( w ) } w ∈ V ; stopword list Z . C ← C ([ v w ] w ∈ Z , α ) . for corpus index i = 1 , · · · , M do for sentence s ∈ D i do q s ← | s | (cid:80) w ∈ s ap ( w )+ a v w end C temp ← C ([ q s ] s ∈ D i , α ) C i ← C temp ∨ C i − end for s ∈ G do q s ← | s | (cid:80) w ∈ s ap ( w )+ a v w f CA s ← q s − C M q s endOutput: { f CA s } s ∈ G A simple modification of Algorithm 2 yieldsa “zero-shot” sentence encoder that requires onlypre-trained word embeddings and no training cor-pus: we can simply skip those corpus-dependentsteps (line 2-8) and use C in place of C M inline 11 in Algorithm 2 to embed sentences. Thismethod will be referred to as “zero-shot CA.” We evaluated our approach for continual sen-tence representation learning using semantic tex-tual similarity (STS) datasets (Agirre et al., 2012,2013, 2014, 2015, 2016). The evaluation crite-rion for such datasets is the Pearson correlationcoefficient (PCC) between the predicted sentencesimilarities and the ground-truth sentence simi-larities. We split these datasets into five cor-pora by their genre: news, captions, wordnet, fo-rums, tweets (for details see appendix). Through-out this section, we use publicly available 300- n trainingcorpora used63 . . . P CC News 1 2 3 4 575 . . . . . . corpus-specialized SIFtrain-from-scratch SIF CA . . . . . . Figure 1: PCC results of STS datasets. Each panel shows the PCC results of a testing corpus (specified as a subtitle)as a function of increasing numbers of training corpora used. The setup of this experiment mimics (Zenke et al.,2017, section 5.1).
News Captions WordNet Forums Tweetsav. train-from-scratch SIF 66.5 79.7 80.3 55.5 74.2zero-shot CA 65.6 79.8 82.5 61.5 75.2av. CA
Table 1: Time-course averaged PCC of train-from-scratch SIF and conceptor-aided (CA) methods, together withthe result of zero-shot CA. Best results are in boldface and the second best results are underscored. dimensional GloVe vectors (trained on the 840billion token Common Crawl) (Pennington et al.,2014). Additional experiments with Word2Vec(Mikolov et al., 2013), Fasttext (Bojanowski et al.,2017), Paragram-SL-999 (Wieting et al., 2015) arein the appendix.We use a standard continual learning experi-ment setup (cf. (Zenke et al., 2017, section 5.1)) asfollows. We sequentially present the five trainingdatasets in the order of news, captions, wordnet,forums, and tweets, to train sentence encoders.Whenever a new training corpus is presented, wetrain a SIF encoder from scratch (by combiningall available training corpora which have been al-ready presented) and then test it on each corpus.At the same time, we incrementally adapt a CA en-coder using the newly presented corpus and testit on each corpus. The lines of each panel of Fig-ure 1 show the test results of SIF and CA on eachtesting corpus (specified as the panel subtitle) asa function of the number of training corpora used(the first n corpora of news, captions, wordnet, fo-rums, and tweets for this experiment). To give aconcrete example, consider the blue line in the first The order can be arbitrary. Here we ordered the corporafrom the one with the largest size (news) to the smallest size(tweets). The results from reversely ordered corpora are re-ported in the appendix. We use a = 0 . as in (Arora et al., 2017). The wordfrequencies are available at the GitHub repository of SIF. We used hyper-parameter α = 1 . Other parameters areset to be the same as SIF. panel of Figure 1. This line shows the test PCCscores ( y -axis) of SIF encoder on the news corpuswhen the number of training corpora increases ( x -axis). Specifically, the left-most blue dot indicatesthe test result of SIF encoder on news corpus whentrained on news corpus itself (that is, the first train-ing corpus is used); the second point indicates thetest results of SIF encoder on news corpus whentrained on news and captions corpora (i.e., the first two training corpora are used); the third point in-dicates the test results of SIF encoder on news cor-pus when trained on news, captions, and wordnetcorpora (that is, the first three training corpora areused), so on and so forth. The dash-lines in pan-els show the results of a corpus-specialized SIF,which is trained and tested on the same corpus,i.e., as done in (Arora et al., 2017, section 4.1). Wesee that the PCC results of CA are better and more“forgetting-resistant” than train-from-scratch SIFthroughout the time course where more trainingdata are incorporated. Consider, for example, thetest result of news corpus (first panel) again. Asmore and more training corpora are used, the per-formance of train-from-scratch SIF drops with anoticeable slope; by contrast, the performance CAdrops only slightly.As remarked in the section 3.2, with a sim-ple modification of CA, we can perform zero-shotsentence representation learning without using anytraining corpus. The zero-shot learning results areresented in Table 1, together with the time-courseaveraged results of CA and train-from-scratch SIF(i.e., the averaged values of those CA or SIF scoresin each panel of Figure 1). We see that the aver-aged results of our CA method performs the bestamong these three methods. Somewhat surpris-ingly, the results yielded by zero-shot CA are bet-ter than the averaged results of train-from-scratchSIF in most of the cases.We defer additional experiments to the ap-pendix, where we compared CA against morebaseline methods and use different word vectorsother than GloVe to carry out the experiments. In this paper, we formulated a continual sentencerepresentation learning task: Given a consecutivesequence of corpora presented in a time-coursemanner, how can we extract useful sentence-levelfeatures from new corpora while retaining thosefrom previously seen corpora? We identified thatthe existing linear sentence encoders usually fallshort at solving this task as they leverage on “com-mon discourse” statistics estimated based on a pri-ori fixed corpora. We proposed two sentence en-coders (CA encoder and zero-shot CA encoder)and demonstrate their the effectiveness at the con-tinual sentence representation learning task usingSTS datasets.As the first paper considering continual sen-tence representation learning task, this work hasbeen limited in a few ways – it remains for fu-ture work to address these limitations. First, it isworthwhile to incorporate more benchmarks suchas GLUE (Wang et al., 2019) and SentEval (Con-neau and Kiela, 2018) into the continual sentencerepresentation task. Second, this work only con-siders the case of linear sentence encoder, but fu-ture research can attempt to devise (potentiallymore powerful) non-linear sentence encoders toaddress the same task. Thirdly, the proposed CAencoder operates at a corpus level, which mightbe a limitation if boundaries of training corporaare ill-defined. As a future direction, we expect tolift this assumption, for example, by updating thecommon direction statistics at a sentence level us-ing Autoconceptors (Jaeger, 2014, section 3.14).Finally, the continual learning based sentence en-coders should be applied to downstream applica-tions in areas such as open domain NLP systems.
Acknowledgement
The authors thank anonymous reviewers for theirhelpful feedback. This work was partially sup-ported by Jo˜ao Sedoc’s Microsoft Research Dis-sertation Grant.
References
E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab,A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio,M. Maritxalar, R. Mihalcea, G. Rigaua, L. Uriaa,and J. Wiebeg. 2015. Semeval-2015 task 2: Seman-tic textual similarity, English, Spanish and pilot oninterpretability. In
Proceedings of the 9th interna-tional workshop on semantic evaluation , pages 252–263.E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab,A. Gonzalez-Agirre, W. Guo, R. Mihalcea,G. Rigau, and J. Wiebe. 2014. Semeval-2014 task10: Multilingual semantic textual similarity. In
Proceedings of the 8th international workshop onsemantic evaluation , pages 81–91.E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, and J. Wiebe. 2016.Semeval-2016 task 1: Semantic textual similarity,monolingual and cross-lingual evaluation. In
Pro-ceedings of the 10th International Workshop on Se-mantic Evaluation (SemEval-2016) , pages 497–511,San Diego, California.E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, andW. Guo. 2013. Sem 2013 shared task: Semantic tex-tual similarity. In
Second Joint Conference on Lexi-cal and Computational Semantics , volume 1, pages32–43.E. Agirre, M. Diab, D. Cer, and A. Gonzalez-Agirre.2012. Semeval-2012 task 6: A pilot on semantictextual similarity. In
Proceedings of the First JointConference on Lexical and Computational Seman-tics , SemEval ’12, pages 385–393, Stroudsburg, PA,USA. Association for Computational Linguistics.S. Arora, Y. Liang, and T. Ma. 2017. A simple buttough-to-beat baseline for sentence embeddings. In
International Conference on Learning Representa-tions .P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov.2017. Enriching word vectors with subword infor-mation.
Transactions of the Association for Compu-tational Linguistics , 5:135–146.D. Cer, Y. Yang, S. Kong, N. Hua, N. Limti-aco, R. John, N. Constant, M. Guajardo-Cespedes,S. Yuan, C. Tar, et al. 2018. Universal sentence en-coder. arXiv preprint arXiv:1803.11175 .A. Conneau and D. Kiela. 2018. Senteval: An evalu-ation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449 .. He and H. Jaeger. 2018. Overcoming catas-trophic interference using conceptor-aided back-propagation. In
International Conference on Learn-ing Representations .H. Jaeger. 2014. Controlling recurrent neural networksby conceptors. Technical report, Jacobs UniversityBremen.H. Jaeger. 2017. Using conceptors to manage neurallong-term memories for temporal patterns.
Journalof Machine Learning Research , 18(13):1–43.M. Khodak, N. Saunshi, Y. Liang, T. Ma, B. Stewart,and S. Arora. 2018. A la carte embedding: Cheapbut effective induction of semantic feature vectors.
In the Proceedings of ACL .S. Lee. 2017. Toward continual learning for conversa-tional agents. Technical report, Microsoft ResearchAI - Redmond.T. Liu, L. Ungar, and J. Sedoc. 2019. Unsupervisedpost-processing of word vectors via conceptor nega-tion. In
Proceedings of the Thirty-Third AAAI Con-ference on Artificial Intelligence (AAAI-2019), Hon-olulu .V. Mathur and A. Singh. 2018. The rapidly changinglandscape of conversational agents.T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, andJ. Dean. 2013. Distributed representations of wordsand phrases and their compositionality. In C. J. C.Burges, L. Bottou, M. Welling, Z. Ghahramani, andK. Q. Weinberger, editors,
Advances in Neural In-formation Processing Systems 26 , pages 3111–3119.Curran Associates, Inc.M. Pagliardini, P. Gupta, and M. Jaggi. 2018. Unsuper-vised Learning of Sentence Embeddings using Com-positional n-Gram Features. In
Proceedings of theNAACL 2018 .J. Pennington, R. Socher, and C. D. Manning. 2014.Glove: Global vectors for word representation. In
Proceedings of EMNLP , pages 1532–1543.A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, andS. R. Bowman. 2019. GLUE: A multi-task bench-mark and analysis platform for natural language un-derstanding. In
International Conference on Learn-ing Representations .J. Wieting, M. Bansal, K. Gimpel, K. Livescu, andD. Roth. 2015. From paraphrase database to com-positional paraphrase model and back.
Transactionsof the Association for Computational Linguistics ,3:345–358.Z. Yang, C. Zhu, and W. Chen. 2018. Zero-trainingsentence embedding via orthogonal basis.W. Yin and H. Sch¨utze. 2015. Convolutional neuralnetwork for paraphrase identification. In
Proceed-ings of the NAACL HLT 2015 , pages 901–911. F. Zenke, B. Poole, and S. Ganguli. 2017. Contin-ual learning through synaptic intelligence. In
Pro-ceedings of the 34th International Conference onMachine Learning , volume 70 of
Proceedings ofMachine Learning Research , pages 3987–3995, In-ternational Convention Centre, Sydney, Australia.PMLR. upplementary Information
The split STS datasets
In the main body of the paper, we have reported that we have used the STS datasets split bygenre. A detailed list such STS tasks can be found in Table 2 and can be downloaded from http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark and http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark . News Captions Forum Tweets WNMSRpar 2012 MSRvid 2012 deft-forum 2014 tweet-news 2014 OnWN 2012-2014headlines 2013-2016 images 2014-2015 answers-forums 2015deft-news 2014 track5.en-en 2017 answer-answer 20164299 sentence pairs 3250 sentence pairs 1079 sentence pairs 750 sentence pairs 2061 sentence pairs
Table 2: STS datasets breakdown according to genres.
CA compared with incremental-deletion SIF
We compare the CA approach with the following variant of SIF. In the learning phase, for each corpuscoming in, we learn and store a common direction (estimated based on the new corpus). In the testingphase, for a sentence in the testing corpora, we project it away from all common directions we havestored so far. We call this approach SIF with incremental deletions. The testing result is reported inFigure 2. n trainingcorpora used63 . . . P CC News 1 2 3 4 575 . . . . . . corpus-specialized SIFtrain-from-scratch SIF CASIF incr. deletion . . . . . . Figure 2: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of trainingcorpora. For explanation see text.
CA without stop word initialization
We have also tested the performance of CA without the initializing our concepor C by stop words. Thatis, we set C as a zero matrix in our CA algorithm. The results are reported in Figure 3 n trainingcorpora used63 . . . P CC News 1 2 3 4 575 . . . . . . corpus-specialized SIFtrain-from-scratch SIF CACA w/o stop word init. . . . . . . Figure 3: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of trainingcorpora. For explanation see text.
We see that, the CA initialized by stop words are more beneficial than without such initializations,especially for those testing corpora that are unseen in training data.
A with the reverse-ordered sequence of training corpora
In the main body of the paper, we sequentially presented new training corpus for sentence encoders,from the corpora of largest size (news) to that of the smallest size (tweets). We have remarked that thischoice of ordering is essentially arbitrary. We now report the results for the reverse order (i.e., fromcorpora of smallest size to that of largest size) in Figure 4. We see that CA approach still outperformstrain-from-scratch SIF throughout the time course. n trainingcorpora used71 . . . P CC Tweets 1 2 3 4 554 . . . . . . corpus-specialized SIFtrain-from-scratch SIF CA . . . . . . Figure 4: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of trainingcorpora. For explanation see text.
Experiment using other word embedding brands
We repeat the experiments with Word2Vec (Mikolov et al., 2013) (pre-trained on Google News; 3million tokens), Fasttext (Bojanowski et al., 2017) (pre-trained on Common Crawl; 2 million of tokens),and Paragram SL-999 (fine-tuned based on GloVe). The pipeline of the experiments echo that of themain body of the paper. Using Word2vec n trainingcorpora used61 . . . P CC News 1 2 3 4 578 . . . . . . corpus-specialized SIFtrain-from-scratch SIF CA . . . . . . Figure 5: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of trainingcorpora. Word2Vec is used. https://code.google.com/archive/p/word2vec/ https://fasttext.cc/docs/en/english-vectors.html https://cogcomp.org/page/resource_view/106 sing Fasttext n trainingcorpora used66 . . . P CC News 1 2 3 4 5808487 Captions 1 2 3 4 578 . . . corpus-specialized SIFtrain-from-scratch SIF CA . . . Figure 6: Pearson correlation coefficients (PCC) of the split STS datasets as a function of the number of trainingcorpora. Fasttext is used.
Using Paragram-SL-999 n trainingcorpora used66 . . . P CC News 1 2 3 4 579 . . . . . . corpus-specialized SIFtrain-from-scratch SIF CA . . . . . .0 Tweets