Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks
EEvaluation of BERT and ALBERT SentenceEmbedding Performance on Downstream NLP Tasks
Hyunjin Choi, Judong Kim, Seongho Joe, and Youngjune Gwon
Samsung SDS, Seoul, Korea
Abstract —Contextualized representations from a pre-trainedlanguage model are central to achieve a high performanceon downstream NLP task. The pre-trained BERT and A LiteBERT (ALBERT) models can be fine-tuned to give state-of-the-art results in sentence-pair regressions such as semantictextual similarity (STS) and natural language inference (NLI).Although BERT-based models yield the [CLS] token vector asa reasonable sentence embedding, the search for an optimalsentence embedding scheme remains an active research areain computational linguistics. This paper explores on sentenceembedding models for BERT and ALBERT. In particular, wetake a modified BERT network with siamese and triplet networkstructures called Sentence-BERT (SBERT) and replace BERTwith ALBERT to create Sentence-ALBERT (SALBERT). Wealso experiment with an outer CNN sentence-embedding networkfor SBERT and SALBERT. We evaluate performances of allsentence-embedding models considered using the STS and NLIdatasets. The empirical results indicate that our CNN architec-ture improves ALBERT models substantially more than BERTmodels for STS benchmark. Despite significantly fewer modelparameters, ALBERT sentence embedding is highly competitiveto BERT in downstream NLP evaluations.
I. I
NTRODUCTION
Pre-trained language models have impacted the way modernnatural language processing (NLP) applications and systemsare built. An important paradigm is to train a language modelon large corpora to serve as a platform upon which an NLPapplication can be built and optimized. Such platform isshareable and can be distributed. Self-supervised learning withlarge corpora provides an appropriate starting point for extratask-specific layers being optimized from scratch while reusingthe pre-trained model parameters.Transformer [1], a sequence transduction model based onattention mechanism, has revolutionized the design of a neuralencoder for natural language sequences. By skipping anyrecurrent or convolutional structures, the transformer architec-ture enables the learning of sequential information in an inputsolely via attention, thanks to multihead self-attention layersin an encoder block. Devlin et al. [2] have proposed Bidirec-tional Encoder Representations from Transformers (BERT) toimprove on predominantly unidirectional training of languagemodels.By jointly conditioning on both left and right context in alllayers, BERT uses the masked language modeling (MLM) lossto make the training of deep bidirectional language encodingpossible. BERT uses an additional loss for pre-training knownas next-sentence prediction (NSP). NSP is designed to learnhigh-level linguistic coherence by predicting whether or not given two text segments should appear consecutively as inthe original text. NSP is expected to improve downstreamNLP task performances such as semantic textual similarity(STS) and natural language inference (NLI) that need to inferreasoning about inter-sentence relations.A Lite BERT [3] is proposed to scale up the languagerepresentation learning via parameter reduction techniques. InALBERT, cross-layer parameter sharing and factorization ofembedding parameters can be thought as a regularization thathelps stabilize its training. Furthermore, ALBERT uses anupdated self-supervised loss known as sentence-order predic-tion (SOP) that enhances the ineffectiveness of NSP confusedbetween topic and coherence predictions. SOP has been shownto consistently help downstream tasks with multi-sentenceinputs.The pre-training tasks are intrinsic compared to downstreamtasks. A key disadvantage of BERT is that no independentsentence embeddings are computed. As a higher means ofabstraction, sentence embeddings can play a central role toachieve good downstream performances like machine readingcomprehension (MRC).The specifics of NLP applications are well-abstracted bydownstream tasks. For this reason, downstream performanceis a good indicator for a language model. When pre-trainedlanguage models are used for downstream task evaluations,pre-trained models can generate additional feature represen-tations in addition to being provided as a platform for fine-tuning.In this paper, we are interested in learning sentence rep-resentation using out-of-the-box BERT and ALBERT tokenembeddings. Sentence embedding models are essential forclustering and semantic search where a sentence input ismapped in a high-dimensional semantic vector space such thatsentence vectors with similar meanings are close in distance.NLP researchers have started to input an individual sentenceinto BERT to derive a fixed-size embedding. A commonlyaccepted sentence embedding for BERT-based models is the [CLS] token used for sentence-order prediction ( i.e. , NSP orSOP) during the pre-training.Averaging the representations obtained from the BERTor ALBERT output layer ( i.e. , token embeddings) gives analternative. Using the [CLS] token, which is optimized byan intrinsic task of the pre-training, is considered suboptimalwhile the average pooling of token embeddings has a limi-tation of its own. Nonetheless, it can be time consuming toperform multi-sentence tasks associated with semantic search, a r X i v : . [ c s . C L ] J a n ummarization, and paraphrase.Computing sentence embeddings from contextualized lan-guage models is an active, ongoing research problem. In ourexploration for more elaborate sentence embedding models,we first consider Sentence-BERT (SBERT) [4], a modifiedBERT network with siamese and triplet network structures toderive semantically meaningful sentence embeddings. SBERTis computationally efficient and can compare sentences usingonly cosine-similarity at run-time. We then take the SBERTarchitecture and simply replace BERT with ALBERT to formSentence-ALBERT (SALBERT). We also apply a convolu-tional neural net (CNN) instead of average pooling that takesin the BERT or ALBERT token embedding outputs.We have evaluated the empirical performance of all sentenceembedding models by using the STS and NLI datasets. We findthat our CNN architecture improves ALBERT models up to8 points in Spearman’s rank correlation for STS benchmark,which is substantially larger than the case for BERT modelswith an improvement of only 1 point. Despite significantlyfewer model parameters, ALBERT sentence embedding ishighly competitive to BERT in downstream NLP evaluations.This paper is structured in the following manner. SectionII presents related work. Section III describes all sentenceembedding models of our consideration. In Section IV, weempirically evaluate the sentence embedding models using theSTS and NLI datasets. The paper concludes in Section V.II. R ELATED W ORK
Language models provide core building blocks for down-stream NLP tasks. Task-specific fine-tuning of a pre-trainedlanguage model is a contemporary approach to implementan NLP system. BERT [2] is a pre-trained transformer en-coder network [1] fine-tuned to give state-of-the-art resultsin question answering, sentence classification, and sentence-pair regression. A Lite BERT (ALBERT) [3] incorporatesparameter reduction techniques to scale better than BERT.ALBERT is known to improve on inter-sentence coherence bya self-supervised loss from sentence-order prediction (SOP)compared to the next sentence prediction (NSP) loss in theoriginal BERT.The BERT network structure contains a special classifica-tion token [CLS] as an aggregate sequence representationfor NSP. (Similarly for ALBERT, [CLS] is used for SOP.)The [CLS] token therefore can serve a sentence embedding.Because there are no other independently computed sentenceembeddings for BERT and ALBERT, one can average-poolthe token embedding outputs to form a fixed-length sentencevector.Previously, sentence embedding research looked over con-volutional and recurrent structures as building blocks. Kim [5]proposed a CNN with max pooling for sentence classification.In Conneau et al. [6], bidirectional LSTM (BiLSTM) wasused as sentence embedding for natural language inferencetasks. More complex neural nets such as Socher et al. [7]introduced recursive neural tensor network (RNTN) over parsetrees to compute sentence embedding for sentiment analysis. Zhu et al. [8] and Tai et al. [9] proposed tree-LSTM while Yu& Munkhdalai [10] suggested neural semantic encoder (NSE)based on memory augmented neural net.Recently, sentence embedding research is exploring atten-tion mechanisms. Vaswani et al. [1] have proposed Trans-former, a self-attention network for the neural sequence-to-sequence task. A self-attention network uses multi-head scaleddot-product attention to represent each word as a weighted sumof all words in the sentence. The idea of self-attention poolinghas existed before self-attention network as in Liu et al. [11]that have utilized inner-attention within a sentence to applypooling for sentence embedding. Choi et al. [12] have devel-oped a fine-grained attention mechanism for neural machinetranslation, extending scalar attention to vectors.Complex contextualized sentence encoders are usually pre-trained like language models, but they can be improved bysupervised transfer tasks such as natural language inference(NLI). InferSent by Conneau et al. [6] has consistently out-performed unsupervised methods like SkipThought. UniversalSentence Encoder [13] trains a transformer network and aug-ments unsupervised learning with training on the Stanford NLI(SNLI) dataset. Hill et al. [14] show that the task on whichsentence embeddings are trained significantly impacts theirquality. According to Conneau et al. [6] and Cer et al. [13], theSNLI datasets are suitable for training sentence embeddings.Yang et al. [15] present a method to train siamese deep av-eraging network (DAN) and transformer, using conversationsfrom Reddit to yield good results on the STS benchmark.In Sentence-BERT (SBERT) [4], a comprehensive evalu-ation on the pre-trained BERT combined with siamese andtriplet network structures is presented. To alleviate the run-timeoverhead, SBERT’s more elaborate fine-tuning mechanismssuch as softmax on augmented sentence representations andtriplet loss are replaced by the cosine similarity at inference.The simplistic SBERT inference helps reduce the effort forfinding the most similar pair from 65 hours with BERT toabout 5 seconds, while hardly impacting the accuracy.III. M
ODELS
The output of BERT or ALBERT constitutes token embed-dings for a given text input. With a large output size ( e.g. , up to512 token vectors of 768 dimensions each), the contextualizedword embeddings can be fine-tuned for any downstream task.To do sentence-level regressions such as semantic textualsimilarity (STS), fixed-size sentence embeddings would benecessary. In this section, we describe sentence embeddingmodels for BERT and ALBERT.
A. The [CLS] token embedding
The most straightforward sentence embedding model is the [CLS] vector used to predict sentence-level context ( i.e. ,BERT NSP, ALBERT SOP) during the pre-training. The [CLS] token summarizes the information from other tokensvia a self-attention mechanism that facilitates the intrinsictasks of the pre-training. A similar reasoning applies such thatthe [CLS] token can be further optimized while fine-tuninghe downstream task. After the fine-tuning, the [CLS] token isexpected to capture more semantically-relevant sentence-levelcontext specific to the downstream task.
B. Pooled token embeddings
Averaging the token embedding output gives our nextmodel. The model works like a pooling layer in a convolutionalneural net. Average pooling turns the token embeddings intoa fixed-length sentence vector. An alternative would use maxpooling instead, although max pooling tends to select the mostimportant features rather than taking representative summary.In this paper, we choose to go with the average-pooling model.
C. Sentence-BERT (SBERT)
Reimers & Gurevych [4] propose SBERT that modifies apre-trained BERT with siamese and triplet network structuresto derive semantically meaningful sentence embeddings com-parable using only cosine similarity. The siamese architectureis computationally efficient. Note that using a single copy ofpre-trained BERT would require to run all possible combina-tions of sentence pairs from a dataset to form a representationfor sentence pairs. SBERT first average-pools a pair of theBERT embeddings to fixed-size sentence embeddings. Usingthe two sentence embeddings and an element-wise differencebetween them, SBERT can run a softmax layer configured forclassification and regression tasks.
D. Sentence-ALBERT (SALBERT)
Based on ALBERT, SALBERT has the same siamese andtriplet networks as SBERT. The siamese network structure inSBERT and SALBERT is illustrated in Fig. 1.
BERT / ALBERT
Average Poolingor CNN
BERT / ALBERT
Average Pooling or CNNCosine SimilarityMSE Loss
Sentence 1 Sentence 2
Fig. 1. Siamese network structure used in SBERT and SALBERT
E. CNN-SBERT
In SBERT, average pooling is used to make the BERTembeddings into fixed-length sentence vectors. CNN-SBERTinstead employs a CNN architecture that takes in the tokenembeddings and computes a fixed-size sentence embedding through convolutional layers with the hyperbolic tangent ac-tivation function interlaced with pooling layers. In CNN-SBERT, all the pooling layers use max pooling except the finalaverage pooling. The CNN architecture used in CNN-SBERTis described in Fig. 2.
F. CNN-SALBERT
Similarly, CNN-SALBERT uses the same CNN architectureused in CNN-SBERT. (B, H)(B, 1, T, H)1x1 Conv, 1 tanh3x1 Conv, 128 tanh3x1 Conv, 128 tanh
Max Pooling
Max PoolingMax PoolingMax PoolingAverage PoolingMax Pooling
Fig. 2. CNN architecture used in CNN-SBERT and CNN-SALBERT. B, T,and H means mini-batch size, number of tokens, and transformer hidden size.
IV. E
XPERIMENTS
We evaluate the performance of the sentence embeddingmodels on Semantic Textual Similarity (STS) and NaturalLanguage Inference (NLI) benchmarks. Following the method-ology by Reimers & Gurevych [4], we use cosine-similarity asa main metric to evaluate the similarity between two sentenceembeddings. We compute both Pearson and Spearman’s rankcoefficients to indicate how our cosine-similarity estimate anda ground-truth label provided by the datasets are correlated.We use pre-trained BERT and ALBERT models from HuggingFace [16] . A. Datasets and tasks
We fine-tune the BERT and ALBERT sentence embed-ding models on the Semantic Textual Similarity benchmark(STSb) [17], the Multi-Genre Natural Language Inference(MultiNLI) [18], and the Stanford Natural Language Inference(SNLI) [19] datasets. https://github.com/huggingface ) Semantic Textual Similarity benchmark: STSb gives aset of English data used for STS tasks organized in Inter-national Workshop on Semantic Evaluation (SemEval) [20]between 2012 and 2017. The dataset includes 8,628 sentencepairs from image captions, news headlines, and user forumsthat are partitioned in train (5,749), dev (1,500) and test(1,379) sets. They are annotated with a score from 0 to 5indicating how similar a pair of sentences are in terms ofsemantic relatedness.
2) Multi-genre Natural Language Inference:
The MultiNLIcorpus [18] is a crowd-sourced collection of 433k sen-tence pairs annotated with textual entailment information.The dataset is used to evaluate entailment classification task.MultiNLI is modeled on the SNLI corpus, differing in itscoverage of genres of spoken and written text. MultiNLIsupports a distinctive cross-genre generalization evaluation.Each sentence pair in MultiNLI has a label that distinguisheswhether the two sentences are contradiction, entailment, orneutral.
3) Stanford Natural Language Inference:
The SNLI cor-pus [19] contains 570k human-written English sentence pairsmanually labeled for balanced classification with the labelsentailment, contradiction, and neutral for natural languageinference (NLI), also known as recognizing textual entail-ment (RTE). The General Language Understanding Evalua-tion (GLUE) benchmark [21] recommends the SNLI datasetused as an auxiliary training data for MultiNLI task. Con-neau et al. [6] and Cer et al. [13] find SNLI suitable fortraining sentence embeddings for asserting reasoning aboutthe semantic relationship within sentences.
B. Training
In our evaluation, we consider only BERT and ALBERTbase models ( i.e. , multi-head attention over 12 layers) in thetransformer package downloaded from Hugging Face [16].We use GLUE benchmark to fine-tune the [CLS] tokenembedding and average-pooled token embedding models witha learning rate of × − . We train all of our modelsusing the Adam optimizer with a linear learning rate warm-up for 10% of the training data. We use a learning rate of × − for SBERT and SALBERT as suggested by theoriginal SBERT architecture and × − for CNN-SBERTand CNN-SALBERT. Using the MultiNLI and SNLI data, weoptimize SBERT and SALBERT on the 3-way softmax loss.
1) STSb:
To train STS benchmark task, we use siamesenetwork as shown in Fig 1. We run 10 training epochs with abatch size of 32.
2) NLI (MultiNLI + SNLI):
To train NLI tasks, we adoptthe siamese architecture in Fig 1. We use a softmax classifierinstead of cosine similarity in training NLI tasks with a cross-entropy loss. We train 1 epoch because the NLI train set ismuch bigger than STSb. We use a batch size of 16.
3) NLI + STSb:
After fine-tuning on the NLI dataset, wetrain on the STS benchmark with a batch size of 32.
TABLE IE
VALUATION ON THE
STS
B BY FINE - TUNING SENTENCE EMBEDDINGS ON
STS, NLI,
AND BOTH
Model Spearman (Pearson)Not fine-tunedBERT [CLS]-token embedding 6.43 (1.70)BERT Avg. pooled token embedding 47.29 (47.91)ALBERT [CLS]-token embedding 0.86 (4.57)ALBERT Avg. pooled token embedding 47.84 (46.57)Fine-tuned on STSbBERT [CLS]-token embedding 12.96 (7.49)BERT Avg. pooled token embedding 55.76 (54.90)SBERT 84.66 (84.86)CNN-SBERT
ALBERT [CLS]-token embedding 37.98 (27.89)ALBERT Avg. pooled token embedding 61.06 (60.41)SALBERT 74.33 (75.26)CNN-SALBERT
Fine-tuned on NLI (MultiNLI + SNLI)BERT [CLS]-token embedding 32.72 (26.88)BERT Avg. pooled token embedding 69.57 (68.49)SBERT
CNN-SBERT 76.77 (75.31)ALBERT [CLS]-token embedding 24.87 (4.11)ALBERT Avg. pooled token embedding 54.21 (53.58)SALBERT
CNN-SALBERT 73.70 (72.24)Fine-tuned on NLI (MultiNLI + SNLI) and STSbBERT [CLS]-token embedding 44.77 (38.74)BERT Avg. pooled token embedding 67.61 (65.30)SBERT 85.32 (84.51)CNN-SBERT
ALBERT [CLS]-token embedding 40.35 (33.46)ALBERT Avg. pooled token embedding 60.24 (59.98)SALBERT 77.59 (77.82)CNN-SALBERT
TABLE IIE
VALUATION ON THE
GLUE STS
B TASK .Model Spearman (Pearson)BERT 88.58 (88.89)ALBERT 90.13 (90.46)
C. Results1) Effect of fine-tuning:
Table I presents the STS bench-mark results. Note that performance we report is ρ × ,where ρ is Spearman’s rank or Pearson correlation coefficient.In general, fine-tuning results in a better performance thanno fine-tuning. Without fine-tuning, the [CLS] token as asentence embedding gives poor downstream task performance.Quality of sentence embedding reflected on the STSb perfor-mance seems to be affected by how related train sets usedfor fine-tuning are to the task. We consider STSb train set,which is directly related to the task of STSb. We also considerNLI ( i.e. , MultiNLI and SNLI) train sets that are not directlyrelated to STSb. We have experimented with the following: i)fine-tuning with only STSb train set, ii) with only NLI trainsets, and iii) with both NLI and STSb train sets. Fine-tuningwith only STSb train set gives a reasonably good performancewhereas fine-tuning with irrelevant NLI train sets only haveyielded a suboptimal performance as expected. Our best STSbresults are obtained by fine-tuning with both STSb and NLItrain sets. ABLE IIIE
VALUATION ON VARIOUS
STS
TASKS . N
UMBERS REPRESENT S PEARMAN (P EARSON ).Model STS12 STS13 STS14 STS15 STS16 STSb Avg.SBERT
CNN-SBERT 69.80 (75.04)
2) Model comparison:
We expect a more elaborate sen-tence embedding model to give a better performance in STSb.We have found that pooling token embeddings will forma better sentence representation than [CLS] . We have alsofound that siamese structure further helps sentence embed-dings. Generally, our CNN-based sentence embedding modelsgive the best performance among all sentence embeddingmodels.
3) Performance of ALBERT:
ALBERT-based sentence em-bedding models generally achieve lower performance than theBERT counterparts in STSb evaluations. Before fine-tuning,there is no significant difference between ALBERT and BERT.The gap, however, increases after fine-tuning. Only for [CLS] token embedding and average-pooled embeddings, ALBERThas a better performance than BERT when fine-tuned on STSb.SALBERT has much lower performance than SBERT eventhough they both have the same siamese architecture. This issurprising because ALBERT has a higher score than BERTwhen evaluated on STSb using GLUE as shown in Table II.The performance of SALBERT catches up with SBERT whenthe CNN architecture applies, but CNN-SALBERT is stillslightly inferior to CNN-SBERT.
4) Effect of CNN:
In Table I, we find that the best score isfrom CNN-based models trained on NLI and STSb. Accordingto these scores, the CNN architecture seems to have a posi-tive impact on sentence embedding performances. The CNNarchitecture, however, improves the ALBERT-based sentenceembedding models more than the BERT-based models. Wehave found that the improvement by CNN to ALBERT modelscan be as high as 8 points, which is compared to 1 point forthe case of BERT models. We have empirically observed thatALBERT exposes more instability (due to parameter sharing)compared to BERT. Such instability can be alleviated by CNN,and this is a possible explanation for more improvement onALBERT by adding CNN than BERT.
5) Evaluation of STS12–STS16 Tasks:
In Table III, wepresent a comprehensive evaluation on the STS tasks from2012 to 2016 [22]–[26] after fine-tuning with both the NLI andSTSb train sets. We show the results of our best two models( i.e. , SBERT/SALBERT and CNN-SBERT/CNN-SALBERT).The STSb result is also presented for comparison. The purposeof the evaluation is to verify the improvement by CNN beyondSTSb. In general, we see a similar trend that CNN archi-tecture improves ALBERT-based sentence embedding modelssubstantially more than BERT-based. On the average, SBERTembeddings achieve a Spearman’s rank correlation point of84.49 while the average for CNN-SBERT is 84.34. The CNNarchitecture seems almost no effect on BERT-based sentence embedding models. On the other hand, the average correlationscore of CNN-SALBERT is improved by 2 points.V. C
ONCLUSION AND F UTURE W ORK
In this paper, we have presented an evaluation of BERT andALBERT sentence embedding models on Semantic TextualSimilarity (STS). Knowing limitations of the [CLS] sentencevector, we facilitate the STS sentence-pair regression task withthe siamese and triplet network architecture by Reimers &Gurevych for BERT and ALBERT. We have additionally de-veloped a CNN architecture that takes in the token embeddingsto compute a fixed-size sentence vector. Our CNN architectureimproves ALBERT models up to 0.08 (8 points in percentile)in Spearman’s rank correlation for STS benchmark, whichis substantially larger than the case for BERT models withan improvement of only 0.01 (1 point). Despite significantlyfewer model parameters, ALBERT sentence embedding ishighly competitive to BERT in downstream NLP evaluations.For our future work, we plan to evaluate sentence embed-ding with larger ALBERT models— i.e. , ALBERT-large andALBERT-xlarge. (Note that the total number of parameters inALBERT-xlarge is still fewer than that of BERT-base.) TheALBERT results in this paper are obtained with the numberof groups for the hidden layers ( num_hidden_groups ) setto 1. We also plan to optimize the num_hidden_groups hyperparameter for better performance.R
EFERENCES[1] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit,Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz andPolosukhin, Illia, “Attention is All you Need,” in
Advances in NeuralInformation Processing Systems 30 , 2017.[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof Deep Bidirectional Transformers for Language Understanding,” arXivpreprint arXiv:1810.04805 , 2018.[3] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Sori-cut, “Albert: A Lite BERT for Self-supervised Learning of LanguageRepresentations,” arXiv preprint arXiv:1909.11942 , 2019.[4] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddingsusing Siamese BERT-networks,” in
Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019.[5] Y. Kim, “Convolutional neural networks for sentence classification,” in
Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , 2014.[6] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes,“Supervised learning of universal sentence representations from naturallanguage inference data,” in
Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing , 2017.[7] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng,and C. Potts, “Recursive deep models for semantic compositionalityover a sentiment treebank,” in
Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing , 2013.8] X. Zhu, P. Sobhani, and H. Guo, “Long Short-Term Memory over Re-cursive Structures,” in
Proceedings of the 32nd International Conferenceon International Conference on Machine Learning , 2015.[9] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic rep-resentations from tree-structured long short-term memory networks,”in
Proceedings of the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: Long Papers) , 2015.[10] T. Munkhdalai and H. Yu, “Neural semantic encoders,” in
Proceedingsof the 15th Conference of the European Chapter of the Association forComputational Linguistics , 2017.[11] Y. Liu, C. Sun, L. Lin, and X. Wang, “Learning Natural LanguageInference using Bidirectional LSTM model and Inner-Attention,”
CoRR ,vol. abs/1605.09090, 2016.[12] H. Choi, K. Cho, and Y. Bengio, “Fine-Grained Attention Mechanismfor Neural Machine Translation,”
CoRR , vol. abs/1803.11407, 2018.[13] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John,N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, andR. Kurzweil, “Universal sentence encoder for English,” in
Proceedingsof the 2018 Conference on Empirical Methods in Natural LanguageProcessing: System Demonstrations , 2018.[14] F. Hill, K. Cho, and A. Korhonen, “Learning distributed representa-tions of sentences from unlabelled data,” in
Proceedings of the 2016Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies , 2016.[15] Y. Yang, S. Yuan, D. Cer, S.-y. Kong, N. Constant, P. Pilar, H. Ge,Y.-H. Sung, B. Strope, and R. Kurzweil, “Learning semantic textualsimilarity from conversations,” in
Proceedings of The Third Workshopon Representation Learning for NLP , 2018.[16] Hugging Face, “Open Source NLP,” https://huggingface.co.[17] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingualfocused evaluation,” in
Proceedings of the 11th International Workshopon Semantic Evaluation (SemEval-2017) , 2017.[18] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challengecorpus for sentence understanding through inference,” in
Proceedings ofthe 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume1 (Long Papers) , 2018.[19] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large an-notated corpus for learning natural language inference,” in
Proceedingsof the 2015 Conference on Empirical Methods in Natural LanguageProcessing , 2015.[20] Special Interest Group on the Lexicon of the Association for Com-putational Linguistics, “SemEval: International Workshop on SemanticEvaluation,” https://semeval.github.io/.[21] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman,“GLUE: A multi-task benchmark and analysis platform for naturallanguage understanding,” in
Proceedings of the 2018 EMNLP WorkshopBlackboxNLP: Analyzing and Interpreting Neural Networks for NLP ,2018.[22] E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre, “SemEval-2012task 6: A pilot on semantic textual similarity,” in *SEM 2012: The FirstJoint Conference on Lexical and Computational Semantics – Volume 1:Proceedings of the main conference and the shared task, and Volume 2:Proceedings of the Sixth International Workshop on Semantic Evaluation(SemEval 2012) , 2012.[23] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo, “*SEM2013 shared task: Semantic textual similarity,” in
Second Joint Con-ference on Lexical and Computational Semantics (*SEM), Volume 1:Proceedings of the Main Conference and the Shared Task: SemanticTextual Similarity , 2013.[24] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe, “SemEval-2014 task10: Multilingual semantic textual similarity,” in
Proceedings of the 8thInternational Workshop on Semantic Evaluation (SemEval 2014) , 2014.[25] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria,and J. Wiebe, “SemEval-2015 task 2: Semantic textual similarity,English, Spanish and pilot on interpretability,” in
Proceedings of the9th International Workshop on Semantic Evaluation (SemEval 2015) ,2015.[26] E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea,G. Rigau, and J. Wiebe, “SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation,” in