[PDF] Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

Abstract

Contextualized representations from a pre-trained language model are central to achieve a high performance on downstream NLP task. The pre-trained BERT and A Lite BERT (ALBERT) models can be fine-tuned to give state-ofthe-art results in sentence-pair regressions such as semantic textual similarity (STS) and natural language inference (NLI). Although BERT-based models yield the [CLS] token vector as a reasonable sentence embedding, the search for an optimal sentence embedding scheme remains an active research area in computational linguistics. This paper explores on sentence embedding models for BERT and ALBERT. In particular, we take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT). We also experiment with an outer CNN sentence-embedding network for SBERT and SALBERT. We evaluate performances of all sentence-embedding models considered using the STS and NLI datasets. The empirical results indicate that our CNN architecture improves ALBERT models substantially more than BERT models for STS benchmark. Despite significantly fewer model parameters, ALBERT sentence embedding is highly competitive to BERT in downstream NLP evaluations.

Full PDF

EEvaluation of BERT and ALBERT SentenceEmbedding Performance on Downstream NLP Tasks

Hyunjin Choi, Judong Kim, Seongho Joe, and Youngjune Gwon

Samsung SDS, Seoul, Korea

Abstract —Contextualized representations from a pre-trainedlanguage model are central to achieve a high performanceon downstream NLP task. The pre-trained BERT and A LiteBERT (ALBERT) models can be ﬁne-tuned to give state-of-the-art results in sentence-pair regressions such as semantictextual similarity (STS) and natural language inference (NLI).Although BERT-based models yield the [CLS] token vector asa reasonable sentence embedding, the search for an optimalsentence embedding scheme remains an active research areain computational linguistics. This paper explores on sentenceembedding models for BERT and ALBERT. In particular, wetake a modiﬁed BERT network with siamese and triplet networkstructures called Sentence-BERT (SBERT) and replace BERTwith ALBERT to create Sentence-ALBERT (SALBERT). Wealso experiment with an outer CNN sentence-embedding networkfor SBERT and SALBERT. We evaluate performances of allsentence-embedding models considered using the STS and NLIdatasets. The empirical results indicate that our CNN architec-ture improves ALBERT models substantially more than BERTmodels for STS benchmark. Despite signiﬁcantly fewer modelparameters, ALBERT sentence embedding is highly competitiveto BERT in downstream NLP evaluations.

I. I

NTRODUCTION

Pre-trained language models have impacted the way modernnatural language processing (NLP) applications and systemsare built. An important paradigm is to train a language modelon large corpora to serve as a platform upon which an NLPapplication can be built and optimized. Such platform isshareable and can be distributed. Self-supervised learning withlarge corpora provides an appropriate starting point for extratask-speciﬁc layers being optimized from scratch while reusingthe pre-trained model parameters.Transformer [1], a sequence transduction model based onattention mechanism, has revolutionized the design of a neuralencoder for natural language sequences. By skipping anyrecurrent or convolutional structures, the transformer architec-ture enables the learning of sequential information in an inputsolely via attention, thanks to multihead self-attention layersin an encoder block. Devlin et al. [2] have proposed Bidirec-tional Encoder Representations from Transformers (BERT) toimprove on predominantly unidirectional training of languagemodels.By jointly conditioning on both left and right context in alllayers, BERT uses the masked language modeling (MLM) lossto make the training of deep bidirectional language encodingpossible. BERT uses an additional loss for pre-training knownas next-sentence prediction (NSP). NSP is designed to learnhigh-level linguistic coherence by predicting whether or not given two text segments should appear consecutively as inthe original text. NSP is expected to improve downstreamNLP task performances such as semantic textual similarity(STS) and natural language inference (NLI) that need to inferreasoning about inter-sentence relations.A Lite BERT [3] is proposed to scale up the languagerepresentation learning via parameter reduction techniques. InALBERT, cross-layer parameter sharing and factorization ofembedding parameters can be thought as a regularization thathelps stabilize its training. Furthermore, ALBERT uses anupdated self-supervised loss known as sentence-order predic-tion (SOP) that enhances the ineffectiveness of NSP confusedbetween topic and coherence predictions. SOP has been shownto consistently help downstream tasks with multi-sentenceinputs.The pre-training tasks are intrinsic compared to downstreamtasks. A key disadvantage of BERT is that no independentsentence embeddings are computed. As a higher means ofabstraction, sentence embeddings can play a central role toachieve good downstream performances like machine readingcomprehension (MRC).The speciﬁcs of NLP applications are well-abstracted bydownstream tasks. For this reason, downstream performanceis a good indicator for a language model. When pre-trainedlanguage models are used for downstream task evaluations,pre-trained models can generate additional feature represen-tations in addition to being provided as a platform for ﬁne-tuning.In this paper, we are interested in learning sentence rep-resentation using out-of-the-box BERT and ALBERT tokenembeddings. Sentence embedding models are essential forclustering and semantic search where a sentence input ismapped in a high-dimensional semantic vector space such thatsentence vectors with similar meanings are close in distance.NLP researchers have started to input an individual sentenceinto BERT to derive a ﬁxed-size embedding. A commonlyaccepted sentence embedding for BERT-based models is the [CLS] token used for sentence-order prediction ( i.e. , NSP orSOP) during the pre-training.Averaging the representations obtained from the BERTor ALBERT output layer ( i.e. , token embeddings) gives analternative. Using the [CLS] token, which is optimized byan intrinsic task of the pre-training, is considered suboptimalwhile the average pooling of token embeddings has a limi-tation of its own. Nonetheless, it can be time consuming toperform multi-sentence tasks associated with semantic search, a r X i v : . [ c s . C L ] J a n ummarization, and paraphrase.Computing sentence embeddings from contextualized lan-guage models is an active, ongoing research problem. In ourexploration for more elaborate sentence embedding models,we ﬁrst consider Sentence-BERT (SBERT) [4], a modiﬁedBERT network with siamese and triplet network structures toderive semantically meaningful sentence embeddings. SBERTis computationally efﬁcient and can compare sentences usingonly cosine-similarity at run-time. We then take the SBERTarchitecture and simply replace BERT with ALBERT to formSentence-ALBERT (SALBERT). We also apply a convolu-tional neural net (CNN) instead of average pooling that takesin the BERT or ALBERT token embedding outputs.We have evaluated the empirical performance of all sentenceembedding models by using the STS and NLI datasets. We ﬁndthat our CNN architecture improves ALBERT models up to8 points in Spearman’s rank correlation for STS benchmark,which is substantially larger than the case for BERT modelswith an improvement of only 1 point. Despite signiﬁcantlyfewer model parameters, ALBERT sentence embedding ishighly competitive to BERT in downstream NLP evaluations.This paper is structured in the following manner. SectionII presents related work. Section III describes all sentenceembedding models of our consideration. In Section IV, weempirically evaluate the sentence embedding models using theSTS and NLI datasets. The paper concludes in Section V.II. R ELATED W ORK

Language models provide core building blocks for down-stream NLP tasks. Task-speciﬁc ﬁne-tuning of a pre-trainedlanguage model is a contemporary approach to implementan NLP system. BERT [2] is a pre-trained transformer en-coder network [1] ﬁne-tuned to give state-of-the-art resultsin question answering, sentence classiﬁcation, and sentence-pair regression. A Lite BERT (ALBERT) [3] incorporatesparameter reduction techniques to scale better than BERT.ALBERT is known to improve on inter-sentence coherence bya self-supervised loss from sentence-order prediction (SOP)compared to the next sentence prediction (NSP) loss in theoriginal BERT.The BERT network structure contains a special classiﬁca-tion token [CLS] as an aggregate sequence representationfor NSP. (Similarly for ALBERT, [CLS] is used for SOP.)The [CLS] token therefore can serve a sentence embedding.Because there are no other independently computed sentenceembeddings for BERT and ALBERT, one can average-poolthe token embedding outputs to form a ﬁxed-length sentencevector.Previously, sentence embedding research looked over con-volutional and recurrent structures as building blocks. Kim [5]proposed a CNN with max pooling for sentence classiﬁcation.In Conneau et al. [6], bidirectional LSTM (BiLSTM) wasused as sentence embedding for natural language inferencetasks. More complex neural nets such as Socher et al. [7]introduced recursive neural tensor network (RNTN) over parsetrees to compute sentence embedding for sentiment analysis. Zhu et al. [8] and Tai et al. [9] proposed tree-LSTM while Yu& Munkhdalai [10] suggested neural semantic encoder (NSE)based on memory augmented neural net.Recently, sentence embedding research is exploring atten-tion mechanisms. Vaswani et al. [1] have proposed Trans-former, a self-attention network for the neural sequence-to-sequence task. A self-attention network uses multi-head scaleddot-product attention to represent each word as a weighted sumof all words in the sentence. The idea of self-attention poolinghas existed before self-attention network as in Liu et al. [11]that have utilized inner-attention within a sentence to applypooling for sentence embedding. Choi et al. [12] have devel-oped a ﬁne-grained attention mechanism for neural machinetranslation, extending scalar attention to vectors.Complex contextualized sentence encoders are usually pre-trained like language models, but they can be improved bysupervised transfer tasks such as natural language inference(NLI). InferSent by Conneau et al. [6] has consistently out-performed unsupervised methods like SkipThought. UniversalSentence Encoder [13] trains a transformer network and aug-ments unsupervised learning with training on the Stanford NLI(SNLI) dataset. Hill et al. [14] show that the task on whichsentence embeddings are trained signiﬁcantly impacts theirquality. According to Conneau et al. [6] and Cer et al. [13], theSNLI datasets are suitable for training sentence embeddings.Yang et al. [15] present a method to train siamese deep av-eraging network (DAN) and transformer, using conversationsfrom Reddit to yield good results on the STS benchmark.In Sentence-BERT (SBERT) [4], a comprehensive evalu-ation on the pre-trained BERT combined with siamese andtriplet network structures is presented. To alleviate the run-timeoverhead, SBERT’s more elaborate ﬁne-tuning mechanismssuch as softmax on augmented sentence representations andtriplet loss are replaced by the cosine similarity at inference.The simplistic SBERT inference helps reduce the effort forﬁnding the most similar pair from 65 hours with BERT toabout 5 seconds, while hardly impacting the accuracy.III. M

ODELS

The output of BERT or ALBERT constitutes token embed-dings for a given text input. With a large output size ( e.g. , up to512 token vectors of 768 dimensions each), the contextualizedword embeddings can be ﬁne-tuned for any downstream task.To do sentence-level regressions such as semantic textualsimilarity (STS), ﬁxed-size sentence embeddings would benecessary. In this section, we describe sentence embeddingmodels for BERT and ALBERT.

A. The [CLS] token embedding

The most straightforward sentence embedding model is the [CLS] vector used to predict sentence-level context ( i.e. ,BERT NSP, ALBERT SOP) during the pre-training. The [CLS] token summarizes the information from other tokensvia a self-attention mechanism that facilitates the intrinsictasks of the pre-training. A similar reasoning applies such thatthe [CLS] token can be further optimized while ﬁne-tuninghe downstream task. After the ﬁne-tuning, the [CLS] token isexpected to capture more semantically-relevant sentence-levelcontext speciﬁc to the downstream task.

B. Pooled token embeddings

Averaging the token embedding output gives our nextmodel. The model works like a pooling layer in a convolutionalneural net. Average pooling turns the token embeddings intoa ﬁxed-length sentence vector. An alternative would use maxpooling instead, although max pooling tends to select the mostimportant features rather than taking representative summary.In this paper, we choose to go with the average-pooling model.

C. Sentence-BERT (SBERT)

Reimers & Gurevych [4] propose SBERT that modiﬁes apre-trained BERT with siamese and triplet network structuresto derive semantically meaningful sentence embeddings com-parable using only cosine similarity. The siamese architectureis computationally efﬁcient. Note that using a single copy ofpre-trained BERT would require to run all possible combina-tions of sentence pairs from a dataset to form a representationfor sentence pairs. SBERT ﬁrst average-pools a pair of theBERT embeddings to ﬁxed-size sentence embeddings. Usingthe two sentence embeddings and an element-wise differencebetween them, SBERT can run a softmax layer conﬁgured forclassiﬁcation and regression tasks.

D. Sentence-ALBERT (SALBERT)

Based on ALBERT, SALBERT has the same siamese andtriplet networks as SBERT. The siamese network structure inSBERT and SALBERT is illustrated in Fig. 1.

BERT / ALBERT

Average Poolingor CNN

BERT / ALBERT

Average Pooling or CNNCosine SimilarityMSE Loss

Sentence 1 Sentence 2

Fig. 1. Siamese network structure used in SBERT and SALBERT

E. CNN-SBERT

In SBERT, average pooling is used to make the BERTembeddings into ﬁxed-length sentence vectors. CNN-SBERTinstead employs a CNN architecture that takes in the tokenembeddings and computes a ﬁxed-size sentence embedding through convolutional layers with the hyperbolic tangent ac-tivation function interlaced with pooling layers. In CNN-SBERT, all the pooling layers use max pooling except the ﬁnalaverage pooling. The CNN architecture used in CNN-SBERTis described in Fig. 2.

F. CNN-SALBERT

Similarly, CNN-SALBERT uses the same CNN architectureused in CNN-SBERT. (B, H)(B, 1, T, H)1x1 Conv, 1 tanh3x1 Conv, 128 tanh3x1 Conv, 128 tanh

Max Pooling

Max PoolingMax PoolingMax PoolingAverage PoolingMax Pooling

Fig. 2. CNN architecture used in CNN-SBERT and CNN-SALBERT. B, T,and H means mini-batch size, number of tokens, and transformer hidden size.

IV. E

XPERIMENTS

We evaluate the performance of the sentence embeddingmodels on Semantic Textual Similarity (STS) and NaturalLanguage Inference (NLI) benchmarks. Following the method-ology by Reimers & Gurevych [4], we use cosine-similarity asa main metric to evaluate the similarity between two sentenceembeddings. We compute both Pearson and Spearman’s rankcoefﬁcients to indicate how our cosine-similarity estimate anda ground-truth label provided by the datasets are correlated.We use pre-trained BERT and ALBERT models from HuggingFace [16] . A. Datasets and tasks

We ﬁne-tune the BERT and ALBERT sentence embed-ding models on the Semantic Textual Similarity benchmark(STSb) [17], the Multi-Genre Natural Language Inference(MultiNLI) [18], and the Stanford Natural Language Inference(SNLI) [19] datasets. https://github.com/huggingface ) Semantic Textual Similarity benchmark: STSb gives aset of English data used for STS tasks organized in Inter-national Workshop on Semantic Evaluation (SemEval) [20]between 2012 and 2017. The dataset includes 8,628 sentencepairs from image captions, news headlines, and user forumsthat are partitioned in train (5,749), dev (1,500) and test(1,379) sets. They are annotated with a score from 0 to 5indicating how similar a pair of sentences are in terms ofsemantic relatedness.

2) Multi-genre Natural Language Inference:

The MultiNLIcorpus [18] is a crowd-sourced collection of 433k sen-tence pairs annotated with textual entailment information.The dataset is used to evaluate entailment classiﬁcation task.MultiNLI is modeled on the SNLI corpus, differing in itscoverage of genres of spoken and written text. MultiNLIsupports a distinctive cross-genre generalization evaluation.Each sentence pair in MultiNLI has a label that distinguisheswhether the two sentences are contradiction, entailment, orneutral.

3) Stanford Natural Language Inference:

The SNLI cor-pus [19] contains 570k human-written English sentence pairsmanually labeled for balanced classiﬁcation with the labelsentailment, contradiction, and neutral for natural languageinference (NLI), also known as recognizing textual entail-ment (RTE). The General Language Understanding Evalua-tion (GLUE) benchmark [21] recommends the SNLI datasetused as an auxiliary training data for MultiNLI task. Con-neau et al. [6] and Cer et al. [13] ﬁnd SNLI suitable fortraining sentence embeddings for asserting reasoning aboutthe semantic relationship within sentences.

B. Training

In our evaluation, we consider only BERT and ALBERTbase models ( i.e. , multi-head attention over 12 layers) in thetransformer package downloaded from Hugging Face [16].We use GLUE benchmark to ﬁne-tune the [CLS] tokenembedding and average-pooled token embedding models witha learning rate of × − . We train all of our modelsusing the Adam optimizer with a linear learning rate warm-up for 10% of the training data. We use a learning rate of × − for SBERT and SALBERT as suggested by theoriginal SBERT architecture and × − for CNN-SBERTand CNN-SALBERT. Using the MultiNLI and SNLI data, weoptimize SBERT and SALBERT on the 3-way softmax loss.

1) STSb:

To train STS benchmark task, we use siamesenetwork as shown in Fig 1. We run 10 training epochs with abatch size of 32.

2) NLI (MultiNLI + SNLI):

To train NLI tasks, we adoptthe siamese architecture in Fig 1. We use a softmax classiﬁerinstead of cosine similarity in training NLI tasks with a cross-entropy loss. We train 1 epoch because the NLI train set ismuch bigger than STSb. We use a batch size of 16.

3) NLI + STSb:

After ﬁne-tuning on the NLI dataset, wetrain on the STS benchmark with a batch size of 32.

TABLE IE

VALUATION ON THE

STS

B BY FINE - TUNING SENTENCE EMBEDDINGS ON

STS, NLI,

AND BOTH

Model Spearman (Pearson)Not ﬁne-tunedBERT [CLS]-token embedding 6.43 (1.70)BERT Avg. pooled token embedding 47.29 (47.91)ALBERT [CLS]-token embedding 0.86 (4.57)ALBERT Avg. pooled token embedding 47.84 (46.57)Fine-tuned on STSbBERT [CLS]-token embedding 12.96 (7.49)BERT Avg. pooled token embedding 55.76 (54.90)SBERT 84.66 (84.86)CNN-SBERT

ALBERT [CLS]-token embedding 37.98 (27.89)ALBERT Avg. pooled token embedding 61.06 (60.41)SALBERT 74.33 (75.26)CNN-SALBERT

Fine-tuned on NLI (MultiNLI + SNLI)BERT [CLS]-token embedding 32.72 (26.88)BERT Avg. pooled token embedding 69.57 (68.49)SBERT

CNN-SBERT 76.77 (75.31)ALBERT [CLS]-token embedding 24.87 (4.11)ALBERT Avg. pooled token embedding 54.21 (53.58)SALBERT

CNN-SALBERT 73.70 (72.24)Fine-tuned on NLI (MultiNLI + SNLI) and STSbBERT [CLS]-token embedding 44.77 (38.74)BERT Avg. pooled token embedding 67.61 (65.30)SBERT 85.32 (84.51)CNN-SBERT

ALBERT [CLS]-token embedding 40.35 (33.46)ALBERT Avg. pooled token embedding 60.24 (59.98)SALBERT 77.59 (77.82)CNN-SALBERT

TABLE IIE

VALUATION ON THE

GLUE STS

B TASK .Model Spearman (Pearson)BERT 88.58 (88.89)ALBERT 90.13 (90.46)

C. Results1) Effect of ﬁne-tuning:

Table I presents the STS bench-mark results. Note that performance we report is ρ × ,where ρ is Spearman’s rank or Pearson correlation coefﬁcient.In general, ﬁne-tuning results in a better performance thanno ﬁne-tuning. Without ﬁne-tuning, the [CLS] token as asentence embedding gives poor downstream task performance.Quality of sentence embedding reﬂected on the STSb perfor-mance seems to be affected by how related train sets usedfor ﬁne-tuning are to the task. We consider STSb train set,which is directly related to the task of STSb. We also considerNLI ( i.e. , MultiNLI and SNLI) train sets that are not directlyrelated to STSb. We have experimented with the following: i)ﬁne-tuning with only STSb train set, ii) with only NLI trainsets, and iii) with both NLI and STSb train sets. Fine-tuningwith only STSb train set gives a reasonably good performancewhereas ﬁne-tuning with irrelevant NLI train sets only haveyielded a suboptimal performance as expected. Our best STSbresults are obtained by ﬁne-tuning with both STSb and NLItrain sets. ABLE IIIE

VALUATION ON VARIOUS

STS

TASKS . N

UMBERS REPRESENT S PEARMAN (P EARSON ).Model STS12 STS13 STS14 STS15 STS16 STSb Avg.SBERT

CNN-SBERT 69.80 (75.04)

2) Model comparison:

We expect a more elaborate sen-tence embedding model to give a better performance in STSb.We have found that pooling token embeddings will forma better sentence representation than [CLS] . We have alsofound that siamese structure further helps sentence embed-dings. Generally, our CNN-based sentence embedding modelsgive the best performance among all sentence embeddingmodels.

3) Performance of ALBERT:

ALBERT-based sentence em-bedding models generally achieve lower performance than theBERT counterparts in STSb evaluations. Before ﬁne-tuning,there is no signiﬁcant difference between ALBERT and BERT.The gap, however, increases after ﬁne-tuning. Only for [CLS] token embedding and average-pooled embeddings, ALBERThas a better performance than BERT when ﬁne-tuned on STSb.SALBERT has much lower performance than SBERT eventhough they both have the same siamese architecture. This issurprising because ALBERT has a higher score than BERTwhen evaluated on STSb using GLUE as shown in Table II.The performance of SALBERT catches up with SBERT whenthe CNN architecture applies, but CNN-SALBERT is stillslightly inferior to CNN-SBERT.

4) Effect of CNN:

In Table I, we ﬁnd that the best score isfrom CNN-based models trained on NLI and STSb. Accordingto these scores, the CNN architecture seems to have a posi-tive impact on sentence embedding performances. The CNNarchitecture, however, improves the ALBERT-based sentenceembedding models more than the BERT-based models. Wehave found that the improvement by CNN to ALBERT modelscan be as high as 8 points, which is compared to 1 point forthe case of BERT models. We have empirically observed thatALBERT exposes more instability (due to parameter sharing)compared to BERT. Such instability can be alleviated by CNN,and this is a possible explanation for more improvement onALBERT by adding CNN than BERT.

5) Evaluation of STS12–STS16 Tasks:

In Table III, wepresent a comprehensive evaluation on the STS tasks from2012 to 2016 [22]–[26] after ﬁne-tuning with both the NLI andSTSb train sets. We show the results of our best two models( i.e. , SBERT/SALBERT and CNN-SBERT/CNN-SALBERT).The STSb result is also presented for comparison. The purposeof the evaluation is to verify the improvement by CNN beyondSTSb. In general, we see a similar trend that CNN archi-tecture improves ALBERT-based sentence embedding modelssubstantially more than BERT-based. On the average, SBERTembeddings achieve a Spearman’s rank correlation point of84.49 while the average for CNN-SBERT is 84.34. The CNNarchitecture seems almost no effect on BERT-based sentence embedding models. On the other hand, the average correlationscore of CNN-SALBERT is improved by 2 points.V. C

ONCLUSION AND F UTURE W ORK

In this paper, we have presented an evaluation of BERT andALBERT sentence embedding models on Semantic TextualSimilarity (STS). Knowing limitations of the [CLS] sentencevector, we facilitate the STS sentence-pair regression task withthe siamese and triplet network architecture by Reimers &Gurevych for BERT and ALBERT. We have additionally de-veloped a CNN architecture that takes in the token embeddingsto compute a ﬁxed-size sentence vector. Our CNN architectureimproves ALBERT models up to 0.08 (8 points in percentile)in Spearman’s rank correlation for STS benchmark, whichis substantially larger than the case for BERT models withan improvement of only 0.01 (1 point). Despite signiﬁcantlyfewer model parameters, ALBERT sentence embedding ishighly competitive to BERT in downstream NLP evaluations.For our future work, we plan to evaluate sentence embed-ding with larger ALBERT models— i.e. , ALBERT-large andALBERT-xlarge. (Note that the total number of parameters inALBERT-xlarge is still fewer than that of BERT-base.) TheALBERT results in this paper are obtained with the numberof groups for the hidden layers ( num_hidden_groups ) setto 1. We also plan to optimize the num_hidden_groups hyperparameter for better performance.R

EFERENCES[1] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit,Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz andPolosukhin, Illia, “Attention is All you Need,” in

Advances in NeuralInformation Processing Systems 30 , 2017.[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof Deep Bidirectional Transformers for Language Understanding,” arXivpreprint arXiv:1810.04805 , 2018.[3] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Sori-cut, “Albert: A Lite BERT for Self-supervised Learning of LanguageRepresentations,” arXiv preprint arXiv:1909.11942 , 2019.[4] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddingsusing Siamese BERT-networks,” in

Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019.[5] Y. Kim, “Convolutional neural networks for sentence classiﬁcation,” in

Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , 2014.[6] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes,“Supervised learning of universal sentence representations from naturallanguage inference data,” in

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing , 2017.[7] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng,and C. Potts, “Recursive deep models for semantic compositionalityover a sentiment treebank,” in

Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing , 2013.8] X. Zhu, P. Sobhani, and H. Guo, “Long Short-Term Memory over Re-cursive Structures,” in

Proceedings of the 32nd International Conferenceon International Conference on Machine Learning , 2015.[9] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic rep-resentations from tree-structured long short-term memory networks,”in

Proceedings of the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: Long Papers) , 2015.[10] T. Munkhdalai and H. Yu, “Neural semantic encoders,” in

Proceedingsof the 15th Conference of the European Chapter of the Association forComputational Linguistics , 2017.[11] Y. Liu, C. Sun, L. Lin, and X. Wang, “Learning Natural LanguageInference using Bidirectional LSTM model and Inner-Attention,”

CoRR ,vol. abs/1605.09090, 2016.[12] H. Choi, K. Cho, and Y. Bengio, “Fine-Grained Attention Mechanismfor Neural Machine Translation,”

CoRR , vol. abs/1803.11407, 2018.[13] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John,N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, andR. Kurzweil, “Universal sentence encoder for English,” in

Proceedingsof the 2018 Conference on Empirical Methods in Natural LanguageProcessing: System Demonstrations , 2018.[14] F. Hill, K. Cho, and A. Korhonen, “Learning distributed representa-tions of sentences from unlabelled data,” in

Proceedings of the 2016Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies , 2016.[15] Y. Yang, S. Yuan, D. Cer, S.-y. Kong, N. Constant, P. Pilar, H. Ge,Y.-H. Sung, B. Strope, and R. Kurzweil, “Learning semantic textualsimilarity from conversations,” in

Proceedings of The Third Workshopon Representation Learning for NLP , 2018.[16] Hugging Face, “Open Source NLP,” https://huggingface.co.[17] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingualfocused evaluation,” in

Proceedings of the 11th International Workshopon Semantic Evaluation (SemEval-2017) , 2017.[18] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challengecorpus for sentence understanding through inference,” in

Proceedings ofthe 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume1 (Long Papers) , 2018.[19] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large an-notated corpus for learning natural language inference,” in

Proceedingsof the 2015 Conference on Empirical Methods in Natural LanguageProcessing , 2015.[20] Special Interest Group on the Lexicon of the Association for Com-putational Linguistics, “SemEval: International Workshop on SemanticEvaluation,” https://semeval.github.io/.[21] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman,“GLUE: A multi-task benchmark and analysis platform for naturallanguage understanding,” in

Proceedings of the 2018 EMNLP WorkshopBlackboxNLP: Analyzing and Interpreting Neural Networks for NLP ,2018.[22] E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre, “SemEval-2012task 6: A pilot on semantic textual similarity,” in *SEM 2012: The FirstJoint Conference on Lexical and Computational Semantics – Volume 1:Proceedings of the main conference and the shared task, and Volume 2:Proceedings of the Sixth International Workshop on Semantic Evaluation(SemEval 2012) , 2012.[23] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo, “*SEM2013 shared task: Semantic textual similarity,” in

Second Joint Con-ference on Lexical and Computational Semantics (*SEM), Volume 1:Proceedings of the Main Conference and the Shared Task: SemanticTextual Similarity , 2013.[24] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe, “SemEval-2014 task10: Multilingual semantic textual similarity,” in

Proceedings of the 8thInternational Workshop on Semantic Evaluation (SemEval 2014) , 2014.[25] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria,and J. Wiebe, “SemEval-2015 task 2: Semantic textual similarity,English, Spanish and pilot on interpretability,” in

Proceedings of the9th International Workshop on Semantic Evaluation (SemEval 2015) ,2015.[26] E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea,G. Rigau, and J. Wiebe, “SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation,” in