[PDF] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Abstract

Full PDF

SSentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych

Ubiquitous Knowledge Processing Lab (UKP-TUDA)Department of Computer Science, Technische Universit¨at Darmstadt

Abstract

BERT (Devlin et al., 2018) and RoBERTa (Liuet al., 2019) has set a new state-of-the-artperformance on sentence-pair regression taskslike semantic textual similarity (STS). How-ever, it requires that both sentences are fedinto the network, which causes a massive com-putational overhead: Finding the most sim-ilar pair in a collection of 10,000 sentencesrequires about 50 million inference computa-tions (~65 hours) with BERT. The constructionof BERT makes it unsuitable for semantic sim-ilarity search as well as for unsupervised taskslike clustering.In this publication, we present Sentence-BERT(SBERT), a modiﬁcation of the pretrainedBERT network that use siamese and triplet net-work structures to derive semantically mean-ingful sentence embeddings that can be com-pared using cosine-similarity. This reduces theeffort for ﬁnding the most similar pair from 65hours with BERT / RoBERTa to about 5 sec-onds with SBERT, while maintaining the ac-curacy from BERT.We evaluate SBERT and SRoBERTa on com-mon STS tasks and transfer learning tasks,where it outperforms other state-of-the-artsentence embeddings methods. In this publication, we present Sentence-BERT(SBERT), a modiﬁcation of the BERT network us-ing siamese and triplet networks that is able toderive semantically meaningful sentence embed-dings . This enables BERT to be used for certainnew tasks, which up-to-now were not applicablefor BERT. These tasks include large-scale seman- Code available: https://github.com/UKPLab/sentence-transformers With semantically meaningful we mean that semanticallysimilar sentences are close in vector space. tic similarity comparison, clustering, and informa-tion retrieval via semantic search.BERT set new state-of-the-art performance onvarious sentence classiﬁcation and sentence-pairregression tasks. BERT uses a cross-encoder: Twosentences are passed to the transformer networkand the target value is predicted. However, thissetup is unsuitable for various pair regression tasksdue to too many possible combinations. Findingin a collection of n = 10 000 sentences the pairwith the highest similarity requires with BERT n · ( n − / inference computations.On a modern V100 GPU, this requires about 65hours. Similar, ﬁnding which of the over 40 mil-lion existent questions of Quora is the most similarfor a new question could be modeled as a pair-wisecomparison with BERT, however, answering a sin-gle query would require over 50 hours.A common method to address clustering and se-mantic search is to map each sentence to a vec-tor space such that semantically similar sentencesare close. Researchers have started to input indi-vidual sentences into BERT and to derive ﬁxed-size sentence embeddings. The most commonlyused approach is to average the BERT output layer(known as BERT embeddings) or by using the out-put of the ﬁrst token (the [CLS] token). As wewill show, this common practice yields rather badsentence embeddings, often worse than averagingGloVe embeddings (Pennington et al., 2014).To alleviate this issue, we developed SBERT.The siamese network architecture enables thatﬁxed-sized vectors for input sentences can be de-rived. Using a similarity measure like cosine-similarity or Manhatten / Euclidean distance, se-mantically similar sentences can be found. Thesesimilarity measures can be performed extremelyefﬁcient on modern hardware, allowing SBERTto be used for semantic similarity search as wellas for clustering. The complexity for ﬁnding the a r X i v : . [ c s . C L ] A ug ost similar sentence pair in a collection of 10,000sentences is reduced from 65 hours with BERT tothe computation of 10,000 sentence embeddings(~5 seconds with SBERT) and computing cosine-similarity (~0.01 seconds). By using optimizedindex structures, ﬁnding the most similar Quoraquestion can be reduced from 50 hours to a fewmilliseconds (Johnson et al., 2017).We ﬁne-tune SBERT on NLI data, which cre-ates sentence embeddings that signiﬁcantly out-perform other state-of-the-art sentence embeddingmethods like InferSent (Conneau et al., 2017) andUniversal Sentence Encoder (Cer et al., 2018). Onseven Semantic Textual Similarity (STS) tasks,SBERT achieves an improvement of 11.7 pointscompared to InferSent and 5.5 points compared toUniversal Sentence Encoder. On SentEval (Con-neau and Kiela, 2018), an evaluation toolkit forsentence embeddings, we achieve an improvementof 2.1 and 2.6 points, respectively.SBERT can be adapted to a speciﬁc task. Itsets new state-of-the-art performance on a chal-lenging argument similarity dataset (Misra et al.,2016) and on a triplet dataset to distinguish sen-tences from different sections of a Wikipedia arti-cle (Dor et al., 2018).The paper is structured in the following way:Section 3 presents SBERT, section 4 evaluatesSBERT on common STS tasks and on the chal-lenging Argument Facet Similarity (AFS) corpus(Misra et al., 2016). Section 5 evaluates SBERTon SentEval. In section 6, we perform an ablationstudy to test some design aspect of SBERT. In sec-tion 7, we compare the computational efﬁciency ofSBERT sentence embeddings in contrast to otherstate-of-the-art sentence embedding methods. We ﬁrst introduce BERT, then, we discuss state-of-the-art sentence embedding methods.BERT (Devlin et al., 2018) is a pre-trainedtransformer network (Vaswani et al., 2017), whichset for various NLP tasks new state-of-the-art re-sults, including question answering, sentence clas-siﬁcation, and sentence-pair regression. The inputfor BERT for sentence-pair regression consists ofthe two sentences, separated by a special [SEP] token. Multi-head attention over 12 (base-model)or 24 layers (large-model) is applied and the out-put is passed to a simple regression function to de-rive the ﬁnal label. Using this setup, BERT set a new state-of-the-art performance on the SemanticTextual Semilarity (STS) benchmark (Cer et al.,2017). RoBERTa (Liu et al., 2019) showed, thatthe performance of BERT can further improved bysmall adaptations to the pre-training process. Wealso tested XLNet (Yang et al., 2019), but it led ingeneral to worse results than BERT.A large disadvantage of the BERT networkstructure is that no independent sentence embed-dings are computed, which makes it difﬁcult to de-rive sentence embeddings from BERT. To bypassthis limitations, researchers passed single sen-tences through BERT and then derive a ﬁxed sizedvector by either averaging the outputs (similar toaverage word embeddings) or by using the outputof the special

CLS token (for example: May et al.(2019); Zhang et al. (2019); Qiao et al. (2019)).These two options are also provided by the popu-lar bert-as-a-service-repository . Up to our knowl-edge, there is so far no evaluation if these methodslead to useful sentence embeddings.Sentence embeddings are a well studied areawith dozens of proposed methods. Skip-Thought(Kiros et al., 2015) trains an encoder-decoder ar-chitecture to predict the surrounding sentences.InferSent (Conneau et al., 2017) uses labeleddata of the Stanford Natural Language Inferencedataset (Bowman et al., 2015) and the Multi-Genre NLI dataset (Williams et al., 2018) to traina siamese BiLSTM network with max-poolingover the output. Conneau et al. showed, thatInferSent consistently outperforms unsupervisedmethods like SkipThought. Universal SentenceEncoder (Cer et al., 2018) trains a transformernetwork and augments unsupervised learning withtraining on SNLI. Hill et al. (2016) showed, thatthe task on which sentence embeddings are trainedsigniﬁcantly impacts their quality. Previous work(Conneau et al., 2017; Cer et al., 2018) found thatthe SNLI datasets are suitable for training sen-tence embeddings. Yang et al. (2018) presenteda method to train on conversations from Redditusing siamese DAN and siamese transformer net-works, which yielded good results on the STSbenchmark dataset.Humeau et al. (2019) addresses the run-timeoverhead of the cross-encoder from BERT andpresent a method (poly-encoders) to computea score between m context vectors and pre- https://github.com/hanxiao/bert-as-service/ entence A Sentence B

BERT

BERT u v pooling pooling (u, v, |u-v|) Softmax classifier

Figure 1: SBERT architecture with classiﬁcation ob-jective function, e.g., for ﬁne-tuning on SNLI dataset.The two BERT networks have tied weights (siamesenetwork structure). computed candidate embeddings using attention.This idea works for ﬁnding the highest scoringsentence in a larger collection. However, poly-encoders have the drawback that the score functionis not symmetric and the computational overheadis too large for use-cases like clustering, whichwould require O ( n ) score computations.Previous neural sentence embedding methodsstarted the training from a random initialization.In this publication, we use the pre-trained BERTand RoBERTa network and only ﬁne-tune it toyield useful sentence embeddings. This reducessigniﬁcantly the needed training time: SBERT canbe tuned in less than 20 minutes, while yieldingbetter results than comparable sentence embed-ding methods. SBERT adds a pooling operation to the outputof BERT / RoBERTa to derive a ﬁxed sized sen-tence embedding. We experiment with three pool-ing strategies: Using the output of the

CLS -token,computing the mean of all output vectors (

MEAN -strategy), and computing a max-over-time of theoutput vectors (

MAX -strategy). The default conﬁg-uration is

MEAN .In order to ﬁne-tune BERT / RoBERTa, we cre-ate siamese and triplet networks (Schroff et al.,2015) to update the weights such that the producedsentence embeddings are semantically meaningfuland can be compared with cosine-similarity.The network structure depends on the available

Sentence A

Sentence B

BERT

BERT u v pooling pooling cosine-sim(u, v) -1 … 1 Figure 2: SBERT architecture at inference, for exam-ple, to compute similarity scores. This architecture isalso used with the regression objective function. training data. We experiment with the followingstructures and objective functions.

Classiﬁcation Objective Function.

We con-catenate the sentence embeddings u and v withthe element-wise difference | u − v | and multiply itwith the trainable weight W t ∈ R n × k : o = softmax ( W t ( u, v, | u − v | )) where n is the dimension of the sentence em-beddings and k the number of labels. We optimizecross-entropy loss. This structure is depicted inFigure 1. Regression Objective Function.

The cosine-similarity between the two sentence embeddings u and v is computed (Figure 2). We use mean-squared-error loss as the objective function. Triplet Objective Function.

Given an anchorsentence a , a positive sentence p , and a negativesentence n , triplet loss tunes the network such thatthe distance between a and p is smaller than thedistance between a and n . Mathematically, weminimize the following loss function: max ( || s a − s p || − || s a − s n || + (cid:15), with s x the sentence embedding for a / n / p , || · || a distance metric and margin (cid:15) . Margin (cid:15) ensuresthat s p is at least (cid:15) closer to s a than s n . As metricwe use Euclidean distance and we set (cid:15) = 1 in ourexperiments. We train SBERT on the combination of the SNLI(Bowman et al., 2015) and the Multi-Genre NLI odel STS12 STS13 STS14 STS15 STS16 STSb SICK-R Avg.

Avg. GloVe embeddings 55.14 70.66 59.73 68.25 63.66 58.02 53.76 61.32Avg. BERT embeddings 38.78 57.98 57.98 63.15 61.06 46.35 58.40 54.81BERT CLS-vector 20.16 30.01 20.09 36.88 38.08 16.50 42.63 29.19InferSent - Glove 52.86 66.75 62.15 72.77 66.87 68.03 65.65 65.01Universal Sentence Encoder 64.49 67.80 64.61 76.83 73.18 74.92

Table 1: Spearman rank correlation ρ between the cosine similarity of sentence representations and the gold labelsfor various Textual Similarity (STS) tasks. Performance is reported by convention as ρ × . STS12-STS16:SemEval 2012-2016, STSb: STSbenchmark, SICK-R: SICK relatedness dataset. (Williams et al., 2018) dataset. The SNLI is a col-lection of 570,000 sentence pairs annotated withthe labels contradiction , eintailment , and neu-tral . MultiNLI contains 430,000 sentence pairsand covers a range of genres of spoken and writtentext. We ﬁne-tune SBERT with a 3-way softmax-classiﬁer objective function for one epoch. Weused a batch-size of 16, Adam optimizer withlearning rate − , and a linear learning ratewarm-up over 10% of the training data. Our de-fault pooling strategy is MEAN . We evaluate the performance of SBERT for com-mon Semantic Textual Similarity (STS) tasks.State-of-the-art methods often learn a (complex)regression function that maps sentence embed-dings to a similarity score. However, these regres-sion functions work pair-wise and due to the com-binatorial explosion those are often not scalable ifthe collection of sentences reaches a certain size.Instead, we always use cosine-similarity to com-pare the similarity between two sentence embed-dings. We ran our experiments also with nega-tive Manhatten and negative Euclidean distancesas similarity measures, but the results for all ap-proaches remained roughly the same.

We evaluate the performance of SBERT for STSwithout using any STS speciﬁc training data. Weuse the STS tasks 2012 - 2016 (Agirre et al., 2012,2013, 2014, 2015, 2016), the STS benchmark (Ceret al., 2017), and the SICK-Relatedness dataset(Marelli et al., 2014). These datasets provide la-bels between 0 and 5 on the semantic relatednessof sentence pairs. We showed in (Reimers et al.,2016) that Pearson correlation is badly suited for STS. Instead, we compute the Spearman’s rankcorrelation between the cosine-similarity of thesentence embeddings and the gold labels. Thesetup for the other sentence embedding methodsis equivalent, the similarity is computed by cosine-similarity. The results are depicted in Table 1.The results shows that directly using the outputof BERT leads to rather poor performances. Av-eraging the BERT embeddings achieves an aver-age correlation of only 54.81, and using the

CLS -token output only achieves an average correlationof 29.19. Both are worse than computing averageGloVe embeddings.Using the described siamese network structureand ﬁne-tuning mechanism substantially improvesthe correlation, outperforming both InferSent andUniversal Sentence Encoder substantially. Theonly dataset where SBERT performs worse thanUniversal Sentence Encoder is SICK-R. UniversalSentence Encoder was trained on various datasets,including news, question-answer pages and dis-cussion forums, which appears to be more suitableto the data of SICK-R. In contrast, SBERT waspre-trained only on Wikipedia (via BERT) and onNLI data.While RoBERTa was able to improve the per-formance for several supervised tasks, we onlyobserve minor difference between SBERT andSRoBERTa for generating sentence embeddings.

The STS benchmark (STSb) (Cer et al., 2017) pro-vides is a popular dataset to evaluate supervisedSTS systems. The data includes 8,628 sentencepairs from the three categories captions , news , and forums . It is divided into train (5,749), dev (1,500)and test (1,379). BERT set a new state-of-the-artperformance on this dataset by passing both sen-tences to the network and using a simple regres-ion method for the output. Model Spearman

Not trained for STS

Avg. GloVe embeddings 58.02Avg. BERT embeddings 46.35InferSent - GloVe 68.03Universal Sentence Encoder 74.92SBERT-NLI-base 77.03SBERT-NLI-large 79.23

Trained on STS benchmark dataset

BERT-STSb-base 84.30 ± ± ± ± ± ± Trained on NLI data + STS benchmark data

BERT-NLI-STSb-base ± ± ± ± ± ± Table 2: Evaluation on the STS benchmark test set.BERT systems were trained with 10 random seeds and4 epochs. SBERT was ﬁne-tuned on the STSb dataset,SBERT-NLI was pretrained on the NLI datasets, thenﬁne-tuned on the STSb dataset.

We use the training set to ﬁne-tune SBERT us-ing the regression objective function. At predic-tion time, we compute the cosine-similarity be-tween the sentence embeddings. All systems aretrained with 10 random seeds to counter variances(Reimers and Gurevych, 2018).The results are depicted in Table 2. We ex-perimented with two setups: Only training onSTSb, and ﬁrst training on NLI, then training onSTSb. We observe that the later strategy leads to aslight improvement of 1-2 points. This two-stepapproach had an especially large impact for theBERT cross-encoder, which improved the perfor-mance by 3-4 points. We do not observe a signiﬁ-cant difference between BERT and RoBERTa.

We evaluate SBERT on the Argument Facet Sim-ilarity (AFS) corpus by Misra et al. (2016). TheAFS corpus annotated 6,000 sentential argumentpairs from social media dialogs on three contro-versial topics: gun control , gay marriage , and death penalty . The data was annotated on a scalefrom 0 (“different topic”) to 5 (“completely equiv-alent”). The similarity notion in the AFS corpusis fairly different to the similarity notion in theSTS datasets from SemEval. STS data is usually descriptive, while AFS data are argumentative ex-cerpts from dialogs. To be considered similar, ar-guments must not only make similar claims, butalso provide a similar reasoning. Further, the lex-ical gap between the sentences in AFS is muchlarger. Hence, simple unsupervised methods aswell as state-of-the-art STS systems perform badlyon this dataset (Reimers et al., 2019).We evaluate SBERT on this dataset in two sce-narios: 1) As proposed by Misra et al., we evaluateSBERT using 10-fold cross-validation. A draw-back of this evaluation setup is that it is not clearhow well approaches generalize to different top-ics. Hence, 2) we evaluate SBERT in a cross-topicsetup. Two topics serve for training and the ap-proach is evaluated on the left-out topic. We repeatthis for all three topics and average the results.SBERT is ﬁne-tuned using the Regression Ob-jective Function. The similarity score is computedusing cosine-similarity based on the sentence em-beddings. We also provide the Pearson correla-tion r to make the results comparable to Misra etal. However, we showed (Reimers et al., 2016)that Pearson correlation has some serious draw-backs and should be avoided for comparing STSsystems. The results are depicted in Table 3.Unsupervised methods like tf-idf, averageGloVe embeddings or InferSent perform ratherbadly on this dataset with low scores. TrainingSBERT in the 10-fold cross-validation setup givesa performance that is nearly on-par with BERT.However, in the cross-topic evaluation, we ob-serve a performance drop of SBERT by about 7points Spearman correlation. To be consideredsimilar, arguments should address the same claimsand provide the same reasoning. BERT is able touse attention to compare directly both sentences(e.g. word-by-word comparison), while SBERTmust map individual sentences from an unseentopic to a vector space such that arguments withsimilar claims and reasons are close. This is amuch more challenging task, which appears to re-quire more than just two topics for training to workon-par with BERT. Dor et al. (2018) use Wikipedia to create a the-matically ﬁne-grained train, dev and test set forsentence embeddings methods. Wikipedia arti-cles are separated into distinct sections focusingon certain aspects. Dor et al. assume that sen- odel r ρ

Unsupervised methods tf-idf 46.77 42.95Avg. GloVe embeddings 32.40 34.00InferSent - GloVe 27.08 26.63

SVR (Misra et al., 2016) 63.33 -BERT-AFS-base 77.20 74.84SBERT-AFS-base 76.57 74.13BERT-AFS-large 78.68 76.38SBERT-AFS-large 77.85 75.93

Cross-Topic Evaluation

BERT-AFS-base 58.49 57.23SBERT-AFS-base 52.34 50.65BERT-AFS-large 62.02 60.34SBERT-AFS-large 53.82 53.10

Table 3: Average Pearson correlation r and averageSpearman’s rank correlation ρ on the Argument FacetSimilarity (AFS) corpus (Misra et al., 2016). Misra etal. proposes 10-fold cross-validation. We additionallyevaluate in a cross-topic scenario: Methods are trainedon two topics, and are evaluated on the third topic. tences in the same section are thematically closerthan sentences in different sections. They use thisto create a large dataset of weakly labeled sen-tence triplets: The anchor and the positive exam-ple come from the same section, while the neg-ative example comes from a different section ofthe same article. For example, from the AliceArnold article: Anchor: Arnold joined the BBCRadio Drama Company in 1988. , positive:

Arnoldgained media attention in May 2012. , negative:

Balding and Arnold are keen amateur golfers.

We use the dataset from Dor et al. We use theTriplet Objective, train SBERT for one epoch onthe about 1.8 Million training triplets and evaluateit on the 222,957 test triplets. Test triplets are froma distinct set of Wikipedia articles. As evaluationmetric, we use accuracy: Is the positive examplecloser to the anchor than the negative example?Results are presented in Table 4. Dor et al. ﬁne-tuned a BiLSTM architecture with triplet loss toderive sentence embeddings for this dataset. Asthe table shows, SBERT clearly outperforms theBiLSTM approach by Dor et al.

SentEval (Conneau and Kiela, 2018) is a populartoolkit to evaluate the quality of sentence embed-dings. Sentence embeddings are used as featuresfor a logistic regression classiﬁer. The logistic re-gression classiﬁer is trained on various tasks in a10-fold cross-validation setup and the predictionaccuracy is computed for the test-fold.

Model Accuracy mean-vectors 0.65skip-thoughts-CS 0.62Dor et al. 0.74SBERT-WikiSec-base 0.8042SBERT-WikiSec-large

SRoBERTa-WikiSec-base 0.7945SRoBERTa-WikiSec-large 0.7973

Table 4: Evaluation on the Wikipedia section tripletsdataset (Dor et al., 2018). SBERT trained with tripletloss for one epoch.

The purpose of SBERT sentence embeddingsare not to be used for transfer learning for othertasks. Here, we think ﬁne-tuning BERT as de-scribed by Devlin et al. (2018) for new tasks isthe more suitable method, as it updates all layersof the BERT network. However, SentEval can stillgive an impression on the quality of our sentenceembeddings for various tasks.We compare the SBERT sentence embeddingsto other sentence embeddings methods on the fol-lowing seven SentEval transfer tasks:• MR : Sentiment prediction for movie reviewssnippets on a ﬁve start scale (Pang and Lee,2005).• CR : Sentiment prediction of customer prod-uct reviews (Hu and Liu, 2004).• SUBJ : Subjectivity prediction of sentencesfrom movie reviews and plot summaries(Pang and Lee, 2004).•

MPQA : Phrase level opinion polarity classi-ﬁcation from newswire (Wiebe et al., 2005).•

SST : Stanford Sentiment Treebank with bi-nary labels (Socher et al., 2013).•

TREC : Fine grained question-type classiﬁ-cation from TREC (Li and Roth, 2002).•

MRPC : Microsoft Research Paraphrase Cor-pus from parallel news sources (Dolan et al.,2004).The results can be found in Table 5. SBERTis able to achieve the best performance in 5 outof 7 tasks. The average performance increasesby about 2 percentage points compared to In-ferSent as well as the Universal Sentence Encoder.Even though transfer learning is not the purpose ofSBERT, it outperforms other state-of-the-art sen-tence embeddings methods on this task. odel MR CR SUBJ MPQA SST TREC MRPC Avg.

Avg. GloVe embeddings 77.25 78.30 91.17 87.85 80.18 83.0 72.87 81.52Avg. fast-text embeddings 77.96 79.23 91.68 87.81 82.15 83.6 74.49 82.42Avg. BERT embeddings 78.66 86.25 94.37 88.66 84.40 92.8 69.45 84.94BERT CLS-vector 78.68 84.85 94.21 88.23 84.13 91.4 71.13 84.66InferSent - GloVe 81.57 86.54 92.50

Table 5: Evaluation of SBERT sentence embeddings using the SentEval toolkit. SentEval evaluates sentenceembeddings on different sentence classiﬁcation tasks by training a logistic regression classiﬁer using the sentenceembeddings as features. Scores are based on a 10-fold cross-validation.

It appears that the sentence embeddings fromSBERT capture well sentiment information: Weobserve large improvements for all sentiment tasks(MR, CR, and SST) from SentEval in comparisonto InferSent and Universal Sentence Encoder.The only dataset where SBERT is signiﬁcantlyworse than Universal Sentence Encoder is theTREC dataset. Universal Sentence Encoder waspre-trained on question-answering data, which ap-pears to be beneﬁcial for the question-type classi-ﬁcation task of the TREC dataset.Average BERT embeddings or using the

CLS -token output from a BERT network achieved badresults for various STS tasks (Table 1), worse thanaverage GloVe embeddings. However, for Sent-Eval, average BERT embeddings and the BERT

CLS -token output achieves decent results (Ta-ble 5), outperforming average GloVe embeddings.The reason for this are the different setups. Forthe STS tasks, we used cosine-similarity to es-timate the similarities between sentence embed-dings. Cosine-similarity treats all dimensionsequally. In contrast, SentEval ﬁts a logistic regres-sion classiﬁer to the sentence embeddings. Thisallows that certain dimensions can have higher orlower impact on the classiﬁcation result.We conclude that average BERT embeddings /

CLS -token output from BERT return sentence em-beddings that are infeasible to be used with cosine-similarity or with Manhatten / Euclidean distance.For transfer learning, they yield slightly worseresults than InferSent or Universal Sentence En-coder. However, using the described ﬁne-tuningsetup with a siamese network structure on NLIdatasets yields sentence embeddings that achievea new state-of-the-art for the SentEval toolkit.

We have demonstrated strong empirical results forthe quality of SBERT sentence embeddings. In this section, we perform an ablation study of dif-ferent aspects of SBERT in order to get a betterunderstanding of their relative importance.We evaluated different pooling strategies(

MEAN , MAX , and

CLS ). For the classiﬁcationobjective function, we evaluate different concate-nation methods. For each possible conﬁguration,we train SBERT with 10 different random seedsand average the performances.The objective function (classiﬁcation vs. regres-sion) depends on the annotated dataset. For theclassiﬁcation objective function, we train SBERT-base on the SNLI and the Multi-NLI dataset. Forthe regression objective function, we train on thetraining set of the STS benchmark dataset. Perfor-mances are measured on the development split ofthe STS benchmark dataset. Results are shown inTable 6.

NLI STSb

Pooling Strategy

MEAN

MAX

CLS

Concatenation ( u, v ) ( | u − v | ) ( u ∗ v ) ( | u − v | , u ∗ v ) ( u, v, u ∗ v ) ( u, v, | u − v | ) - ( u, v, | u − v | , u ∗ v ) Table 6: SBERT trained on NLI data with the clas-siﬁcation objective function, on the STS benchmark(STSb) with the regression objective function. Con-ﬁgurations are evaluated on the development set of theSTSb using cosine-similarity and Spearman’s rank cor-relation. For the concatenation methods, we only reportscores with

MEAN pooling strategy.

When trained with the classiﬁcation objectivefunction on NLI data, the pooling strategy has arather minor impact. The impact of the concate-nation mode is much larger. InferSent (Conneaut al., 2017) and Universal Sentence Encoder (Ceret al., 2018) both use ( u, v, | u − v | , u ∗ v ) as inputfor a softmax classiﬁer. However, in our architec-ture, adding the element-wise u ∗ v decreased theperformance.The most important component is the element-wise difference | u − v | . Note, that the concate-nation mode is only relevant for training the soft-max classiﬁer. At inference, when predicting sim-ilarities for the STS benchmark dataset, only thesentence embeddings u and v are used in combi-nation with cosine-similarity. The element-wisedifference measures the distance between the di-mensions of the two sentence embeddings, ensur-ing that similar pairs are closer and dissimilar pairsare further apart.When trained with the regression objectivefunction, we observe that the pooling strategy hasa large impact. There, the MAX strategy performsigniﬁcantly worse than

MEAN or CLS -token strat-egy. This is in contrast to (Conneau et al., 2017),who found it beneﬁcial for the BiLSTM-layer ofInferSent to use

MAX instead of

MEAN pooling.

Sentence embeddings need potentially be com-puted for Millions of sentences, hence, a highcomputation speed is desired. In this section, wecompare SBERT to average GloVe embeddings,InferSent (Conneau et al., 2017), and UniversalSentence Encoder (Cer et al., 2018).For our comparison we use the sentences fromthe STS benchmark (Cer et al., 2017). We com-pute average GloVe embeddings using a sim-ple for-loop with python dictionary lookups andNumPy. InferSent is based on PyTorch. ForUniversal Sentence Encoder, we use the Tensor-Flow Hub version , which is based on Tensor-Flow. SBERT is based on PyTorch. For improvedcomputation of sentence embeddings, we imple-mented a smart batching strategy: Sentences withsimilar lengths are grouped together and are onlypadded to the longest element in a mini-batch.This drastically reduces computational overheadfrom padding tokens.Performances were measured on a server withIntel i7-5820K CPU @ 3.30GHz, Nvidia Tesla https://github.com/facebookresearch/InferSent https://tfhub.dev/google/universal-sentence-encoder-large/3 V100 GPU, CUDA 9.2 and cuDNN. The resultsare depicted in Table 7.

Model CPU GPU

Avg. GloVe embeddings 6469 -InferSent 137 1876Universal Sentence Encoder 67 1318SBERT-base 44 1378SBERT-base - smart batching 83 2042

Table 7: Computation speed (sentences per second) ofsentence embedding methods. Higher is better.

On CPU, InferSent is about 65% faster thanSBERT. This is due to the much simpler net-work architecture. InferSent uses a single Bi-LSTM layer, while BERT uses 12 stacked trans-former layers. However, an advantage of trans-former networks is the computational efﬁciencyon GPUs. There, SBERT with smart batchingis about 9% faster than InferSent and about 55%faster than Universal Sentence Encoder. Smartbatching achieves a speed-up of 89% on CPU and48% on GPU. Average GloVe embeddings is obvi-ously by a large margin the fastest method to com-pute sentence embeddings.

We showed that BERT out-of-the-box maps sen-tences to a vector space that is rather unsuit-able to be used with common similarity measureslike cosine-similarity. The performance for sevenSTS tasks was below the performance of averageGloVe embeddings.To overcome this shortcoming, we presentedSentence-BERT (SBERT). SBERT ﬁne-tunesBERT in a siamese / triplet network architec-ture. We evaluated the quality on various com-mon benchmarks, where it could achieve a sig-niﬁcant improvement over state-of-the-art sen-tence embeddings methods. Replacing BERT withRoBERTa did not yield a signiﬁcant improvementin our experiments.SBERT is computationally efﬁcient. On a GPU,it is about 9% faster than InferSent and about 55%faster than Universal Sentence Encoder. SBERTcan be used for tasks which are computationallynot feasible to be modeled with BERT. For exam-ple, clustering of 10,000 sentences with hierarchi-cal clustering requires with BERT about 65 hours,as around 50 Million sentence combinations mustbe computed. With SBERT, we were able to re-duce the effort to about 5 seconds. cknowledgments

This work has been supported by the GermanResearch Foundation through the German-IsraeliProject Cooperation (DIP, grant DA 1600/1-1 andgrant GU 798/17-1). It has been co-funded by theGerman Federal Ministry of Education and Re-search (BMBF) under the promotional references03VP02540 (ArgumenText).

References

Eneko Agirre, Carmen Banea, Claire Cardie, DanielCer, Mona Diab, Aitor Gonzalez-Agirre, WeiweiGuo, Inigo Lopez-Gazpio, Montse Maritxalar, RadaMihalcea, German Rigau, Larraitz Uria, and JanyceWiebe. 2015. SemEval-2015 Task 2: Semantic Tex-tual Similarity, English, Spanish and Pilot on Inter-pretability. In

Proceedings of the 9th InternationalWorkshop on Semantic Evaluation (SemEval 2015) ,pages 252–263, Denver, Colorado. Association forComputational Linguistics.Eneko Agirre, Carmen Banea, Claire Cardie, DanielCer, Mona Diab, Aitor Gonzalez-Agirre, WeiweiGuo, Rada Mihalcea, German Rigau, and JanyceWiebe. 2014. SemEval-2014 Task 10: MultilingualSemantic Textual Similarity. In

Proceedings of the8th International Workshop on Semantic Evaluation(SemEval 2014) , pages 81–91, Dublin, Ireland. As-sociation for Computational Linguistics.Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T.Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, Ger-man Rigau, and Janyce Wiebe. 2016. SemEval-2016 Task 1: Semantic Textual Similarity, Mono-lingual and Cross-Lingual Evaluation. In

Proceed-ings of the 10th International Workshop on Seman-tic Evaluation, SemEval@NAACL-HLT 2016, SanDiego, CA, USA, June 16-17, 2016 , pages 497–511.Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. *SEM 2013 sharedtask: Semantic Textual Similarity. In

Second JointConference on Lexical and Computational Seman-tics (*SEM), Volume 1: Proceedings of the MainConference and the Shared Task: Semantic TextualSimilarity , pages 32–43, Atlanta, Georgia, USA. As-sociation for Computational Linguistics.Eneko Agirre, Mona Diab, Daniel Cer, and AitorGonzalez-Agirre. 2012. SemEval-2012 Task 6: APilot on Semantic Textual Similarity. In

Proceed-ings of the First Joint Conference on Lexical andComputational Semantics - Volume 1: Proceedingsof the Main Conference and the Shared Task, andVolume 2: Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation , SemEval ’12,pages 385–393, Stroudsburg, PA, USA. Associationfor Computational Linguistics. Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In

Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing , pages632–642, Lisbon, Portugal. Association for Compu-tational Linguistics.Daniel Cer, Mona Diab, Eneko Agirre, Iigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017Task 1: Semantic Textual Similarity Multilingualand Crosslingual Focused Evaluation. In

Proceed-ings of the 11th International Workshop on SemanticEvaluation (SemEval-2017) , pages 1–14, Vancou-ver, Canada.Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St. John, Noah Constant,Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil.2018. Universal Sentence Encoder. arXiv preprintarXiv:1803.11175 .Alexis Conneau and Douwe Kiela. 2018. SentEval: AnEvaluation Toolkit for Universal Sentence Represen-tations. arXiv preprint arXiv:1803.05449 .Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıcBarrault, and Antoine Bordes. 2017. SupervisedLearning of Universal Sentence Representationsfrom Natural Language Inference Data. In

Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing , pages 670–680,Copenhagen, Denmark. Association for Computa-tional Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. arXiv preprint arXiv:1810.04805 .Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un-supervised Construction of Large Paraphrase Cor-pora: Exploiting Massively Parallel News Sources.In

Proceedings of the 20th International Confer-ence on Computational Linguistics , COLING ’04,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.Liat Ein Dor, Yosi Mass, Alon Halfon, Elad Venezian,Ilya Shnayderman, Ranit Aharonov, and NoamSlonim. 2018. Learning Thematic Similarity Metricfrom Article Sections Using Triplet Networks. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers) , pages 49–54, Melbourne, Australia.Association for Computational Linguistics.Felix Hill, Kyunghyun Cho, and Anna Korhonen.2016. Learning Distributed Representations of Sen-tences from Unlabelled Data. In

Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies , pages 1367–1377, San Diego, California. Association for Com-putational Linguistics.inqing Hu and Bing Liu. 2004. Mining and Sum-marizing Customer Reviews. In

Proceedings of theTenth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , KDD ’04,pages 168–177, New York, NY, USA. ACM.Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,and Jason Weston. 2019. Real-time Inferencein Multi-sentence Tasks with Deep PretrainedTransformers. arXiv preprint arXiv:1905.01969 ,abs/1905.01969.Jeff Johnson, Matthijs Douze, and Herv´e J´egou. 2017.Billion-scale similarity search with GPUs. arXivpreprint arXiv:1702.08734 .Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-Thought Vectors. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett, editors,

Advances in Neural Infor-mation Processing Systems 28 , pages 3294–3302.Curran Associates, Inc.Xin Li and Dan Roth. 2002. Learning Question Classi-ﬁers. In

Proceedings of the 19th International Con-ference on Computational Linguistics - Volume 1 ,COLING ’02, pages 1–7, Stroudsburg, PA, USA.Association for Computational Linguistics.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach. arXiv preprint arXiv:1907.11692 .Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella Bernardi, and Roberto Zam-parelli. 2014. A SICK cure for the evaluation ofcompositional distributional semantic models. In

Proceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC’14) ,pages 216–223, Reykjavik, Iceland. European Lan-guage Resources Association (ELRA).Chandler May, Alex Wang, Shikha Bordia, Samuel R.Bowman, and Rachel Rudinger. 2019. On Mea-suring Social Biases in Sentence Encoders. arXivpreprint arXiv:1903.10561 .Amita Misra, Brian Ecker, and Marilyn A. Walker.2016. Measuring the Similarity of Sentential Ar-guments in Dialogue. In

Proceedings of the SIG-DIAL 2016 Conference, The 17th Annual Meetingof the Special Interest Group on Discourse and Di-alogue, 13-15 September 2016, Los Angeles, CA,USA , pages 276–287.Bo Pang and Lillian Lee. 2004. A Sentimental Educa-tion: Sentiment Analysis Using Subjectivity Sum-marization Based on Minimum Cuts. In

Proceed-ings of the 42nd Meeting of the Association forComputational Linguistics (ACL’04), Main Volume ,pages 271–278, Barcelona, Spain. Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploit-ing Class Relationships for Sentiment Categoriza-tion with Respect to Rating Scales. In

Proceedingsof the 43rd Annual Meeting of the Association forComputational Linguistics (ACL’05) , pages 115–124, Ann Arbor, Michigan. Association for Compu-tational Linguistics.Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global Vectors forWord Representation. In

Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1532–1543.Yifan Qiao, Chenyan Xiong, Zheng-Hao Liu, andZhiyuan Liu. 2019. Understanding the Be-haviors of BERT in Ranking. arXiv preprintarXiv:1904.07531 .Nils Reimers, Philip Beyer, and Iryna Gurevych. 2016.Task-Oriented Intrinsic Evaluation of Semantic Tex-tual Similarity. In

Proceedings of the 26th Inter-national Conference on Computational Linguistics(COLING) , pages 87–96.Nils Reimers and Iryna Gurevych. 2018. Why Com-paring Single Performance Scores Does Not Al-low to Draw Conclusions About Machine Learn-ing Approaches. arXiv preprint arXiv:1803.09578 ,abs/1803.09578.Nils Reimers, Benjamin Schiller, Tilman Beck, Jo-hannes Daxenberger, Christian Stab, and IrynaGurevych. 2019. Classiﬁcation and Clustering ofArguments with Contextualized Word Embeddings.In

Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics , pages 567–578, Florence, Italy. Association for ComputationalLinguistics.Florian Schroff, Dmitry Kalenichenko, and JamesPhilbin. 2015. FaceNet: A Uniﬁed Embedding forFace Recognition and Clustering. arXiv preprintarXiv:1503.03832 , abs/1503.03832.Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive Deep Models forSemantic Compositionality Over a Sentiment Tree-bank. In

Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing , pages 1631–1642, Seattle, Washington, USA.Association for Computational Linguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008.Janyce Wiebe, Theresa Wilson, and Claire Cardie.2005. Annotating Expressions of Opinions andEmotions in Language.

Language Resources andEvaluation , 39(2):165–210.dina Williams, Nikita Nangia, and Samuel Bowman.2018. A Broad-Coverage Challenge Corpus forSentence Understanding through Inference. In

Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long Papers) , pages 1112–1122. Associationfor Computational Linguistics.Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-Yi Kong,Noah Constant, Petr Pilar, Heming Ge, Yun-hsuanSung, Brian Strope, and Ray Kurzweil. 2018.Learning Semantic Textual Similarity from Conver-sations. In

Proceedings of The Third Workshopon Representation Learning for NLP , pages 164–174, Melbourne, Australia. Association for Compu-tational Linguistics.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.2019. XLNet: Generalized Autoregressive Pretrain-ing for Language Understanding. arXiv preprintarXiv:1906.08237 , abs/1906.08237.Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2019. BERTScore:Evaluating Text Generation with BERT. arXivpreprint arXiv:1904.09675arXivpreprint arXiv:1904.09675