[PDF] Traceability Transformed: Generating more Accurate Links with Pre-Trained BERT Models

Abstract

Software traceability establishes and leverages associations between diverse development artifacts. Researchers have proposed the use of deep learning trace models to link natural language artifacts, such as requirements and issue descriptions, to source code; however, their effectiveness has been restricted by availability of labeled data and efficiency at runtime. In this study, we propose a novel framework called Trace BERT (T-BERT) to generate trace links between source code and natural language artifacts. To address data sparsity, we leverage a three-step training strategy to enable trace models to transfer knowledge from a closely related Software Engineering challenge, which has a rich dataset, to produce trace links with much higher accuracy than has previously been achieved. We then apply the T-BERT framework to recover links between issues and commits in Open Source Projects. We comparatively evaluated accuracy and efficiency of three BERT architectures. Results show that a Single-BERT architecture generated the most accurate links, while a Siamese-BERT architecture produced comparable results with significantly less execution time. Furthermore, by learning and transferring knowledge, all three models in the framework outperform classical IR trace models. On the three evaluated real-word OSS projects, the best T-BERT stably outperformed the VSM model with average improvements of 60.31% measured using Mean Average Precision (MAP). RNN severely underperformed on these projects due to insufficient training data, while T-BERT overcame this problem by using pretrained language models and transfer learning.

Full PDF

TTraceability Transformed: Generating moreAccurate Links with Pre-Trained BERT Models

Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, Jane Cleland-Huang

Computer Science And EngineeringUniversity Of Notre Dame

Notre Dame, IN, USAjlin6, yliu26, qzeng, mjiang2, [email protected]

Abstract —Software traceability establishes and leverages asso-ciations between diverse development artifacts. Researchers haveproposed the use of deep learning trace models to link naturallanguage artifacts, such as requirements and issue descriptions,to source code; however, their effectiveness has been restricted byavailability of labeled data and efﬁciency at runtime. In this study,we propose a novel framework called Trace BERT (T-BERT) togenerate trace links between source code and natural languageartifacts. To address data sparsity, we leverage a three-steptraining strategy to enable trace models to transfer knowledgefrom a closely related Software Engineering challenge, whichhas a rich dataset, to produce trace links with much higheraccuracy than has previously been achieved. We then apply theT-BERT framework to recover links between issues and commitsin Open Source Projects. We comparatively evaluated accuracyand efﬁciency of three BERT architectures. Results show thata Single-BERT architecture generated the most accurate links,while a Siamese-BERT architecture produced comparable resultswith signiﬁcantly less execution time. Furthermore, by learningand transferring knowledge, all three models in the frameworkoutperform classical IR trace models. On the three evaluatedreal-word OSS projects, the best T-BERT stably outperformedthe VSM model with average improvements of 60.31% measuredusing Mean Average Precision (MAP). RNN severely underper-formed on these projects due to insufﬁcient training data, whileT-BERT overcame this problem by using pretrained languagemodels and transfer learning.

Index Terms —Software traceability, deep learning, languagemodels

I. I

NTRODUCTION

Software and systems traceability, is the ability to create andmaintain relations between software artifacts and to leveragethe resulting network of links to support queries about theproduct and its development process. Traceability is deemedessential in safety-critical systems where it is prescribed bycertifying bodies such as the USA Federal Aviation Adminis-tration (FAA), USA Food and Drug Administration (FAA) [1].When present, trace links support diverse software engineeringactivities such as impact analysis, compliance validation, andsafety assurance. Unfortunately, in practice, the cost and effortof manually creating and maintaining trace links can beinhibitive, and therefore trace links are typically incompleteand inaccurate [2]. As a result, traceability data is often nottrusted by developers and is often greatly underutilized.Software artifacts, such as requirements, design deﬁnitions,code, and test cases all include natural language text, and therefore over the past decades, researchers have exploreda wide variety of automated approaches for generating andevolving links automatically. Techniques have included prob-abilistic techniques [3], the Vector Space Model (VSM), [4],Latent Semantic Indexing [5], [6], Latent Dirichlet Allocation(LDA) [7], [8], AI swarm techniques [9], recurrent neuralnetworks [10] to integrate semantics, heuristic approaches[11]–[13], combinations of techniques [7], [14], [15], andthe use of decision trees and support vector machines [16]to integrate temporal dependencies and other process-relatedinformation into the tracing task. Despite all of these efforts,the accuracy of generated trace links has been unacceptablylow, and therefore industry has been reticent to integrateautomated tracing solutions into their development life-cycles.The primary impedance is a semantic one as most existingtechniques rely upon word matching – either direct matches(e.g., VSM), topic-based matches (e.g., using LSI or LDA), orindirect matches based on building a domain-speciﬁc ontologyto bridge the terminology gap [17]. Results have been mixed,especially when applied to industrial-sized datasets, whereacceptable recall levels above 90% can often only be achievedat extremely low levels of precision [18].One of the primary reasons that automated approacheshave underperformed is the semantic gap that often existsbetween related artifacts [10]. Techniques that are unable toreason about semantic associations and bridge this gap fail toestablish accurate and relatively complete trace links. Recentwork has proposed deep learning (DL) techniques [19], [20]for traceability, but without providing effective solutions. Forexample, Guo et al. [10] proposed an architecture based on aRecurrent Neural Network (RNN) , and evaluated two typesof RNN tracing models (LSTM [21] and GRU [22]) forgenerating links between subsystem requirements and designdeﬁnitions against a small dataset from an industrial project.While their results showed that accuracy improved as the sizeof the training set increased, their approach was not trained onlarge training sets, and therefore was not shown to generalizeacross larger or more diverse projects. We include both LSTMand GRU approaches for comparison purposes and refer tothem collectively as TraceNN (TNN) in this paper.Two primary factors impede the advancement of DL trace-ability solutions. The ﬁrst is the sparsity of training data,given that DL techniques require large volumes of train- a r X i v : . [ c s . S E ] F e b ng data. Manually created trace links (i.e., golden answersets) available in individual software projects are usually notsufﬁcient for training a DL model. The second impedanceis the practicality of applying multi-layer neural networksin a large industrial project as training and utilizing deepneural networks is signiﬁcantly slower than more traditionalinformation retrieval or machine learning techniques.Fig. 1: An example commit message and code change set,where green lines have been added and red ones removed. Thecommit was tagged by the committer to the depicted issue.The work reported in this paper addresses these two criticalimpedances in order to deliver fast and accurate automatedtraceability solutions for solving industrial problems. Morespeciﬁcally, our proposed Language Model (LM) approach totraceability is designed to (1) deliver accurate and thereforemore trustworthy trace links, (2) be applicable for projectswith limited training data, and (3) scale-up to support largeindustrial projects with low time complexity. Our approachleverages BERT ( B idirectional E ncoder R epresentations from T ransformers) as its underlying language models. BERT,which was introduced by Google in 2018 [23], has deliveredmarked improvements in diverse NLP tasks, primarily becauseits bidirectional approach provides deeper contextual infor-mation than single-direction language models. In this paper,we explore the use of BERT in the traceability domain –introducing what we refer to as Trace BERT (T-BERT).T-BERT is a framework for training a BERT-based re-lationship classiﬁer to generate trace links. Three types ofrelation classiﬁcation architectures are particularly well suitedfor traceability. These are the single, twin and, siamese archi-tectures, which we describe in more depth later in the paper.We compare the effectiveness of these three architectures forgenerating trace links between Natural Language Artifacts(NLAs) and Programming Language Artifacts (PLAs). NLAsare artifacts such as feature requests, bug reports, require-ments, and design deﬁnitions, which are written primarilyusing natural language but may also include code snippets. Incontrast, PLAs are primarily programming language artifacts,such as code ﬁles, code snippets, function deﬁnitions, and codechange sets, which also contain natural language comments and descriptors. We evaluated T-BERT by generating tracelinks from issues to code (represented by change sets), whichwe refer to as a

NLA-PLA traceability challenge .The remainder of this paper is laid out as follows. Sec. IIoutlines the concrete research questions we address in thispaper. Sec. III and Sec. IV provide a detailed descriptionof our approach for achieving NLA-PLA traceability, whileSec. V describes the experiments we conducted to evaluate theeffectiveness of our approach. Based on the results obtainedfrom these experiments, we derive answers for our researchquestions in Sec. VI. Finally, Sec. VII to Sec. IX discussrelated work, threats to validity, and conclusions.II. P

ROBLEM S TATEMENT

Researchers have addressed the data sparsity problem andthe performance issues of training large models through theuse of pre-trained DL models for various NLP problems .This approach divides the training stage into pre-training andﬁne-tuning phases. In the pre-training phase, DL models areconstructed using a huge amount of unlabeled data and self-supervised training tasks. Then in the ﬁne-tuning phase, themodels are trained on smaller, labeled datasets in order toperform more specialized ‘downstream’ tasks. The underlyingnotion is that knowledge learned from pre-training a modelon a larger and more generalized dataset can be effectivelytransferred to the downstream tasks which have limited labelsfor supervised training. Furthermore, a pre-trained modelprovides a better starting point for model optimization thana randomly seeded one. It therefore reduces the likelihoodof local optimization traps and improves overall performance.Fine-tuning a pre-trained model on a smaller dataset takessigniﬁcantly less time than training a deep learning modelfrom scratch. While pre-training a general model is extremelyexpensive, the pre-training phase only needs to be performedonce and can then be reused for various downstream tasks.BERT-based language models make use of transformers[24] to learn contextual information from corpora in thepre-training stage and then transfer learned knowledge todownstream NLP tasks, such as question answering, documentclassiﬁcation, and sentiment recognition [23], [25]. To ourknowledge, this is the ﬁrst study that has applied BERT orother transformer-based methods to the software traceabilitytask. We pose a series of research questions to evaluatewhether T-BERT can effectively address the traceability prob-lem. Our ﬁrst question is deﬁned as follows:

RQ1 : Given three variants of T-BERT models, based on single,twin, and siamese BERT relation classiﬁers, which is the bestarchitecture for addressing NLA-PLA traceability with respectto both accuracy and efﬁciency?In addition to investigating the DL model architecture, wealso explore different training techniques for improving modelaccuracy. As discussed by Guo et al. [10], the DL tracemodel may hit a ‘performance glass ceiling‘ and convergeat relatively low accuracy. We therefore deﬁne our secondresearch question as: Q2 : Which training technique improves accuracy withoutsuffering from the previously observed glass ceiling ?Gururangan et al., in their study of Domain-Adaptive Pre-Training (DAPT), claim that a second phase of pre-trainingusing a domain corpus leads to performance gains. Thisﬁnding motivated us to explore the third and most importantresearch question: RQ3 : Can T-BERT transfer knowledge from a resource-richretrieval task to enhance the accuracy and performance of thedownstream NLA-PLA tracing challenge?Feng et al. [26] demonstrated that a BERT LanguageModel, pre-trained using large numbers of function deﬁnitions,can effectively address the downstream code search problem.In that study, researchers provided doc-strings (i.e., pythoncomments) as user queries, and leveraged a BERT model toretrieve related functions. Since doc-strings and functions arealways paired in the code base, ample training data for thecode search problem is available. Our RQ3 explores whetherthe code search problem can be leveraged as a training taskto improve T-BERT for the software traceability challenge.Because this step occurs between pre-training and ﬁne-tuning,we refer to it as intermediate-training

III. A

PPROACH

Trace retrieval algorithms dynamically generate trace linksbetween artifacts [27], for example, by linking a source artifact(e.g., a python ﬁle) to a target artifact (e.g., an issue or re-quirement). The traceability algorithm computes the relevancebetween pairs of source and target artifacts and proposes themost related pairs as trace links. In this section, we ﬁrstintroduce the fundamental architecture of the BERT-basedmodel and its variances, and then introduce T-BERT with threespeciﬁc relation classiﬁers that are well suited for addressingthis traceability problem.

A. Introduction to BERT and Language Models

A language model represents a probability distribution overa word sequence [28], which with proper training can ef-fectively capture the semantics of individual words based ontheir surrounding context. Given the importance of generalcontext, DL models built upon pre-trained language modelsusually achieve better results than those trained on task-speciﬁc datasets directly. The architecture of the BERT-basedmodel is based on transformers, in which each layer in themodel is a transformer layer. The transformer layers allow theBERT model to focus on terms at any position in a sentence,and training a BERT model is accomplished through a noveltechnique called Masked Language Modeling (MLM). In aMLM training task, BERT randomly masks the words in theinput text and then optimizes itself to predict the maskedterms based on the contextual information. In this pre-trainingstep, a massive amount of corpora are fed to the BERT-basedmodel, and the resulting model is leveraged to address differentdownstream tasks by ﬁne-tuning on task-speciﬁc datasets. Adistinctive feature of BERT is its uniﬁed architecture across Fig. 2: A three step workﬂow applies T-BERT to NLA-PLA traceability. 1) Pre-training data are functions collectedfrom Github projects 2) A BERT is trained as a languagemodel for code with these functions and composed with arelation classiﬁer as the T-BERT model 3) Functions are splitas speciﬁcations and doc-strings and used as intermediate-training data 4) T-BERT model is intermediate-trained usingcode search data 5) OSS datasets are collected from Githubrepo 6) T-BERT model is ﬁne-tuned as a trace model usingtransferred knowledgedifferent tasks [23], as the architecture for LM pre-trainingand task-speciﬁc ﬁne-tuning are almost identical, with only thelast layer of the model customized according to the targeteddownstream tasks. This layer is usually referred to as a taskheader in a BERT-based model.

B. BERT For Software Traceability

The solution we propose represents a three-fold procedureof pre-training, intermediate-training and ﬁne-tuning, as sum-marized in Fig. 2. In the pre-training phase, a dedicatedlanguage model is trained on source code and then utilizedto construct the T-BERT models. In the intermediate-trainingphase, T-BERT is then trained to address the code searchproblem. In this phase, we provide adequate labeled trainingexamples to T-BERT and expect it to learn general NL-PLclassiﬁcation knowledge that can ultimately be transferred tothe traceability challenge. Finally, in the ﬁne-tuning phase, theintermediate-trained T-BERT model is applied to the issue-commit tracing challenge in real-world open-source projects.

C. T-BERT Architectures

The three variants of the T-BERT architecture that we inves-tigate for software traceability have previously been applied tosimilar text-based problems. These variants are: • TWIN:

The Twin BERT architecture is shown in Fig. 3a.It leverages two BERT models to encode the NL and PLartifacts separately. The two artifacts are then transformedinto two independent hidden state matrices, in which tokensare represented by ﬁxed length vectors. We applied a poolingtechnique on these hidden state matrices to formulate featurevectors representing the artifacts. Finally, we concatenatedthese two feature vectors for classiﬁcation tasks.

SIAMESE:

The siamese BERT architecture is shown inFig. 3b. It is a hybrid of the single and twin architecture. Itonly uses one BERT model; however, instead of creating aconcatenated token sequence for an NL-PL pair like singleBERT, it passes each artifact sequentially to the BERT modeland creates separate hidden state matrices for each of the twoartifact types (i.e., NL and PL). The generated two hiddenstate matrices are then pooled and concatenated to produce ajoint feature vector as in the Twin BERT architecture. Thisjoint feature vector is then sent to classiﬁcation headers toaccomplish the prediction task.Both Siamese and Twin T-BERT architectures concatenateartifact feature vectors to create a joint feature vector. Nilset al. [29] explored the impact of different concatenationapproaches for the siamese BERT architecture and showedthat given two pooled feature vectors u and v , siamese BERTwith a joint feature vector ( u, v, | u − v | ) achieved the bestperformance on a sentence classiﬁcation task. We thereforeapply this type of concatenation method to fuse the NL andPL feature vectors to create a joint feature vector. • SINGLE:

The single BERT architecture is shown in Fig. 3c.NL and PL text are annotated with special tokens and thenconcatenated into a single sequence. For example, token[CLS]/[SEP] is used to annotate the start/end of a sentence. Asentence S with tokens s , s , s , ..s N , and a piece of code Cwith tokens c , c , c , ..c N will be transformed into an inputformat of [ CLS ] s , s , s , ..s N , [ SEP ] c , c , c , ..c N [ SEP ] .The annotated and concatenated sequence is fed to the singleBERT to generate a single hidden state matrix. A subsequentpooling layer then reduces the dimension of the matrix tocreate a fused feature vector, which is a counterpart of the jointfeature vector in SIAMESE and TWIN. This feature vector isused by the classiﬁcation header, to predict whether the inputNL-PL pair is related or not.IV. M ODEL T RAINING

In this section, we describe the training strategies usedfor pre-training, intermediate-training , and ﬁne-tuning phases.The dataset supporting pre-training and intermediate-trainingphases is provided by Hamel et al. [30] from their study of thecode search problem. It includes function deﬁnitions and theirassociated doc-strings scraped from numerous Github projects,and includes Go, Java, JavaScript, PHP, Python, and Rubyprogramming languages.The dataset used in the ﬁne-tuning phase was retrieved fromOSS by our team. We extracted issues and commits throughGithub’s APIs and mined ground-truth trace links from thecommit messages. We show the data format in Fig. 1, andexplain details of the data collection process in Sec. V-A.In this study we selected Python as our target language forboth training and evaluation due to the large number of activeprojects; however, our approach is not language dependent.Given sufﬁcient time, the same post-training process could beapplied to other programming languages.

A. Three Step Training • Pre-training Code Language Model:

In the pre-trainingstep, we leveraged the BERT model to learn the word distri-bution among NL and PL documents, and refer to this BERTmodel as ‘code BERT’ to distinguish it from the plain BERTmodel that handles only NL text. In the plain BERT model,masked LM (MLM) tasks were used to pre-train BERT as alanguage model. As previously explained, in MLM tasks, 15%of tokens are selected and masked, and then BERT is trained torecover the masked tokens based on their surrounding context.Given that pre-training a language model is very expensive,three commercial organizations have released their own pre-trained code BERT models (Hugging Face [31], CodistAI [32],and Microsoft [26]), all of which were trained on the Code-SearchNet dataset. Of these, we leverage Microsoft’s model(referred to as MS-CodeBert) as our source code languagemodel directly for T-BERT relation classiﬁcation models de-picted in Fig. 3, as it has been shown to deliver improvedlanguage comprehension for diverse downstream softwareengineering tasks. These improvements in MS-CodeBert canbe attributed to its ‘Replaced Token Detection’ training tasks,which have been shown to be a more effective way of trainingLMs [33]. This training task replaces a small portion tokensin the corpus with random tokens and then requires the BERTmodel to identify which tokens have been replaced in thecorpus. • Intermediate-training:

For intermediate-training we trainedT-BERT models to perform the code search problem, as thisproblem is inherently similar to the NLA-PLA traceabilitychallenge. In both cases, we used T-BERT to retrieve relatedsource code based on a NL description of code functionality.The CodeSearchNet dataset provides a benchmark for thecode search problem, as each function in the dataset ispaired with a doc-string. For Python, the dataset includes824,342 functions for training, 46,213 functions for develop-ment and 22,176 functions for testing. This dataset is idealfor intermediate-training purposes because 1) it is large insize, therefore the T-BERT model has adequate labeled datato learn general rules for identifying NL-PL relevance, 2) therelationships between doc-string and function are deﬁnitive,meaning that there is minimal noise in the ground truth,and 3) the function deﬁnitions use only part of the pythongrammar which makes this task easier to handle than NLA-PLA traceability.We formulated the intermediate-training as a binary classiﬁ-cation task in which T-BERT was asked to identify whether agiven doc string properly describes its paired function or not.The loss function used in this intermediate-training step wasCross Entropy Loss, and Adam Optimizer [34] was used toupdate the parameters and optimize for the loss function.The code search problem creates an unbalanced distributionof positive and negative docstring-to-code pairs, we thereforecreated a balanced training dataset with an equal number ofpositive and negative samples. Guo et al. [10] adopted a dy- CodeSearchNet dataset https://github.com/github/CodeSearchNet a) TWIN(b) SIAMESE(c) SINGLE

Fig. 3: The architectures of the three T-BERT models proposedand evaluated in our experiments. namic under-sampling strategy to avoid inﬂating training datasize while continually exposing the model to previously unseennegative samples. We adopted a similar technique to constructour training samples. In each epoch, a balanced training setwas constructed by including all function and doc-string pairsfrom CodeSearchNet dataset, as well as a randomly selectedequal number of non-related pairs. We updated the trainingset at the beginning of each epoch, so that the T-BERT modelcould learn from previously unseen negative examples. Werefer to this training strategy as Dynamic Random NegativeSampling (DRNS), and compare it to other training strategiesdescribed in Sec. IV-B. • Fine-tuning:

In ﬁne-tuning, we utilized a similar trainingtechnique to that discussed in the previous step, but addressedthe traceability challenge of tracing issues to code commitsusing real world OSS datasets. Although the input data isformatted differently to the intermediate-training format, T-BERT uses the same architecture for both tasks. As shown inFig. 1, the issues are comprised of a short issue summary anda long issue description while the commit is composed of acommit message and code change set. For each type of artifact,we concatenated the text to formulate input sequences for theT-BERT model (i.e., issue summary + issue description andthen commit message + code change set.) In contrast to theintermediate-training step, the dataset utilized in this step islimited and fuzzy. As reported by Rath et al., link sets minedfrom OSS projects are unlikely to be complete and entirelyaccurate as engineers may forget to tag issues, or may commitmultiple changes against multiple tags. [16]. Furthermore thenumber of links in the project-speciﬁc ﬁne-tuning phase issigniﬁcantly smaller than the number of function to doc-stringpairs used in post-training. As reported in Table. I, we haveapproximately six hundred true links for ﬁne-tuning comparedto 824,342 pairs of function and doc-string. Furthermore, thecode change set in commits has a more ﬂexible and complexformat than the short and succinct function deﬁnitions.

B. Negative Sampling

Guo et al., observed a glass ceiling in terms of achievedtrace link performance in which the accuracy of their neuraltrace model increased as the training epoch increased at thebeginning, however, it then reached a peak value and started todecline with further training [10]. Our hypothesis of this phe-nomenon is as follows. At earlier stages of training, the tracemodel can effectively learn rules for distinguishing positiveand negative examples, and the neural trace model was easilyable to rule out many unrelated examples (i.e., cases wherethe source and target artifacts have little common vocabulary).Since those types of negative examples constitute the majorityof negative examples, the random negative sampling strategyexperiences few challenging examples and therefore startsoverﬁtting based on n¨aive rules, because these n¨aive rules areapplicable for the majority of simple cases. This overﬁttingcauses accuracy to decline after a few training epochs.Our approach adds high quality negative samples to alleviatethis problem through our proposed Online Negative SamplingONS) as an alternative to Dynamic Random Negative Sam-pling (DRNS). The principle idea of ONS is that, instead ofcreating a training dataset at the beginning of each epoch, thetrace model will generate negative examples dynamically atthe batch level. For illustrative purposes imagine we want tocreate a batch of size 8 containing 4 positive links. If we have4 NL artifacts and 4 PL artifacts (i.e., a total of 16 candidatelinks), we include the 4 positive links, and then select 4 nega-tive links from the 12 candidates in order to create a balancedbatch. By evaluating these negative links and ranking themby predicted score, we can identify the false links which aremore likely to be mistakenly classiﬁed. Then by incorporatingthe top-scoring negative examples into the training set, weimprove the quality of negative examples and avoid early over-ﬁtting. This approach is inspired by applications in the FaceRecognition domain, where face recognition models need todistinguish between people with similar appearances [35]. Thisnegative example mining strategy is usually combined withTriplet Loss [36] in the contrastive learning [37] framework.Here we adopt it for our classiﬁcation and combine it with thewidely used Cross Entropy Loss.V. E

XPERIMENTAL E VALUATION

A. Data Collection

In this study, we train T-BERT and evaluate it against twotypes of datasets. The ﬁrst dataset is CodeSearchNet which ispublicly available [30]. It includes functions and their associ-ated doc-strings for six different programming languages. Aspreviously stated, we focused on python functions.The other datasets we leveraged were mined from threeOSS projects in Github and included Pgcli, Flask and Kerasas described in Table. I. We selected these projects becausethey are popular, actively maintained Python projects, withdevelopers actively tagging commits with issue IDs. We re-trieved issues and their discussions as the source artifacts andcommits as target artifacts. For each issue, we included boththe short issue summary and the longer issue description. Weautomatically removed stack traces from issue discussions ifhighlighted as Code block in markdown, as we wanted to trainour approach to perform the harder job of generating linksbetween issues and code without the more explicit informationprovided by stack traces. For each commit, we included thecommit message and change set. However, we removed verysmall commits (with less than 5 LOC) from our target artifactset. Finally, due to the Github API rate limit, we retrieved amaximum of 5000 issues for each project.After retrieving commits and issues we mined a ‘golden linkset’ from the commit messages by using issue tags embeddedinto commit messages added by committers. In addition, weleveraged pull requests, as an accepted pull request automati-cally creates both an issue and commit in the OSS project, andconnects them through an issue ID embedded into the commitmessage. Tags were mined using regular expressions in orderto build a link set connecting issues and commits. One risk OSS dataset https://zenodo.org/record/4511291 of mining links from commit message is that the link set maybe incomplete. Liu et al. partially addressed this problem bypruning the dataset and only retaining artifacts appearing inlinks set [38]. We adopted this process to construct our datasetand report results in Table ITABLE I: The size of software project leveraged in traceabilityexperiment. We applied the cleaning procedures, described inSec. V-A, to clean artifacts and links.

Commits Issues Links TypePgcli original 2191 1197 645 databasecommand linecleaned 531 522 530Flask original 4011 3711 1159 web frameworkcleaned 752 739 753Keras original 5348 4810 9375 neural-networklibrarycleaned 551 550 551

B. Experiment Setup

We conducted our experiment on Supermicro SYS-7048GR-TR SuperServer with Dual Twelve-core 2.2GHz Intel Xeonprocessors and 128GB RAM. We utilized 1 NVIDIA GeForceGTX 1080 Ti GPU with 10 GB memory to train and evaluateour model. The T-BERT model was implemented with Py-Torch V.1.6.0 and HuggingFace Transformer library V.2.8.0.We trained models for 8 and 400 epoch in intermediate-training and ﬁne-tuning. For each task, we use a batch size of 8and a batch accumulation step of 8. We set the initial learningrate as 1E-05 and applied a linear scheduler to control thelearning rate at run time. Regarding the model selection, wesplit the dataset into training (train), development (dev), andtest sets. We trained the model using the training dataset andtested its performance on the dev dataset. We then selectedthe best performing model based on the dev dataset andcreated an output model. We ﬁnally evaluated and comparedthe performance of output models on the test dataset. In theintermediate-training stage the dataset was already split bythe data provider. In the ﬁne-tuning stage, we split the datasetinto ten folds, of which eight were used for training, one fordevelopment, and one for testing.

C. Evaluation Metrics

The metrics for our experiments include F-scores, MeanAverage Precision (MAP@3), Mean Reciprocal Rank (MRR),and Precision@K. • F-scores:

F-scores are composite metrics calculated fromprecision and recall, and are frequently used to evaluatetraceability results. The F-1 score assigns equal weights toprecision and recall, while the F-2 score favors recall overprecision. Although both precision and recall are important, F-2 is usually preferred for evaluating trace results where recallis considered more important than precision. We report the bestF-scores in our experiments by enumerating the thresholds. F β = (1 + β ) · precision · recallβ · precision · recall (1) Mean Average Precision:

MAP evaluates the ranking ofrelevant artifacts over retrieved ones. Each source artifact isregarded as a query Q for retrieving artifacts. After rankingthe retrieved target artifacts, an Average Precision (AveP orAP) score is obtained based on the position of all relevanttarget artifacts in the ranking. The Mean of AveP scores isthen computed to return the MAP. In our study, we applya stricter metric known as MAP@3, in which only artifactsranked in the top 3 positions contribute to

AveP score. Theformula for this metric is shown in following, k representsthe total relevant target artifacts in a query and rank i referto the ranking a target artifact : AveP @3 = (cid:80) ki Xk , X = (cid:40) P @ i, if rank i < = 30 , otherwise (2) M AP @3 = (cid:80) Qj AveP j @3 Q (3) • Mean Reciprocal Rank:

MRR is another measurementof the result ranking. In each query, the ﬁrst related targetartifact with a rank of N, will provide a Reciprocal Rank of1/N. MRR accumulates by averaging the Reciprocal Rank forall the queries Q. This focuses on the ﬁrst effective resultsfor a query, while ignoring the overall ranking. This is thestandard metric used for the CodeSearchNet benchmark. WhileMAP is more typical for trace retrieval tasks, we include thismetric to compare our intermediate-trained model against theapproaches in other studies for code search problem.

M RR = 1 Q Q (cid:88) | i =1 | stRank i (4) • Precision@K:

Precision@K evaluates how many relatedartifacts are retrieved and ranked in the top K. The formulafor this metric is shown in Eq. 5. We provide results with Kvalues of 1 to 3. A trace model with high Precision@K meansusers are more likely to ﬁnd at least one related target artifactin the top K results.

P recision @ K = (cid:80) Qi | Rel i @ K || Rel | (5)As we can see, MRR and Precision@K ignore recall and fo-cus on evaluating whether the search result can ﬁnd interestingresults for a user. They are ideal for the code search problembut not for traceability where recall is particularly important.Therefore, we apply only F-score and MAP@3 to evaluatetraceability results.As the majority of our queries have fewerthan three correct links, a perfect MAP@3 score representsclose to 100% recall.VI. R ESULTS A ND D ISCUSSION

We report performance results of T-BERT models for thecode search problem and the NLAs-PLAs traceability problem,and address the RQs deﬁned in Sec. I.

A. Evaluating the Code Search Problem

Our ﬁrst evaluation explores how well T-BERT modelsperform when datasets have adequate labeled examples. Wetrained T-BERT models for the three architectures introducedin Sec. III-C, using the training part of the CodeSearchNetdataset. For the T-BERT models which are trained with ONStechnique, we add a star sign after the model names todistinguish them from the T- BERT model trained with DRNS.For example, SINGLE* refers to the model with single BERTarchitecture and trained with online negative sampling. Theperformance of these six types of models are reported inTable. II. In addition, we compare T-BERT models againstthree classical tracing techniques of VSM, LDA, LSI and alsoa deep learning trace model, TraceNN [10] for this dataset.Other researchers such as Hamel et al. [30] and Feng [26]leveraged the same dataset to conduct code search study, andso we select the methods which achieved the best MRR scoresin their study as a comparison to T-BERT, and created ourevaluation dataset for CodeSearchNet in the same way. Foreach doc-string, we combined the related function with 999unrelated ones, and charged the retrieval models with ﬁndingthe correct function among one thousand candidates. However,the SINGLE models were not able to efﬁciently process theentire dataset, and so in this case we evaluated only 100 out ofthe total 22,176 queries. The comparison of MRR scores areshown in Table. II. To observe the learning process for eachTABLE II: Evaluation of T-BERT models on the CodeSearch-Net Challenge dataset

F1 F2 MAP MRR Pr@1 Pr@2 Pr@3TWIN 0.497 0.563 0.735 0.756 0.646 0.787 0.842SIAMESE 0.604 0.668 0.814 0.825 0.729 0.866 0.915SINGLE 0.482 0.572 0.825 0.839 0.730 0.900 0.930TWIN* 0.559 0.626 0.794 0.809 0.712 0.846 0.890SIAMESE* 0.594 0.655 0.817 0.829 0.738 0.866 0.910SINGLE* 0.612 0.678 0.837 0.851 0.750 0.910 0.930VSM 0.219 0.255 0.314 0.351 0.251 0.341 0.397LDA 0.005 0.010 0.012 0.021 0.008 0.013 0..017LSI 0.003 0.007 0.014 0.025 0.009 0.015 0.020TNN-LSTM 0.179 0.245 0.351 0.400 0.269 0.386 0.457TNN-BiGRU 0.221 0.29 0.392 0.438 0.304 0.432 0.504JV-biRNN *0.321JV-SelfAtt *0.692MSC *0.860JV=Joint Vector [30]; MSC=MS-CodeBERT [26]; TNN = TraceNN [10];*Previously reported results against same CodeSearch-Net challenge dataset.

TABLE III: Training and testing time for T-BERT models oncode search problem. The test time is recorded for a test setwith 100 queries.

Strategy TWIN SIAMESE SINGLETrain(hr) DRNS 156h 138h 164hTest(sec) DRNS 3254s 3264s 183357sTrain(hr) ONS 146h 142h 283hTest(sec) ONS 3211s 3265s 193667s model, we visualized the learning curve in Fig .4 for the ﬁrst5,000 steps of optimizations. We evaluate the performanceof each model at intervals of 1000 steps during the trainingby applying the intermediate model against small testing setscomposed of 200 development examples.

B. Evaluate NLA-PLA Traceability

We then evaluated the T-BERT models on the NLA-PLATraceability problem. As previously described, we used 8 foldsof trace links for training, 1 fold for developing and 1 fold fortesting. To explore our RQ3, we conducted a controlled exper-iment, in which we trained two groups of T-BERT models. Inthe ﬁrst group, we continued training the T-BERT model whichhad been intermediate-trained in our previous experiment. Inthe second group, we trained the T-BERT models withoutapplying transferred knowledge learned from intermediate-training . When we conducted model training, we applied thesame training dataset and ONS techniques to the two groups.To maintain consistency of abbreviations, we name the modelsin the ﬁrst group, for example, as SINGLE*+T; and the modelsin the second group as SINGLE*. The result of this experimentare shown in Table. IV, while the learning curve for the T-BERT models showing the ﬁne-tuning to traceability tasks isshown in Fig. 5. We show only the learning curve for Pgclidata due to space constraints.

C. RQ2: How does ONS alleviate the glass ceiling problem?

In this section, we discuss how the effectiveness of ONSfor T-BERT models alleviates the glass ceiling problem. Thisquestion helps us to identify the best approach for trainingT-BERT models. To answer this question we apply both ONSand DRNS to the same T-BERT architecture to create test andcontrol groups. Fig. 4 shows the learning curve of T-BERTmodels on CodeSearchNet dataset during training. The orangelines represent the models trained with ONS while blue linesrepresent those trained with DRNS. We ﬁnd that for TWINand SINGLE model, the orange line is always above the blueline, meaning that ONS can accelerate the learning for T-BERT models but also let it converge at a higher value. Whilefor SIAMESE we ﬁnd the orange line is above blue line inthe early steps, but soon converges at a similar level. Thisresult indicates ONS beneﬁts the T-BERT model training byintroducing harder negative examples. The evaluation resultsshown in the ﬁrst six rows of Table. II also support this ﬁnding,as the TWIN* and SINGLE* models (line 4,6) achieve betterresults than T-BERT models (line 1,3) from the perspectives ofall metrics. SIAMESE* (line 5) and SIAMESE (line 2) have avery close result where the difference of all metrics are within0.5% except F2 score (1.3%).We report the training time for DRNS and ONS in Table III.ONS introduces initial overheads in constructing each batch,but then has fewer candidates to evaluate and sort. In contrast,DRNS has no upfront construction costs, but must sample datafrom a large list, creating a performance bottleneck. The useof ONS only signiﬁcantly increased training time except forthe SINGLE model which is particularly slow on evaluation. We conclude that ONS delivers better accuracy than DRNS,and only increases training times for models (e.g., SINGLE)which have slow evaluation processes.

D. RQ1: Which T-BERT architecture is better?

This RQ focuses on comparing the accuracy and efﬁciencyof T-BERT when used for trace retrieval tasks. Performancecomparisons were conducted on T-BERT* models, as we haveshown in RQ2 that T-BERT* models returned better accuracythan T-BERT ones. To answer this question, we further dividedour questions into three sub RQs as following. • RQ1.1:

Are T-BERT models capable of resolving theCodeSearchNet and NLA-PLA traceability problem?Table. II shows that SINGLE*, TWIN*, and SIAMESE*(line 4-6) can achieve F scores around 0.6 and MAP scoresaround 0.8. The Precision@3 score for the three models arearound 0.9 which means T-BERT* models can return relatedfunctions in around 9 of 10 user queries. And in 75% to 80%of cases the correct answer deﬁnition is ranked at the ﬁrstposition. This result shows that all three models are effectivefor the CodeSearchNet challenge. Among these three models,SINGLE* achieves the best performance with respect to allmetrics. However, the gap between these three models is small,and all three models clearly outperform the base line createdby the three IR models of VSM, LSI, and LDA.In Table. IV, the ﬁrst three rows show that T-BERT* modelsapplied to the traceability challenge and trained without thebeneﬁt of transfer knowledge are ranked in the same wayas for the code search problem. However, the performancegap between these three models increases. This suggests thatthe size of training data has different impacts on the threetypes of architecture. Since, TWIN* model includes two innerBERT models, the parameters in this architecture are doubled,and it therefore requires more training examples to tune themodels. Nevertheless, all T-BERT* models achieved achievebetter results than the IR model baselines. Especially in Kerasdataset, SIAMESE* and SINGLE* (line2 and 3) have an Fscore above 0.95 and MAP of 0.99 indicating that T-BERTcan provide perfect tracing results in some scenarios. • RQ1.2:

Which T-BERT model most effectively addressesthe two problems of data sparsity and performance ?We need to take both accuracy and efﬁciency into con-sideration when selecting a model for use in production.As discussed in RQ1.1, the SINGLE* model achieves bestperformance; however, it is very slow for processing largescale datasets. As shown in Table. III, TWIN and SIAMESEneed around 3000 seconds to evaluate 100 queries whileSINGLE needs around 20000 seconds. We estimate that it willtake around 6000 hours for SINGLE to evaluate the wholeCodeSearchNet test with our current experiment setup. Butfor TWIN and SIAMESE, it took us only around 20 hoursto evaluate the whole test set in practise. In the traceabilitychallenge, the test set is relatively tiny. Taking Pgcli forexample, it contains 2704 candidate links compose of 52source artifact and 52 target artifacts. TWIN and SIAMESEboth take around 160 seconds to ﬁnish the task while theABLE IV: Evaluation of models on NLAs-PLAs traceability

Pgcli Flask KerasF1 F2 MAP F1 F2 MAP F1 F2 MAPTWIN* 0.450 0.491 0.574 0.524 0.577 0.683 0.450 0.491 0.574SIAMESE* 0.621 0.654 0.728 0.681 0.731 0.801 0.962 0.962 0.990SINGLE* 0.707 0.745 0.785 0.841 0.873 0.952 0.931 0.925 0.971TWIN* T 0.686 0.709 0.766 0.750 0.781 0.869 0.953 0.970 0.978SIAMESE* T 0.729 0.748 0.779 0.820 0.830 0.920 0.971 0.977 0.990SINGLE* T 0.730 0.789 0.859 0.884 0.862 0.92 0.972 0.989 0.990VSM 0.376 0.424 0.506 0.509 0.474 0.540 0.532 0.512 0.703LDA 0.121 0.226 0.208 0.182 0.241 0.227 0.290 0.367 0.333LSI 0.085 0.145 0.147 0.127 0.164 0.142 0.072 0.126 0.109TNN-LSTM 0.138 0.179 0.128 0.106 0.126 0.080 0.053 0.087 0.034TNN-BiGRU 0.062 0.116 0.006 0.066 0.100 0.044 0.063 0.119 0.073T=Transfer learning, TNN = TraceNN [10]; (a) TWIN (b) SIAMESE (c) SINGLE

Fig. 4: The learning curve for T-BERT and T-BERT* models on code search challenge. This ﬁgure shows the MAP scores(Y-Axis) over the ﬁrst 35K (X-Axis) Adam optimization steps.SINGLE model takes around one hour. SIAMESE and TWINarchitectures accelerate the process by decoupling the featurevector creation steps. In the SINGLE architecture, NL andPL document pairs are fed to BERT to create a joint featurevectors. This step is extremely expensive and creates the mainperformance bottleneck for the SINGLE model. Assuming wehave N source and N target artifacts. SINGLE has a timecomplexity of O ( N ∗ K ) for creating feature vectors for allthe candidate links, where K refer to the time consumed bythe BERT model to convert an input token sequence into afeature vector. TWIN and SIAMESE only need O ( N ∗ K ) to convert artifacts into feature vectors and then O ( N ) time to concatenate the feature vectors together. The timecomplexity of TWIN and SIAMESE is one order of magnitudelower than the SINGLE model thus more scalable to projectswith massive artifacts. We argue that SIAMESE is the mostappropriate model for addressing NLA-PLA traceability takingboth accuracy and efﬁciency into consideration, because itcan achieve an accuracy close to the SINGLE architecturefor the traceability challenge while maintaining the low timecomplexity of the TWIN architecture. However, in cases whereaccuracy is the primary concern, e.g. traceability for safetyTABLE V: Model performance on Pgcli dataset for NLA-PLA. TWIN* SIAMESE* SINGLE*Train(hr) 12h 12h 13hTest(sec) 170s 163s 5395s critical projects, users should adopt SINGLE model supportedby high-performance hardware. • RQ1.3:

How do T-BERT models compare to other ap-proaches ?For the CodeSearchNet challenge, we compared the perfor-mance of T-BERT models to Joint Vector Embedding (JVE)and MS-CodeBERT. JVE’s architecture is similar to TWIN,and leverages two encoders to create feature vectors for aclassiﬁcation network. Previous studies have reported 60% asthe highest MRR achieved by JVE on the same dataset, whichis lower than T-BERT models. MS-CodeBERT, provided byMicrosoft, used the same architecture as SINGLE in our ex-periment. However, MS-CodeBERT was trained with a batchsize of 256 on a cluster with 16 Tesla GPU, and no specialtechniques were applied during training. Our machine onlyallows one small batch due to memory limitations, but SIN-GLE* model’s MRR results were only 0.9% lower than MS-CodeBERT, indicating that our training techniques partiallyalleviate limitations introduced by less powerful hardware.TNN [10] is an RNN based trace model proposed by Guoet al. and designed for generating NLA-NLA links. We recon-structed the model according to the authors’ speciﬁcations andapplied it to our NLA-PLA problem for comparison purposes.TNN utilizes Word2Vec embedding to transform tokens intovectors. It uses two alternate RNN networks, LSTM or Bi-directional GRU (BiGRU) to generate semantic representationsof NLA and PLA, and feeds these semantic hidden states to theintegration layer to generate a new hidden state representingig. 5: Learning curve of T-BERT* models on Pgcli datasetfor NLA-PLA trace challenge. This ﬁgure shows the MAPscores (Y-Axix) over 5k Adam optimization steps (X-Axis)the correlations between the NLA and PLA from whichlinks are generated. Our embedding layer was constructed byunsupervised training of a skip-gram Word2Vec model usingartifacts from the three OSS reported in Table I. We evaluatedboth LSTM and BiGRU for the RNN layer in this study. TNNresults are shown at the bottom of Table. IV, and show that itunderperformed all BERT models and VSM on all three OSSprojects. We provide an illustrated example in Fig. 6, showingT-BERT and VSM results for a commit-issue pair, tagged bythe committer as related.Guo et al., reported improvements over the VSM modelfor their NLA-NLA dataset, we were unable to replicatethese for our NLA-PLA problem. An inspection of the TNNlearning curve indicated that TNN effectively reduced theloss and improved the link prediction accuracy for all threetraining datasets in training dataset, but converged early in thevalidation datasets and then decreased in accuracy - indicatingan overﬁtting problem. There are several possible explanationsfor these results. First, the dataset used by Guo et al., contained1,387 positive links versus our 530-739 links, which could beinsufﬁcient for RNN training. Second, programming languageshave an open vocabulary in which new terms can be createdas variable and function names, and TNN may therefore needa larger training set to generate NLA-PLA links versus NLA-NLA ones. Our hypotheses are supported by the observationthat TNN does not overﬁt when applied to the CodeSearchNetwhere larger numbers of training examples are provided. T-BERT models leverage transferred knowledge from pretrainedlanguage models and adjacent problems to reduce the require-ments of the training dataset size, and are therefore able tohandle tracing challenges which can not easily be addressedby classical Deep Learning trace models. This characteristicmakes T-BERT more practical for industrial applications.

E. RQ3: To what extent can T-BERT leverage transfer knowl-edge from code search to software traceability

Table IV the T-BERT model trained with and withouttransferred knowledge from the post-pretrained model. The results show that intermediate-training T-BERT models on thecode search task can signiﬁcantly improve their performanceon the traceability problem. Taking SIAMESE on Pgcli forexample, the F2 score increased from 0.654 to 0.748, while theMAP score increased from 0.728 to 0.779. Similar results areobserved for the other datasets with different T-BERT models.This suggests that the knowledge learned from text to structurecode (function deﬁnition) can be effectively transferred tocases where 1) code formats are more fuzzy and 2) trainingdata has limited labels. Intermediate-training improvementsFig. 6: In this example, a link is tagged by developers, retrievedby the T-BERT model (with a high score o 0.965 due tosemantic similarity and context) and missed by VSM, becausethe key terms ’request‘ and ’json‘ are common terms.were observed to different extents across the three architecturetypes. As shown in Fig. 5, the blue line (SINGLE) convergedat a very early stage, showing that SINGLE needed onlyrelatively few epochs on the smaller task-speciﬁc datasetto localize its transferred knowledge. SIAMESE convergedslower, while TWIN converged slowest of all, indicating thateach architecture has a different capacity for transferringknowledge. VII. R

ELATED W ORK

Our study constructs T-BERT models using three differentarchitectures, all of which have previously been used toaddress related problems in other domains. Lu et al. leverageda TWIN-like model, named TwinBERT, as a search engine todeliver ads alongside organic search results [39]. They usedreinforcement training techniques and found that TwinBERTcould return results with high accuracy and low latency.Reimers et al. proposed a SIAMESE architecture to addressproblems such as Semantic Textual Similarity [29]. Theytrained their model to determine whether two sentences wererelated through contradiction or entailment, and utilized SNLI[40] and the Multi-Genre NLI [41] dataset for training andevaluation. Their results showed that SIAMESE BERT couldachieve a high Spearman Rank Correlation score around 0.76.They also found that use of averaging pooling was moreeffective than max pooling and ﬁrst token (the [CLS] token)pooling; and that concatenating source and target hiddenstates as ( u, v, (cid:107) u − v (cid:107) ) achieved best results. We adoptedhese ﬁndings when we built T-BERT models for this study.However, no study has yet been conducted comparing theTWIN, SIAMESE and SINGLE architectures.To address the NLA-PLA challenge, we adopted the codesearch problem as our intermediate solution. Several studieshave addressed the code search problem using a recurrentneural network (RNN). We have already made comparisons tothe work by Feng et al. [26] and Huian et al. [30] in Table. II;however, in another study, Gu et al. [42] converted methodspeciﬁcations into API call sequences and then processed thesequence with RNN. They reported achieving 0.6 MRR in atest set with 100 queries. However, we can not directly adaptthis method to the traceability challenge, because, unlike APIcalls, the statements in code change sets are not structured.A related domain for addressing NLA-PLA is source codeembedding. By converting both source code and documentsinto distributed representations, the relevance between thesetwo type of artifacts can be effectively calculated throughdistance metrics such as Cosine and Euclidean Distance.Code2Vec [43] belong to this type of approach. T-BERTmodels can adapt to this type of training by integrating CosineEmbedding Loss in the classiﬁcation header. We leave thisexploration for future work.VIII. T HREATS TO V ALIDITY

There are several threats to validity in this study. First,Our current experiments have only been conducted on Pythonprojects, and results could differ when applied to otherprogramming languages. Also, due to time constraints weﬁne-tuned and evaluated the T-BERT model performance ononly three OSS projects, which may not be enough to drawgeneralized conclusions. Second, we construct our experimentdatasets from OSS projects by mining the issues and commitswhose IDs are explicitly marked as related by project main-tainers. Although, this is a conventional way of leveragingOSS projects for traceability, true links may be missed. Forexample, a bug report may have hidden dependencies onseveral other issues such as a feature request or other bugreport even though a commit addressing the parent bug reportis not marked as ‘related’. We alleviate the impact of thisphenomena by adopting the data processing suggested byLiu et.al. [38]. Another important threat is that while theSINGLE architecture, trained for code search problem, doesnot outperform CodeBERT, further improvements could beachieved using hyper parameter optimization. Our experimentswere limited by hardware availability for conducting excessivehyper parameter tuning. However, the performance compar-ison across T-BERT models should still be valid becauseall experiments were conducted with the same parameters.Finally, due to processing time constraints, we evaluated theSINGLE model on 100 queries whilst using the entire testingset for the other models (in Table. II). Although not reportedwe also evaluated TWIN and SIAMESE on 100 queries andobserved that they achieved almost identical results to thoseobtained from the whole test set indicating that 100 querieswas a reasonable sample size for the SINGLE model. IX. C

ONCLUSION AND F UTURE W ORK

This study has explored several different BERT architecturesfor generating trace links between natural language artifactsand programming language artifacts. Our experimental re-sults showed that the SINGLE architecture achieved the bestaccuracy but at long execution times, whilst the SIAMESEarchitecture achieved similar accuracy with faster executiontimes. Second, we showed that ONS training (based onnegative sampling) improved both performance and modelconvergence speed without incurring signiﬁcant performanceoverheads when compared to DRNS. Third, we found that T-BERT was able to effectively transfer knowledge learned fromthe code search problem to NLA-PLA traceability, meaningthat intermediate-trained T-BERT models can be effectivelyapplied to software engineering projects with limited trainingexamples, alleviating the data sparsity problem for deep neuraltrace models. Regarding the training time, we showed that thesame intermediate-trained T-BERT can be applied for OSSprojects in three different domains. By avoiding the need forintermediate training on each individual project, our approachwas able to efﬁciently adapt to new domains. In conclusion,our results show that T-BERT generates trace links at farhigher degrees of accuracy than existing information retrievaland RNN techniques – bringing us closer to achieving thevision of practical and trustworthy traceability.To support replication and reproducibility, we have providedlinks throughout this paper to the datasets that we used and weprovide a complete implementation of T-BERT and executioninstructions on github .In future work we will evaluate our approach across morediverse project domains and programming languages, and willexplore its application to more diverse types of softwareartifacts such as requirements, design, and test cases.A CKNOWLEDGMENT

This work has been partially funded under US NationalScience Foundation Grant SHF:1901059.R

EFERENCES[1] L. Rierson,

Developing Safety-Critical Software: A Practical Guide forAviation Software and DO-178C Compliance . CRC Press, 2013.[2] A. Mahmoud, N. Niu, and S. Xu, “A semantic relatedness approach fortraceability link recovery,” in . IEEE, 2012, pp. 183–192.[3] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo,“Recovering traceability links between code and documentation,”

IEEETransactions on Software Engineering , vol. 28, no. 10, pp. 970–983,2002.[4] J. Huffman Hayes, A. Dekhtyar, and S. K. Sundaram, “Advancing can-didate link generation for requirements tracing: The study of methods,”

IEEE Transactions on Software Engineering , vol. 32, no. 1, pp. 4–19,2006.[5] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora, “Enhancing anartefact management system with traceability recovery features,” in , 2004,pp. 306–315. T-BERT source code https://github.com/jinfenglin/TraceBERT6] P. Rempel, P. M¨ader, and T. Kuschke, “Towards feature-aware retrievalof reﬁnement traces,” in , N. Niu and P. M¨ader, Eds.IEEE Computer Society, 2013, pp. 100–104. [Online]. Available:https://doi.org/10.1109/TEFSE.2013.6620163[7] A. Dekhtyar, J. H. Hayes, S. K. Sundaram, E. A. Holbrook, andO. Dekhtyar, “Technique integration for requirements assessment,”in . IEEEComputer Society, 2007, pp. 141–150. [Online]. Available: https://doi.org/10.1109/RE.2007.17[8] H. U. Asuncion, A. U. Asuncion, and R. N. Taylor, “Softwaretraceability with topic modeling,” in

Proceedings of the 32ndACM/IEEE International Conference on Software Engineering - Volume1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010 , J. Kramer,J. Bishop, P. T. Devanbu, and S. Uchitel, Eds. ACM, 2010, pp. 95–104.[Online]. Available: http://doi.acm.org/10.1145/1806799.1806817[9] H. Sultanov, J. Huffman Hayes, and W.-K. Kong, “Application of swarmtechniques to requirements tracing,”

Requirements Engineering , vol. 16,no. 3, pp. 209–226, 2011.[10] J. Guo, J. Cheng, and J. Cleland-Huang, “Semantically enhanced soft-ware traceability using deep learning techniques,” in . IEEE,2017, pp. 3–14.[11] G. Spanoudakis, A. Zisman, E. P´erez-Mi˜nana, and P. Krause, “Rule-based generation of requirements traceability relations,”

Journal ofSystems and Software , vol. 72, no. 2, pp. 105–127, 2004.[12] J. Guo, J. Cleland-Huang, and B. Berenbach, “Foundations for an expertsystem in domain-speciﬁc traceability,” in , 2013, pp. 42–51.[13] J. Cleland-Huang, P. M¨ader, M. Mirakhorli, and S. Amornborvornwong,“Breaking the big-bang practice of traceability: Pushing timely tracerecommendations to project stakeholders,” in , M. P. E. Heimdahl and P. Sawyer,Eds. IEEE Computer Society, 2012, pp. 231–240. [Online]. Available:https://doi.org/10.1109/RE.2012.6345809[14] S. Lohar, S. Amornborvornwong, A. Zisman, and J. Cleland-Huang,“Improving trace accuracy through data-driven conﬁguration and compo-sition of tracing features,” in

Proceedings of the 2013 9th Joint Meetingon Foundations of Software Engineering . ACM, 2013, pp. 378–388.[15] M. Gethers, R. Oliveto, D. Poshyvanyk, and A. De Lucia, “On inte-grating orthogonal information retrieval methods to improve traceabilitylink recovery,” in , 2011, pp. 133–142.[16] M. Rath, J. Rendall, J. L. Guo, J. Cleland-Huang, and P. M¨ader,“Traceability in the wild: automatically augmenting incomplete tracelinks,” in

Proceedings of the 40th International Conference on SoftwareEngineering , 2018, pp. 834–845.[17] Y. Liu, J. Lin, Q. Zeng, M. Jiang, and J. Cleland-Huang, “Towardssemantically guided traceability,” in

International Conference on Re-quirements Engineering , vol. 2020, 2020.[18] S. Lohar, S. Amornborvornwong, A. Zisman, and J. Cleland-Huang,“Improving trace accuracy through data-driven conﬁguration and com-position of tracing features,” in , 2013, pp. 378–388.[19] M. Borg, C. Englund, and B. Duran, “Traceability and deep learning-safety-critical systems with traces ending in deep neural networks,”

InProc. of the Grand Challenges of Traceability: The Next Ten Years , pp.48–49, 2017.[20] Y. Zhao, T. S. Zaman, T. Yu, and J. H. Hayes, “Using deep learningto improve the accuracy of requirements to code traceability,”

GrandChallenges of Traceability: The Next Ten Years , p. 22, 2017.[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997. [22] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation ofgated recurrent neural networks on sequence modeling,” arXiv preprintarXiv:1412.3555 , 2014.[23] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin neural information processing systems , 2017, pp. 5998–6008.[25] J. Liu, Y. Lin, Z. Liu, and M. Sun, “Xqa: A cross-lingual open-domainquestion answering dataset,” in

Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics , 2019, pp. 2358–2368.[26] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,T. Liu, D. Jiang et al. , “Codebert: A pre-trained model for programmingand natural languages,” arXiv preprint arXiv:2002.08155 , 2020.[27] J. Cleland-Huang, O. C. Gotel, J. Huffman Hayes, P. M¨ader, andA. Zisman, “Software traceability: trends and future directions,” in

Proceedings of the on Future of Software Engineering . ACM, 2014,pp. 55–69.[28] Wikipedia, “Language model — Wikipedia, the free encyclopedia,”2020, [Online; accessed 22-July-2020]. [Online]. Available: https://en.wikipedia.org/wiki/Language model Bidirectional[29] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings usingsiamese bert-networks,” arXiv preprint arXiv:1908.10084 , 2019.[30] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,“CodeSearchNet Challenge: Evaluating the State of Semantic CodeSearch,” arXiv:1909.09436 [cs, stat] , Sep. 2019, arXiv: 1909.09436.[Online]. Available: http://arxiv.org/abs/1909.09436[31] [Online]. Available: https://huggingface.co/huggingface/CodeBERTa-small-v1[32] [Online]. Available: https://huggingface.co/codistai/codeBERT-small-v2[33] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” arXivpreprint arXiv:2003.10555 , 2020.[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[35] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embed-ding for face recognition and clustering,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2015, pp. 815–823.[36] Wikipedia contributors, “Triplet loss,” 202-, [Online; accessed 22-Aug-2020]. [Online]. Available: https://en.wikipedia.org/wiki/Triplet loss[37] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame-work for contrastive learning of visual representations,” arXiv preprintarXiv:2002.05709 , 2020.[38] Y. Liu, J. Lin, and J. Cleland-Huang, “Traceability support for multi-lingual software projects,” arXiv preprint arXiv:2006.16940 , 2020.[39] W. Lu, J. Jiao, and R. Zhang, “Twinbert: Distilling knowledge totwin-structured bert models for efﬁcient retrieval,” arXiv preprintarXiv:2002.06275 , 2020.[40] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large an-notated corpus for learning natural language inference,” in

Proceedingsof the 2015 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . Association for Computational Linguistics, 2015.[41] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challengecorpus for sentence understanding through inference,” in

Proceedings ofthe 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume1 (Long Papers) . Association for Computational Linguistics, 2018, pp.1112–1122. [Online]. Available: http://aclweb.org/anthology/N18-1101[42] X. Gu, H. Zhang, and S. Kim, “Deep code search,” in

Proceedingsof the 40th International Conference on Software Engineering, ICSE2018, Gothenburg, Sweden, May 27 - June 03, 2018 , M. Chaudron,I. Crnkovic, M. Chechik, and M. Harman, Eds. ACM, 2018, pp.933–944. [Online]. Available: https://doi.org/10.1145/3180155.3180167[43] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learningdistributed representations of code,”