[PDF] NaturalCC: A Toolkit to Naturalize the Source Code Corpus

Abstract

We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each model's performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code retrieval) for demonstration. The video of this demo is available at this https URL.

Full PDF

NN AT U R A L

C C: A Toolkit to Naturalize the SourceCode Corpus

Yao Wan , Yang He , Jian-Guo Zhang ,Yulei Sui , Hai Jin , Guandong Xu , Caiming Xiong , Philip S. Yu Services Computing Technology and System Lab, Cluster and Grid Computing Lab,School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China School of Computer Science, University of Technology Sydney, NSW, Australia Department of Computer Science, University of Illinois at Chicago, Illinois, USA Salesforce Research, Palo Alto, USA { wanyao,hjin } @hust.edu.cn, { yulei.sui,guandong.xu } @uts.edu.au, { jzhan51,psyu } @uic.edu, [email protected] Abstract —We present N

ATURAL

CC, an efﬁcient and exten-sible toolkit to bridge the gap between natural language andprogramming language, and facilitate the research on big codeanalysis. Using N

ATURAL

CC, researchers both from naturallanguage or programming language communities can quicklyand easily reproduce the state-of-the-art baselines and implementtheir approach. N

ATURAL

CC is built upon Fairseq and PyTorch,providing (1) an efﬁcient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modularand extensible framework that makes it easy to reproduce orimplement an approach for big code analysis, and (3) a commandline interface and a graphical user interface to demonstrate eachmodel’s performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion,code comment generation, and code retrieval) for demonstration.The video of this demo is available at . Index Terms —Natural language processing, programming lan-guage analysis, big code, toolkit.

I. I

NTRODUCTION

The rapid growth of machine learning (ML), especiallyof deep learning (DL) based natural language processing(NLP), brings great opportunities to explore and exploit NLPtechniques for various tasks of software engineering (SE),e.g., code documentation [1], [2], code completion [3] andcode retrieval [4], [5]. The underlying insights for learning-based code analysis is the naturalness hypothesis shared amongnatural languages and programming languages. Despite theﬂourishing study, many state-of-the-art approaches have beensuffering from the replicability and reproducibility issues. Thisis due to the fact that the performance of deep learningapproaches is sensitive to datasets and hyperparameters, andcurrently, no uniﬁed open-source toolkit is available to ourcommunities.To ﬁll this gap, this paper introduces N

ATURAL

CC, acomprehensive platform to facilitate research NLP-based bigcode analysis. Both of the researchers from SE communityor NLP community can beneﬁt from the toolkit for fast The term NaturalCC also represents natural code comprehension, which isa fundamental task lies in the synergy between the programming languageand NLP. training and reproduction. N

ATURAL

CC features the followingadvantages: • Efﬁcient Data Preprocessing.

We have cleaned andpreprocessed three public datasets (i.e., CodeSearchNet [6],Py150 [7], and Python [1]) for different tasks. All thedata loaders and processes in our model training can beparallelized in multiple GPUs. Besides, our toolkit alsosupports the mixed-precision for numerical calculation. • Extensibility and Modularity.

Based on the registrymechanism implemented in Fairseq [8], our framework iswell modularized and can be easily extended to varioustasks. In particular, when implementing a new task, weonly need to implement the task and models in thecorresponding folders and then register them. • Flexible Interface.

We provide ﬂexible APIs for de-velopers to easily invoke the trained models for otherapplications.Additionally, we demonstrate N

ATURAL

CC with a commandline interface as well as a graphical user interface, usingthree application tasks, i.e., code completion, code commentgeneration, and code retrieval.N

ATURAL

CC is an ongoing open-source toolkit maintainedby the

CodeMind team. We hope N

ATURAL

CC can facilitatethe research of software engineering with natural languageprocessing. We also encourage researchers to integrate theirstate-of-the-art approach into N

ATURAL

CC, to promote theresearch in both communities.All the source code and materials are publicly avail-able via GitHub: http://github.com/CGCL-codes/naturalcc . We also build a website for our team and willpost the updates in http://xcodemind.github.io .II. A

RCHITECTURE D ESIGN

Figure 1 shows a pipeline of our N

ATURAL

CC. Given adataset of code snippets, we ﬁrst preprocess the data in the datapreprocessing stage and then feed each mini-batch of samplesinto the code representation module, a fundamental component The open-source Fairseq toolkit has inspired us a lot, and our open-sourceproject also follows the MIT license. a r X i v : . [ c s . S E ] D ec ode Representation• RNN (LSTM, Tree-LSTM)• GNN (GGNN)• Transformer• BERT …Code Data Preprocess• Tokenizer• Build vocabulary• Feature extractor• Data loader Applications• Code completion• Code documentation• Code retrieval• Type inference … data modules tasks Trainer folders

Figure 1: A pipeline of N

ATURAL

CCfor several downstream tasks. In the code representation module,we have implemented many state-of-the-art encoders (e.g.,RNN [9], GNN [10], Transformer [11] and BERT [12]).Based on the code representation, N

ATURAL

CC can alsosupport various downstream tasks, e.g., code documentation,programming language modeling, code retrieval and typeinference. The designed

Trainer controls model trainingfor each task.

A. Data Preprocessing

In the data preprocessing stage, we ﬁrst tokenize the sourcecode by a tokenizer (e.g., space tokenizer or BPE [13] tokenizer)and then build a vocabulary for these tokens. In addition, wecan also extract some domain-speciﬁc features (e.g., AST [1],control-ﬂow graphs [5], or data-ﬂow graphs [14]). The goal ofthis process is to build a series of mini-batches for training. Weput all the data-related processes in the data and dataset folders.

B. Code Representation

Code representation/understanding, which aims to learn anembedding vector, is one of the most critical components for bigcode analysis. In N

ATURAL

CC, we have included most state-of-the-art neural network encoders to represent the source codeand their extracted features. For example, we have implementedRNN-based models to represent the sequential tokens or(linearized) AST of code. We implement graph neural networks(GNNs) such as gated graph neural networks (GGNNs) torepresent the graph structure features of code (e.g., control-ﬂow and data-ﬂow graphs). We have also included the advancedTransformer networks [11], which serve as the replacementof the RNN network, with its fast computation and abilityto handle long-range dependent sequence. In addition, theN

ATURAL

CC also supports the masked pre-trained models, e.g.,BERT and RoBERTa [15]. We put all the code representationnetworks in the models and modules folders.

C. Applications

Our N

ATURAL

CC supports many different downstreamtasks. We have currently implemented three tasks, i.e., codecompletion, code comment generation, and code retrieval, tovalidate the effectiveness of the proposed framework. All thereferred baselines on each task have been carefully checkedand evaluated when compared against the released source codefrom the original papers. The implementations and the tasksin this toolkit will serve as baselines for fair comparison for future research use. We organize all the tasks in the tasks folder.

1) Code Completion:

Code completion, which predicts thenext code element based on the previously written code, hasbecome an essential tool in many IDEs. It can boost developers’programming productivity. In this task, we have implementedthe referred model SeqRNN [16] and TravTrans [3]. We trainthe models in the Py150 dataset and evaluate them using theMRR metrics.

2) Code Comment Generation:

Generating comments forcode snippets is an effective way for program understandingand facilitate the software development and maintenance. Inthis task, we have implemented the referred model NeuralTrans-former [17]. We trained the model in Python and Java datasetsand evaluated them using the BLEU and Rouge metrics.

3) Code Retrieval:

Searching semantically similar codesnippets given a natural language query can provide developersa series of templates as reference for rapid prototyping. Inthis task, we have implemented four benchmark baselines (i.e.,NBOW, 1D-CNN, biRNN and SelfAttn) and evaluated themby using the MRR metrics on CodeSearchNet dataset [6].Additional tasks with state-of-the-art models are still underdevelopment. They will be released soon, including code clonedetection [18], type inference [19], vulnerability detection [20],and masked language modeling for code pre-training [21].

D. Trainer

We have designed a trainer ( ncc_trainer.py ) module tocontrol the whole training process of models. Furthermore, wehave designed and implemented a simpler trainer in a universalway ( ncc_trainer_simple.py ) for those beginners whoare not familiar with our framework.III. I

MPLEMENTATION

We have implemented N

ATURAL

CC based on the Fairseqand PyTorch. Following the outstanding registry mechanismdesigned in Fairseq, N

ATURAL

CC has good extensibility withthe modularized design.

A. Registry Mechanism def register_task(name): def register_task_cls(cls): if name in TASK_REGISTRY: raise ValueError (’Duplicate error...’) TASK_REGISTRY[name] = cls TASK_CLASS_NAMES.add(cls.__name__) return cls return register_task_cls Listing 1: The registry mechanism in __init__.py

We have implemented a register decorator in the entry ofbuilding a task, model or module (cf. __init__.py in eachfolder). Listing 1 shows the workﬂow of registering a new task.In brief, the registry mechanism is to design a global variableto store each task of model objects for off-the-shelf fetching.This registry mechanism can provide us the ability of extension,s we only need to include this decorator when deﬁning a newtask/model/module in the corresponding function.

B. Multi-GPU Training

Following Fairseq, we use the

NCCL library and torch.distributed to support model training on multipleGPUs. Every GPU stores a copy of model parameters, and theglobal optimizer functions as synchronous optimization in eachGPU. Gradients accumulation is also supported to mitigatemulti-GPU computation lagging.

C. Mixed-Precision N ATURAL

CC can also support both full precision (FP32)and half-precision ﬂoating point (FP16) for fast training andinference. From our experience, setting the FP16 option canlargely reduce the memory usage, and further save the trainingtime. To preserve model accuracy, the parameters are storedin FP32 while updated by FP16 gradients.

D. An Implementation Example

We take code completion as an example to show the pipelineof how to quickly build a new task in N

ATURAL

CC. Note that,in this section, we only describe the main steps in each ﬁle,and more details are referred to the corresponding source ﬁles.

Building a task.

In the ﬁrst step, we create a

CompletionTask in the ncc/tasks/completion.py ,with a decorator register_task around. Listing 2 showsthe whole processing of building a new task. This class provides a function build_model for building a modelaccording to the arguments deﬁned by users. @register_task(’completion’) class CompletionTask (NccTask): def __init__(self, args, dictionary): super().__init__(args) self.dictionary = dictionary @classmethod def setup_task(cls, args, **kwargs): dictionary = cls.load_dictionary(args) return cls(args, dictionary) def build_model(self, args): model = super().build_model(args) return model Listing 2: tasks/completion/completion.py

Building a model.

Listing 3 shows the processof building an RNN model for code completion.We deﬁne a new class

SeqRNNModel in the ncc/models/completion/seqrnn.py , which inheritsthe

NccLanguageModel . In this class, we build a decodernetwork

LSTMDecoder , which is implemented in the modules folder. @register_model(’seqrnn’) class SeqRNNModel (NccLanguageModel): def __init__(self, args, decoder): super().__init__(decoder) self.args = args @classmethod def build_model(cls, args, config, task): decoder = LSTMDecoder( ... ) return cls(args, decoder) Listing 3: models/completion/seqrnn.py

Model training.

Listing 4 shows the construction of a

Trainer and the pipeline of train steps. Core parametersare involved in this class such that pre-trained models canbe precisely restored during inference or ﬁne-tuning. (cid:44) → task = tasks.setup_task(args) model = task.build_model(args) criterion = task.build_criterion(args) trainer = Trainer(args, task, model, criterion) while ( lr > args[’optimization’][’min_lr’] and epoch_itr.next_epoch_idx <= max_epoch and trainer.get_num_updates() < max_update ): task.train_step(samples) Listing 4: trainer/trainer.py

IV. D

EMONSTRATION

A. Command Line Interface N ATURAL

CC provides a command line interface that enablesresearchers and developers to simply explore the included state-of-the-art models. For each code analysis related tasks, userscan try this command: $python -m cli.predictor -m -i where -m is the pre-trained model directory, and -i isthe corresponding user input (a partial code snippet in thecode completion task). N ATURAL

CC will automatically loadmodel parameters, process the user input and return inferenceinformation in details.

B. Graphical User Interface

We have also provided a graphical user interface for users toeasily access and explore each trained model’s results throughan online Web browser. The design of our Web is based onthe open source demo of AllenNLP [22]. We have deployedthe graphical demo in the Nginx server and provided ﬂexibleAPIs via the Flask engine.As shown in Figure 2, we have integrated three popularsoftware engineering tasks for demonstration, i.e., code comple-tion, code comment generation, and code retrieval. Taking codecompletion as an example. By default, we have implementedthis task based on the programming language modeling. Givenigure 2: A screenshot of our graphical user interface for demonstrationa series of written tokens by Python, the predicted tokens withcorresponding probabilities generated by our model will appearsimultaneously when the user enters the next tokens. In thispage, the users can also select the trained model accordingly.V. C

ONCLUSION AND F UTURE W ORK

This paper presents N

ATURAL

CC, an efﬁcient and extensibleopen-source toolkit to bridge the gap between the program-ming language and natural language. Currently, N

ATURAL

CChas implemented several state-of-the-art models across threepopular software engineering tasks. We provide a detailedsample as an example to quickly implement a new task. Fordemonstration, we have provided a command line tool as wellas a graphical user interface for other researchers to do quickprototyping. All the materials about this toolkit can be accessedfrom http://xcodemind.github.io .We will extend this toolkit to more software engineering tasksin our future work, including code clone detection, vulnerabilitydetection, and masked language modeling. We also encouragemore researchers to join our team to promote the developmentof this toolkit as well as the whole research community.A

CKNOWLEDGEMENTS

The Fairseq highly inspires the N

ATURAL

CC. We appreciatethe Fairseq team for their contribution and the high-qualitybackbone structure of the framework.R

EFERENCES[1] Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu,“Improving automatic source code summarization via deep reinforcementlearning,” in

Proceedings of the 33rd ACM/IEEE International Conferenceon Automated Software Engineering . ACM, 2018, pp. 397–407.[2] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generatingsequences from structured representations of code,” arXiv preprintarXiv:1808.01400 , 2018.[3] S. Kim, J. Zhao, Y. Tian, and S. Chandra, “Code prediction by feedingtrees to transformers,” arXiv preprint arXiv:2003.13848 , 2020.[4] X. Gu, H. Zhang, and S. Kim, “Deep code search,” in . IEEE,2018, pp. 933–944.[5] Y. Wan, J. Shu, Y. Sui, G. Xu, Z. Zhao, J. Wu, and P. Yu, “Multi-modal attention network learning for semantic source code retrieval,” in , 2019, pp. 13–25. [6] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,“Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436 , 2019.[7] “150k python dataset,” https://eth-sri.github.io/py150, 2016, [Online;accessed 1-Nov-2020].[8] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier,and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,”in

Proceedings of NAACL-HLT 2019: Demonstrations , 2019.[9] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[10] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,”

IEEE Transactions onNeural Networks and Learning Systems , 2020.[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advances inneural information processing systems , 2017, pp. 5998–6008.[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[13] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Bigcode!= big vocabulary: Open-vocabulary models for source code,” arXivpreprint arXiv:2003.07914 , 2020.[14] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to representprograms with graphs,” arXiv preprint arXiv:1711.00740 , 2017.[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bertpretraining approach,” arXiv preprint arXiv:1907.11692 , 2019.[16] V. Raychev, M. Vechev, and E. Yahav, “Code completion with statisticallanguage models,” in

Proceedings of the 35th ACM SIGPLAN Conferenceon Programming Language Design and Implementation , 2014, pp. 419–428.[17] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Atransformer-based approach for source code summarization,” arXivpreprint arXiv:2005.00653 , 2020.[18] W. Hua, Y. Sui, Y. Wan, G. Liu, and G. Xu, “Fcca: Hybrid coderepresentation for functional clone detection using attention networks,”

IEEE Transactions on Reliability , pp. 1–15, 2020.[19] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learningtype inference,” in

Proceedings of the 2018 26th acm joint meetingon european software engineering conference and symposium on thefoundations of software engineering , 2018, pp. 152–162.[20] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong,“Vuldeepecker: A deep learning-based system for vulnerability detection,” arXiv preprint arXiv:1801.01681 , 2018.[21] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,T. Liu, D. Jiang et al. , “Codebert: A pre-trained model for programmingand natural languages,” arXiv preprint arXiv:2002.08155arXiv preprint arXiv:2002.08155