NaturalCC: A Toolkit to Naturalize the Source Code Corpus
Yao Wan, Yang He, Jian-Guo Zhang, Yulei Sui, Hai Jin, Guandong Xu, Caiming Xiong, Philip S. Yu
NN AT U R A L
C C: A Toolkit to Naturalize the SourceCode Corpus
Yao Wan , Yang He , Jian-Guo Zhang ,Yulei Sui , Hai Jin , Guandong Xu , Caiming Xiong , Philip S. Yu Services Computing Technology and System Lab, Cluster and Grid Computing Lab,School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China School of Computer Science, University of Technology Sydney, NSW, Australia Department of Computer Science, University of Illinois at Chicago, Illinois, USA Salesforce Research, Palo Alto, USA { wanyao,hjin } @hust.edu.cn, { yulei.sui,guandong.xu } @uts.edu.au, { jzhan51,psyu } @uic.edu, [email protected] Abstract —We present N
ATURAL
CC, an efficient and exten-sible toolkit to bridge the gap between natural language andprogramming language, and facilitate the research on big codeanalysis. Using N
ATURAL
CC, researchers both from naturallanguage or programming language communities can quicklyand easily reproduce the state-of-the-art baselines and implementtheir approach. N
ATURAL
CC is built upon Fairseq and PyTorch,providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modularand extensible framework that makes it easy to reproduce orimplement an approach for big code analysis, and (3) a commandline interface and a graphical user interface to demonstrate eachmodel’s performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion,code comment generation, and code retrieval) for demonstration.The video of this demo is available at . Index Terms —Natural language processing, programming lan-guage analysis, big code, toolkit.
I. I
NTRODUCTION
The rapid growth of machine learning (ML), especiallyof deep learning (DL) based natural language processing(NLP), brings great opportunities to explore and exploit NLPtechniques for various tasks of software engineering (SE),e.g., code documentation [1], [2], code completion [3] andcode retrieval [4], [5]. The underlying insights for learning-based code analysis is the naturalness hypothesis shared amongnatural languages and programming languages. Despite theflourishing study, many state-of-the-art approaches have beensuffering from the replicability and reproducibility issues. Thisis due to the fact that the performance of deep learningapproaches is sensitive to datasets and hyperparameters, andcurrently, no unified open-source toolkit is available to ourcommunities.To fill this gap, this paper introduces N
ATURAL
CC, acomprehensive platform to facilitate research NLP-based bigcode analysis. Both of the researchers from SE communityor NLP community can benefit from the toolkit for fast The term NaturalCC also represents natural code comprehension, which isa fundamental task lies in the synergy between the programming languageand NLP. training and reproduction. N
ATURAL
CC features the followingadvantages: • Efficient Data Preprocessing.
We have cleaned andpreprocessed three public datasets (i.e., CodeSearchNet [6],Py150 [7], and Python [1]) for different tasks. All thedata loaders and processes in our model training can beparallelized in multiple GPUs. Besides, our toolkit alsosupports the mixed-precision for numerical calculation. • Extensibility and Modularity.
Based on the registrymechanism implemented in Fairseq [8], our framework iswell modularized and can be easily extended to varioustasks. In particular, when implementing a new task, weonly need to implement the task and models in thecorresponding folders and then register them. • Flexible Interface.
We provide flexible APIs for de-velopers to easily invoke the trained models for otherapplications.Additionally, we demonstrate N
ATURAL
CC with a commandline interface as well as a graphical user interface, usingthree application tasks, i.e., code completion, code commentgeneration, and code retrieval.N
ATURAL
CC is an ongoing open-source toolkit maintainedby the
CodeMind team. We hope N
ATURAL
CC can facilitatethe research of software engineering with natural languageprocessing. We also encourage researchers to integrate theirstate-of-the-art approach into N
ATURAL
CC, to promote theresearch in both communities.All the source code and materials are publicly avail-able via GitHub: http://github.com/CGCL-codes/naturalcc . We also build a website for our team and willpost the updates in http://xcodemind.github.io .II. A
RCHITECTURE D ESIGN
Figure 1 shows a pipeline of our N
ATURAL
CC. Given adataset of code snippets, we first preprocess the data in the datapreprocessing stage and then feed each mini-batch of samplesinto the code representation module, a fundamental component The open-source Fairseq toolkit has inspired us a lot, and our open-sourceproject also follows the MIT license. a r X i v : . [ c s . S E ] D ec ode Representation• RNN (LSTM, Tree-LSTM)• GNN (GGNN)• Transformer• BERT …Code Data Preprocess• Tokenizer• Build vocabulary• Feature extractor• Data loader Applications• Code completion• Code documentation• Code retrieval• Type inference … data modules tasks Trainer folders
Figure 1: A pipeline of N
ATURAL
CCfor several downstream tasks. In the code representation module,we have implemented many state-of-the-art encoders (e.g.,RNN [9], GNN [10], Transformer [11] and BERT [12]).Based on the code representation, N
ATURAL
CC can alsosupport various downstream tasks, e.g., code documentation,programming language modeling, code retrieval and typeinference. The designed
Trainer controls model trainingfor each task.
A. Data Preprocessing
In the data preprocessing stage, we first tokenize the sourcecode by a tokenizer (e.g., space tokenizer or BPE [13] tokenizer)and then build a vocabulary for these tokens. In addition, wecan also extract some domain-specific features (e.g., AST [1],control-flow graphs [5], or data-flow graphs [14]). The goal ofthis process is to build a series of mini-batches for training. Weput all the data-related processes in the data and dataset folders.
B. Code Representation
Code representation/understanding, which aims to learn anembedding vector, is one of the most critical components for bigcode analysis. In N
ATURAL
CC, we have included most state-of-the-art neural network encoders to represent the source codeand their extracted features. For example, we have implementedRNN-based models to represent the sequential tokens or(linearized) AST of code. We implement graph neural networks(GNNs) such as gated graph neural networks (GGNNs) torepresent the graph structure features of code (e.g., control-flow and data-flow graphs). We have also included the advancedTransformer networks [11], which serve as the replacementof the RNN network, with its fast computation and abilityto handle long-range dependent sequence. In addition, theN
ATURAL
CC also supports the masked pre-trained models, e.g.,BERT and RoBERTa [15]. We put all the code representationnetworks in the models and modules folders.
C. Applications
Our N
ATURAL
CC supports many different downstreamtasks. We have currently implemented three tasks, i.e., codecompletion, code comment generation, and code retrieval, tovalidate the effectiveness of the proposed framework. All thereferred baselines on each task have been carefully checkedand evaluated when compared against the released source codefrom the original papers. The implementations and the tasksin this toolkit will serve as baselines for fair comparison for future research use. We organize all the tasks in the tasks folder.
1) Code Completion:
Code completion, which predicts thenext code element based on the previously written code, hasbecome an essential tool in many IDEs. It can boost developers’programming productivity. In this task, we have implementedthe referred model SeqRNN [16] and TravTrans [3]. We trainthe models in the Py150 dataset and evaluate them using theMRR metrics.
2) Code Comment Generation:
Generating comments forcode snippets is an effective way for program understandingand facilitate the software development and maintenance. Inthis task, we have implemented the referred model NeuralTrans-former [17]. We trained the model in Python and Java datasetsand evaluated them using the BLEU and Rouge metrics.
3) Code Retrieval:
Searching semantically similar codesnippets given a natural language query can provide developersa series of templates as reference for rapid prototyping. Inthis task, we have implemented four benchmark baselines (i.e.,NBOW, 1D-CNN, biRNN and SelfAttn) and evaluated themby using the MRR metrics on CodeSearchNet dataset [6].Additional tasks with state-of-the-art models are still underdevelopment. They will be released soon, including code clonedetection [18], type inference [19], vulnerability detection [20],and masked language modeling for code pre-training [21].
D. Trainer
We have designed a trainer ( ncc_trainer.py ) module tocontrol the whole training process of models. Furthermore, wehave designed and implemented a simpler trainer in a universalway ( ncc_trainer_simple.py ) for those beginners whoare not familiar with our framework.III. I
MPLEMENTATION
We have implemented N
ATURAL
CC based on the Fairseqand PyTorch. Following the outstanding registry mechanismdesigned in Fairseq, N
ATURAL
CC has good extensibility withthe modularized design.
A. Registry Mechanism def register_task(name): def register_task_cls(cls): if name in TASK_REGISTRY: raise ValueError (’Duplicate error...’) TASK_REGISTRY[name] = cls TASK_CLASS_NAMES.add(cls.__name__) return cls return register_task_cls Listing 1: The registry mechanism in __init__.py
We have implemented a register decorator in the entry ofbuilding a task, model or module (cf. __init__.py in eachfolder). Listing 1 shows the workflow of registering a new task.In brief, the registry mechanism is to design a global variableto store each task of model objects for off-the-shelf fetching.This registry mechanism can provide us the ability of extension,s we only need to include this decorator when defining a newtask/model/module in the corresponding function.
B. Multi-GPU Training
Following Fairseq, we use the
NCCL library and torch.distributed to support model training on multipleGPUs. Every GPU stores a copy of model parameters, and theglobal optimizer functions as synchronous optimization in eachGPU. Gradients accumulation is also supported to mitigatemulti-GPU computation lagging.
C. Mixed-Precision N ATURAL
CC can also support both full precision (FP32)and half-precision floating point (FP16) for fast training andinference. From our experience, setting the FP16 option canlargely reduce the memory usage, and further save the trainingtime. To preserve model accuracy, the parameters are storedin FP32 while updated by FP16 gradients.
D. An Implementation Example
We take code completion as an example to show the pipelineof how to quickly build a new task in N
ATURAL
CC. Note that,in this section, we only describe the main steps in each file,and more details are referred to the corresponding source files.
Building a task.
In the first step, we create a
CompletionTask in the ncc/tasks/completion.py ,with a decorator register_task around. Listing 2 showsthe whole processing of building a new task. This class provides a function build_model for building a modelaccording to the arguments defined by users. @register_task(’completion’) class CompletionTask (NccTask): def __init__(self, args, dictionary): super().__init__(args) self.dictionary = dictionary @classmethod def setup_task(cls, args, **kwargs): dictionary = cls.load_dictionary(args) return cls(args, dictionary) def build_model(self, args): model = super().build_model(args) return model Listing 2: tasks/completion/completion.py
Building a model.
Listing 3 shows the processof building an RNN model for code completion.We define a new class
SeqRNNModel in the ncc/models/completion/seqrnn.py , which inheritsthe
NccLanguageModel . In this class, we build a decodernetwork
LSTMDecoder , which is implemented in the modules folder. @register_model(’seqrnn’) class SeqRNNModel (NccLanguageModel): def __init__(self, args, decoder): super().__init__(decoder) self.args = args @classmethod def build_model(cls, args, config, task): decoder = LSTMDecoder( ... ) return cls(args, decoder) Listing 3: models/completion/seqrnn.py
Model training.
Listing 4 shows the construction of a
Trainer and the pipeline of train steps. Core parametersare involved in this class such that pre-trained models canbe precisely restored during inference or fine-tuning. (cid:44) → task = tasks.setup_task(args) model = task.build_model(args) criterion = task.build_criterion(args) trainer = Trainer(args, task, model, criterion) while ( lr > args[’optimization’][’min_lr’] and epoch_itr.next_epoch_idx <= max_epoch and trainer.get_num_updates() < max_update ): task.train_step(samples) Listing 4: trainer/trainer.py
IV. D
EMONSTRATION
A. Command Line Interface N ATURAL
CC provides a command line interface that enablesresearchers and developers to simply explore the included state-of-the-art models. For each code analysis related tasks, userscan try this command: $python -m cli.predictor -m
CC will automatically loadmodel parameters, process the user input and return inferenceinformation in details.
B. Graphical User Interface
We have also provided a graphical user interface for users toeasily access and explore each trained model’s results throughan online Web browser. The design of our Web is based onthe open source demo of AllenNLP [22]. We have deployedthe graphical demo in the Nginx server and provided flexibleAPIs via the Flask engine.As shown in Figure 2, we have integrated three popularsoftware engineering tasks for demonstration, i.e., code comple-tion, code comment generation, and code retrieval. Taking codecompletion as an example. By default, we have implementedthis task based on the programming language modeling. Givenigure 2: A screenshot of our graphical user interface for demonstrationa series of written tokens by Python, the predicted tokens withcorresponding probabilities generated by our model will appearsimultaneously when the user enters the next tokens. In thispage, the users can also select the trained model accordingly.V. C
ONCLUSION AND F UTURE W ORK
This paper presents N
ATURAL
CC, an efficient and extensibleopen-source toolkit to bridge the gap between the program-ming language and natural language. Currently, N
ATURAL
CChas implemented several state-of-the-art models across threepopular software engineering tasks. We provide a detailedsample as an example to quickly implement a new task. Fordemonstration, we have provided a command line tool as wellas a graphical user interface for other researchers to do quickprototyping. All the materials about this toolkit can be accessedfrom http://xcodemind.github.io .We will extend this toolkit to more software engineering tasksin our future work, including code clone detection, vulnerabilitydetection, and masked language modeling. We also encouragemore researchers to join our team to promote the developmentof this toolkit as well as the whole research community.A
CKNOWLEDGEMENTS
The Fairseq highly inspires the N
ATURAL
CC. We appreciatethe Fairseq team for their contribution and the high-qualitybackbone structure of the framework.R
EFERENCES[1] Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu,“Improving automatic source code summarization via deep reinforcementlearning,” in
Proceedings of the 33rd ACM/IEEE International Conferenceon Automated Software Engineering . ACM, 2018, pp. 397–407.[2] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generatingsequences from structured representations of code,” arXiv preprintarXiv:1808.01400 , 2018.[3] S. Kim, J. Zhao, Y. Tian, and S. Chandra, “Code prediction by feedingtrees to transformers,” arXiv preprint arXiv:2003.13848 , 2020.[4] X. Gu, H. Zhang, and S. Kim, “Deep code search,” in . IEEE,2018, pp. 933–944.[5] Y. Wan, J. Shu, Y. Sui, G. Xu, Z. Zhao, J. Wu, and P. Yu, “Multi-modal attention network learning for semantic source code retrieval,” in , 2019, pp. 13–25. [6] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,“Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436 , 2019.[7] “150k python dataset,” https://eth-sri.github.io/py150, 2016, [Online;accessed 1-Nov-2020].[8] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier,and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,”in
Proceedings of NAACL-HLT 2019: Demonstrations , 2019.[9] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[10] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,”
IEEE Transactions onNeural Networks and Learning Systems , 2020.[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advances inneural information processing systems , 2017, pp. 5998–6008.[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[13] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Bigcode!= big vocabulary: Open-vocabulary models for source code,” arXivpreprint arXiv:2003.07914 , 2020.[14] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to representprograms with graphs,” arXiv preprint arXiv:1711.00740 , 2017.[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bertpretraining approach,” arXiv preprint arXiv:1907.11692 , 2019.[16] V. Raychev, M. Vechev, and E. Yahav, “Code completion with statisticallanguage models,” in
Proceedings of the 35th ACM SIGPLAN Conferenceon Programming Language Design and Implementation , 2014, pp. 419–428.[17] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Atransformer-based approach for source code summarization,” arXivpreprint arXiv:2005.00653 , 2020.[18] W. Hua, Y. Sui, Y. Wan, G. Liu, and G. Xu, “Fcca: Hybrid coderepresentation for functional clone detection using attention networks,”
IEEE Transactions on Reliability , pp. 1–15, 2020.[19] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learningtype inference,” in
Proceedings of the 2018 26th acm joint meetingon european software engineering conference and symposium on thefoundations of software engineering , 2018, pp. 152–162.[20] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong,“Vuldeepecker: A deep learning-based system for vulnerability detection,” arXiv preprint arXiv:1801.01681 , 2018.[21] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,T. Liu, D. Jiang et al. , “Codebert: A pre-trained model for programmingand natural languages,” arXiv preprint arXiv:2002.08155arXiv preprint arXiv:2002.08155