[PDF] A Comprehensive Study on Challenges in Deploying Deep Learning Based Software

Abstract

Deep learning (DL) becomes increasingly pervasive, being used in a wide range of software applications. These software applications, named as DL based software (in short as DL software), integrate DL models trained using a large data corpus with DL programs written based on DL frameworks such as TensorFlow and Keras. A DL program encodes the network structure of a desirable DL model and the process by which the model is trained using the training data. To help developers of DL software meet the new challenges posed by DL, enormous research efforts in software engineering have been devoted. Existing studies focus on the development of DL software and extensively analyze faults in DL programs. However, the deployment of DL software has not been comprehensively studied. To fill this knowledge gap, this paper presents a comprehensive study on understanding challenges in deploying DL software. We mine and analyze 3,023 relevant posts from Stack Overflow, a popular Q&A website for developers, and show the increasing popularity and high difficulty of DL software deployment among developers. We build a taxonomy of specific challenges encountered by developers in the process of DL software deployment through manual inspection of 769 sampled posts and report a series of actionable implications for researchers, developers, and DL framework vendors.

Full PDF

UUnderstanding Challenges in Deploying Deep Learning BasedSoftware: An Empirical Study

Zhenpeng Chen, Yanbin Cao,Yuanqiang Liu

Peking University, Beijing, China{czp,caoyanbin,yuanqiangliu}@pku.edu.cn

Haoyu Wang

Beijing University of Posts andTelecommunications, Beijing, [email protected]

Tao Xie, Xuanzhe Liu ∗ Peking University, Beijing, China{taoxie,xzl}@pku.edu.cn

ABSTRACT

Deep learning (DL) becomes increasingly pervasive, being used ina wide range of software applications. These software applications,named as DL based software (in short as

DL software ), integrateDL models trained using a large data corpus with DL programswritten based on DL frameworks such as TensorFlow and Keras.A DL program encodes the network structure of a desirable DLmodel and the process by which the model is trained using thetraining data. To help developers of DL software meet the newchallenges posed by DL, enormous research efforts in softwareengineering have been devoted. Existing studies focus on the de-velopment of DL software and extensively analyze faults in DLprograms. However, the deployment of DL software has not beencomprehensively studied. To fill this knowledge gap, this paperpresents a comprehensive study on understanding challenges indeploying DL software. We mine and analyze 3,023 relevant postsfrom Stack Overflow, a popular Q&A website for developers, andshow the increasing popularity and high difficulty of DL softwaredeployment among developers. We build a taxonomy of specificchallenges encountered by developers in the process of DL softwaredeployment through manual inspection of 769 sampled posts andreport a series of actionable implications for researchers, developers,and DL framework vendors.

KEYWORDS deep learning, software deployment, Stack Overflow

Deep learning (DL) has been used in a wide range of softwareapplications from different domains, including natural languageprocessing [76], speech recognition [89], image processing [79],disease diagnosis [80], autonomous driving [86], etc. These soft-ware applications, named as DL based software (in short as

DLsoftware ), integrate DL models trained using a large data corpuswith DL programs. To implement DL programs, developers relyon DL frameworks (e.g., TensorFlow [68] and Keras [62]), whichencode the structure of desirable DL models and the process bywhich the models are trained using the training data.The increasing dependence of current software applications onDL (as in DL software) makes it a crucial topic in the softwareengineering (SE) research community. Specifically, many researchefforts [78, 82, 83, 103, 105] have been devoted to characterizing thenew challenges that DL poses to software development . To character-ize the challenges that developers encounter in this process, variousstudies [83, 103, 105] focus on analyzing faults in DL programs. For ∗ Corresponding author: Xuanzhe Liu ([email protected]). instance, Islam et al. [83] have presented a comprehensive studyof faults in DL programs written based on TensorFlow (TF) [68],Keras [62], PyTorch [63], Theano [71], and Caffe [84] frameworks.Recently, with the great demand of deploying DL software todifferent platforms for real usage [78, 88, 99], it also poses newchallenges to software deployment , i.e., deploying DL software on aspecific platform. For example, a computation-intensive DL modelin DL software can be executed efficiently on PC platforms withthe GPU support, but it cannot be directly deployed and executedon platforms with limited computing power, such as mobile de-vices. To facilitate such a deployment process, some DL frame-works such as TF Lite [65] and Core ML [60] are rolled out bymajor vendors. Furthermore, SE researchers and practitioners alsobegin to focus on DL software deployment. For example, Guo etal. [81] investigated the changes in prediction accuracy and per-formance when DL models trained on PC platforms are deployedto mobile devices and browsers, and unveiled that the deploymentstill suffers from compatibility and reliability issues. Additionally,DL software deployment also poses some specific programmingchallenges to developers such as converting DL models to the for-mats expected by the deployment platforms; these challenges arefrequently asked in developers’ Q&A forums [1–4]. Despite someefforts made, to the best of our knowledge, a fundamental questionremains under-investigated: what specific challenges do developersface when deploying DL software ?To bridge the knowledge gap, this paper presents the first com-prehensive empirical study on identifying challenges in deployingDL software. Given surging interest in DL and the importanceof DL software deployment, such a study can aid developers toavoid common pitfalls and make researchers and DL frameworkvendors better positioned to help software engineers perform thedeployment task in a more targeted way. Besides mobile devicesand browsers that have been considered in previous work [81], inthis work, we also take into account server/cloud platforms, wherea large number of DL software applications are deployed [78, 104].To understand what struggles faced by developers when they de-ploy DL software, we analyze the relevant posts from a varietyof developers on Stack Overflow (SO), which is one of the mostpopular Q&A forums for developers. When developers have trou-bles to solve programming issues that they meet, they often seektechnological advice from peers on SO [78]. Therefore, it has been acommon practice for researchers to understand the challenges thatdevelopers encounter when dealing with different engineering tasksfrom SO posts, as shown in recent work [70, 72–74, 78, 93, 102, 103]. Unless explicitly stated, framework vendors in this paper refer to vendors of deploy-ment related frameworks such as TF Lite and Core ML. a r X i v : . [ c s . S E ] M a y SEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States Zhenpeng Chen et al.

Trained DLModelsTrain Exported Models forServer/CloudConverted Modelsfor MobileConverted Modelsfor BrowserTrainingData

Server/Cloud PlatformsMobile PlatformsBrowser Platforms

DL Software Development DL Software Deployment

Figure 1: DL software development and deployment.

Our study collects and analyzes 3,023 SO posts regarding deploy-ing DL software to server/cloud, mobile, and browser platforms.Based on these posts, we focus our study on the following researchquestions.

RQ1: Popularity trend.

Through quantitative analysis, we findthat DL software deployment is gaining increasing attention, andfind evidence about the timeliness and urgency of our study.

RQ2: Difficulty.

We measure the difficulty of DL software de-ployment using well-adopted metrics in SE. Results show that thedeployment of DL software is more challenging compared to otheraspects related to DL software. This finding motivates us to furtherunveil the specific challenges encountered by developers in DLsoftware deployment.

RQ3: Taxonomy of Challenges.

To identify specific challengesin DL software deployment, we randomly sample a set of 769 rele-vant SO posts for manual examination. For each post, we qualita-tively extract the challenge behind it. Finally, we build a compre-hensive taxonomy consisting of 72 categories, linked to challengesin deploying DL software to server/cloud, mobile, and browserplatforms. The resulting taxonomy indicates that DL software de-ployment faces a wide spectrum of challenges.Furthermore, we discuss actionable implications (derived fromour results) for researchers, developers, and DL framework vendors.

We first briefly describe the current practice of DL software devel-opment and deployment. Figure 1 distinguishes the two processes.

DL software development.

To integrate DL capabilities intosoftware applications, developers make use of state-of-the-art DLframeworks (e.g., TF and Keras) in the software development pro-cess. Specifically, they use these frameworks to create the architec-ture of DL models and specify run-time configuration (e.g., hyper-parameters). In a DL model, multiple layers of transformation func-tions are used to convert input to output, with each layer learningsuccessively higher level of abstractions in the data. Then large-scale data (i.e., the training data) are used to train (i.e., adjust theweights of) the multiple layers. Finally, validation data, which aredifferent from the training data, are used to tune the model. Due tothe space limit, we show only the model training phase in Figure 1.

DL software deployment.

After DL software has been wellvalidated and tested, it is ready to be deployed to different platformsfor real usage. The most popular way is to deploy DL software on theserver or cloud platforms [104]. Such a way enables developers toinvoke services powered by DL techniques via simply calling an APIendpoint. Some frameworks (e.g., TF Serving [66]) and platforms (e.g., Google Cloud ML Engine [61]) can facilitate this deploymentway. In addition, there is a rising demand in deploying DL softwareto mobile devices [99] and browsers [88]. For mobile platforms,due to their limited computing power, memory size, and energycapacity, models that are trained on PC platforms and used in theDL software cannot be deployed directly to the mobile platforms insome cases. Therefore, some lightweight DL frameworks, such asTF Lite for Android and Core ML for iOS, are specifically designedfor converting pre-trained DL models to the formats supported bymobile platforms. In addition, it is a common practice to performmodel quantization before deploying DL models on mobile devices,in order to reduce memory cost and computing overhead [81, 99].For model quantization, TF Lite supports only converting modelweights from floating points to 8-bits integers, while Core MLallows flexible quantization modes, such as 32-bits to 16/8/4 bits [81].For browsers, some solutions (e.g., TF.js [67]) are proposed fordeploying DL models under Web environments.

Scope.

We focus our analysis on DL software deployment. Specif-ically, we analyze the challenges in deploying DL software todifferent platforms including server/cloud, mobile, and browserplatforms. Any problems related to this process are within ourscope. However, challenges related to DL software development(e.g., model training) are not considered in this study.

To understand the challenges in deploying DL software, we ana-lyze the relevant questions posted on Stack Overflow (SO), wheredevelopers seek technological advice about unresolved issues. Weshow an overview of the methodology of our study in Figure 2.

Download Stack Overflow datasetIdentify relevant questionsDetermine popularity trendDetermine difficultyConstruct taxonomy of challenges12345

RQ1RQ2RQ3

Figure 2: An overview of the methodology.

Step 1: Download Stack Overflow dataset . In the first step of thisstudy, we download SO dataset from the official Stack ExchangeData Dump [64] on December 2, 2019. The dataset covers the SOposts generated from July 31, 2008 to December 1, 2019. The meta-data of each post includes its identifier, post type (i.e., question oranswer), creation date, tags, title, body, identifier of the acceptedanswer if the post is a question, etc. Each question has one to fivetags based on its topics. The developer who posted a question canmark an answer as an accepted answer to indicate that it works forthe question. Among all the questions in the dataset (denoted asthe set A ), 52.33% have an accepted answer. Step 2: Identify relevant questions . In this study, we select threerepresentative deployment platforms of DL software for study, in-cluding server/cloud, mobile, and browser platforms. Since ques-tions related to DL software deployment may be contained in DL nderstanding Challenges in Deploying Deep Learning Based Software: An Empirical Study ESEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States related questions, we first identify these questions related to DL. Fol-lowing previous work [82, 83], we extract questions tagged with atleast one of the top five popular DL frameworks (i.e., TF, Keras, Py-Torch, Theano, and Caffe) from A and denote the extracted 70,669questions as the set B . Then we identify the relevant questions foreach kind of platform, respectively. Server/Cloud.

We define a vocabulary of words related to serverand cloud platforms (i.e., “cloud”, “server”, and “serving”). Then weperform a case-insensitive search of the three terms within the titleand body (excluding code snippets) of questions in B and denotethe questions that contain at least one of the terms as the set C .Since questions in C may contain some noise that is not related todeployment (e.g., questions about training DL models on the server),we filter out those that do not contain the word “deploy” and finally279 questions are remained in C . To further complement C , weextract questions tagged with TF Serving, Google Cloud ML Engine,and Amazon SageMaker from A . TF Serving is a DL frameworkthat is specifically designed for deploying DL software to servers;Google Cloud ML Engine and Amazon SageMaker [59] are twopopular cloud platforms for training DL models and deploying DLsoftware. Since the two platforms are rolled out by two major cloudservices vendors, i.e., Google and Amazon, we believe that they arerepresentative. For questions tagged with the two platforms, wefilter out those that do not contain the word “deploy” as they alsosupport model training. Then we add the remaining questions aswell as all questions tagged with TF Serving into C and removethe duplicate questions. Finally, we have 1,325 questions about DLsoftware deployment to server/cloud platforms in the set C . Mobile.

We define a vocabulary of words related to mobiledevices ( i.e. , “mobile”, “android”, and “ios”) and extract the ques-tions that contain at least one of the three words from B in acase-insensitive way. We denote the extracted 486 questions asthe question set D . Then, following previous work [81], we alsoconsider two DL frameworks specifically designed for DL softwaredeployment to mobile platforms (i.e., TF Lite and Core ML). Weextract the questions tagged with the two frameworks from A andthen add them into D . Finally, we remove the duplicate questionsand have 1,533 questions about DL software deployment to mobiledevices in the set D . Browser.

We extract the questions that contain the word “browser”from B in a case-insensitive way and denote the extracted 89 ques-tions as the set E . In addition, following previous work [81], we alsotake TF.js, which can be used for deploying DL models on browsers,into consideration. Different from TF Lite that only supports deploy-ment, TF.js also supports developing DL models. However, sinceDL on browsers is still at dawn [88], questions tagged with TF.js in A are too few, only 535. If we employ the strict keyword match-ing method to filter out questions that do not contain “deploy” asabove, only 10 out of 535 questions can remain. To keep as manyas possible the relevant questions, instead of keyword matching,we employ manual inspection here. Specifically, we add all the 535questions into E and exclude the duplicate questions. Then twoauthors examine the remaining 576 questions independently anddetermine whether or not each question is about DL software de-ployment. The inter-rater agreement measured as Cohen’s Kappa( κ ) [77] is 0.894, which indicates almost perfect agreement. Thenthe conflicts are resolved through discussion, and the questions considered as non-deployment issues are excluded from E . Finally,we have 165 questions about DL software deployment to browsersin the set E . Step 3: Determine popularity trend . To illustrate the popularitytrend of DL software deployment, following previous work [72], wecalculate the number of users and questions related to the topic peryear. Specifically, the metrics are calculated based on the questionsets C , D , and E , for each of the past five years (i.e., from 2015 to2019). Step 3 answers the research question RQ1 . Step 4: Determine difficulty . We measure the difficulty of de-ploying DL software using two metrics widely adopted by previouswork [70, 72, 73], including the percentage of questions with no ac-cepted answer (“%no acc.”) and the response time needed to receivean accepted answer. In this step, we use the questions related toother aspects of DL software (in short as non-deployment questions )as the baseline for comparison. To this end, we exclude the deploy-ment related questions (i.e., questions in C , D , and E ) from theDL related questions (i.e., questions in B ), and use the remainingquestions as the non-deployment questions. For the first metric,proportion test [91] is employed to ensure the statistical signifi-cance of comparison. For the second metric, we select the questionsthat have received accepted answers and then show the distributionand the median value of the response time needed to receive anaccept answer for both deployment and non-deployment questions.Step 4 answers the research question RQ2 . Step 5: Construct taxonomy of challenges . In this step, we man-ually analyze the questions related to DL software deployment, inorder to construct the taxonomy of challenges. Following previouswork [103], to ensure a 95% confidence level and a 5% confidenceinterval, we randomly sample 297 server/cloud related questionsfrom C and 307 mobile related questions from D . Since browserrelated questions in E are not too many, we use all the 165 ques-tions in it for manual analysis. In total, we get a dataset of 769questions that are used for taxonomy construction. The size of thisdataset is comparable and even larger than those used in existingstudies [69, 75, 103, 105] that also require manual analysis of SOposts. Next, we present our procedures of taxonomy construction. Pilot construction.

First, we randomly sample 30% of the 769questions for a pilot construction of the taxonomy. The taxonomyfor each kind of platform is constructed individually based on itscorresponding samples. We follow an open coding procedure [94]to inductively create the categories and sub-categories of our tax-onomy in a bottom-up way by analyzing the sampled questions.The first two authors, who both have four years of DL experiences,jointly participate in the pilot construction. The detailed procedureis described below.They read and reread all the questions, in order to be familiarwith them. In this process, they take all the elements of each ques-tion, including title, body, code snippets, comments, answers, tags,and even URLs mentioned by questioners and answerers, for carefulinspection. Questions not related to DL software deployment areclassified as

False positives . For a relevant question, if the authorscannot identify the specific challenge behind it, they mark it as

Unclear questions , which as well as

False positives are not includedinto the taxonomy. For the remaining questions, the authors assignshort phrases as initial codes to indicate the challenges behind them.

SEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States Zhenpeng Chen et al.

Specifically, for those that are thrown without attempts (mainly inthe form of “how”, e.g., “ how to process raw data in tf-serving ” [5]),the authors can often clearly identify the challenges from the ques-tion descriptions; for those that describe the faults or unexpectedresults developers encountered in practice, the authors identifytheir causes as the challenges. For example, if a developer reportedan error that she encountered when making predictions and theauthors can find that the cause is the wrong format of input datafrom the question descriptions, comments, or answers, they con-sider setting the format of input data correctly as the challengebehind this question.Then the authors proceed to group similar codes into categoriesand create a hierarchical taxonomy of challenges. The groupingprocess is iterative, in which they continuously go back and forthbetween categories and questions to refine the taxonomy. A ques-tion is assigned to all related categories if it is related to multiplechallenges. All conflicts are discussed and resolved by introducingthree arbitrators. The arbitrator for server/cloud deployment is apractitioner who has four years of experiences in deploying DLsoftware to servers/cloud platforms. The arbitrators for mobile andbrowser deployment are both graduate students who have twoyears of experiences in deploying DL software to mobile devicesand browsers, respectively. Both of them have published papersrelated to DL software deployment in top-tier conferences. Thearbitrators finally approve all categories in the taxonomy.

Reliability analysis and extended construction.

Based onthe coding schema in the pilot construction, the first two authorsthen independently label the remaining 70% questions for reliabilityanalysis. Each question is labeled with

False positives , Unclear ques-tions , or the identified leaf categories in the taxonomy. Questionsthat cannot be classified into the current taxonomy are added intoa new category named

Pending . The inter-rater agreement duringthe independent labeling is 0.816 measured by Cohen’s Kappa ( κ ),which indicates almost perfect agreement and demonstrates thereliability of our coding schema and procedure. The conflicts oflabeling are then discussed and resolved by the aforementionedthree arbitrators. For the questions classified as Pending , we alsoemploy the arbitrators to help us further identify the challengesbehind them and determine if new categories need to be added.Finally, 8 new leaf categories are added and all questions in

Pending are assigned into the taxonomy.In summary, among the 769 sampled questions, 58 are markedas

False positives , 130

Unclear questions . In addition, two questionsare assigned into two categories. The remaining 583 samples (i.e.,227 for server/cloud deployment, 231 for mobile deployment, and125 for browser deployment) are all covered in the final taxonomy.The entire manual construction process takes about 450 man-hours.Step 5 answers the research question

RQ3 . Figure 3 shows the popularity trend of deploying DL software interms of the number of users and questions on SO. The figureindicates that such a topic is gaining increasing attention, whichdemonstrates the timeliness and urgency of this study.For deploying DL software on server/cloud platforms, we ob-serve that users and questions increase in a steady trend. In 2017,

Year o f u s e r s server/cloudmobilebrowser (a) Trend of users Year o f q u e s t i o n s server/cloudmobilebrowser (b) Trend of questions Figure 3: The popularity trend of deploying DL software.

Response time (min)

Non-deploymentDeployment

Figure 4: Time needed to receive an accepted answer. most major vendors rolled out their DL frameworks for mobiledevices [99]. As a result, we can observe that both the number ofusers and the number of questions related to mobile deployment in2017 increased by more than 300% compared to 2016. For deployingDL software on browsers, questions start to appear in 2018, whichcan be explained by the release of TF.js in 2018. As found by Ma etal. [88], DL in browsers is still at dawn. Therefore, the users andquestions related to it are still not so many, as shown in Figure 3.

For deployment and other aspects (in short of non-deployment ) ofDL software, the percentages of relevant questions with no acceptedanswer are 62.7% and 70.7%, respectively. The significance of sucha difference is ensured by the result of proportion test ( χ = . p -value < nderstanding Challenges in Deploying Deep Learning Based Software: An Empirical Study ESEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States In summary, we find that questions related to DL software de-ployment are difficult to resolve, which partly demonstrates thefinding in previous work that model deployment is the most chal-lenging phase in the life cycle of machine learning (ML) [72] andmotivates us to further identify the specific challenges behind it.

Figure 5 illustrates the hierarchical taxonomy of challenges in DLsoftware deployment. According to it, we can observe that devel-opers have difficulty in a broad spectrum of problems. Note thatalthough the identified challenges are about deploying DL softwareto specific platforms, not all relevant issues occur on correspondingplatforms. For example, to deploy DL software to mobile devices,the model conversion task can be done on PC platforms.We group the full taxonomy into three sub-taxonomies that cor-respond to the challenges in deploying DL software to server/cloud,mobile, and browser platforms, respectively. Each sub-taxonomy isthen organized into three-level categories, including the root cate-gories (e.g.,

Server/Cloud ), the inner categories (e.g.,

Model Export ),and the leaf categories (e.g.,

Model quantization ). In total, we have3 root categories, 25 inner categories, and 72 leaf categories. Weshow the percentages for questions related to each category in theparentheses. Then we describe and exemplify each inner category.

To avoid duplicate descriptions, we first present the common innercategories in

Server/Cloud , Mobile , and

Browser . This category shows general challengesthat do not involve a specific step in the deployment process andcontains several leaf categories as follows.

Entire procedure of deployment.

This category refers to gen-eral questions about the entire procedure of deployment that aremainly thrown without practical attempts. They are mainly in theform of “how”, like “ how can I use that model in android for imageclassification ” [6]. In such questions, developers often complainabout the documentation, e.g., “ there is no documentation given forthis model ” [7]) Answerers mainly handle these questions by provid-ing existing tutorials or documentation-like information that doesnot appear elsewhere, or translate the jargon heavy documentationinto case-specific guidance phrased in a developer-friendly way.Compared to

Server/Cloud (9.7%) and

Mobile (13.4%),

Browser con-tains relatively less such questions (3.2%). A possible explanation isthat since DL in browsers is still in the early stage [88], developersare mainly stuck in its primary usage rather than being eager toexplore how to apply it to various scenarios.

Conceptual questions.

This category presents questions aboutbasic concepts or background knowledge related to DL softwaredeployment, like “ is there any difference between these Neural Net-work Classifier and Neural Network in machine learning model typeused in iOS ” [8]. This category of questions is also observed in pre-vious work that analyzed challenges that developers face throughSO questions [70, 73, 95]. For

Server/Cloud and

Mobile , this cat-egory accounts for 4.4% and 4.8%, respectively, which indicatesthat developers find even the basics of DL software deploymentchallenging. For

Browser , this category is missing. Since TF.js also supports model training, we filter out the conceptual questionsabout TF.js during manual inspection as we cannot discern whetherthese questions occur during training or deployment. However, itdoes not mean that there is no conceptual problems about browserdeployment. We discuss this deficiency in threats to validity.

Limitations of platforms/frameworks.

This category presentslimitations of relevant platforms or DL frameworks. For example, asenior software engineer working on the Google Cloud ML Plat-form team apologized for the failure that a developer encountered,admitting that the platform currently does not support batch predic-tion [9]. Besides, some issues reported bugs in current deploymentrelated frameworks. For instance, an issue revealed a bug in the

TocoConvert.from_keras_model_file method of TF Lite [10].

Both categories coverchallenges in converting DL models in DL software into the formatssupported by deployment platforms. Model export directly saves thetrained model into the expected formats, which is a common wayfor deploying DL models to server/cloud platforms. By comparison,model conversion always needs two steps: i) saving the trainedmodel into a format supported by the deployment frameworks; ii)using these frameworks to convert the saved model into the formatsupported by mobile devices or browsers. Considering the similarfunctions of model export and model conversion, we put themtogether for description. Model export represents 15.0% of questionsin

Server/Cloud , while model conversion is the most challengingproblem in

Mobile and the third challenging problem in

Browser ,accounting for 26.4% and 18.4%, respectively. Then we presentrepresentative leaf categories under the two categories.

Procedure.

Different from

Entire procedure of deployment thatasks about the entire deployment process, questions in

Procedure areabout the procedure of a specific step in the process. For example,questions in

Procedure under

Model Conversion are like “ how can Iconvert this file into a .coreml file ” [1]. Due to page limit, we do notrepeat the descriptions of

Procedure in other inner categories.

Export/conversion of unsupported models.

The support ofDL on some platforms is still unfledged. Some standard operatorsand layers used in the trained model are not supported in deploy-ment frameworks. For example, developers reported that

LSTM isnot supported by TF Lite [2] and that

GaussianNoise is not sup-ported by TF.js [11]. Similarly, Guo et al. [81] reported that theycould not deploy the RNN models (i.e., LSTM and GRU) to mobileplatforms due to the “unsupported operation” error. In addition,when developers attempt to export or convert models with customoperators or layers, they also encounter difficulties [3, 4].

Specification of model information.

When exporting or con-verting DL models to expected formats, developers always needto specify model information. For instance, TF Serving requiresdevelopers to construct a signature to specify names of the inputand output tensors and the method of inference (i.e., regression,prediction, or classification) [12]. Incorrect specification would re-sult in errors [13]. Sometimes, developers directly use off-the-shelfmodels that have been well trained and released online for deploy-ment, but they have no idea about their information (e.g., namesof the input and output tensors [14]), which also makes the modelexport/conversion task challenging.

SEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States Zhenpeng Chen et al.

Data Processing (19.8%)

Procedure(1.3%) Setting size /shape of inputdata (1.8%) Setting format /datatype of inputdata (8.8%) Parsingoutput (4.8%)Migrating pre-processing(3.1%) Authenticatingclient (1.8%)Procedure(0.4%) Parsingrequest(1.3%)

Serving (13.2%)

Modelloading (3.1%) Configurationof batching(2.6%)Serving multiple models simultaneously(3.5%) Bidirectionalstreaming(0.4%)

Server / Cloud (100%)

Model Update (2.6%)Environment (19.4%)

Installing /buildingframeworks (7.5%) Avoiding version incompatibility(4.0%) Configuration of environment variables (7.9%)Limitations ofplatforms /frameworks (2.6%)

Model Export (15.0%)

Procedure(4.4%) Specification ofmodelinformation (5.7%) Export ofunsupportedmodels (3.1%) Selection / usageof APIs (0.9%) Modelquantization(0.4%)

Request (13.7%)

Procedure (0.9%) Setting requestparameters /body (8.8%) Batchingrequest (2.2%)

General Questions (16.7%)

Entire procedureof deployment(9.7%) Conceptualquestions(4.4%)

Model Update (3.0%)Data Extraction (1.7%)General Questions (18.6%)

Entire procedureof deployment(13.4%) Conceptualquestions(4.8%) Limitations offrameworks(0.4%) Avoiding versionincompatibility(1.7%)Configuration ofinput / outputinformation (8.2%)

DL Integration into Projects (21.2%)

Importing /loading models(4.3%) Buildconfiguration(3.9%)

Inference Speed (3.9%)Model Conversion (26.4%)

Procedure(3.9%) Saving models(1.3%) Conversion ofunsupportedmodels (6.1%) Modelquantization(4.8%) Specification ofmodel information(8.2%) Selection /usage of APIs(0.9%) Parsingconvertedmodels (1.3%)

DL Library Compilation (7.8%)

Usage of prebuilt libraries (0.4%)Register of unsupported operators (3.0%)Buildconfiguration(2.6%)Procedure (1.7%)

Data Processing (16.9%)

Setting size /shape of inputdata (3.0%)Setting format /datatype of inputdata (5.2%) Parsing output(2.2%)Migrating pre-processing (4.8%)

Mobile (100%)

Getting informationof exposed model(1.8%)

Model Update (1.6%) Data Extraction (3.2%)Model Loading (24.0%)

Loading fromlocal storage(8.0%) Loading from aHttp endpoint(2.4%) Asynchronousloading (5.6%)Selection / usageof APIs (2.4%) Improving loadingspeed (0.8%)Procedure(4.8%)

Inference Speed (7.2%)Environment (19.2%)

Importinglibraries (10.4%) Avoiding version incompatibility(8.8%)Procedure(3.2%) Specification ofmodel information(5.6%) Conversion ofunsupportedmodels (4.0%) Selection /usage of APIs(2.4%)Saving models(3.2%)

General Questions (5.6%)

Entire procedureof deployment(3.2%) Limitations offrameworks (2.4%) Procedure (1.6%)

Data Processing (18.4%)

Setting size /shape of inputdata (5.6%)Setting format /datatype of inputdata (4.8%) Migrating pre-processing(2.4%)

Browser (100%)

Model Conversion (18.4%)

Procedure(1.7%) Threadmanagement(2.2%)

Model Security (2.4%) Model Security (0.4%)

Procedure(0.9%) Data Loading(4.0%)

Figure 5: Taxonomy of challenges in deploying DL software. nderstanding Challenges in Deploying Deep Learning Based Software: An Empirical Study ESEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States

Selection/usage of APIs.

There are so many APIs provided bydifferent frameworks for developers to export and convert modelsto various formats. Therefore, it is challenging for developers toselect and use these APIs correctly according to their demand. Forexample, a developer was confused about the “ relationship betweentensorflow saver, exporter and save model ” [15] and said franklythat she felt more confused after reading some tutorials. What’smore, the addition, deprecation, and upgrade of APIs caused by theupdate of frameworks also make the selection and usage of APIserror-prone [16].

Model quantization.

Model quantization reduces precision rep-resentations of model weights, in order to reduce memory cost andcomputing overhead of DL models [17]. It is mainly used for de-ployment to mobile devices, due to their limitations of computingpower, memory size, and energy capacity. For such a technique, de-velopers have difficulty in configuration of relevant parameters [18].In addition, developers call for support of more quantization op-tions. For instance, TF Lite supports only 8-bits quantization (i.e.,converting model weights from floating points to 8-bits integers),but developers may need more bits for quantization [19].

This category covers challenges in convert-ing raw data into the input format needed by DL models in DLsoftware (i.e., pre-processing) and converting the model output intoexpected formats (i.e., post-processing). It accounts for the mostquestions (i.e., 19.8%) in

Server/Cloud . For

Mobile and

Browser , itrepresents 16.9% and 18.4% of questions, respectively. Then Wedescribe the representative leaf categories under

Data Processing . Setting size/shape/format/datatype of input data.

It is acommon challenge in data pre-processing to set the size/shapeand format/datatype of data. A faulty behavior manifests when theinput data have an unexpected size/shape (e.g., a ×

224 image instead of a ×

227 image [20]), format (e.g., encoding an image inthe

Base64 format instead of converting it to a list [21]), or datatype(e.g., float instead of int [22]).

Migrating pre-processing.

When developing ML/DL models,data pre-processing is often considered as an individual phase [72]and thus may not be included inside the model architecture. Inthis case, code for data pre-processing needs to be migrated duringthe deployment process, so as to keep the consistent behaviorsof software before and after deployment. For instance, when de-velopers deploy a DL application with pre-precessing that is im-plemented with Python and out of the DL models to an Androiddevice, they may need to re-implement pre-processing using a newlanguage (e.g., Java or C/C++). Forgetting to re-implement it [23]or re-implementing it incorrectly [24] can lead to faulty behaviors.In addition, an alternative to keep data pre-processing consistent isto add it into the architecture of DL models. For this option, devel-opers face challenges like “ how to add layers before the input layerof model restored from a .pb file [...] to decode jpeg encoded stringsand feed the result into the current input tensor ” [25].

Parsing output.

This category includes challenges in convert-ing the output of DL models to expected or human-readable results,such as parsing the output array [26] or tensor [27] to get the actualpredicted class.

Once DL software are deployed for real usage,they can receive feedback (e.g., bad cases) from users. The feedback can be used to update the weights of models in DL software forfurther performance improvement. Many challenges, such as peri-odically automated model update on clouds [28] and model update(or re-training) on mobile devices [29], emerge from the efforts toachieve this goal. This category represents 2.6%, 3.0%, and 1.6% ofquestions in

Server/Cloud , Mobile , and

Browser , respectively.

DL models in DL software are often stored inunencrypted formats, which results in a risk that competitors maydisassemble and reuse the models. To avoid such a risk and ensurethe model security, developers attempt multiple approaches, suchas obfuscating code [30] or libraries [31]. Any challenges relatedto model security are included in this category. This category isobserved only in

Mobile and

Browser , since models deployed to theseplatforms are easier to obtain. By comparison, models deployed onserver/cloud platforms are hidden behind API calls.

To deploy DL software successfully, devel-opers need to consider any stage that may affect the final perfor-mance, including data extraction. This category is only observed in

Mobile and

Browser , accounting for 1.7% and 3.2% of questions, re-spectively. This indicates the difficulty of extracting data in mobiledevices and browsers.

Compared to server or cloud platforms,mobile and browser platforms have weaker computing power. Asa result, the inference speed of the deployed software has been achallenge in mobile devices (3.9%) and browsers (7.2%).

Environment.

This category presents challenges in setting up theenvironment for DL software deployment, and accounts for 19.4%and 19.2% of issues in

Server/Cloud and

Browser , respectively. For

Mobile , its environment related issues are mainly distributed in

DLLibrary Compilation and

DL Integration into Projects categories thatwill be introduced later. When deploying DL software to servers orclouds, developers need to configure various environment variables,whose diverse options make the configuration task challenging. Inaddition, for the server deployment, developers also need to installor build necessary frameworks such as TF Serving. Problems oc-curred in this phase are included in

Installing/building frameworks .Similarly, when deploying DL software to browsers, some devel-opers have difficulty in

Importing libraries , e.g., “

I am developing achrome extension, where I use my trained keras model. For this I needto import a library tensorflow.js. How should I do that ” [32]. Besidesthese, the rapid evolution of DL frameworks makes the versioncompatibility of frameworks/libraries challenging for developers.For instance, an error reported on SO is caused by that the TF usedto train and save the model has an incompatible version with TFServing used for deployment [33]. Similarly, Humbatova et al. [82]mentioned that version incompatibility between different librariesand frameworks was one of the main concerns of practitioners indeveloping DL software.

SEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States Zhenpeng Chen et al.

This category covers challenges in making requestsin the client and accounts for 13.7% in

Server/Cloud . For

Request , de-velopers have difficulty in configuring the request body [34], send-ing multiple requests at a single time (i.e., batching request) [35],getting information of serving models via request [36], etc.

This category concerns challenges related to serv-ing DL software on the server/cloud platforms and accounts for13.2% of questions. To make a DL model in DL software servable,developers firstly need to load it, where problems such as loadingtime [37] and memory usage [38] may emerge. In addition, manydevelopers encounter difficulties in authenticating the client [39]and parsing the request [40]. Sometimes, developers need to servemultiple different models to provide diverse services or serve dif-ferent versions of one model at the same time [41], but they findthat the implementation is not such easy (accounting for 3.5% ofquestions). Similarly, Zhang et al. [104] demonstrated that multiplemodel maintenance is one of the main challenges in DL softwaredeployment and maintenance in the server side. Finally, we wantto mention a specific configuration problem in this category, i.e.,

Configuration of batching . To process requests in batches, develop-ers need to configure relevant parameters manually. We observethis problem in 2.6% of questions, e.g., “

I know that the batch.configfile needs to be fine-tuned a bunch by hand, and I have messed with ita lot and tuned numbers around, but nothing seems to actually effectruntimes ” [42].

This category includes challengesin compiling DL libraries for target mobile devices and represents7.8% of questions in

Mobile . Since Core ML is well supported by iOS,developers can use it directly without installing or building it. ForTF Lite, pre-built libraries are officially provided for developersâĂŹconvenience. However, developers still need to compile TF Litefrom source code by themselves in some cases (e.g., deployingmodels containing unsupported operators). Since the operatorssupported by TF Lite are still insufficient to meet developersâĂŹdemand [43], developers sometimes need to register unsupportedoperators manually to add them into the run-time library, whichmay be challenging for developers who are unfamiliar with TF Lite.In addition, for compilation, developers need to configure buildcommand lines and edit configuration files (i.e.,

Build configuration ).Wrong configurations [44] can result in build failure or libraryincompatibility with target platforms.

This category presents chal-lenges in integrating DL libraries and models into mobile softwareprojects. It accounts for 21.2% in

Mobile . To integrate DL librariesand build projects, developers need to edit build configuration files(i.e.,

Build configuration ), which has been a common challenge (3.9%)for both Android and iOS developers. To integrate DL models intoprojects, developers have challenges in importing and loading mod-els (4.3%). For example, in an Xcode project for iOS, developerscan drag the models into the project navigator, and then Xcodecan parse and import the model automatically [45]. However, somedevelopers encountered errors during this process [46, 47]. When it comes to an Android project, the importing process is more com-plicated. For instance, if developers load a TF Lite model with C++or Java, they need to set the information (e.g., datatype and size)of input and output tensors manually (8.2%), but some developersfail in this configuration [48]. What’s more, developers have diffi-culty in the thread management (2.2%) when integrating DL modelsinto projects, like “

I am building an Android application that hasthree threads running three different models, would it be possible tostill enable inter_op_parallelism_threads and set to 4 for a quad-coredevice ” [49].

Model Loading.

This category shows challenges in loading DL mod-els in browsers. It is the most common challenge in browser de-ployment, accounting for 24.0% of questions. For browsers, TF.jsprovides with tf.loadLayersModel method to support loading mod-els from local storage, Http endpoints, and IndexedDB. Among thethree ways, we observe that the main challenge lies in loading fromlocal storage (8.0%). In the official document of TF.js [50], “localstorage” refers to the browser’s local storage, which is interpretedin a hyperlink [51] contained in the document as that “the storeddata is saved across browser sessions.” However, nearly all bad casesin

Loading from local storage attempted to load models from localfile systems. In fact, tf.loadLayersModel uses the fetch method [52]under the hood.

Fetch is used to get a file served by a server andcannot be used directly with local files. To work with local files,developers firstly need to serve them on a server. What’s more,many developers do not have a good grasp of the asynchronousloading (5.6%). In a scenario, when a developer loaded a DL modelin Chrome and then used it to make predictions, she received that“ loadedModel.predict is not a function error ” since the model had notbeen successfully loaded [53]. Since model loading is an asynchro-nous process in TF.js, developers needs to either use await or .then to wait for the model to be completely loaded before using it forfurther actions. Although unclear questions are not included in our taxonomy, wealso manually examine them to seek for some insights. All unclearquestions have no accepted answers and do not have informativediscussions and question descriptions to help us determine thechallenges behind them. Among them, 53% reported unexpectedresults [54] or errors [55] when making predictions using the de-ployed models. However, no anomalies occur at any stages beforethis phase, making it rather difficult to discover the challenges be-hind. In fact, various problems can result in the errors or unexpectedresults in this phase. Take the server deployment as an example.During the manual inspection, we find that errors occurring inmaking predictions can be attributed to the improper handlingof various challenges, such as version incompatibility between li-braries used for training and deploying [56] (i.e.,

Environment ),wrong specification of model information [57] (i.e.,

Model Export ),mismatched format of input data [58] (i.e.,

Data Processing ), etc. nderstanding Challenges in Deploying Deep Learning Based Software: An Empirical Study ESEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States

Based on the preceding derived findings, we then discuss our in-sights and some practical implications for developers, researchers,and DL framework vendors.

Researchers.

As demonstrated in our study, DL software de-ployment is gaining increasing attention from developers, but theyencounter a spectrum of challenges and various unresolved is-sues. Such findings encourage researchers to develop technologyto help developers meet these deployment challenges. Here, webriefly discuss some potential opportunities to the research com-munities based our results. (i) Automated fault localization.

InSection 6.7, we find that 53% of unclear questions reported errorswhen making predictions and that various faults in different phasescan result in such errors. This indicates the difficulty in manuallylocalizing the faults and highlights the needs for researchers topropose automated fault localization tools for DL software deploy-ment. Similarly, pro-active alerting techniques can be proposed toinform developers about potential errors during the deploymentprocess. However, it should be acknowledged that monitoring andtroubleshooting deployment process is quite difficult, because ofthe myriad potential problems, including hardware and softwarefailures, misconfigurations, input data, even simply unrealistic userexpectations, etc. Therefore, we encourage researchers to conducta systematic study to characterize the major types and root causesof faults occurred in deploying DL software before developing theaforementioned automated tools. (ii) Automated configuration.

In our taxonomy, many challenges are related to configuration(e.g.,

Specification of model information and

Configuration of en-vironment variables ). This observation motivates researchers topropose automated configuration techniques to simplify some de-ployment tasks for developers, especially non-experts. In addition,automated configuration checkers can be proposed to detect anddiagnose misconfigurations, based on analyzing the configurationlogic, requirements, and constraints. (iii) Implications for othercommunities.

Our results reveal some emerging needs of develop-ers, which can provide implications for other research communities,such as system and AI . For example, some developers call for morequantization options (see Model quantization ) in model conversion.Researchers from the AI community should propose more effectiveand efficient techniques for model quantization, so as to help im-prove current frameworks. In addition, to update model on mobiledevices (see

Model Update ), system researchers need to proposeeffective techniques to support model update (i.e., re-training) onthe devices with limited computation power.

Developers. (i) Targeted learning of required skills.

DL soft-ware deployment lies in the interaction between DL and SE. There-fore, it requires developers with a solid knowledge of both fields,which makes such a task quite challenging. Our taxonomy can serveas a checklist for developers with varying backgrounds, motivatingthem to learn necessary knowledge before really deploying DLsoftware. For instance, an Android developer needs to learn neces-sary knowledge about DL before deploying DL software to mobiledevices. Otherwise, she may fail in the specification of informationabout DL models (see

Specification of model information ) trained byDL developers or data scientists. Similarly, when a DL developerwho is not skillful in JavaScript deploys DL models on browsers, she may directly load models from local file systems due to themisunderstanding of “browsers’ local storage” (see Section 6.6). (ii) Avoiding common pitfalls.

Our study identifies some com-mon pitfalls in DL software deployment. Developers should payattention to these pitfalls and avoid them accordingly. For instance,when deploying DL software to target platforms, developers shouldremember to migrate the pre-processing code and pay attentionto version compatibility. (iii) Better project management.

Ourtaxonomy presents the distribution of different categories, indi-cating which challenges developers have encountered more. In aproject that involves DL software deployment, the project managercan use our taxonomy to assign a task where developers alwayshave challenges (e.g., model conversion) to a more knowledgeabledeveloper.

Framework vendors. (i) Improving the usability of docu-mentation.

As shown in our results, many developers even havedifficulty in the entire procedure of deployment (i.e., how to de-ploy DL software). For instance, such questions account for 13.4%in mobile deployment. As described, developers often complainabout the poor documentation in these questions. It reveals thatthe usability [69] of relevant documentation should be improved.Specifically, DL framework vendors can provide better detaileddocumentation and tutorials for developers’ reference. In addition,confused information organization, such as hiding explanations ofimportant concepts behind hyperlinks (see Section 6.6), may resultin developers’ misuse and thus should be avoided. (ii) Improvingthe completeness of documentation.

The prevalence of “

Con-ceptual questions ” category suggests that framework vendors shouldimprove the completeness [69, 106] of their documentation, espe-cially considering that DL software deployment requires a wideset of background knowledge and skills. Indeed, basic informationthat might look clear from the vendors’ perspective is not alwayseasy to digest by the users (i.e., the developers) [69]. The vendorsshould involve the users in the review of the documentation, inorder to supplement necessary explanations of basic knowledge inthem. This might help in minimizing developers’ learning curve andavoiding misunderstandings. (iii) Improving the design of APIs.

The quality of APIs heavily influences the developing experienceof developers and even correlates with the success of applicationsthat make use of them [97]. Our study reveals some APIs issuesthat need the attention of DL framework vendors. For one function,framework vendors may provide similar APIs for various options(see

Selection/usage of APIs ), which makes some developers con-fused in practice. To mitigate this issue, framework vendors shouldbetter distinguish these APIs and clarify use cases of them moreclearly. (iv) Improving the functions as needed.

We observethat many developers suffer from conversion and export of unsup-ported models in the deployment process. For instance, in mobiledeployment, 6.1% of issues are about such a challenge. Since it isimpractical for framework vendors to support all possible operatorsat once, we suggest that they can mine SO and GitHub to collectrelated issues reported by developers and then first meet those mosturgent operators and models.

SEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States Zhenpeng Chen et al.

Construct validity.

Our automated identification of relevant ques-tions is based on pre-selected tags and keywords-matching mecha-nisms. We mainly follow previous related work to determine thetags. Moreover, all tags we use are about popular frameworks orplatforms, which promises the representativeness of the questionsused in this study. However, it is still possible that in other contextsdevelopers discuss issues that we do not encounter. In addition,the keywords-matching identification may result in the retrievalof false positives and the loss of posts that do not contain explicitkeywords. The false positives are discarded during our manualexamination, so they do not affect the precision of our final tax-onomy. Compared to the implicit posts, our identified posts withexplicit keywords are more representative. Therefore, we believethat the loss of these implicit posts also does not invalidate ourdistilled challenges. What’s more, our identification of posts relatedto browser deployment is based on manual examination, duringwhich some issues are discarded as we cannot discern whether theyoccur during training or deployment. As a result, categories such

Conceptual questions are missing in

Browser . However, it does notmean that there is no basic conceptual problems in browser deploy-ment. The implications derived from the taxonomy and results aregeneral, not specific to server/cloud or mobile platforms.

Internal validity.

The manual analysis in this study presentsthreats to internal validity. To minimize this threat, two authors areinvolved in inspecting cases and finally reach agreement with thehelp of three experienced arbitrators through discussions. The inter-rater agreement is relatively high, which evidences the reliabilityof the coding schema and procedure. What’s more, we use 30% ofsamples for a pilot construction and the remaining for reliabilityanalysis. Although the selection of this threshold is a bit arbitrary,samples used in both phases are all examined by two authors andtheir classification results are approved by arbitrators. Therefore,we believe that this threshold selection does not affect the validityof our taxonomy.

External validity.

Similar to previous studies [70, 72, 73, 93,95, 102, 103], our work uses SO as the only data source to studythe challenges developers encounter. As a result, we may overlookvaluable insights from other sources. In future work, we plan toextend our study to diverse data sources and conduct in-depthinterviews with researchers and practitioners to further validateour results. However, since SO contains both novices’ and experts’posts [105], we believe that our results are still valid.

In this section, we summarize the relevant studies to well positionour work within the literature.

Challenges that ML/DL poses for SE.

The rapid developmentof ML technologies poses new challenges for software developers.To characterize such challenges, Thung et al. [96] collected andanalyzed bugs in ML systems to study bug severities, efforts neededto fix bugs, and bug impacts. Alshangiti et al. [72] demonstratedthat ML questions are more difficult to answer than other questionson SO and that model deployment is most challenging across allML phases. In addition, they found that DL related topics are mostpopular among the ML related questions. In recent years, several studies focused on the challenges in DL. By inspecting DL relatedposts on SO, Zhang et al. [103] found that program crashes, modeldeployment, and implementation questions are the top three mostfrequently asked questions. Besides, several studies characterizedthe faults in software that make use of DL frameworks. Zhang et al. [105] collected bugs in TF programs from SO and GitHub.By manual examination, they categorized the symptoms and rootcauses of these bugs and proposed strategies to detect and locate DLbugs. Following this work, Islam et al. [83] and Humbatova et al. [82]extended their scope to the bugs in programs written based on thetop five popular DL frameworks to present more comprehensiveresults. Inspired by these pioneer studies, we also aim to investigatethe challenges that DL poses for SE. However, different from theexisting efforts, this study focuses on the deployment process ofDL software.

DL software deployment.

To make DL software really accessi-ble for users, developers need to deploy them to different platformsaccording to various application scenarios. A popular way is todeploy DL software to server/cloud platforms, and then the DLfunctionality can be accessed as services. For this deployment way,Cummaudo et al. [78] analyzed the pain-points that developers facewhen using these services. In other words, they focused on the chal-lenges occurred after the deployment of DL software. Different fromthis work, our study focuses on the challenges in the deploymentprocess. In addition, mobile devices have created great opportuni-ties for DL software. Researchers have built numerous DL softwareapplications on mobile devices [90, 92, 100] and proposed variousoptimization techniques (e.g., model compression [87, 98] and cloudoffloading [85, 101]) for deploying DL software to mobile platforms.To bridge the knowledge gap between research and practice, Xu etal. [99] conducted the first empirical study on large scale Androidapps to demystify how DL techniques are adopted in the wild. Inaddition, in recent years, various JavaScript-based DL frameworkshave been published to enable DL-powered Web applications inbrowsers. To investigate what and how well we can do with theseframeworks, Ma et al. [88] selected seven JavaScript-based frame-works and measured their performance gap when running differentDL tasks on Chrome. Their findings showed that DL in browsers arestill at dawn. Recently, Guo et al. [81] put their attention on DL soft-ware deployment across different platforms, and investigated theperformance gap when the trained DL models are migrated fromPC to mobile devices and Web browsers. Their findings unveiledthat the deployment still suffers from compatibility and reliabilityissues. Despite these efforts, the specific challenges in deployingDL software are still under-investigated and thus our study aims tofill this knowledge gap.

10 CONCLUSION

Based on SO posts related to DL software deployment, we find thatthis task is becoming increasingly popular among software engi-neers. By comparison, we further evidence that it is more challeng-ing than other aspects of DL software and even other challengingtopics in SE such as big data and concurrency, which motivates usto identify the specific challenges behind DL software deployment.To this end, we manually inspect 769 sampled SO posts to derivea taxonomy of 72 challenges faced by developers in DL software nderstanding Challenges in Deploying Deep Learning Based Software: An Empirical Study ESEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States deployment. Finally, we qualitatively discuss our findings and inferimplications for researchers, developers, and DL framework ven-dors, with the goal of highlighting good practices and interestingresearch avenues in deploying DL software.

REFERENCES

SEC/FSE 2020, 8 - 13 November, 2020, Sacramento, California, United States Zhenpeng Chen et al. . 265–283.[69] Emad Aghajani, Csaba Nagy, Olga Lucero Vega-Márquez, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Softwaredocumentation issues unveiled. In

Proceedings of the 41st International Conferenceon Software Engineering, ICSE 2019 . 1199–1210.[70] Syed Ahmed and Mehdi Bagherzadeh. 2018. What do concurrency developersask about?: A large-scale study using Stack Overflow. In

Proceedings of the12th ACM/IEEE International Symposium on Empirical Software Engineering andMeasurement, ESEM 2018 . 30:1–30:10.[71] Rami Al-Rfou, Guillaume Alain, and Amjad Almahairi et al. 2016. Theano: apython framework for fast computation of mathematical expressions.

CoRR abs/1605.02688 (2016).[72] Moayad Alshangiti, Hitesh Sapkota, Pradeep K. Murukannaiah, Xumin Liu, andQi Yu. 2019. Why is developing machine learning applications challenging? Astudy on Stack Overflow posts. In

Proceedings of 2019 ACM/IEEE InternationalSymposium on Empirical Software Engineering and Measurement, ESEM 2019 .1–11.[73] Mehdi Bagherzadeh and Raffi Khatchadourian. 2019. Going big: a large-scalestudy on what big data developers ask. In

Proceedings of the ACM Joint Meetingon European Software Engineering Conference and Symposium on the Foundationsof Software Engineering, ESEC/FSE 2019 . 432–442.[74] Kartik Bajaj, Karthik Pattabiraman, and Ali Mesbah. 2014. Mining questionsasked by Web developers. In

Proceedings of the 11th Working Conference onMining Software Repositories, MSR 2014 . 112–121.[75] Stefanie Beyer, Christian Macho, Martin Pinzger, and Massimiliano Di Penta.2018. Automatically classifying posts into question categories on Stack Overflow.In

Proceedings of the 26th Conference on Program Comprehension, ICPC 2018 . 211–221.[76] Zhenpeng Chen, Yanbin Cao, Xuan Lu, Qiaozhu Mei, and Xuanzhe Liu. 2019.SEntiMoji: an emoji-powered learning approach for sentiment analysis in soft-ware engineering. In

Proceedings of the ACM Joint Meeting on European SoftwareEngineering Conference and Symposium on the Foundations of Software Engineer-ing, ESEC/FSE 2019 . 841–852.[77] Jacob Cohen. 1960. A coefficient of agreement for nominal scales.

Educationaland psychological measurement

20, 1 (1960), 37–46.[78] Alex Cummaudo, Rajesh Vasa, Scott Barnett, John Grundy, and Mohamed Ab-delrazek. 2020. Interpreting cloud computer vision pain-points: a mining studyof Stack Overflow. In

Proceedings of the 41st International Conference on SoftwareEngineering, ICSE 2020 .[79] Ryosuke Furuta, Naoto Inoue, and Toshihiko Yamasaki. 2019. Fully convolutionalnetwork with multi-step reinforcement learning for image processing. In

TheThirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019 . 3598–3605.[80] Jingyue Gao, Xiting Wang, Yasha Wang, Zhao Yang, Junyi Gao, Jiangtao Wang,Wen Tang, and Xing Xie. 2019. CAMP: co-attention memory networks fordiagnosis prediction in healthcare. In . 1036–1041.[81] Qianyu Guo, Sen Chen, Xiaofei Xie, Lei Ma, Qiang Hu, Hongtao Liu, Yang Liu,Jianjun Zhao, and Xiaohong Li. 2019. An empirical study towards characterizingdeep learning development and deployment across different frameworks andplatforms. In

Proceedings of the 34th IEEE/ACM International Conference onAutomated Software Engineering, ASE 2019 . 810–822.[82] Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, An-drea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learningsystems. In

Proceedings of the 41st International Conference on Software Engi-neering, ICSE 2020 .[83] Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. Acomprehensive study on deep learning bug characteristics. In

Proceedings of theACM Joint Meeting on European Software Engineering Conference and Symposiumon the Foundations of Software Engineering, ESEC/FSE 2019 . 510–520.[84] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: con-volutional architecture for fast feature embedding. In

Proceedings of the ACMInternational Conference on Multimedia, MM 2014 . 675–678.[85] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor N. Mudge,Jason Mars, and Lingjia Tang. 2017. Neurosurgeon: collaborative intelligence between the cloud and mobile edge. In

Proceedings of the Twenty-Second Inter-national Conference on Architectural Support for Programming Languages andOperating Systems, ASPLOS 2017 . 615–629.[86] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. 2019. Stereo R-CNN based 3Dobject detection for autonomous driving. In

IEEE Conference on Computer Visionand Pattern Recognition, CVPR 2019 . 7644–7652.[87] Sicong Liu, Yingyan Lin, Zimu Zhou, Kaiming Nan, Hui Liu, and Junzhao Du.2018. On-demand deep model compression for mobile devices: a usage-drivenmodel selection framework. In

Proceedings of the 16th Annual InternationalConference on Mobile Systems, Applications, and Services, MobiSys 2018 . 389–400.[88] Yun Ma, Dongwei Xiang, Shuyu Zheng, Deyu Tian, and Xuanzhe Liu. 2019.Moving deep learning into Web browser: how far can we Go?. In

Proceedings ofthe World Wide Web Conference, WWW 2019 . 1234–1244.[89] Brian McMahan and Delip Rao. 2018. Listening to the world improves speechcommand recognition. In

Proceedings of the Thirty-Second AAAI Conference onArtificial Intelligence, AAAI 2018 . 378–385.[90] Gaurav Mittal, Kaushal B. Yagnik, Mohit Garg, and Narayanan C. Krishnan.2016. SpotGarbage: smartphone app to detect garbage using deep learning. In

Proceedings of the 2016 ACM International Joint Conference on Pervasive andUbiquitous Computing, UbiComp 2016 . 940–945.[91] Robert G. Newcombe. [n.d.]. Interval estimation for the difference betweenindependent proportions: comparison of eleven methods.

Statistics in Medicine

17, 8 ([n. d.]), 873–890.[92] Valentin Radu, Nicholas D. Lane, Sourav Bhattacharya, Cecilia Mascolo, Ma-hesh K. Marina, and Fahim Kawsar. 2016. Towards multimodal deep learningfor activity recognition on mobile devices. In

Proceedings of the 2016 ACM In-ternational Joint Conference on Pervasive and Ubiquitous Computing, UbiCompAdjunct 2016 . 185–188.[93] Christoffer Rosen and Emad Shihab. 2016. What are mobile developers askingabout? A large scale study using Stack Overflow.

Empirical Software Engineering

21, 3 (2016), 1192–1223.[94] Carolyn B. Seaman. 1999. Qualitative Methods in Empirical Studies of SoftwareEngineering.

IEEE Trans. Software Eng.

25, 4 (1999), 557–572.[95] Mohammad Tahaei, Kami Vaniea, and Naomi Saphra. 2020. Understandingprivacy-related questions on Stack Overflow. In

Proceedings of the 2020 CHIConference on Human Factors in Computing Systems, CHI 2020 .[96] Ferdian Thung, Shaowei Wang, David Lo, and Lingxiao Jiang. 2012. An empiricalstudy of bugs in machine learning systems. In

Proceedings of the 23rd IEEEInternational Symposium on Software Reliability Engineering, ISSRE 2012 . 271–280.[97] Mario Linares Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Massimil-iano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. 2013. API change andfault proneness: a threat to the success of Android apps. In

Joint Meeting of theEuropean Software Engineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering, ESEC/FSE 2013 . 477–487.[98] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016.Quantized convolutional neural networks for mobile devices. In

Proceedings ofthe 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 .4820–4828.[99] Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, andXuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In

Proceedings of the World Wide Web Conference, WWW 2019 . 2125–2136.[100] Mengwei Xu, Feng Qian, Qiaozhu Mei, Kang Huang, and Xuanzhe Liu. 2018.DeepType: on-device deep learning for input personalization service with mini-mal privacy concern.

Proceedings of the ACM on Interactive, Mobile, Wearableand Ubiquitous Technologies, IMWUT

2, 4 (2018), 197:1–197:26.[101] Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xuanzhe Liu.2018. DeepCache: principled cache for mobile deep vision. In

Proceedings ofthe 24th Annual International Conference on Mobile Computing and Networking,MobiCom 2018 . 129–144.[102] Xinli Yang, David Lo, Xin Xia, Zhiyuan Wan, and Jian-Ling Sun. 2016. Whatsecurity questions do developers ask? A large-scale study of Stack Overflowposts.

Journal of Computer Science and Technology

31, 5 (2016), 910–924.[103] Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael R. Lyu, and Miryung Kim. 2019.An Empirical Study of Common Challenges in Developing Deep LearningApplications. In

Proceedings of the 30th IEEE International Symposium on SoftwareReliability Engineering, ISSRE 2019 .[104] Xufan Zhang, Yilin Yang, Yang Feng, and Zhenyu Chen. 2019. Software En-gineering Practice in the Development of Deep Learning Applications.

CoRR (2019).[105] Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang.2018. An empirical study on TensorFlow program bugs. In

Proceedings of the27th ACM SIGSOFT International Symposium on Software Testing and Analysis,ISSTA 2018 . 129–140.[106] Junji Zhi, Vahid Garousi-Yusifoglu, Bo Sun, Golara Garousi, S. M. Shahnewaz,and Günther Ruhe. 2015. Cost, benefits and quality of software developmentdocumentation: a systematic mapping.