[PDF] Machine Learning Model Development from a Software Engineering Perspective: A Systematic Literature Review

Abstract

Data scientists often develop machine learning models to solve a variety of problems in the industry and academy but not without facing several challenges in terms of Model Development. The problems regarding Machine Learning Development involves the fact that such professionals do not realize that they usually perform ad-hoc practices that could be improved by the adoption of activities presented in the Software Engineering Development Lifecycle. Of course, since machine learning systems are different from traditional Software systems, some differences in their respective development processes are to be expected. In this context, this paper is an effort to investigate the challenges and practices that emerge during the development of ML models from the software engineering perspective by focusing on understanding how software developers could benefit from applying or adapting the traditional software engineering process to the Machine Learning workflow.

Full PDF

aa r X i v : . [ c s . S E ] F e b Machine Learning Model Development from aSoftware Engineering Perspective:A Systematic Literature Review

Giuliano Lorenzoni

David R. Cheriton School of Computer ScienceUniversity of Waterloo (UW)

Waterloo, [email protected]

Paulo Alencar

David R. Cheriton School of Computer ScienceUniversity of Waterloo (UW)

Waterloo, [email protected]

Nathalia Nascimento

David R. Cheriton School of Computer ScienceUniversity of Waterloo (UW)

Waterloo, [email protected]

Donald Cowan

David R. Cheriton School of Computer ScienceUniversity of Waterloo (UW)

Waterloo, [email protected]

Abstract —Data scientists often develop machine learning mod-els to solve a variety of problems in the industry and academybut not without facing several challenges in terms of ModelDevelopment. The problems regarding Machine Learning De-velopment involves the fact that such professionals do notrealize that they usually perform ad-hoc practices that could beimproved by the adoption of activities presented in the SoftwareEngineering Development Lifecycle. Of course, since machinelearning systems are different from traditional Software systems,some differences in their respective development processes are tobe expected. In this context, this paper is an effort to investigatethe challenges and practices that emerge during the developmentof ML models from the software engineering perspective byfocusing on understanding how software developers could beneﬁtfrom applying or adapting the traditional software engineeringprocess to the Machine Learning workﬂow.

Index Terms —Software Engineering; Machine Learning; SElifecycle; ML workﬂow; SE process

I. I

NTRODUCTION

In Software Engineering (SE), researchers and practitionershave spent decades on developing tools and methodologiesto create, manage and assemble complex software modules.Software engineering refers to the comprehensive study ofengineering for the design, development and maintenance ofsoftware with the main purpose of developing methods andprocedures for software development, for example to scaleit up for large systems and guarantee high-quality softwarewith low cost software production. Indeed, SE has progressedbeyond expectations to produce signiﬁcant advances in itsmethods and processes. Therefore, in the light of these ad-vances, it becomes important to understand how softwaredevelopers could adapt their existing processes to incorporateor adapt the existing SE process to their Machine Learning(ML) workﬂows. In this context, we investigate the challenges and practicesthat emerge during the development of ML models from thesoftware engineering perspective. By focusing our analysis onthe well-known stages presented in the Software Engineeringdevelopment process, we focus on investigating how softwaredevelopers could beneﬁt from applying or adapting theseprocesses to the ML workﬂow since one might argue thatdata scientists would beneﬁt from adopting classical softwareengineering disciplines (e.g. systems design, quality assurance,and veriﬁcation) to build their models properly.Consequently, this article is organized in 7 sections: (i)Section 2 regards with the related work; (ii) Section 3 coversthe research method by describing: the research questions,the document source used to select the articles, the searchstring/strategy and the inclusion/exclusion criteria; (iv) Section4 covers the result analysis including demographics and theanswers for each research question; (v) Sections 5 and 6demonstrate the challenges for future research and the threatsto the validity of the results found in this article. Finally (vi)the conclusions and insights from this research can be foundin section 7. II. R

ELATED W ORK

Amershi et al. [1] conducted a comprehensive study involv-ing AI professionals at Microsoft. By investigating scientists,researchers, managers, programmers, and other professionalsin their respective daily activities, the authors have identiﬁedthe three major challenges in building large-scale AI appli-cations: data management, reuse, and modularity. In orderto address the challenges for ML, the paper analyzes howMicrosoft software teams adapt their existing agile processesto incorporate the complexity from machine learning to thenbuild artiﬁcial intelligence-based applications such as text,oice, and video translators or the interactive speaking agentsbuilt on speech and language recognition. The authors inter-viewed company employees to ﬁnd out how they tackle chal-lenges from a development standpoint - especially in machinelearning modeling. The paper contributes to the ﬁeld by pro-viding insights related to the adoption of software engineeringprocesses to machine learn modeling, such as: (i) machinelearning modeling dependency on data and the consequentimportance of all data related stages in a machine learningmodeling workﬂow; and (ii) the need for teams with skillsfor software engineering and machine learning modeling; (iii)the difﬁculty of modularization and reuse of machine learningmodels (which is different from regular software applications)since modules in machine learning have greater impact on eachother. Consequently, by acknowledging that ML applicationsare structurally different from applications in other domains,the authors were able to build a comprehensive and up-to-date workﬂow (which was adopted in our present work as acommon reference to understand the ML modeling process)with typical stages that address the ML model development.While Amershi et al. [1] identiﬁes the main challengesregarding building large-scale AI applications and used it forbuilding an up-to-date ML workﬂow, another comprehensivestudy by Correia et al. [2] conducted interviews with datascientists from ﬁve different Brazilian companies in orderto identify which were the most challenging stages of theML workﬂow proposed by Amershi et al. [1]. The authorsfound that data scientists have pointed to Data Processing andFeature Engineering as the most challenging stages in the MLworkﬂow, even while they also mentioned important issuesregarding Model Training, Model Evaluation, and Model De-ployment. These results indicate the lack of a well-engineeredprocess in the ML model development practice.A study conducted by Zhang et al. [3] investigated 138research papers in search of methods for testing and debuggingthe ML code. Their ﬁndings show that only a few contributionsfocus on testing interpretability, privacy, or efﬁciency. Zhanget al. [3] focus on exclusively analyzing the stage of ModelEvaluation, as opposed to targeting all ML stages.In contrast to the partial information in works such asAmershi et al. [1], authors such as Hesenius et al. [4]argue that although there are challenges faced by softwareengineers when developing data-driven applications, the datadependency of ML/AI applications does not constitute anissue to the adoption of a common integrated SoftwareEngineering (SE) process, upon which the project’s overallsuccess would depend. As a consequence, by deﬁning a setof roles (Software Engineer, Data Scientist, Data DomainExpert, and Domain Expert), stages, and responsibilities tostructure the necessary work, decisions and documents, theauthors provided a structured engineering process that suitsall data-driven applications, ultimately ﬁlling the gap foundin the literature. It is worth mentioning that although thearticle presents a general framework in contrast to speciﬁcsolutions aiming to solve particular and partial issues relatedto ML/AI modeling, the adoption of the proposed model does not disregard specializations of the aforementioned process,including individual steps and speciﬁcally tailored tools forcertain data-driven applications.In a similar fashion, by presenting methods for measuringthe best practices degree of adoption when investigating therelationship between different groups of practices and assess-ing/predicting their effects by performing regression models,Serban’s [5] article reaches conclusions that are in line withHesenius et al. [4] in a way that there is a set of best practiceswhich is applicable to any ML application development, re-gardless of the type of data under consideration. Additionally,the author has contributed to the evolution of such practicesby presenting a methodology in which each practice is relatedto its effects and adoption rate.Washizaki et al. [6] presented a related work which clearlycomplements both works of Hesenius et al. [4] and Serbanet al. [5]. The paper addresses the classiﬁcation of softwareengineering design patterns by conducting a systematic studywhich collects, classiﬁes, and analyzes SE architecture andthe design of “bad” patterns for ML systems which linkstraditional software systems and ML systems architecture anddesign. One interesting result presented by the article is theunderstanding that SE patterns for ML systems are dividedbetween two processes: The Machine Learning pipeline andSE development.Between these two very distinct approaches, one regardingcustomized ML workﬂows related to the speciﬁcs of each MLbased system and the other oriented towards a broadly generalworkﬂow (which would be suitable to any ML applicationregardless of its speciﬁcs), there are several studies withdifferent approaches for the relation/application of softwareengineering development process to ML modeling. Someof them deal with speciﬁcs like the adoption of softwareengineering best practices related to the development of ap-plication programming interfaces (Reimann, Kniesel-W¨unsch[7]), while others address the accountability gap in ML/AI byproposing a framework based on software development bestpractices (Hutchinson et al. [8]).Further, Nascimento et al. [9] pointed out that the dif-ferences between Traditional systems and Machine Learningsystems can be identiﬁed by observing the differences betweentheir respective software development activities. In fact, theauthors identiﬁed that SE activities are more challenging forML systems which follow speciﬁc four-stage software devel-opment process, namely: understanding the problem, handlingdata, building models and monitoring those models.Other works focus on well-known stages of SE lifecycleor the ML Workﬂow by identifying and addressing manydifferent types of gaps, such as in the works presented in [10]–[16].Finally, we have also found papers which demonstrate thedifﬁculty with reconciling Software Engineering Developmentwith Machine Learning Modeling due to the fundamentaldifferences of such processes as the work of Kim [17].II. R

ESEARCH M ETHOD

To systematize the aforementioned knowledge, we con-ducted a systematic literature review based on the guidelinesdescribed in [18]. Our review protocol includes: (i) the selec-tion of the digital libraries; (ii) the deﬁnition and validationof the search string; (iii) deﬁnition of the inclusion/exclusioncriteria; and (iv) the application of snowballing.Following this protocol, two researchers performed a par-allel search in order to identify studies that address theresearch questions. Before including the papers to the ﬁnalresult collection, they evaluated and interpreted the papers bydiscussing their relevance and the possible answers (ﬁndings)for the research questions.

A. Research Questions

The main question regards how the adoption of SoftwareEngineering Development process and practices could addressthe issues from Machine Learning Modeling. In order toanswer that we designed ﬁve research questions: • RQ1: What are the phases addressed in terms of themachine learning model development? • RQ2: What are the techniques applied in each of phasesof the machine learning model development phases? • RQ3: What are the pros and cons of each machinelearning model development technique? • RQ4: What are the gaps in terms of the model develop-ment lifecycle? • RQ5: What are the trends regarding the techniques ap-plied in the machine learning model development lifecy-cle?RQ1 main objective is to establish common ground in termsof what phases are addressed in the ML model development.By identifying these stages, we can address which techniquesare applied in each stage (RQ2) as well as identify differentmodel development techniques and their advantages/disadvan-tages (RQ3) and eventual gaps in the model developmentlife-cycle of ML models (RQ4). The identiﬁcation of thelatest trends regarding techniques applied in the machinelearning model development life-cycle (RQ5) can also provideimportant insights related to the main question.

B. Document Sources

The procedure to select the sources used for our systematicliterature review starts with a choice of a well-deﬁned doc-ument sources in the ﬁeld: in this case we selected IEEEx-plore and ACM Digital Library. After deﬁning the documentsources, we reﬁned the results by identifying the main venuesfor publishing research in ML and/or SE, mainly based on theH-index. However, as it is an emerging topic, we also includedworkshops that are associated to conferences that are importantin the respective communities, such as ICSE workshops.

C. Search Strategy

The Search Strategy consists in applying a selected searchstring, ﬁltering out papers based on the Inclusion/ExclusionCriteria. The ﬁnal results inclusion comes from the adoption of the Snowballing method. Given the novelty of the subjectwe have adopted a more comprehensive/general search stringin order to get different combinations from the selected keywords:

Title:(machine AND learning AND software ANDengineering) OR Abstract:(machine AND learningAND software AND engineering) After the automatic search with this search string, wecollected a total of 863 papers : 539 from IEEE and 381from ACM (having a overlapping of 57 papers). First, wereduced the collection of articles by ﬁltering them accordingto the venue title. Then, we excluded items that satisfy anyof the exclusion criteria, such as short papers. Finally, weread the abstract of each paper, evaluating whether it satisﬁesthe inclusion criteria. As we summarized in Table I, of the863 papers selected with the automatic string match, only 53papers directly contribute to our research questions. Of the 53papers, we selected the 23 most relevant. We observed that thegreater part of the papers was ﬁltered out because they describesolutions of ML to SE, as in the article of Nascimento et al.[19]. TABLE IS

UMMARY OF THE S EARCH R ESULTS . Automatic Search(2010-2020) SelectedPapers < D. Inclusion and Exclusion Criteria

We consider papers that do not satisfy any of the exclusioncriteria and satisfy at least two inclusion criteria. Thus, weexcluded items that: • Papers written in languages other than English. • Tutorials, short papers, editorials because they do notcontain sufﬁcient data for our study. • Items related to Machine learning for software engineer-ing.We include studies that: • Were published from January 2010 to June 2020. • RQ1 and RQ2: Have abstracts or document titles whichmention/discuss the adoption of Practices/Processes/Prac-tices/Workﬂow/Framework from Software Engineeringfor Machine Learning Modeling/Systems/Applications. • RQ3 and RQ4: Matched the focus of the study (under-standing how software teams could beneﬁt from apply-ing/adapting the traditional software engineering processto the Machine Learning workﬂow).Finally, we have adopted the Snowballing method for re-ﬁning our results based on the citations and references in the IEEExplore uses different reserved words for document title and abstract. The full list of the papers is available at Drive - List of Papers ost relevant articles. For example, we have included somearticles that were cited by Amershi et al [1] and also somearticles that have Amershi et al. [1] as one of their references.Because of the novelty of the research topic, we consideredgrey literature that has already been cited. This step resultedon the addition of 10 papers.IV. R

ESULT A NALYSIS

A. RQ1: What are the phases addressed in terms of themachine learning model development?

The main purpose is to adapt or integrate Machine Learn-ing framework into Software Development processes’ stages,namely: requirements, design, implementation, testing, deploy-ment, and maintenance. In this context, although there aresome papers such as Nascimento et al. [9] that have developedMachine Learning Model workﬂows, the work of Amershiet al. [1] presented the most comprehensive and acceptedMachine Learning workﬂow which was mentioned and used inother articles (such as Correia et al. [2]). The stages addressedin terms of Machine Learning Model Development were: • A Model requirements stage which is related to theagreement between stakeholders and the way the modelshould work. • Data processing stage which involves data collection,cleaning and labelling (in case of supervised learning). • Feature engineering stage which involves the modiﬁca-tion of the selected data. • Model training stage which is related to the way theselected model is trained and tuned on the (labeled) data. • Model evaluation stage which regards to the measure-ments used in order to evaluate the model. • Model deployment stage which includes deploying, mon-itoring and maintaining the model.Nascimento et al. [9] conducted a survey with 6 Brazil-ian software companies reaching the conclusion about MLdevelopment following a 4-stage process in these companies(understanding the problems, data handling, model building,and model monitoring).The work of Banimustafa and Hardy [20] is a practicalapplication of a proposed scientiﬁc data mining process modelin data mining, more speciﬁcally in metabolomics. The modelwas inspired by Software Engineering (among other ﬁelds) andalthough the paper proposes speciﬁc workﬂow stages (such as:data pre-processing, Data Exploration, Technique Selection,Knowledge Evaluation, Deployment, and Process Evaluation),the authors pointed out that the proposed framework could begeneralized in order to perform data mining in other scientiﬁcdisciplines.The work of Gotz et al. [14] addresses the challenges thatarise when trying to adopt traditional Software Engineeringpractices in Machine Learning Modeling, since the authorsidentiﬁed issues regarding the requirements design stage, aswell as differences between traditional software systems andMachine Learning models’ lifecycles and workﬂows.The work of Hutchinson et al. [8] focuses on the DataProcessing stage but also approaches other stages of the Machine Learning Workﬂow such as Requirements, Design,Implementation, Testing and Maintenance. The main goal ofthe author is to better understand the processes that generatethe development of data and highlight the importance ofadopting practices that enable accountability throughout thedata development lifecycle.Singla, Bose and Naik [21] studied the logs related tosoftware engineering following the agile methodology for amachine learning team and compared it with the logs for a non-machine learning team, analyzing the trends and their reasons.The authors then provided a few suggestions about the wayAgile could have a better use for machine learning teams andprojects.The work of Kriens and Verleben [22] which is one ofthe few we found that was done from a Software Engineringperspective, also proposes a Machine Learning Workﬂowbased/inspired by the stages of the Software Engineeringlifecycle. However, the Worklﬂow’s stages of this work aredifferent from the one proposed by Amershi et al. [1].The work by Lo et al. [23] is the part of the groupconsisting of a number of papers which have a software en-gineering perspective. Although the authors proposed a cycli-cal workﬂow for Federated ML (Background understanding,Requirements Analysis, Architecture Design, Implementationand Evaluation, and back to Background understanding), thearticle also mentions well-known machine learning stages suchas: data collection, data pre-processing, feature engineering,model training, and model deployment. It is worth mentioningthat the paper also deals with anomaly detection, which isa ML technique. Among the ﬁndings, they have highlightedthat the most discussed phase is model training. They havealso found that “only a few studies cover data pre-processing,feature engineering, model evaluation, and only Google hasdiscussions about model deployment (e.g., deployment strate-gies) and model inference. Model monitoring (e.g., dealingwith performance degradation), and project management (e.g.,model versioning) are not discussed in the existing studies.”In the work of Hesenius et al. [4] the authors arguethat developing ML/AI applications is typically a subprojectof an overarching development cycle, thus feedback loopsand connections are needed to integrate all activities. Con-sequently, they introduce their own proposed workﬂow forengineering data-driven applications, describe the roles teammembers take, and ﬁnally describe how different phases arestructured, namely: Developing and Understanding the Appli-cation Domain; Creating a Target Data Set; Data Cleaningand Pre-processing ,Data Reduction and Projection; Choosingthe Data Mining Task; Choosing the Data Mining Algorithm;Data Mining; Interpreting Mined Patterns; and ConsolidatingDiscovered Knowledge.Rahman et al. [24] presented an industrial case study, inwhich they apply machine learning (ML) to automaticallydetect transaction errors and propose corrections. The au-thors identiﬁed and discussed the challenges that they facedduring this collaborative research and development projectfrom three distinct perspectives: Software Engineering, Ma-hine Learning, and industry-academia collaboration. In thisway the work addresses the Software Engineering stages(requirements engineering, Design, Implementation, Integra-tion, Testing, Deployment), Machine Learning DevelopmentWorkﬂow stages (Problem Formulation, Data Acquisition, Pre-processing, Feature Extraction, Model Building, Evaluation,Integration and Deployment, Model Management, and AIEthics) and Industry-academia Collaboration stages (ProblemUnderstanding, Knowledge Transfer, Focus on objectives, Pro-fessional Practice, and Privacy and Security). It is also worthmentioning that the authors have adopted the Agile approachfor Research and Development.The work by Reimann and Kniesel-W¨unsche [7] compareML Workﬂows (with similar stages to the ones mentionedby Amershi et al. [1] against traditional SE Workﬂows inorder to address the lack of guidance in what currently useddevelopment environments and ML APIs offer developers ofML applications and contrast these with software engineeringbest practices to identify gaps in the current state of the art.We have also found other articles focusing on speciﬁc stagesof the Software Engineering lifecycle or the Machine LearningWorkﬂow such as [13], [15], [10], and [12]. Some of thesepapers will be also addressed in the Gap section (RQ4).

B. RQ2: What are the techniques applied in each of phasesof the machine learning model development phases

Although we did not ﬁnd any mention of any speciﬁctechniques regarding model requirements stages and trainingstages, we were able to collect several insights about the stagesof data processing, future engineering, model evaluation, andmodel deployment.According to Correia et al. [2] the unique Data Processingmethod is the use of charts, such as box plots and histograms,to aid with the veriﬁcation of data quality. Additionally, themain reason for the adoption of such visual tools is to avoidthe use of inappropriate data so the data scientist can avoid therisk of increasing development costs by re-execution of DataProcessing in case of error identiﬁcation in later stages likeFeature Engineering.Amershi et al. [1] mention that Data Processing stagemakes use of rigorous data versioning and sharing techniques, since Microsoft teams have found it necessary to blend datamanagement tools with their ML frameworks to avoid thefragmentation of data and model management activities andalso because the authors identiﬁed that a fundamental aspectof data management for machine learning is the fast pace in theevolution of data sources. In this way, continuous changes indata may arise either from (i) operations initiated by engineersthemselves, or from (ii) incoming fresh data (e.g., sensor data,user interactions). In an example about the application of suchtechniques provided by the authors, each model is tagged witha provenance tag that explains which data it has been trainedon and which model version, and each dataset is tagged withinformation about where it originated from and which codeversion was used to extract it (and any related features). These techniques are used for mapping datasets to deployed modelsor for facilitating data sharing and reusability.As for the Feature Engineering stage, Correia et al. [2]mentioned that statistical methods in data analysis, and the useof automatic feature selectors in feature selection are the twomain methods used for performing the Feature Engineering.Although the authors mention how statistical methods werewidely used to assist the data analysis process to help datascientists observe data behavior, they did not mention anyspeciﬁc tools. Regarding the use of automated feature selector,it is important to demonstrate that there is a ﬁne distinctionbetween which use is associated with deep learning and whichuse is associated with other algorithms. In fact, the FeatureEngineering Stage is skipped when dealing with deep learningalgorithm (since algorithms for this purpose automaticallylearns the best feature for problem solving and model training,discarding the need for data scientists to do so), whereas whendealing with other kinds of algorithms, the feature selection isperformed manually with data scientists executing operationslike feature scoring to ranking features based on relevance.Regarding model Evaluation, Amershi et al. [1] mention thatMachine Learning centric software goes through frequentreviews initiated by model changes, parameter tuning, anddata updates, and the combination of these has a signiﬁcantimpact on system performance. In this context, they haveidentiﬁed the use agile techniques to evaluate experimentsand the use of combo-ﬂighting techniques (such as ﬂightinga combination of changes and updates), including multiplemetrics in experiment score cards, and performing human-driven evaluation for more sensitive data categories in orderto developed systematic processes.Finally, to ensure all aspects run seamlessly during ModelDeployment the authors recommend the following: (i) au-tomating the training and deployment pipeline; (ii) integratingmodel building with the rest of the software; (iii) usingcommon versioning repositories for both ML and non-MLcodebases, and tightly coupling the ML and non-ML develop-ment sprints and standups.

C. RQ3: What are the pros and cons of each machine learningmodel development technique?

The issue covered in RQ3 (where there were a signiﬁcantnumber of venues with a perspective different from our mainresearch purpose, addressing the question) is more prominenthere. In fact, we could only ﬁnd limited answers in articlesfrom a perspective of Machine Learning Techniques appliedto Software Engineering tasks or applied to the stages of theSoftware Engineering Model Development life-cycle. In otherwords, we only found limited answers from articles that weremostly unrelated to our research purposes.In this way, the work of Hesenius et al. [4] discusses dif-ferent Machine Learning Model Techniques from supervisedto unsupervised learning, but without specifying the pros andcons of each one.Additionally, Nguyen et al [25] search for the learningparadigms classiﬁcation (mentioned in the development of MLystems of organization) by searching with keywords such asSupervised Learning, Unsupervised Learning, ReinforcementLearning among others. In the work of Shaﬁq [26] the authorshowed interest in understanding whether a particular type-/technique was consistently employed for a speciﬁc life cyclestage, bearing in mind that the ML technique refers to how themodels have been trained, e.g., supervised, semi-supervised orunsupervised and how it is related to the algorithms such as asupport vector machine (SVM), random forests (RF) or neuralnetworks (NN).Articles reviewed by the author employed supervised learn-ing, whereas 14 out of 227 articles employed unsupervisedlearning, and 6 out of 227 employed semi-supervised learning.Likewise, 4 out of 227 addressed reinforcement learning, 1 outof 227 focused on analytical (inference based) learning, whilethe rest of the articles (40 out of 227) reported none. Althoughall the ML techniques have certain pros and cons, the selectionof the most suitable technique depends on the type of datasetbeing constructed or employed and the authors did not providefurther information on that.Finally, an interesting contribution we have found to thisresearch question came from the article of Wang et al. [27]summarizes the characteristics of each Machine LearningModel Development Technique by highlighting some of theirrespective advantages: supervised learning (SL), unsupervisedlearning (UL), semi-supervised learning (SSL) and reinforce-ment learning (RL). SSL presents a challenging learningsetting, while in SL, the training data comprises examples(represented in the form of vectors). UL is often used todiscover groups of similar examples and RL is concernedwith the problem of ﬁnding suitable actions to take in a givensituation in order to maximize a reward. According to theauthors, by using an UL technique, the system’s performancemay be unstable in comparison with supervised techniques.

D. RQ4: What are the gaps in terms of the model developmentlife-cycle?

We have identiﬁed some gaps in terms of model develop-ment life cycle which were mentioned in the literature sincethe processes adopted by data scientists in their companies arenon-linear, requiring too much rework to satisfy customers’needs (Correia et al. [2]). Additionally Correia et al. [2]did not ﬁnd any activities that address the veriﬁcation andthe validation of the artifacts generated during the workﬂowstages. As a consequence, the gaps identiﬁed in the developingstages were attributed to the existence particularities identiﬁedin the Machine Learning model development. According tothe authors, practitioners should anticipate problems and saveresources in order to mitigate recurrent feedback loops in theprocess. Further, they state that the best way to accomplishthat is by following software engineering practices startingfrom the early ML modeling stages, which not only allowscompanies to reduce rework and dependence on the domainexperts, but also leverage the maintainability of ML models.For instance, they mentioned the importance of developingcustomized inspection techniques to support the veriﬁcation of Machine Learning features and models. This vision issomewhat supported by Amershi et al. [1], as when theauthors state that data scientists perceive Data Processing andFeature Engineering as one of the more challenging stages ofMachine Learning model development. They describe how MLdevelopment lacks the support of a well-engineered process,and that the validation of the ML model is often not done,given the difﬁculty to test back-box ML models.In a similar fashion we have found other works addressinggaps in speciﬁc stages of Software Engineering DevelopmentStages and Machine Learning Workﬂow. Among these articleswe can highlight Meyer [10], Wan et al. [11], Wolf and Paine[12], Foidl and Ferderer [13], Gotz et al. [14], Tsay et al. [15],and Simmons et al. [16].We have identiﬁed that most of the stage-related gaps areconnected to the difﬁculties in adopting Software Engineer-ing lifecycle in Machine Learning Modeling. This view wascorroborated for the works of Ishikawa and Yoshioka [28],Khom et al. [29], and Kim [17]. By conducting a surveywith 278 professionals with proven experience in ML orpractical ML applications in Japan, the work of Ishikawa andYoshioka [28] found that due to the unique nature of ML-based systems, they would need new approaches in termsof software development processes. Moreover, according tothe authors, the attempts to address this were not enough toeliminate the gaps resulting from this difference. According tothe article of Khom et al. [29], the failures and shortcomingsprofessionals and researcher have been experiencing withMachine Learning systems are due to the fact that the rulesof software development do not applied in Machine LearningModeling where the rules come from the training data (andfrom which the requirements are generated) representing anadditional challenge in terms of model testing and modelveriﬁcation.Finally, the article written by Kim [17] highlights thedifﬁculty of incorporating software engineering developmentprocesses into the Machine Learning workﬂow since data-centric software development, such as Machine Learning Mod-els, would be signiﬁcantly different from traditional softwaredevelopment, mostly regarding testing, debugging and theprobabilistic characteristics of those systems.

E. RQ5: What are the trends regarding the techniques appliedin the machine learning model development life cycle?

We have identiﬁed two distinct trends regarding SoftwareEngineering Model Development life cycle applied to MachineLearning Model Development. The ﬁrst trend states that theintegration of SE development process into Machine Learningmodeling must consider the intrinsic differences between Ma-chine Learning based systems and other applications such asdata dependency. This pattern is clear in the works presentedin [1]–[3].The second trend states that considering data dependencyin machine learning model development leads to differentprocesses/adaptations and partial solutions. Likewise, somearticles defend the development of a single general Machineearning framework (in contrast with speciﬁc solutions des-tined to solve particular and partial issues related to ML/AImodeling), regardless of the type of data under considera-tion. According to this trend, the particularities of MachineLearning Based Systems do not imply the need for differentSoftware Engineering Processes. We can consider the worksin [4]–[6] to be aligned with this view.

F. Discussion

Most of the few existing approaches of SE to ML arefocused on broadly ML workﬂows [1], [2], [4], [7], [9],[11]–[16], [20], [22]–[24]. However, many studies do notprovide details for each stage of the workﬂow, nor describethe techniques and algorithms that were applied, nor providean evaluation of their approach by discussing the pros andcons. Because of this lack of details, there is also a need forspecialized stage approaches, focusing on a speciﬁc step of theworkﬂow. In particular, there is an evident lack of approachesto support the requirements and maintenance SE developmentstages. Table II provides a simpliﬁed view of our ﬁndingsrelated to speciﬁc SE tasks.V. T

HREATS FOR V ALIDITY

We identiﬁed three potential threats to the validity of ourstudy and its results. First there may be bias from the articleswhich provided answers for most of the research questions.Although we have identiﬁed a relatively signiﬁcant numberof articles related to software engineering applied to machinelearning modeling, only some of them are related to themachine learning workﬂow, very few of those are from thesoftware engineering perspective and even fewer consideringthe application of Software Engineering Life Cycle to MachineLearning based systems. In order to mitigate this kind of biaswe have worked with a larger number of articles that were notdirectly related to our subject but that could provide us withinsights about our research questions.A signiﬁcant part of the articles whose conclusions weresigniﬁcant for this research were based on survey answerswhich may be subject to bias. That is because some of themcame from a single company or a small group of companiesin the same geographical region.Finally, in regard with results validation, we have onlyidentiﬁed one article which has applied robust quantitativemeasurement in order to determine (validate) the impact/im-portance of certain practices to the Machine Learning work-ﬂow. It is worth mentioning that these measurements wereapplied to evaluate the impact of practice adoption and notthe methods adoption, or the impact from adopting SE life-cycle in Machine Learning modeling.VI. C

HALLENGE FOR F UTURE R ESEARCH

The results we have found lead us towards a new set ofpossibilities in terms of future work. First, we can reconductour literature review by reﬁning our search strategy in order tocompare/check the results and look for new insights. Second,experiments/surveys like those conducted by Amershi et al. [1] and Correia et al. [2] could be replicated with more companiesfrom different geographic locations.Maybe the greatest achievement would be to incorporatesome robust quantitative measurement such as the regressionsconducted in Serban et al. [5] in order to determine the impactor degree of improvement achieved from the adoption of theMachine Learning workﬂow based on the stages of SoftwareDevelopment lifecycles.The study also could be extended in order to check theimpact of Software Engineering Processes in systems basedon speciﬁc Machine Learning Algorithms (or in machinelearning based systems characterized for using a speciﬁc typeof data), not only to determine the real impact of adoptingSoftware Engineering development lifecycle stages in machinelearning modeling, but also decide if there is a need to adaptthe machine workﬂow according to the machine learningalgorithm (or type of data). Possibly this may answer thequestion of whether we need multiple workﬂows or can use asingle more general workﬂow suitable for all kinds of machinelearning models. VII. C

ONCLUSIONS

This systematic literature review promotes a better un-derstanding about how Software Engineering Developmentlifecycle can improve/address the recurrent issues identiﬁed onMachine Learning Model Development. Among the expectedcontributions, it is important to point out that although wefound some articles proposing Machine Learning Workﬂowswith some differences from each other, we were able todraw/identify the most comprehensive which is aligned withthe Software Engineering Model Development Stages.Another expected result was the highlighting of data depen-dency as the main characteristic of Machine Learning Models.This ﬁnding lead to the identiﬁcation of the most challengingstages of Machine Learning Development Process as: DataProcessing and Feature Engineering. We have also identiﬁedtwo distinct trends regarding the results and the conductedresearch articles.Finally we concluded that the understanding of how Soft-ware Engineering Model Development practices and the adop-tion of a Machine Learning Workﬂow in accordance with thosepractices, more speciﬁcally, with the Software Engineeringlife-cycle is a subject of vital importance for the evolutionof Machine Learning/Artiﬁcial Intelligence and continuingdevelopment of its applications (especially on a large scale),even if further research on the topic is needed.A

CKNOWLEDGMENT

This work was supported by the Natural Sciences andEngineering Research Council of Canada (NSERC), and theOntario Research Fund (ORF) of the Ontario Ministry ofResearch, Innovation, and Science, and the Centre for Com-munity Mapping (COMAP).

ABLE III

SSUES AND A PPROACHES REPORTED IN SOME OF THE P RIMARY S TUDIES . SE process or practices Issues Solutions Primary StudiesData processing - Lack of methods and tools tosupport software engineers toperform data validation- Lack of tools to support data management - Decision support about data prioritization and rigor- Adoption of visual tools-Use of rigorous data versioning and sharing techniques-Provenance tag for data models [1], [2], [13]Documentation and Versioning - Extract metadatafrom repositories is difﬁcult - Catalog of ML models to support design andmaintenance [15]Non-functional Requirements - security-unassured reliability and lacking transparency - Identify parts of the ISO 26262 to be adapted to ML- An approach based on dependability assurances [30], [31]Design andImplementation - APIs look and feel likeconventional APIs, butabstract away data-driven behavior - catalog of design patterns for ML development- information to support documentation and design ofAPIs [6], [32]Evaluation - Testing interpretability,privacy, or efﬁciency of ML - Proposal of new test semantic- Tests based on quality score [3], [33], [34]Deployment and Maintenance - Lack of support to adapt based on feedback - An approach to support adaptation based on quality gates [34]Software Capability MaturityModel (CMM) -Convert business requirements intodata requirements- Support project management(e.g. estimate data budget) - a maturity framework for ML based on CMM [35] R EFERENCES[1] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Na-gappan, B. Nushi, and T. Zimmermann, “Software engineering formachine learning: A case study,” in , 2019, pp. 291–300.[2] J. L. Correia, J. A. Pereira, R. de Mello, A. Garcia, B. Fonseca,M. Ribeiro, R. Gheyi, W. Tiengo, M. Kalinowski, and R. Cerqueira,“Brazilian data scientists: Revealing their challenges and practices onmachine learning model development,” in

Proceedings of the 19thBrazilian Symposium on Software Quality (SBQS) , 2020, pp. 1–10.[3] T. Zhang, C. Gao, L. Ma, M. Lyu, and M. Kim, “An empirical study ofcommon challenges in developing deep learning applications,” in . IEEE, 2019, pp. 104–115.[4] M. Hesenius, N. Schwenzfeier, O. Meyer, W. Koop, and V. Gruhn,“Towards a software engineering process for developing data-drivenapplications,” in .IEEE, 2019, pp. 35–41.[5] A. Serban, K. van der Blom, H. Hoos, and J. Visser, “Adoption andeffects of software engineering best practices in machine learning,”in

Proceedings of the 14th ACM/IEEE International Symposium onEmpirical Software Engineering and Measurement (ESEM) , 2020, pp.1–12.[6] H. Washizaki, H. Uchida, F. Khomh, and Y.-G. Gu´eh´eneuc, “Studyingsoftware engineering patterns for designing machine learning systems,”in . IEEE, 2019, pp. 49–495.[7] L. Reimann and G. Kniesel-W¨unsche, “Achieving guidance in appliedmachine learning through software engineering techniques,” in

Confer-ence Companion of the 4th International Conference on Art, Science,and Engineering of Programming , 2020, pp. 7–12.[8] B. Hutchinson, A. Smart, A. Hanna, E. Denton, C. Greer, O. Kjartansson,P. Barnes, and M. Mitchell, “Towards accountability for machine learn-ing datasets: Practices from software engineering and infrastructure,” arXiv preprint arXiv:2010.13561 , 2020.[9] E. de Souza Nascimento, I. Ahmed, E. Oliveira, M. P. Palheta, I. Stein-macher, and T. Conte, “Understanding development process of machinelearning systems: Challenges and solutions,” in . IEEE, 2019, pp. 1–6.[10] M. Ole and G. Volker, “Towards concept based software engineeringfor intelligent agents,” in , May 2019, pp. 42–48.[11] Z. Wan, X. Xia, D. Lo, and G. C. Murphy, “How does machinelearning change software development practices?”

IEEE Transactionson Software Engineering , pp. 1–1, 2019. [12] C. T. Wolf and D. Paine, “Sensemaking practices in the everydaywork of ai/ml software engineering,” in

Proceedings of theIEEE/ACM 42nd International Conference on Software EngineeringWorkshops , ser. ICSEW’20. New York, NY, USA: Associationfor Computing Machinery, 2020, p. 86–92. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3387940.3391496[13] H. Foidl and M. Felderer, “Risk-based data validation inmachine learning-based software systems,” in

Proceedings ofthe 3rd ACM SIGSOFT International Workshop on MachineLearning Techniques for Software Quality Evaluation , ser.MaLTeSQuE 2019. New York, NY, USA: Association forComputing Machinery, 2019, p. 13–18. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3340482.3342743[14] M. G¨otz, M. Book, C. Bodenstein, and M. Riedel, “Supportingsoftware engineering practices in the development of data-intensivehpc applications with the juml framework,” in

Proceedings ofthe 1st International Workshop on Software Engineering for HighPerformance Computing in Computational and Data-Enabled Scienceand Engineering , ser. SE-CoDeSE’17. New York, NY, USA:Association for Computing Machinery, 2017, p. 1–8. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3144763.3144765[15] J. Tsay, A. Braz, M. Hirzel, A. Shinnar, and T. Mummert,“Aimmx: Artiﬁcial intelligence model metadata extractor,” in

Proceedings of the 17th International Conference on Mining SoftwareRepositories , ser. MSR ’20. New York, NY, USA: Associationfor Computing Machinery, 2020, p. 81–92. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3379597.3387448[16] A. J. Simmons, S. Barnett, J. Rivera-Villicana, A. Bajaj, and R. Vasa,“A large-scale comparative analysis of coding standard conformance inopen-source data science projects,” in

Proceedings of the 14th ACM/ IEEE International Symposium on Empirical Software Engineeringand Measurement (ESEM) , ser. ESEM ’20. New York, NY, USA:Association for Computing Machinery, 2020. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3382494.3410680[17] M. Kim, “Software engineering for data analytics,”

IEEE Software ,vol. 37, no. 4, pp. 36–42, July 2020.[18] S. Keele et al. , “Guidelines for performing systematic literature reviewsin software engineering,” Technical report, Ver. 2.3 EBSE TechnicalReport. EBSE, Tech. Rep., 2007.[19] N. Nascimento, P. Alencar, C. Lucena, and D. Cowan, “Toward human-in-the-loop collaboration between software engineers and machine learn-ing algorithms,” in . IEEE, 2018, pp. 3534–3540.[20] A. Banimustafa and N. Hardy, “A scientiﬁc knowledge discovery anddata mining process model for metabolomics,”

IEEE Access , vol. 8, pp.209 964–210 005, 2020.[21] K. Singla, J. Bose, and C. Naik, “Analysis of software engineeringfor agile machine learning projects,” in . IEEE, 2018, pp. 1–5.[22] P. Kriens and T. Verbelen, “Software engineering practices for machinelearning,” arXiv preprint arXiv:1906.10366 , 2019.23] S. K. Lo, Q. Lu, C. Wang, H. Paik, and L. Zhu, “A systematic literaturereview on federated machine learning: From a software engineeringperspective,” arXiv preprint arXiv:2007.11354 , 2020.[24] M. S. Rahman, E. Rivera, F. Khomh, Y.-G. Gu´eh´eneuc, and B. Lehnert,“Machine learning software engineering in practice: An industrial casestudy,” arXiv preprint arXiv:1906.07154 , 2019.[25] E. Nascimento, A. Nguyen-Duc, I. Sundbø, and T. Conte, “Softwareengineering for artiﬁcial intelligence and machine learning software: Asystematic literature review,” arXiv preprint arXiv:2011.03751 , 2020.[26] S. Shaﬁq, A. Mashkoor, C. Mayr-Dorn, and A. Egyed, “Machinelearning for software engineering: A systematic mapping,” 2020.[27] S. Wang, L. Huang, J. Ge, T. Zhang, H. Feng, M. Li, H. Zhang,and V. Ng, “Synergy between machine/deep learning and softwareengineering: How far are we?” arXiv preprint arXiv:2008.05515 , 2020.[28] F. Ishikawa and N. Yoshioka, “How do engineers perceive difﬁcultiesin engineering of machine-learning systems? questionnaire survey,” in

Proceedings of the Joint 7th International Workshop on ConductingEmpirical Studies in Industry and 6th International Workshopon Software Engineering Research and Industrial Practice , ser.CESSER-IP ’19. IEEE Press, 2019, p. 2–9. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1109/CESSER-IP.2019.00009[29] F. Khomh, B. Adams, J. Cheng, M. Fokaefs, and G. Antoniol, “Softwareengineering for machine-learning applications: The road ahead,”

IEEESoftware , vol. 35, no. 5, pp. 81–84, Sep. 2018.[30] J. Henriksson, M. Borg, and C. Englund, “Automotive safetyand machine learning: Initial results from a study on how toadapt the iso 26262 safety standard,” in

Proceedings of the 1stInternational Workshop on Software Engineering for AI in AutonomousSystems , ser. SEFAIS ’18. New York, NY, USA: Associationfor Computing Machinery, 2018, p. 47–49. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3194085.3194090[31] M. Scheerer, J. Klamroth, R. Reussner, and B. Beckert, “Towardsclasses of architectural dependability assurance for machine-learning-based systems,” in

Proceedings of the IEEE/ACM 15th InternationalSymposium on Software Engineering for Adaptive and Self-ManagingSystems , ser. SEAMS ’20. New York, NY, USA: Associationfor Computing Machinery, 2020, p. 31–37. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3387939.3388613[32] A. Cummaudo, R. Vasa, S. Barnett, J. Grundy, andM. Abdelrazek, “Interpreting cloud computer vision pain-points:A mining study of stack overﬂow,” in

Proceedings of theACM/IEEE 42nd International Conference on Software Engineering ,ser. ICSE ’20. New York, NY, USA: Association forComputing Machinery, 2020, p. 1584–1596. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3377811.3380404[33] S. Gerasimou, H. F. Eniser, A. Sen, and A. Cakan,“Importance-driven deep learning system testing,” in

Proceedingsof the ACM/IEEE 42nd International Conference on SoftwareEngineering , ser. ICSE ’20. New York, NY, USA: Associationfor Computing Machinery, 2020, p. 702–713. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1145/3377811.3380391[34] I. Gerostathopoulos, S. Kugele, C. Segler, T. Bures, andA. Knoll, “Automated trainability evaluation for smart softwarefunctions,” in

Proceedings of the 34th IEEE/ACM InternationalConference on Automated Software Engineering , ser. ASE’19. IEEE Press, 2019, p. 998–1001. [Online]. Available:https://doi-org.proxy.lib.uwaterloo.ca/10.1109/ASE.2019.00096[35] R. Akkiraju, V. Sinha, A. Xu, J. Mahmud, P. Gundecha, Z. Liu, X. Liu,and J. Schumacher, “Characterizing machine learning processes: Amaturity framework,” in