[PDF] Orchestrate: Infrastructure for Enabling Parallelism during Hyperparameter Optimization

Abstract

Two key factors dominate the development of effective production grade machine learning models. First, it requires a local software implementation and iteration process. Second, it requires distributed infrastructure to efficiently conduct training and hyperparameter optimization. While modern machine learning frameworks are very effective at the former, practitioners are often left building ad hoc frameworks for the latter. We present SigOpt Orchestrate, a library for such simultaneous training in a cloud environment. We describe the motivating factors and resulting design of this library, feedback from initial testing, and future goals.

Full PDF

OOrchestrate: Infrastructure for Enabling Parallelismduring Hyperparameter Optimization

Alexandra Johnson, Michael McCourt

SigOptSan Francisco, CA, USA {alexandra, mccourt}@sigopt.com

Abstract

Two key factors dominate the development of effective production grade machinelearning models. First, it requires a local software implementation and iterationprocess. Second, it requires distributed infrastructure to efﬁciently conduct trainingand hyperparameter optimization. While modern machine learning frameworks arevery effective at the former, practitioners are often left building ad hoc frameworksfor the latter. We present SigOpt Orchestrate, a library for such simultaneoustraining in a cloud environment. We describe the motivating factors and resultingdesign of this library, feedback from initial testing, and future goals.

Deep learning models have enjoyed broad adoption [8] because of the development of popularlibraries, such as MXNet [5], PyTorch [10] and Tensorﬂow [1]. These libraries have provided anefﬁcient and stable framework in which to develop models.For these deep learning models to perform well, meta-decisions regarding their architecture andhyperparameters must be made; conducting this model tuning efﬁciently presents a challenge both interms of strategy and implementation. Often times, the strategy for hyperparameter tuning involvesdeﬁning some measurement of generalization quality for a given model and then using a black-boxoptimization strategy to ﬁnd an optimal conﬁguration [7]. Suitable strategies include grid search[3], random search [2], evolutionary strategies [14], swarm intelligence methods [4], and Bayesianoptimization [6, 11].Each of these strategies requires training a model many times, each with different hyperparame-ters/architecture. Training several models in parallel can reduce the necessary wall clock time requiredto complete this important step. Running multiple models in a local development environment islikely infeasible, usually because of the specialized hardware required for deep learning models. Assuch, distributed infrastructure for parallel model trainings is a necessary component of an efﬁcientmodel building pipeline. Deep learning models also require a great deal of high quality labeled data,but this topic is not discussed in this article.We present a library, Orchestrate, which seeks to manage the infrastructure complications fundamentalto parallel hyperparameter tuning. Orchestrate was designed to work with SigOpt, a cloud-basedoptimization API for hyperparameter tuning in parallel [9]. The goal of Orchestrate is to providethe necessary infrastructure to coordinate and simultaneously execute multiple hyperparameterconﬁgurations suggested by SigOpt; this allows the user to focus on the actual design of the deeplearning model rather than the infrastructure used during hyperparameter tuning.In this paper, we discuss the circumstances which led us to develop Orchestrate and the designdecisions made to address these circumstances. We present an internal use case (conducted duringalpha testing) and goals for future development.

Many organizations face the need to develop scalable infrastructure to support tools that have beendeveloped in a local environment. Airﬂow was developed at AirBnB to implement and monitorsequences of tasks in a distributed and asynchronous environment. Mesosphere provides enterprisesolutions around deploying containers to public clouds. Uber has developed Michelangelo to provideinternal teams the ability to deploy their machine learning tools at scale.To inform the development of Orchestrate, we interviewed SigOpt users (and, particularly, deep learn-ing users who evaluate multiple models in parallel) to understand what was needed for Orchestrate tobe effective. Below, we summarize the responses and the resulting Orchestrate workﬂow. At the highest level, interviewees stated that they want the power to execute multiple hyperparameteroptimization experiments simultaneously (to accelerate the optimization process); furthermore, eachof these experiments could have drastically different compute times. Within a single experiment,interviewees wanted to be able to leverage evaluating multiple model conﬁgurations simultaneously.Even within a single model conﬁguration evaluation, interviewees wanted to distribute their modelacross multiple GPUs and evaluate multiple cross-validation folds simultaneously. When initiallyscoping Orchestrate, we had anticipated the desire to support evaluating multiple models in parallel,but, after the interviews, it became clear that for this project to be successful we would have toaddress parallelism on multiple levels.

Experimental model development may proceed at an erratic pace; interviewees reported needingsigniﬁcant compute resources at inconsistent intervals because of the development / tuning cycle attheir company. Additionally, these interviewees were hoping to leverage the elastic nature of cloudcomputing to have access to the resources they needed, when they needed them.

In combining the desire to limit compute cost with a desire to tune multiple models simultaneously(mentioned in Section 2.1), we realized that the cluster should be able to support heterogeneouscompute resources. Both CPU and GPU machines should be available within the same cluster toallow models which do not require GPU resources to not have to pay for them.

In entrusting Orchestrate to manage the infrastructure, interviewees voiced concern about loss ofproximity to the actual functioning of the model tuning experiment; in our interviews we learned thatthey still want to be able to monitor the process despite abstracting away some of the details.

Monitor status

The process of monitoring Orchestrate status seemed to split into two key factors:the status of the cluster and the status of individual experiments on the cluster. Effective monitoringof Orchestrate would require both the ability to answer the question “Is the cluster infrastructureoperating as planned?” and the question “How is work being distributed for each of my experiments?”

View Logs

A common fear among interviewees was that if the infrastructure were managed byOrchestrate that they would lose the ability to easily detect and rectify mistakes in their models. Thisis especially complicated in situations where their models could behave very differently for differentmodel conﬁgurations. The proposed solution was to be able to easily access logs from experimentsas they were running (or after they crashed). In particular, interviewees wanted to be able to easilyrecover all the logs associated with a single experiment, irrespective of what other experiments wererunning on the cluster or how parallel model conﬁgurations were distributed. https://airflow.apache.org/ https://mesosphere.com https://eng.uber.com/michelangelo/ onitor performance Perhaps most importantly, interviewees wanted to be able to monitor theactual quality of the models as they went through this model tuning process. Because this was alreadymanaged through the SigOpt website, it was not considered as a component of Orchestrate.

Interviewees stated that hyperparameter optimization can surface bugs within their models, whetherdue to their code throwing exceptions or their model’s performance failing to reach a threshold. Inboth circumstances, interviewees wanted the ability to terminate all execution on their experimentand free up the resources for future work.

The Orchestrate workﬂow, as guided by these customer discussions, is depicted in Figure 1. Ofparticular note is that the creation and destruction of the cluster is dissociated from the runningof experiments. This allows multiple experiments to be run on a single cluster (rather than tyingthe existence of a cluster to a single experiment) and it allows model tuning artifacts (such asmodel-generated logging output) to remain available after the experiment has completed.Figure 1: The results of our investigation into a workﬂow to support prospective users of Orchestratewith the parallel components of their hyperparameter optimization.

Because our customers have a variety of modeling backgrounds and goals, we prioritized building amodel agnostic tool. This tool could be installed anywhere on a user’s system, and could be used tocontainerize models that lived anywhere. We also wanted our tool to be language-agnostic, so thateven if our tool is written in Python, it can be used from any environment, on any kind of model.Our core API commands are listed below; they, respectively, allow a user to create a cluster, start anexperiment, monitor experiment status, view experiment logs, delete an experiment, and destroy anexperiment. sigopt cluster create -f cluster_configuration.ymlsigopt run -f experiment_configuration.ymlsigopt status $EXPERIMENT_IDsigopt logs --follow $EXPERIMENT_IDsigopt delete $EXPERIMENT_IDsigopt cluster destroy -n $CLUSTER_NAME .2 Containerization To fulﬁll our goal of allowing customers to take their locally developed models and tune them in thecloud, we needed a mechanism for moving a model (and all its supporting components). We foundcontainers to be a logical solution for managing the process. This process is less brittle than copyingindividual ﬁles and more ﬂexible than attempting to provide customers with pre-deﬁned hard discimages already loaded with standard dependencies. Orchestrate uses Docker, an industry standardtool for containerization.One remaining point of contention when using containers is how to move/access data from which amodel should be trained; at present, we advise users import their data from a cloud source after thecontainer has started. Devising a strategy for reducing data trafﬁc is a priority going forward. If containers are the emerging standard for wrapping up code and dependencies, Kubernetes is theemerging standard for running containers. It is not purpose-built for machine learning (the stated goalof many of our users), but it has many features desired of Orchestrate clusters, such as facilitatingcommunication between machines, starting containers across multiple machines, and managingrunning containers. Kubernetes provides built-in APIs and abstractions for starting jobs, monitoringinfrastructure, and viewing container logs, and other important functions. During our developmentof Orchestrate we relied heavily on the standard Kubernetes paradigms (such as jobs, pods andcontainers) and APIs. Using these built-in tools shaved weeks off of our development cycle.Additionally, managed Kubernetes implementations are hosted by Amazon, Google and Azure clouds.This allowed us to save development time by relying on a managed Kubernetes implementationwithout being locked-in to one cloud provider. We chose Amazon Web Services (AWS) as our cloud provider because it was an environment that ourearly interviewees were familiar with and comfortable using. AWS released their Elastic KubernetesService (EKS) shortly after we began scoping this project. To facilitate transfer of containers fromlocal environments to an EKS cluster, we use AWS Elastic Container Registry (ECR). Furthermore,EKS allows users to create clusters with both CPU and GPU machines within the same cluster; thishelps support the “multiple experiments, one cluster” goal described in Section 2.2.EKS came with a few limitations, however. EKS is billed separately from AWS machines, so whilethe cost of one EKS cluster may be negligible compared to a GPU machine, it is still an additionalcost. Additionally, as of this article, AWS limits each account to three EKS instances by default. Toavoid friction from requiring customers contact AWS support to exceed that limit, we opted to build aworkﬂow for a team to share a single cluster to run multiple experiments.We expect to integration Orchestrate with every other cloud over time so it is fully agnostic to theunderlying infrastructure as we progress toward GA.

Figure 2: Example cluster conﬁguration yaml ﬁleAWS EKS simpliﬁes the process of creatinga Kubernetes cluster on AWS, but it still re-quires knowledge of the AWS and KubernetesAPIs. With Orchestrate, we wanted to go onestep further in reducing the complication ofspinning up a cluster. When the user creates acluster, they provide us with a short model con-ﬁguration yaml ﬁle listing the cluster name,cloud provider (only AWS for now), and de-sired number and type of GPU and / or CPU https://kubernetes.io/ Orchestrate relies heavily on SigOpt’s centralized API for distributed hyperparameter optimization torun model tuning experiments with simultaneous model evaluations. SigOpt also informs Orchestrateon the progress of the experiment so that experiments can be run for the desired duration (thisinformation can also be recovered from the CLI to monitor experiment status). Additionally, use ofthe SigOpt API for experiments allows us to take advantage of the SigOpt web interface to view andshare experiments, as shown in Figure 3.SigOpt serves as a system of record for completed experiments. While some experiment artifacts,such as container logs, will be lost once the the cluster is destroyed, experiment metadata, includingparameters and performance will exist on SigOpt in perpetuity.Figure 3: An in-progress Orchestrate experiment, viewed on https://sigopt.com . For experiments requiring GPUs, users may provide the number of GPUs needed per model in theirexperiment conﬁguration yaml ﬁle. Orchestrate passes this information to Kubernetes for use increating a job on the cluster, and the Kubernetes scheduler manages resource and capacity limitationsin the cluster. The experiment conﬁguration yaml ﬁle is where the user deﬁnes the experimentstructure (number of different conﬁgurations with which they wish to evaluate their model and howmany of those evaluations may be run in parallel).

Because of our focus on model evaluations in parallel, we have not builtfeatures and visualizations, etc., that could be useful for a practitioner during early development.

User-provided containers

Because Orchestrate packages a model, dependencies, and Orchestrate-speciﬁc code into a Docker container, allowing a user to bring their own model container would runinto a technical constraint of running Docker within Docker.

Non-Kubernetes Cluster Management

At present, this tool does not play nicely with in houseclusters that are not Kubernetes based. Speciﬁcally, we cannot support clusters running Slurm, apopular workload manager for universities. 5 igh-GPU Models

The largest GPU instance type provided by AWS currently has 8 GPUs(p3.16xlarge); because Orchestrate currently only supports using AWS, and we have not testeda model exceeding the constraints of one AWS EC2 instance, a single model conﬁguration cannot betrained on more than 8 GPUs simultaneously.Figure 4: A split-screen terminal showing two SigOpt Orchestrate CLI commands. On thetop is “ sigopt status $ EXPERIMENT_ID ”, and on the bottom is “ sigopt logs –follow $EXPERIMENT_ID ”. The green and blue colors in the logs represent output from two distinct simulta-neous evaluations of the model.

Our initial alpha tester used Orchestrate over the course of several weeks during his development of aconvolutional neural network with 3 convolutional layers and 2 fully connected layers. This neuralnetwork was trained on the German trafﬁc sign dataset [12], and each model required training on oneGPU. During each model tuning experiment with Orchestrate the model was evaluated 300 times,with ﬁfteen models evaluated simultaneously.The alpha tester said that, in addition to being happy with the numerical results of the hyperparameteroptimization, Orchestrate was “easy to use—I was able to get up and running very fast.” The con-structive feedback aligned with some of our earlier thoughts on Orchestrate’s limitations. Speciﬁcally,the alpha tester found that Orchestrate was “a useful tool once you’ve deﬁned your model but hard ifyou want to make incremental changes.” The user also stated that he found the ability to extract logsduring and after the run to be “helpful”.

High priority areas for future work include: • Incorporating information about how much storage or computational resources each modelrequires. • Efﬁcient dataset storage on the cluster. Good strategies for managing this have beendiscussed at the NIPS ML Systems workshop [13]. • Connecting to existing Kubernetes clusters not created by Orchestrate. • Supporting other cloud providers. 6

Meet new collaborators to understand the needs of speciﬁc use cases, e.g., reinforcementlearning or natural language processing.

References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorﬂow: A system for large-scale machine learning. In

Proceedings of the 12th USENIX Conference on Operating SystemsDesign and Implementation , OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIXAssociation.[2] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization.

Journalof Machine Learning Research , 13(Feb):281–305, 2012.[3] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In

Advances in neural information processing systems , pages 2546–2554, 2011.[4] Christian Blum and Xiaodong Li. Swarm intelligence in optimization. In

Swarm Intelligence ,pages 43–85. Springer, 2008.[5] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu,Chiyuan Zhang, and Zheng Zhang. MXNet: A ﬂexible and efﬁcient machine learning libraryfor heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 , 2015.[6] Peter Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 , 2018.[7] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimizationfor general algorithm conﬁguration. In

International Conference on Learning and IntelligentOptimization , pages 507–523. Springer, 2011.[8] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436,2015.[9] Ruben Martinez-Cantin, Kevin Tee, and Michael McCourt. Practical bayesian optimization inthe presence of outliers. In Amos Storkey and Fernando Perez-Cruz, editors,

Proceedings ofthe Twenty-First International Conference on Artiﬁcial Intelligence and Statistics , volume 84of

Proceedings of Machine Learning Research , pages 1722–1731, Playa Blanca, Lanzarote,Canary Islands, 09–11 Apr 2018. PMLR.[10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inpytorch. In

Autodiff Workshop at NIPS , 2017.[11] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Takingthe human out of the loop: A review of Bayesian optimization.

Proceedings of the IEEE ,104(1):148–175, 2016.[12] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The German trafﬁc signrecognition benchmark: a multi-class classiﬁcation competition. In

Neural Networks (IJCNN),The 2011 International Joint Conference on , pages 1453–1460. IEEE, 2011.[13] Nako Sung, Minkyu Kim, Hyunwoo Jo, Youngil Yang, Jingwoong Kim, Leonard Lausen,Youngkwan Kim, Gayoung Lee, Donghyun Kwak, Jung-Woo Ha, et al. Nsml: A machinelearning platform that enables you to focus on your models. In

ML Systems Workshop at NIPS ,2017.[14] Steven R Young, Derek C Rose, Thomas P Karnowski, Seung-Hwan Lim, and Robert M Patton.Optimizing deep learning hyper-parameters through an evolutionary algorithm. In