Elastic CoCoA: Scaling In to Improve Convergence
EElastic CoCoA: Scaling In to Improve Convergence
Michael Kaufmann †‡ , Thomas Parnell † , Kornilios Kourtis †† IBM Research, ‡ Karlsruhe Institute of Technology (KIT) † Zurich, Switzerland ‡ Karlsruhe, Germany {kau,tpa,kou}@zurich.ibm.com
Abstract
In this paper we experimentally analyze the convergence behavior of CoCoA andshow, that the number of workers required to achieve the highest convergencerate at any point in time, changes over the course of the training. Based on thisobservation, we build Chicle, an elastic framework that dynamically adjusts thenumber of workers based on feedback from the training algorithm , in order to selectthe number of workers that results in the highest convergence rate. In our evaluationof 6 datasets, we show that Chicle is able to accelerate the time-to-accuracy by afactor of up to . × compared to the best static setting, while being robust enoughto find an optimal or near-optimal setting automatically in most cases. As data has become a major source of insight, machine learning (ML) has become a dominantworkload in many (public and private) cloud environments. Ever-increasing collection of data furtherdrives development of efficient algorithms and systems for distributed ML [8, 2] as resource demandsoften exceed the capacity of single nodes. However, distributed execution, and the usage of cloudresources, pose additional challenges in terms of efficient and flexible resource utilization. Recently,several works have aim to improve resource utilization and flexibility of ML applications [3, 6, 11].In this paper, we focus on Co mmunication-efficient distributed dual Co ordinate A scent (CoCoA) [8],a state-of-the-art framework for efficient, distributed training of generalized linear models (GLMs).CoCoA significantly outperforms other distributed methods, such as mini-batch versions of stochasticgradient descent (SGD) and stochastic dual coordinate ascent (SDCA) by minimizing the amount ofcommunication necessary between training steps.Our work is motivated by two characteristics of the CoCoA algorithm. First, even assuming perfectscalability and no overheads, increasing the number of workers K does not, in general, reduce thetime to reach a solution. This is because the convergence rate of CoCoA degrades as K increases [4].Overall, CoCoA execution is split into epochs, and increasing K reduces the execution time of eachepoch, but also decreases the per epoch convergence rate, requiring more epochs to reach a solution.Finding the K that minimizes execution time is not trivial and depends on the dataset.Second, the number of workers K that minimize execution time changes as the algorithm progresses.Figure 1a/1b shows the convergence rate with K = { , , , , } workers, using the kdda and higgsdatasets as examples. We evaluate the convergence rate by plotting the duality-gap, which is given bythe distance between the primal and dual formulation of the training objective, and has been shown toprovide a robust certificate of convergence [1, 8]. Both examples show that for larger values of K ,the duality-gap converges faster initially, but slows down earlier than for smaller values of K , thusresulting in smaller values for K leading to a shorter time-to-(high)-accuracy than large values for When we refer to the training accuracy we mean that a highly accurate solution to the optimization problemhas been found (i.e., a small value of the duality gap), rather than the classification accuracy of the resultingclassifier.Preprint. Work in progress. a r X i v : . [ c s . L G ] N ov . However, this is not universally true, as Figure 1c shows for the rcv1 dataset, which scales almostperfectly with K .Based on these observations, we built Chicle, an elastic distributed machine learning framework, basedon CoCoA, that reduces time time-to-accuracy, robustly finds (near-)optimal settings automaticallyand optimizes resource usage by exploiting the drifting of the optimal K . dua li t y − gap (a) KDDA dua li t y − gap (b) Higgs dua li t y − gap (c) RCV1 Figure 1: Example of the convergence of the duality-gap (a certificate for accuracy) for 3 datasetsusing 1 to 16 workers, assuming perfect scaling and zero communication cost.
CoCoA [8] is a distributed machine learning framework to train GLMs across K workers. The trainingdata matrix A is partitioned column-wise across all workers and processed by local optimizers thatindependently apply updates to a shared vector v , which is synchronized periodically. In contrast tothe mini-batch approach, local optimizers apply intermediate updates directly to their local version ofthe shared vector v , thus benefiting from previous updates within the same epoch.Due to the immediate local updates to v by local optimizers, CoCoA outperforms previous state-of-the-art mini-batch versions of SGD and SDCA. However, for the same reason, it is not trivialto efficiently scale-out CoCoA, as increasing the number of workers does not guarantee a decreasein time-to-accuracy, even assuming perfect linear scaling and zero communication costs betweenepochs. The reason for this counter-intuitive behavior is that, as each local optimizer gets a smallerpartition of A , i.e. as it sees a smaller picture of the entire problem, the number of identifiablecorrelations within each partition decreases as well, thus leaving more correlations to be identifiedacross partitions, which is slower due to infrequent synchronization steps.Moreover, as indicated in the previous section, there is no K for which the convergence rate ismaximal at all times. This poses a challenge about the selection of the best K . It is up to the user todecide in advance whether to train quickly to a low accuracy and wait longer to reach a high accuracyor vice versa. A wrong decision can lead to longer training times and wasted resources as well asmoney, as resources – at least in cloud offerings – are typically billed by the hour.Ideally, the system would automatically and dynamically select K , such that the convergence rate ismaximal at any point in time, in order minimize training time and resource waste. As Figure 1b shows,the convergence rate, i.e. the slope of the curve, starting from the same level of accuracy, differsbetween different settings for K . E.g, as the curve for K = 16 flattens when reaching ≈ e − ,the curves for K ≤ become relatively steeper until they too, one by one, flatten out. Hence, inorder to stay within a region of fast convergence for as long as possible, the system should switchto a smaller K , once the curve for the current K starts to flatten. We assume that the convergencerate, when switching from K to K (cid:48) < K workers, at a certain level of accuracy, will follow a similartrajectory, as if the training had reached said level of accuracy starting with K (cid:48) workers in the firstplace. However, the validity of this assumption is obvious, given that the learned models in bothcases are not guaranteed to be indentical.Apart from the algorithmic side, adjusting K also poses very practical challenges on the system side.Every change in K incurs a transfer of potentially several gigabytes of training data between nodes –2 task that overwhelms many systems [10, 9, 7] as data (de-)serialization and transfer can be verytime consuming . It is therefore crucial that the the overhead, introduced by the adjustment of K , issmall, such that a net benefit can be realized. Chicle is a distributed, auto-elastic machine learning system based on the state-of-the-art CoCoA[8] framework that enables efficient ML training with minimized time-to-accuracy and optimizedresource usage. The core concept of Chicle is to reduce the number of workers (and therefore trainingdata partitions), starting from a set maximum number, dynamically, based on feedback from thetraining algorithm. This is rooted in the observation of a knee in the convergence rate, after which theconvergence slows down significantly, and that this knee typically occurs at a lower duality-gap forfewer workers compared to more workers. This can be observed in Figure 1b. Here, the knee occursat ≈ e − for 16 workers and ≈ e − for 2 workers. The reasoning for adjusting the number ofworkers is the assumption that CoCoA can be accelerated, if, by reducing the number of workers, itcan stay before the knee for as long as possible. Chicle implements a master/slave design in which a central driver (master) coordinates one or moreworkers (slaves), each running on a separate node. Driver and worker communicate via a customremote procedure call (RPC) framework based on remote direct memory access (RDMA) to enablefast data transfer with minimal overhead. Chicle is implemented in ≈ driver is responsible for loading, partitioning and distribution the training data, hence no sharedfile system is required to store the training data. It partitions the data into P ≥ K partitions for K workers, such that each worker is assigned PK partitions with P being the least common multiple of K and all potential scale-in sizes K (cid:48) < K . Moreover, the central CoCoA component is implementedas driver module. The workers implement an SDCA optimizer. Each optimizer instance works onall partitions assigned to a worker, such that it can train with a bigger picture once partitions getreassigned to a smaller set of workers. For each epoch, workers compute the partial primal and dualobjective for their assigned partitions, which are sent to the driver where the duality-gap is computedand passed to the scale-in policy module. Chicle enables efficient adjustment of the number of workers K (and the corresponding number ofdata partitions per worker process) using a decision policy and a RDMA-based data copy mechanism.In the context of this paper, Chicle only scales-in, i.e. reduces the number of workers K and thereforeredistributs the number of partitions P across fewer workers. Scale-in policy.
Our scale-in policy attempts to determine the earliest point in time when it isbeneficial to reduce the number of workers K (i.e. the beginning of the knee ) while, at the sametime, being robust against occasional outlier (i.e. exceptionally long) epochs. To that end, we usethe slope of the duality-gap over time to identify the knee . It computes two slopes (see Figure 2) – along-term slope S l which considers the convergence of the duality-gap since the last scale-in event –and a short-term slope S s , which considers only the last N epochs. As soon as S s × d < S l the policydirects the driver process to initiate the scale-in mechanism. Larger values for N and d generally leadto a more robust decision w.r.t. occasional outlier epochs, however they also increase the decisionlatency, thus potentially failing to maximize benefits from an earlier scale-in. Empirically, we havedetermined that N = 2 and d = 1 . works well across all evaluated datasets. Our policy does notdetermine the optimal factor m of the scale-in, i.e. K → K/m . We use a fixed m of 4, as test haveshown that the convergence rate difference for smaller m is often very small. Initially, we attempted to implement the concept of Chicle in Spark. This, however, failed to a large degreedue to very time-consuming (de-)serialization of the training data. Chicle is the Mexican-Spanish word for latex from the sapodilla tree that is used as basis for chewing gum. l l l l l l l l l l l l time dua li t y − gap lll duality−gaplong−term slopeshort−term slope Figure 2: Schematic of the long-/short-term slope of the duality-gap that we use to identify the knee . Scale-in mechanism.
We implement a simple, RDMA-based foreground data-copy mechanism tocopy data from to-be-removed workers to remaining workers. As the data transfer occurs in parallel,between multiple pairs of workers, we are able to exceed the maximal single-link bandwidth. For ascale-in from K to K/m workers and a single-link bandwidth of r (e.g. 10 Gb/s), we can achieve atotal transfer rate of m × r , e.g. 40 Gb/s to scale-in from 16 to 4 workers on a 10 Gb/s network. While we do not employ a sophisticated data partitioning scheme – we simply split the data intoequally sized chunks as it is laid out in the input file – we use an in-memory layout optimized forefficient local access as well as efficient data transfer between workers (see Listing 1). In Chicle, datafor each partition is stored consecutively in the
Partition::data array, which eliminates the needfor costly serialization. On the receiving side, a simple deserialization step is required to restore the
Example::dp pointer into the
Partition::data array for each
Example . This data layout, combinedwith the usage of RDMA, enables us to transfer data at a rate close to the hardware limit.
Listing 1: In-memory data structures of ChicleWhile we have considered an anticipatory background transfer mechanism, our evaluations (seeTable 3) show that the overhead, introduced by our mechanism, does not necessitate this.
In our evaluation, we attempt to answer the question of how much the CoCoA algorithm can beimproved by scaling-in training and thus staying in front of the knee for as long as possible.To answer this question, we compare the time-to-accuracy (duality-gap) of our static CoCoA imple-mentation with our elastic version, using an SVM training algorithm and the 6 datasets shown inTable 1. We evaluate static settings with 1, 2, 4, 8 and 16 workers as well as two elastic settings.In the first elastic setting, we start with 16 workers and scale-in to a single worker. This representscases where the entire dataset fits inside a single node’s memory but limited CPU resources makedistribution beneficial anyway. In the second elastic setting, we start with 16 workers but scale-in toonly two workers. This represents cases where a dataset exceeds a single node’s memory capacity andtherefore has to be distributed. As convergence behavior for 2+ nodes is similar (see Figure 3), thisalso indicates how our method works in a larger cluster, e.g., when scaling from 64 to 8 nodes. All we use a constant regularizer term λ = 0 . ll dua li t y − gap (a) RCV1 ll dua li t y − gap (b) KDDA ll dua li t y − gap (c) Higgs ll dua li t y − gap (d) KDD12 ll dua li t y − gap (e) Webspam ll dua li t y − gap (f) Criteo Figure 3: Duality-gap vs. time plots for the evaluated datasets and settings. Circles depict a scale-infrom 16 to 4 workers, diamonds a scale-in from 4 to 2 and 1 worker(s), respectively.Our evaluation shows that the basic concept of Chicle – to adjust the number of workers based onfeedback from the training algorithm – has benefits for most evaluated datasets. When scaling downto a single worker, Chicle shows an average speedup of 2 × compared to the best static setting and2.2 × when scaling down to two workers. While our method does not improve upon all evaluatedsettings and target accuracies (e.g., e − for KDDA, Webspam, RCV1), the slowdown (compared tothe respective best static setting) is tolerable, and speedups are still achieved compared to non-optimalstatic settings. It is important to note that the optimal static setting is not necessarily known inadvance and may require several test runs to determine. Chicle, on the other hand, is able to find anoptimal or near optimal setting automatically, which shows its robustness.5ataset e − e − e − RCV1 1.05 (16) 1.06 (16) 0.98 (8)KDDA 1.49 (1) 1.12 (1) 0.83 (1)Higgs 3.21 (4) 3.14 (1) 2.24 (1)KDD12 2.75 (16) >3.15 >2.25Webspam 1.25 (4) 1.43 (2) 0.82 (2)Criteo 2.82 (4) 3.80 (2) 2.76 (1) (a) 1-16 workers
Dataset e − e − e − RCV1 1.31 (16) 1.12 (16) 0.64 (8)KDDA >1.28 – –Higgs 3.46 (4) 5.96 (16) >3.63KDD12 2.57 (16) >3.12 >2.35Webspam 1.12 (4) 1.59 (2) 0.77 (2)Criteo 2.63 (4) 3.23 (2) >1.08 (b) 2-16 workers
Table 2: Speed-up factor of an elastic vs. the best static setting (the number of workers of the beststatic setting is given in parentheses) for reaching a target accuracy of − e – e − . In case nostatic setting has reached the target accuracy within a 10 minute time-limit, we provide a minimumspeedup factor and “–” in case neither an elastic, nor a static setting has reached a target accuracy.Setting RCV1 KDDA Higgs KDD12 Webspam Criteo1-16 workers 0.12 s 0.73 s 0.71 s 5.04 s 2.78 s 4.52 s2-16 workers 0.06 s 0.39 s 0.38 s 2.78 s 1.53 s 2.18 sTable 3: Total average scale-in overheadFinally, we measured data-copy rates and overhead due to scaling-in. Both metrics include the actualdata-transfer, control plane overhead and data deserialization. We measured data transfer rates of upto 5.8 GiB/s (1.4 GiB/s on average) and overheads as shown in Table 3. As the measured times do notconstitute a significant overhead on our system, we did not implement background data transfer. Forslower networks, such a method could be used to hide data transfer times behind regular computation. To our knowledge Chicle is the first elastic CoCoA implementation. Several other elastic ML systemsexist, but in contrast to Chicle, they target efficient resource utilization rather than reducing overallexecution time. Litz [6] is an elastic ML framework that over-partitions training data into P = n × K partitions for K physical workers. Elasticity is achieved by increasing or decreasing the amount ofpartitions per node. In contrast to Chicle, Litz does not scale based on feedback from the trainingalgorithm nor does it improve the per-epoch training algorithm convergence rate when doing so, aspartitions are always processed independently of each other. SLAQ [11] is a cluster scheduler forML applications. Like Chicle, SLAQ uses feedback from ML applications, but instead of optimizingthe time to arbitrary accuracy for one application, SLAQ tries minimize the time to low accuracy formany applications at the same time, by shifting resources from applications with low convergenceratse to those with high ones, assuming that resources can be used more effectively there. Proteus [3]enables the execution of ML applications using transient revocable resources, such as EC2’s spotinstances, by keeping worker state minimal at the cost of increased communication. In this paper we have shown experimentally, that the optimal number of workers for CoCoA changesover the course of the training. Based on this observation we built Chicle, an elastic ML framework,and have shown that it can outperform static CoCoA for several datasets and settings by a factorof 2–2.2 × on average, often, while using fewer resources. Future work includes additional waysto dynamically optimize CoCoA in terms of training time and resource usage, as well as relateduse-cases, e.g. neural networks [5]. Furthermore, we are working towards a theoretical foundation ofour observations. 6 eferences [1] D ÜNNER , C., F
ORTE , S., T
AKÁ ˇC , M.,
AND J AGGI , M. Primal-dual rates and certificates. arXiv preprintarXiv:1602.05205 (2016).[2] D
ÜNNER , C., P
ARNELL , T. P., S
ARIGIANNIS , D., I
OANNOU , N., A
NGHEL , A.,
AND P OZIDIS , H. SnapML: A hierarchical framework for machine learning.
CoRR abs/1803.06333 (2018).[3] H
ARLAP , A., T
UMANOV , A., C
HUNG , A., G
ANGER , G. R.,
AND G IBBONS , P. B. Proteus: agile mlelasticity through tiered reliability in dynamic resource markets. In
Proceedings of the Twelfth EuropeanConference on Computer Systems (2017), ACM, pp. 589–604.[4] J
AGGI , M., S
MITH , V., T
AKAC , M., T
ERHORST , J., K
RISHNAN , S., H
OFMANN , T.,
AND J ORDAN ,M. I. Communication-efficient distributed dual coordinate ascent. In
Advances in Neural InformationProcessing Systems 27 , Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,Eds. Curran Associates, Inc., 2014, pp. 3068–3076.[5] L IN , T., S TICH , S. U.,
AND J AGGI , M. Don’t use large mini-batches, use local sgd.
CoRR abs/1808.07217 (2018).[6] Q
IAO , A., A
GHAYEV , A., Y U , W., C HEN , H., H O , Q., G IBSON , G. A.,
AND X ING , E. P. Litz: Anelastic framework for high-performance distributed machine learning.[7] S
IKDAR , S., T
EYMOURIAN , K.,
AND J ERMAINE , C. An experimental comparison of complex objectimplementations for big data systems. In
Proceedings of the 2017 Symposium on Cloud Computing (NewYork, NY, USA, 2017), SoCC ’17, ACM, pp. 432–444.[8] S
MITH , V., F
ORTE , S., M A , C., T AKÁ ˇC , M., J
ORDAN , M. I.,
AND J AGGI , M. Cocoa: A generalframework for communication-efficient distributed optimization.
JMLR 18 (2018), 1–49.[9] S
TUEDI , P., T
RIVEDI , A., P
FEFFERLE , J., S
TOICA , R., M
ETZLER , B., I
OANNOU , N.,
AND K OLTSIDAS ,I. Crail: A high-performance i/o architecture for distributed data processing.
IEEE Data Eng. Bull. 40 , 1(2017), 38–49.[10] Z
AHARIA , M., C
HOWDHURY , M., F
RANKLIN , M. J., S
HENKER , S.,
AND S TOICA , I. Spark: Clustercomputing with working sets. In
Proceedings of the 2Nd USENIX Conference on Hot Topics in CloudComputing (Berkeley, CA, USA, 2010), HotCloud’10, USENIX Association, pp. 10–10.[11] Z
HANG , H., S
TAFMAN , L., O R , A., AND F REEDMAN , M. J. Slaq: quality-driven scheduling fordistributed machine learning. In
Proceedings of the 2017 Symposium on Cloud Computing (2017), ACM,pp. 390–404.(2017), ACM,pp. 390–404.