Smart Build Targets Batching Service at Google
Kaiyuan Wang, Daniel Rall, Greg Tener, Vijay Gullapalli, Xin Huang, Ahmed Gad
aa r X i v : . [ c s . S E ] F e b Smart Build Targets Batching Service at Google
Kaiyuan Wang
Google Inc [email protected]
Daniel Rall
Google Inc [email protected]
Greg Tener
Google Inc [email protected]
Vijay Gullapalli
Google Inc [email protected]
Xin Huang
Google Inc [email protected]
Ahmed Gad
Google Inc [email protected]
Abstract —Google has a monolithic codebase with tens ofmillions build targets. Each build target specifies the informationthat is needed to build a software artifact or run tests. It iscommon to execute a subset of build targets at each revision andmake sure that the change does not break the codebase. Google’sbuild service system uses Bazel to build targets. Bazel takes asinput a build that specifies the execution context, flags and buildtargets to run. The outputs are the build libraries, binaries ortest results. To be able to support developer’s daily activities, thebuild service system runs millions of builds per day.It is a known issue that a build with many targets could run outof the allocated memory or exceed its execution deadline. Thisis problematic because it reduces the developer’s productivity,e.g. code submissions or binary releases. In this paper, wepropose a technique that predicts the memory usage and executoroccupancy of a build. The technique batches a set of targetssuch that the build created with those targets does not run outof memory or exceed its deadline. This approach significantlyreduces the number of builds that run out of memory or exceedthe deadlines, hence improving developer’s productivity.
I. I
NTRODUCTION
Google has a monolithic codebase and it grows rapidly withan average of more than 100,000 code submissions per day.To make sure that the new changes do not break the existingcodebase, Google has adopted Continuous Integration (CI) [7],[11]. For each code change, a CI system first uses the buildservice system [14] to build the affected libraries and binaries.Then, it runs the affected tests to check if the built librariesand binaries are working as intended.Google’s build service system uses Bazel [1] to buildlibraries and run tests. Bazel takes as input a set of buildspecification files that declare build targets. We refer to a buildtarget as a target in the rest of this paper. A target specifieswhat is needed to produce an artifact, such as a library orbinary. A test target specifies what is needed to run tests andcheck if a code change breaks the codebase. Bazel decideshow to build a given target based on the target’s specification.As the codebase grows, each change may affect a lot oftargets. For example, a code change on a common library likeGuava [4] could affect many Java targets. A C++ compilerchange could affect all C++ targets. Moreover, the postsubmitservice needs to guarantee that all targets in a given changerevision compile and pass all tests. The above use casesare common and require Bazel to build up to millions oftargets at once. Running a large number of targets in a buildexecution may cause out of memory ( OOM ) errors in Bazel as it exceeds the memory of a single machine for dependencyanalysis. Moreover, Google’s infrastructure limits the numberof executors, e.g. CPU, GPU and TPU, a build can use inparallel at the build execution time. So executing too manytargets in a build may cause the build to take too long andexceed its invocation deadline, resulting in deadline exceeded ( DE ) errors. In contrary, executing too few targets in a buildmay use up all the machines allocated to run builds.In this paper, we propose a build target batching service( BTBS ) that partitions a large stream of targets into batchesand creates a build for each batch of targets such that thosebuilds do not have OOM or DE errors. The technique relies ona memory estimation model and an executor occupancy esti-mation model. The memory model predicts the peak memoryusage of a build. The occupancy model predicts the averageexecutor occupancy of a build. The technique partitions a largestream of targets into target batches such that the build withthe flags and each batch of targets consumes a limited memoryor executor occupancy. The results show that BTBS generatesfew OOM and DE builds with 0.08% OOM rate and 0.05%DE rate, which saves a lot of computational resources usedfor build failure retries.The paper makes the following contributions: • It presents the first technique that effectively creates buildsfrom a large stream of targets with the goal of minimizingthe number of OOM and DE errors. It is also the firsttechnique that predicts memory and occupancy usages ofbuild executions. • It demonstrates that the technique is able to reduce the OOMrate to 0.08% and the DE rate to 0.05%. Our past experienceshows that BTBS is critical and improves the developer’sproductivity by reducing build failure retries due to OOMor DE errors. II. B
ACKGROUND
A. Bazel
Bazel is an open-source build and test tool and is widelyused within Google. It is responsible for transforming sourcecode into libraries, executable binaries, and other artifacts.Bazel takes as input a set of flags and targets that programmersdeclare in build files. It supports a large number of commandline options and these options can affect the way Bazelgenerates outputs.ABLE I: Example build flags
Category Example
Error checking –check_visibility enables checking if all dependent targets are visible to all targets to build.Tool flags –copt specifies the arguments to be passed to the C++ compiler.Build semantics –cpu specifies the target CPU architecture to be used for the compilation.Execution strategy –jobs specifies a limit on the number of concurrently running executors during build execution.Output selection –fuseless_output restricts Bazel to generate intermediate output files in memory.Platform –platforms specifies the labels of the platform rules describing the target platforms.Miscellaneous –use_action_cache enables Bazel’s local action cache. java_library(name = "HelloWorld",srcs = ["HelloWorld.java"],) java_library(name = "HelloWorldTest",srcs =["HelloWorldTest.java"],deps = [":HelloWorld"],) java_test(name = "AllTests",size = "small",tags = ["requires-gpu"],deps = [":HelloWorldTest"],)
Fig. 1: Example Java targetsFig. 2: Build service architectureTable I shows different flag categories with examples.For example, the -fuseless_output flag restricts Bazel togenerate intermediate output in memory instead of writing itto the disk. The -jobs flag specifies a limit on the number ofconcurrently running executors during a build execution. So -fuseless_output and -jobs can significantly affect thememory usage and executor occupancy of a build, respectively.Figure 1 shows some example Java targets.
HelloWorld isa Java library target that compiles
HelloWorld.java into alibrary ( java_library is the target rule).
HelloWorldTest is a Java library target that compiles
HelloWorldTest.java into a library and it depends on the
HelloWorld targetbecause
HelloWorldTest.java uses
HelloWorld.java . AllTests is a Java test target that runs the
HelloWorldTest library using JUnit and the execution requires GPU. Whena programmer issues a command to build a target, Bazelfirst ensures that the required dependencies of the target arebuilt. Then, it builds the desired target from its sources anddependencies. When a programmer issues a command to runa test target, Bazel will first build all dependencies of the testtarget and then execute the tests.
B. Build Service System
BTBS takes as input a set of flags and a stream of targets,and outputs a set of builds that include all targets with thesame flags. The output builds are sent to the build servicesystem [14] for execution. Figure 2 shows how BTBS isconnected to the build service system. The system diagramis simplified for brevity. BTBS splits a stream of targets into batches and creates abuild for each batch of targets. Those builds are enqueued tothe scheduling service. The scheduling service waits until thereare available resources and dequeues a build to a Bazel workerfor execution. The worker allocates a fixed amount of memoryfor each Bazel process and a build runs out of memory if theBazel process uses more than the allocated memory duringexecution. Bazel translates the build flags and targets intoactions, and sends those actions to the executor cluster forthe actual execution. For example, if a test requires GPU thenBazel will send it to the GPU executors. The executor clusterhas a large but limited number of executors and each executortalks to a Bazel process to execute actions. The executorcluster also has an action cache to minimize duplicate work.Each Bazel process is configured to use a limited number ofexecutors concurrently to avoid the case where a very largebuild uses a lot of executors and blocks other build executions.As a consequence, executing a large build that could use moreexecutors than the limit causes the additional actions to queueuntil some executors become idle again. This may cause thebuild to exceed its deadline and we call these builds
TypeI
DE builds. It is also possible that some actions depend onother actions and they must be executed in sequence. Thismay cause build to exceed its deadline if some actions arelong running. We call these builds
Type II
DE builds. Anaction uses both memory and executor in the executor cluster.We use an executor service unit (ESU) to unify the expenseof both executor memory and CPU, and 1 ESU is equalto 2.5GB of memory or 1 executor. Note that the executormemory usage is different from the Bazel memory usage. Werefer to the occupancy usage as the ESU used by a build inthe rest of the paper. BTBS reduces
Type I
DE errors bylimiting the occupancy of a build, and we do not consider
Type II
DE errors because it is not correlated with the buildoccupancy usage. BTBS assumes that a build with too muchESU usage is more likely to have queuing actions that cause ype I
DE errors. The executor cluster reports the executoravailability to the quota governor and the scheduling servicemakes dequeuing decisions based on the executor availabilityfrom the quota governor. During a build execution, the Bazelprocess reports the progress and result to the build eventservice. Clients of BTBS can use the build request IDs toquery the build event service and find the build progress andstatus in real time.
C. Collateral Damage
Each build may use different executor types. For example,the
AllTests
Java test target can use both x86 CPU and GPUduring execution. The build scheduling service throttles buildsbased on the availability of each executor type. For example,a build that requires both x86 CPU and GPU can only bedequeued when both x86 CPU executors and GPU executorsare available. In other words, a build might be delayed if oneof its required executor types is unavailable even if all otherrequired executor types are available. Moreover, the delayedbuild will reserve the executor resources and block other lowerpriority builds from dequeuing. This design prevents highpriority builds from being starved by lower priority builds thatuse very few ESU [14]. As a consequence, a build that requiresmore executor types is more likely to be delayed and it mayblock other lower priority builds from dequeuing.If a delayed build contains targets that require different setsof executor types, then some of the targets could have beendequeued if they are in a separate build. For example, assumethat a build contains one target that only uses x86 executorsand another target that uses both x86 and GPU executors. Thebuild could be throttled due to insufficient GPU executors butthe target that only uses x86 executors could be dequeued ifit is built separately. We call such delays collateral damage . D. Linear Regression
In machine learning, regression analysis is a set of statisticalprocesses for estimating the relationships between a dependentvariable (denoted as y ) and one or more independent variables(denoted as x ). The most common form of regression analysisis linear regression [12]. It tries to find the line that mostclosely fits the data according to a specific mathematicalcriterion. For example, the method of ordinary least squarescomputes the unique line (or hyperplane) that minimizes thesum of squared distances between the true data and that line(or hyperplane).Given a set of data set { y i , x i , . . . , x ip } ni =1 of n statisticalunits. A simple linear regression model has the following form: y i = β + β x i + · · · + β p x ip + ε i = x T i β + ε i , i = 1 , . . . , n where T denotes the transpose, so that x T i β is the innerproduct between vectors x i and β .Fitting a linear model to a given data set usually requiresestimating the regression coefficients β such that the error term ε i = y i − x T i β is minimized across all n samples. For example,it is common to use the sum of squared errors P ni =1 ε i as thequality of the fit. Linear models can be efficiently trained usingstochastic gradient descent [5]. E. Feature Cross
A feature cross is a synthetic feature formed by multiplying(crossing) two or more features [3]. Crossing combinations offeatures can provide predictive abilities beyond what thosefeatures can provide individually. As a consequence, featurecrosses can help us learn non-linear functions using linearregression. A well-known example is that the XOR function f ( x, y ) where x, y, f ( x, y ) ∈ { , } is not linearly separableand it cannot be written as f ( x, y ) = αx + βy + γ where α , β and γ are real numbers. However, the XOR function canbe written as f ( x, y ) = x + y − xy , where the xy term is afeature cross for x and y .In practice, machine learning models seldom cross continu-ous features. However, machine learning models do frequentlycross one-hot feature vectors. Feature crosses of one-hotfeature vectors are analogous to logical conjunctions. Forexample, suppose we bin latitude and longitude, producingseparate one-hot five-element feature vectors, e.g. [0, 0, 0, 1,0]. Further assume that we want to predict the gross income ofa person using the binned latitude and longitude as features.Creating a feature cross of these two features results in a 25-element one-hot vector (24 zeroes and 1 one). The single 1in the cross identifies a particular conjunction of latitude andlongitude. By feature crossing these two one-hot vectors, themodel can form different conjunctions, which will ultimatelyproduce far better results compared to a linear combination ofindividual features.III. B UILD T ARGET B ATCHING S ERVICE
A. EnqueueTargets API
BTBS provides a single remote procedure call (RPC) called
EnqueueTargets . The RPC is a streaming RPC and it takesas input a sequence of requests and returns to the clients asequence of responses. The main reason to have a streamingRPC is to avoid needing to stream millions of targets in a sin-gle request. We enforce the first
EnqueueTargets request toinclude the execution context, build flags and optionally sometargets. The execution context points to either a workspace thatcontains the unsubmitted code, or an existing code revision.The flags and targets are described in Section II-A. Thesubsequent requests may only contain the remaining targets.BTBS will create a set of builds with the same executioncontext and flags but different targets. The created buildsinclude all targets in the requests. Each
EnqueueTargets response contains the enqueued build request ID so that theclient can use it to query the build status.
B. Group Targets
Once BTBS receives all targets from the client, it firstgroups targets by the executor types they use. This avoids thecollateral damage as described in Section II-C. For example,all targets that use only x86 executors will be grouped together.All targets that use both x86 and Mac executors will begrouped together. Given a build target, BTBS determines itsrequired executor types by checking the tags attribute, e.g.the requires-gpu tag in the
AllTests target in Figure 1. lgorithm 1:
Batching targets algorithm
Input:
List of targets to batch allTargets ; Max targets perbatch maxTargetsPerBatch ; Memory cutoff value memoryCutoff ; Occupancy cutoff value occupancyCutoff . Output:
Target batches batches . batches = [] while len(allTargets) > 0 do targets = allTargets [: maxTargetsPerBatch ] targets = limitBatchSizeByCutoff( targets,memoryCutoff, MEMORY_MODEL ) targets = limitBatchSizeByCutoff( targets,occupancyCutoff, OCCUPANCY_MODEL ) batches .append( targets ) allTargets = allTargets [len( targets ):] return batches Function limitBatchSizeByCutoff( targets, cutoff,modelName ) : low = 0, high = len( targets ) - 1, cutoffIndex = 0 while low <= high do mid = (low + high) / 2estimate = getEstimateFromTargets( targets[:mid + 1],modelName ) if estimate < cutoff then cutoffIndex = midlow = mid + 1 else high = mid - 1 return targets [: cutoffIndex + 1] If a target does not have a tag, then BTBS determines therequired executor type by the target rule. For example, if atarget has the rule ios_ui_test , then it uses Mac executors.Once all targets are grouped by their required executortypes, we sort each group of targets in lexicographical order.This is a heuristic to increase the probability that the batchedtargets share similar dependencies and thus allow Bazel toconstruct tighter dependency graphs and reduce the memoryusage. The idea is that developers tend to declare targets for thesame project under the same directory. The targets in the samedirectory are likely to share dependent targets. For example,target //a/b/c:t1 may share more dependencies with target //a/b/c:t2 than target //d/e/f:t3 . The assumption maynot hold for all targets, but this heuristic works well in practice.
C. Batch Targets
BTBS generates a set of target batches for each group oflexicographically sorted targets. Algorithm 1 shows the targetbatching algorithm. The algorithm takes as input a groupof sorted targets allTargets , a bound on the max numberof targets per batch maxTargetsPerBatch , the memory cutoffvalue memoryCutoff above which to consider the build withthe batch of targets running out of memory, and the occupancycutoff value occupancyCutoff above which to consider thebuild with the batch of targets exceeding the deadline. Theoutput batches is a list of target batches.The algorithm first initializes batches to an empty list.When allTargets is not empty, the algorithm takes the first maxTargetsPerBatch targets in allTargets as the initial targets list for binary search. The first binary search queries the memory model and tries to find the largest sublist of targets that fits into the memoryCutoff during execution. The secondbinary search queries the occupancy model and tries to findthe largest sublist of targets (updated in the first binary search)that uses less than occupancyCutoff
ESU. The binary searched targets list always starts with the same initial target in eachiteration. Finally, the batch of targets is added to batches andremoved from allTargets . limitBatchSizeByCutoff is the method that does thebinary search. It takes as input a list of targets to search,a cutoff value to fit the final target list and a modelname modelName to query the machine learning model. The getEstimateFromTargets method generates the featuresfrom a synthetic build with the same execution context, buildflags as present in the first EnqueueTargets request, butwith a different sublist of targets . Then, the method sendsthe generated features to the model which then returns backan estimate. For the memory model, it returns the estimatedpeak memory usage of the build. For the occupancy model,it returns the average occupancy of the build. Finally, the limitBatchSizeByCutoff method returns a sublist of tar-gets that uses less than cutoff
GB of memory at peak or fewerthan cutoff
ESU on average. Note that the returned targets sublist always includes at least one target.
D. Batch Size Reasons
Each target batch is associated with a batch size reason,indicating why the batch of targets is created. We definea target batch to be valid if either (1) it uses less than memoryCutoff memory and occupancyCutoff occupancy; or(2) it only has one target. Table II shows all possible kindsof batch size reasons.
ONLY_ONE_TARGET means that theremaining unbatched target list only has one target, and BTBSsimply creates a single target batch.
MAX_TARGETS meansthat the initial target batch with size maxTargetsPerBatch is valid.
ALL_REMAINING_TARGETS means that the targetbatch includes all remaining unbatched targets and it isvalid.
MAX_MEMORY means that the target batch is createdby a binary search with the memory model and is valid.
MAX_OCCUPANCY means that the target batch is created bya binary search with the occupancy model and is valid.
MEMORY_ESTIMATE_ERROR means that the memory modelreturns an error and the target batch includes a list of targetswith a fallback size.
OCCUPANCY_ESTIMATE_ERROR meansthat the occupancy model returns an error and the target batchincludes a list of targets with a fallback size. These batchsize reasons can help us understand the impact of the memorymodel and occupancy model.Note that a target batch may have a higher estimate than memoryCutoff or occupancyCutoff when it only contains asingle target. But there is nothing BTBS can do becauseit cannot split a single target. The memory and occupancymodels are served in a separate server and BTBS queries thosemodels via RPC. So it is possible that the RPC fails, e.g. RPCdeadline exceeded errors.ABLE II: Batch size reasons Batch Size Reason Description
ONLY_ONE_TARGET Remaining unbatched target list only has one target.MAX_TARGETS Initial target batch with size maxTargetsPerBatch is valid.ALL_REMAINING_TARGETS Target batch includes all remaining unbatched targets and is valid.MAX_MEMORY Target batch is created by a binary search with the memory model.MAX_OCCUPANCY Target batch is created by a binary search with the occupancy model.MEMORY_ESTIMATE_ERROR Target batch includes a fallback size of unbatched targets due to memory error.OCCUPANCY_ESTIMATE_ERROR Target batch includes a fallback size of unbatched targets due to occupancy error.
E. Create Build
For each target batch, BTBS creates a build with thesame execution context and flags as present in the first
EnqueueTargets request. The build will be enqueued in thebuild service system [14]. Finally, BTBS sends back all buildrequest IDs generated by the build service system to the clientsso that they can use them to track the build statuses. The clientsreceive a sequence of
EnqueueTargets responses and eachresponse contains a single build request ID.
F. Build Failure Retry
Even with the memory model and the occupancy model,builds may still fail with OOM or DE errors. If that happens,BTBS can retry those builds on the build service system [14].For an OOM build, BTBS splits the targets in that build intohalf and reruns Algorithm 1 for each subset of the split targets.BTBS does not retry single target OOM builds. For a DE build,BTBS reruns Algorithm 1 with the exact same targets in thatDE build. The reason is that we expect the build service systemto cache some actions done for the DE build, so rerunning thebuild should be faster and may finish within the deadline.IV. M
EMORY AND O CCUPANCY P REDICTION
A. Data Collection
Bazel is a Java based build system and it runs on JVM.The Bazel process keeps track of the memory and occupancyusage of a build execution.In terms of the memory usage, Bazel records both the peakheap memory usage and the peak post full GC heap memoryusage. The peak heap memory usage is the max heap memoryusage during the build execution. Bazel does not necessarilyrun out of memory if the peak heap memory usage is close tothe allocated memory, because JVM can garbage collect (GC)the unused references and free some heap memory. The peakpost full GC heap memory usage is the max heap memoryusage after a full GC during the build execution. If the peakpost full GC heap memory usage is close to the allocatedmemory, then Bazel will run out of memory because GCcannot free up more memory space. We use the peak post fullGC heap memory usage as the predicting label for the memorymodel. If no GC is triggered during the build execution, thenwe use the peak heap memory usage as the predicting label.In terms of the occupancy usage, Bazel records the exactbuild execution time and the total executor service time forthe build. The build execution time is the wall time of the command-specific execution, excluding Bazel startup time.The executor service time for a Bazel action is the number ofexecutors (or equivalent memory unified by ESU) multipliedby their execution time occupied by that action. The totalexecutor service time for a build is the sum of executor servicetime for all actions generated by that build. The averageexecutor occupancy is equal to the total executor service timedivided by the build execution time. Conceptually, the averageexecutor occupancy measures the average concurrently usedexecutors or memory during a build execution. If the averageexecutor occupancy is above the number of concurrent execu-tors limit, then it is more likely to cause a
Type I deadlineexceeded build. We use the average executor occupancy (inESU) as the predicting label for the occupancy model.The build flags and targets are the raw features we use topredict the memory and occupancy usage. The Bazel processis configured to store all execution statistics to a distributedfile system [8] and we have a Flume [6] job to generate allfeatures and labels from the execution statistics every day.
B. Feature Engineering
By measuring how important a feature is in predicting thelabel, we can choose the most representative features insteadof all the features. The benefits of feature reduction include (1)speeding up the model training, (2) making the model trainingstable (training the same model multiple times produces closeresults), (3) avoiding learning bad features, and (4) reduceddata processing, etc. We use the mutual information [10], [13]to directly measure how much information a set of featurescan provide to predict the label.Table III shows the set of important features we use topredict both memory and occupancy usages. The build prioritydoes not affect the memory or occupancy usage of a build.But it happens that high priority builds often run a smallerset of targets compared to low priority builds. For example,the number of targets triggered by a human code changeis often smaller than the number of targets triggered by acode coverage tool. The Bazel command name is importantbecause it affects the way targets are built. For example, the test command not only builds the libraries/binaries but alsoruns the tests. As a comparison, the build command onlybuilds a more restricted set, such as libraries and binaries. Theoriginating user and the product area are important becausecertain users or teams often trigger targets of the same projectand their builds often have similar cost. The tool tag isABLE III: Important features
Features Description
Build priority The priority of the build.Command name The Bazel command name.Originating user The username that requests the build.Product area The product area under which the build is charged.Tool tag The tool that sends targets to BTBS.Targets The set of targets in the build.Target count The number of distinct targets in the build.Packages The set of packages for the corresponding targets in the build.Package count The number of distinct packages of all targets in the build.–cache_test_results Whether Bazel uses the previously saved test results when running tests.–cpu The target CPU architecture to be used for the compilation of binaries during the build.–discard_analysis_cache Whether Bazel should discard the analysis cache right before execution starts.–fuseless_output Whether to generate intermediate output files in memory.–jobs The limit on the number of concurrent executors used during the build execution.–keep_going Whether to proceed as much as possible even in the face of errors.–keep_state_after_build Whether to keep incremental in-memory state after the build execution.–runs_per_test The number of times each test should be executed.–test_size_filters Only run test targets with the specified size.–use_action_cache Whether to use Bazel’s local action cache.another important feature because the sets of targets triggeredby different tools can vary a lot. For example, a presubmitservice often triggers a small set of targets given a codechange, but the postsubmit service often triggers millions oftargets. The targets and the packages, i.e. directories withinwhich the targets are declared, are very important becausetargets often have different costs to build. Typically, a buildwith more targets and packages is more expensive than abuild with fewer targets and packages. Build flags are alsoimportant because they affect how Bazel builds the targets. Forexample, the -discard_analysis_cache flag tells Blazeto discard the analysis cache for the previous build beforethe next build execution starts, so it affects the memory usedby the build. The -keep_going flag tells Bazel to proceedeven in the face of errors and it may cause a failed build touse more memory compared to exiting the execution once afailure occurs. The -runs_per_test flag controls the numberof times each test should be executed so it can affect thenumber of concurrently used executors in the case of parallelexecution. The -test_size_filters flag filters a subset oftest targets to execute and thus can affect both the memoryand occupancy usage of a build. In general, the importantfeatures selected based on the mutual information indeed affectthe memory and occupancy usage, intuitively. There are moreimportant flags but we do not describe them for brevity.In addition to the basic important features in Table III, wealso generate synthetic features to improve the model accuracy.For example, the target and package counts are discretizedinto quantile buckets. Specifically, we can split the range oftarget counts by their median, and half of the target countswould fall into the first bucket and the rest half would fall intothe second bucket. This strategy converts a numeric featureinto a categorical feature. Another example is to split each target path into multiple fragments based on the path delimiter.Specifically, we generate the target prefix splits feature froma target //a/b/c:t by splitting its path into //a/b/c:t , //a/b/c , //a/b and //a . This helps the model to learnthe memory or occupancy usage patterns for targets underdifferent directories.All basic and synthetic categorical features are used togenerate feature crosses (Section II-E). We use an off-the-shelfblackbox optimization tool similar to AutoML [9] to generatea set of candidate feature crosses. The tool trains the modelwith a small set of steps to see which feature crosses givethe best performance. All feature crosses are generated fromthe basic and synthetic categorical features, and we limit themax number of feature crosses to 6. Finally, the tool returnsthe best feature crosses for the models. We run feature crosssearch separately for the memory and occupancy models. C. Model Training
Once all features are finalized, they will be used to trainthe memory and occupancy models. We choose to train aregression model because it fits well with the binary search asdescribed in Algorithm 1. Another reason to not train a binaryclassification model is that the model may perform badly whenonly few OOM builds are present in the training data.We want our models to be monotonic over all targets relatedfeatures. This property makes sure that adding new targets tothe build with the same flags always results in an increasedmemory estimate. If the monotonicity property does not hold,then the binary search may not work as expected. We achievethe monotonicity property by setting the regularization termto infinity when the weights of the targets related features arenegative. Specifically, we use the loss function as follows: ( β , x , y ) = 1 n n X i =1 ( y i − x T i β ) + R ( β ) R ( β ) = n X i =1 r i ( β i ) r i ( β i ) = ( λ − i | β i | β i < λ + i | β i | β i ≥ λ − i , λ + i ≥ where L ( β , x , y ) is the mean squared error (MSE) lossfunction and R ( β ) is the regularization term. λ − i and λ + i are the L1 regularization term for the weight β i when β i < and β i ≥ , respectively. We set λ − i = ∞ for the weights ofthe targets related features. This causes all the weights of thetargets related features to be non-negative, thus making themodel monotonic over all targets related features.We train the memory and occupancy models using the last17 days of all builds that are executed in the build servicesystem. This means that the training and testing datasetsinclude builds that are not created by BTBS. 17 is chosento cover at least 2 weeks of weekday data and be able tohandle holiday weekends. However, using a long period oftraining data may cause the memory model to slowly capturethe new memory patterns of recent builds. For example, itis possible that a memory usage regression is injected intoBazel, which causes otherwise identical builds to use morememory than before. The memory model won’t be able tocapture the memory usage increase until a couple of dayslater because a majority of the builds still use less memory. Tosolve this problem, we train another memory model that usesdata from the most recent day and BTBS uses the maximummemory estimate of the two models when performing thebinary search. All the models are continuously trained andpushed to production. V. E VALUATION
A. Production Setup
BTBS is deployed geographically at 3 locations in the USand each location has 5 running jobs that serve traffic allover the world. This distributed setting avoids single pointfailure. Our experiment lasts from 2020/01/16 to 2020/02/19(35 days). We report the performance of BTBS using theproduction data. During the time, BTBS receives 51 million
EnqueueTargets
RPC calls (1.47 million daily on average)and creates 102 million builds (2.92 million daily on average).The maxTargetsPerBatch is set to 900 in Algorithm 1. Thememory cutoff value is set to 7GB for high priority builds,9GB for medium priority builds and 10GB for low prioritybuilds. The reason is that the memory model is not 100%accurate and the number of OOM builds will increase if weset the cutoff close to the allocated memory (13GB). We seta lower memory cutoff for high priority builds because wecannot afford the human users to wait for OOM failure retries.In contrast, we can tolerate more low priority OOM builds, e.g. builds that collect code coverage and do not block developers.The occupancy cutoff value is set to 500 ESU which is smallerthan the max allowed concurrent executors limit (600). If thememory or occupancy model returns errors, then we fallbackto a default batch size of 300.All regularization terms λ + i , i = 1 ...n for positive weightsare set to 7000. All the learning parameters, e.g. learningrate and batch size, are set to the default values. During theexperiment time, the build service system ran 17.8 millionsbuilds on average each day and 34.8% builds were Bazelqueries that do not use much memory or any executor. So weexcluded those builds and used the rest 11.6 million builds asthe training examples daily.In the rest of this section, 1k represents 1 thousand and 1mrepresents 1 million. The error is calculated as the differencebetween the actual and estimated memory usage, i.e. ( y − ˆ y ) . B. BTBS Performance
Table IV shows some metrics to measure the performanceof BTBS.
QPS and
Latency show the queries per second andlatency in milliseconds of the
EnqueueTargets streamingRPC over the past 24 hours during the entire experiment span.
StreamTC shows the total target count per RPC to BTBS.
StreamBC shows the total generated build count per RPC toBTBS.
BuildTC , ExecT , Memory and
Occupancy show thetarget count, execution time in seconds, memory usage ingigabytes and occupancy usage in ESU of individual buildscreated by BTBS.BTBS is used heavily in production and the average QPSover the past 24 hours ranges from 6 to 27. The min andmax QPS over the past 24 hours range from 0 to 12 andfrom 18 to 89, respectively. BTBS receives less traffic onweekends off the peak hours and more traffic on weekdaysduring the peak hours. The batching algorithm is efficient andthe average latency over the past 24 hours ranges from 265 to414 milliseconds. The min and max latency over the past 24hours range from 1 to 81 milliseconds and from 69k to 886kmilliseconds, respectively. The latency of BTBS is very smallif the clients enqueue less than hundreds of targets, but it couldgo up to minutes if the clients enqueue millions of targets. Theaverage, min, median and max number of targets receivedby BTBS per RPC are 674, 1, 2 and 70m, respectively. Amajority of the
EnqueueTargets requests only contain 1-2 targets and they are sent by tools like presubmit failurererunner, culprit finder, flaky test detector, etc. The postsubmitservice could send millions of targets to BTBS because itneeds to make sure that all tests can pass at a given revision.The average, min, median and max number of builds createdby BTBS per RPC are 2, 1, 1 and 84k, respectively. Thisis consistent with the number of targets received by BTBSbecause BTBS creates more builds as it receives more targetsper RPC. Since a majority of the
EnqueueTargets requestsonly contain 1-2 targets, BTBS only needs to create a singlebuild for most of the time. The average, min, median andmax number of targets per build created by BTBS are 339, 1,24 and 900, respectively. Note that the number of targets perABLE IV: BTBS performance
Metrics Avg Min Median Max
QPS 6-27 0-12 4-24 18-89Latency 265-414 1-81 150-179 69k-886kStreamTC 674 1 2 70mStreamBC 2 1 1 84kBuildTC 339 1 24 900ExecT 394 0.1 250 7923Memory 3.87 0.04 2.82 13.26Occupancy 18.3 0 2.8 2066.4build is bounded by the maxTargetsPerBatch . The average,min, median and max execution time of builds created byBTBS are 394, 0.1, 250 and 7923 seconds, respectively. Itindicates that most builds run with a couple of seconds tominutes but some builds can run with 1-2 hours. A majorityof builds have a deadline of 1.5 hours and will be expired ifthey do not finish by the deadline. The average, min, medianand max memory usage of builds created by BTBS are 3.87,0.04, 2.82 and 13.26 GB, respectively. It shows that manybuilds take less than a few GB of memory but some of themcan still use up all the allocated memory. The average, min,median and max occupancy of builds created by BTBS are18.3, 0, 2.8 and 2066.4 ESU, respectively. The max occupancyusage is greater than the max allowed concurrent executorsbecause some builds use more executor memory. It shows thata majority of builds only use a few concurrent executors orexecutor memory but some builds can use thousands ESU. Itis worth mentioning that some builds hit the cached actionsso they have 0 occupancy usage.
C. Model Impact
Table V shows the build count distribution by batch sizereasons. shows the number of builds. showsthe number of OOM builds and its percentage out of all buildsin each batch size reason. shows the number of DEbuilds and its percentage out of all builds in each batch sizereason. %Build shows the percentage of builds in each batchsize reason out of all builds. %OOM shows the percentage ofOOM builds in each batch size reason out of all OOM builds. %DE shows the percentage of DE builds in each batch sizereason out of all DE builds.Builds with batch size reasons
ONLY_ONE_TARGET , MAX_TARGETS , ALL_REMAINING_TARGETS , MAX_MEMORY , MAX_OCCUPANCY , MEMORY_ESTIMATE_ERROR and
OCCUPANCY_ESTIMATE_ERROR account for 21.12%,32.95%, 33.06%, 12.71%, 0.06%, 0.10% and 0.00%of the total builds, respectively. The memory model isused in all builds with batch size reasons
MAX_TARGETS , ALL_REMAINING_TARGETS , MAX_MEMORY , MAX_OCCUPANCY and
OCCUPANCY_ESTIMATE_ERROR , which is 78.78% of allbuilds. The occupancy model is used in all builds with batchsize reasons
MAX_TARGETS , ALL_REMAINING_TARGETS , MAX_MEMORY , MAX_OCCUPANCY , which is 78.78% ofall builds.
ONLY_ONE_TARGET builds use neither thememory model nor the occupancy model because BTBSalways creates a build if a single target is left in the remaining target list.
MEMORY_ESTIMATE_ERROR and
OCCUPANCY_ESTIMATE_ERROR builds may use the memoryand occupancy models, respectively, in the initial binarysearch iterations before the failure. So both the memory andoccupancy models are heavily used in BTBS.
MAX_MEMORY builds provide an indicator on the poten-tial number of OOM builds if BTBS simply creates buildswith maxTargetsPerBatch targets. It shows that 0.36% of
MAX_MEMORY builds still run out of memory and they accountfor 58.75% of all out of memory builds. This is expectedbecause the
MAX_MEMORY builds should use more memory thanother builds. The
MEMORY_ESTIMATE_ERROR builds approxi-mately indicate how BTBS would behave without the memorymodel. It shows that 0.93%
MEMORY_ESTIMATE_ERROR buildsrun out of memory. If the
MAX_MEMORY builds have the sameout of memory rate as the
MEMORY_ESTIMATE_ERROR builds,we would have 74k more out of memory builds during theexperiment. In practice, the memory model is very importantbecause an OOM build is very expensive to retry and it blocksthe developer’s productivity. The memory model is usefulin limiting the memory usage while maximizing the batchdensity. BTBS used a small maxTargetsPerBatch before thememory model existed, the problem was that it generated toomany builds and used up almost all Bazel workers, whichblocked other builds from running.
MAX_OCCUPANCY builds provide an indicator on the po-tential number of
Type I
DE builds if BTBS simply cre-ates builds with maxTargetsPerBatch targets. It shows that1.4% of
MAX_OCCUPANCY builds still exceed the deadlinesand they account for 1.84% of all deadline exceeded builds.Interestingly, a majority of DE builds are not
MAX_OCCUPANCY builds. This indicates that most DE builds are
Type II
DEbuilds. Out of all DE builds, 98.68% builds use less thanor equal to 600 ESU and only 1.32% builds use more than600 ESU. This confirms that
Type II
DE builds domi-nate in the current setup. Unfortunately, we only have 716
OCCUPANCY_ESTIMATE_ERROR builds and no one exceeds thedeadline, so it’s hard to quantitatively measure how manymore deadline exceeded builds we would have without theoccupancy model. In practice, we used to have more DE buildsbefore the occupancy model exists, so we believe that theoccupancy model is useful to reduce
Type I
DE errors.Table VI shows the build target count (
BuildTC ), executiontime in seconds (
ExecT ), memory usage in GB (
Memory ) andoccupancy usage in ESU (
Occup ) on average and median,respectively, for each batch size reason.The
MAX_MEMORY builds use 8.1GB and 8.3GB memoryon average and median, respectively. This indicates that
MAX_MEMORY builds use more memory than other builds andthe usage is close to the memory cutoff. The
MAX_OCCUPANCY builds use 471.9 and 521.6 occupancy on average and median,respectively. This indicates that
MAX_OCCUPANCY builds usemore occupancy than other builds and the usage is close to theoccupancy cutoff. Moreover, it takes 1424s and 1301s to run
MAX_OCCUPANCY builds on average and median, respectively.This indicates that
MAX_OCCUPANCY builds are longer runningABLE V: Build count distribution by batch size reasons
Batch Size Reason
ONLY_ONE_TARGET 21.6m 22.5k (0.10) 6.6k (0.03) 21.12 28.13 14.22MAX_TARGETS 33.7m 1.9k (0.01) 6.8k (0.02) 32.95 2.34 14.59ALL_REMAINING_TARGETS 33.8m 7.6k (0.02) 21.0k (0.06) 33.06 9.52 45.17MAX_MEMORY 13.0m 46.9k (0.36) 11.2k (0.09) 12.71 58.75 24.13MAX_OCCUPANCY 59.3k 35 (0.06) 855 (1.4) 0.06 0.04 1.84MEMORY_ESTIMATE_ERROR 104.3k 967 (0.93) 25 (0.02) 0.10 1.21 0.05OCCUPANCY_ESTIMATE_ERROR 716 0 (0.00) 0 (0.00) 0.00 0.00 0.00
Sum
Batch Size Reason Avg MedianBuildTC ExecT Memory Occup BuildTC ExecT Memory Occup
ONLY_ONE_TARGET 1 241 2.2 2.0 1 92 1.6 0.3MAX_TARGETS 900 371 4.0 33.4 900 275 3.7 18.8ALL_REMAINING_TARGETS 67 403 3.2 13.2 5 227 2.4 2.5MAX_MEMORY 154 679 8.1 15.2 9 574 8.3 2.6MAX_OCCUPANCY 506 1424 5.5 471.9 555 1301 5.6 521.6MEMORY_ESTIMATE_ERROR 239 443 3.0 21.0 300 306 1.8 11.9OCCUPANCY_ESTIMATE_ERROR 222 391 2.2 19.6 300 273 1.6 11.3 (a) RMSE by actual memory (b) Build count by memory error (c) Build count by actual memory
Fig. 3: Memory model accuracy (GB) (a) RMSE by actual occupancy (b) Build count by occupancy error (c) Build count by actual occupancy
Fig. 4: Occupancy model accuracy (ESU)than other builds and the execution time is proportional to theoccupancy usage in general. The result shows that the memoryand occupancy models can restrict the memory and occupancyusage of the builds, thus reducing the OOM or DE errors.
ONLY_ONE_TARGET builds are single target builds, so theytypically have the least execution time and memory/occupancyusage as expected. All
MAX_TARGETS builds have 900targets and BTBS cannot generate builds with more than maxTargetsPerBatch targets. So
MAX_TARGETS buildstypically have more execution time and memory/occupancy usage.
ALL_REMAINING_TARGETS builds are generated atthe end of Algorithm 1, so these builds typically have fewtargets and use little memory and occupancy. The averagebuild target count for the
MEMORY_ESTIMATE_ERROR and
OCCUPANCY_ESTIMATE_ERROR builds isless than 300 because the initial binary searchiterations may already cut the batch size to beless than 300. Both
MEMORY_ESTIMATE_ERROR and
OCCUPANCY_ESTIMATE_ERROR builds are rare and they uselittle memory and occupancy. . Model Accuracy
Figure 3a shows the root mean square error (RMSE) of thememory model grouped by different actual memory usages(0GB to 13GB with a step of 0.1GB). The graph is brokendown by whether the memory model overestimates (blue line),underestimates (red line) or accurately predicts (yellow line)the actual memory usage, respectively. It also shows the overallRMSE (green line). We can see that the memory model heavilyoverestimates builds which use very little memory. A commonreason is that some targets may finish without GC so theirmemory usage is recorded much larger than that after the GC.The memory model learns from builds without GC that sometargets have large memory usage, so it often overestimatesthe memory usage of builds with the same targets that finishafter GC. The model also overestimates builds which use 7-8GB memory. Many of these builds are single target buildsand the model gives large memory estimates for all of thesebuilds. In general, the model tends to have a larger error inoverestimation compared to underestimation. The error tendsto go up for larger memory builds. Figure 3b shows thenumber of builds grouped by different error values. The graphis broken down by the build priority. It shows that mostbuilds, regardless of their priorities, have an error close to 0,which means that the model performs well. Figure 3c showsthe number of
MAX_MEMORY builds grouped by their actualmemory usages. We can see that the memory usage of high,medium and low priority builds peaks at around 7GB, 9GBand 10GB, respectively. This is consistent with our setup.Figure 4a shows the RMSE of the occupancy model groupedby different actual occupancy usages (0 ESU to 500 ESU witha step of 1 ESU). It seems that the occupancy error increasesfor larger occupancy builds. The overestimation error increasesslowly and then decreases until it diminishes at 500. Themodel only underestimates builds that use more than 500 ESU(not shown in the graph). The underestimation error increasesfor larger occupancy builds. The overall RMSE increases asthe actual build occupancy usage increases. This is expectedbecause most builds use <100 ESU and the model does nothave much data to learn larger occupancy builds. Figure 4bshows the number of builds grouped by different occupancyerror values. It shows that most builds have an error closeto 0, which shows that the model performs well. Figure 4cshows the number of
MAX_OCCUPANCY builds grouped by theiractual occupancy usages. We can see that the occupancy usagepeaks at both 0 and around 500 ESU. The peak at 0 ESU iscaused by many builds hitting the action cache in the executorcluster [14] and these builds do not occupy any executor. Thepeak at around 500 ESU is consistent with our setup.VI. R
ELATED W ORK
We do not find any related work about using machinelearning models to construct builds that use limited resourcesto reduce OOM and DE errors. This problem is quite commonin Google and we expect it to be common in other companiesthat have a monolithic code repository and many build targets.Our experience is that a service like BTBS becomes more important as the need to build a large number of targetsincreases. We believe that this work provides an insight fordevelopers who have a similar need, e.g. Bazel [1] andBuck [2] users.Google’s TAP service [11] mentions that the test executionscould fail due to OOM errors, but it does not give a solution.BTBS is designed exactly to solve that problem. In fact, theTAP service is one of the BTBS users.The build service system [14] describes how the build isexecuted remotely in Google. BTBS is one of the build servicesystem users. It is designed to workaround the constraints,i.e. limited memory allocation and max allowed concurrentexecutors, imposed by the build service system which is notcapable of building a very large number of targets.VII. C
ONCLUSIONS
In this paper, we describe the first build target batchingservice (BTBS) that is able to partition a large stream oftargets into a set of target batches such that the builds createdfrom those target batches do not have OOM or DE errors. Wediscuss the batching algorithm as well as the machine learningmodels used in BTBS. Overall, we show that BTBS is able toachieve a low OOM rate (0.08%) and a low DE rate (0.05%).We believe that BTBS introduces useful insights and can helpin designing new target batching systems.R
EFERENCES[1] Bazel build tool. https://bazel.build/.[2] Buck build tool. https://buck.build/.[3] Feature cross. https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture/.[4] Google guava. https://github.com/google/guava/.[5] L. Bottou. Large-scale machine learning with stochastic gradientdescent. In
COMPSTAT . 2010.[6] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Brad-shaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallelpipelines.
ACM Sigplan Notices , 2010.[7] P. M. Duvall, S. Matyas, and A. Glover.
Continuous integration:improving software quality and reducing risk . Pearson Education, 2007.[8] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In
SOSP , 2003.[9] F. Hutter, L. Kotthoff, and J. Vanschoren.
Automated Machine Learning .2019.[10] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutualinformation.
Physical review E , 2004.[11] A. Memon, Z. Gao, B. Nguyen, S. Dhanda, E. Nickell, R. Siemborski,and J. Micco. Taming google-scale continuous testing. In
ICSE-SEIP ,2017.[12] D. C. Montgomery, E. A. Peck, and G. G. Vining.
Introduction to linearregression analysis . 2012.[13] B. C. Ross. Mutual information between discrete and continuous datasets.
PloS one , 2014.[14] K. Wang, G. Tener, V. Gullapalli, X. Huang, A. Gad, and D. Rall.Scalable Build Service System with Smart Scheduling Service. In