[PDF] Smart Build Targets Batching Service at Google

Abstract

Google has a monolithic codebase with tens of millions build targets. Each build target specifies the information that is needed to build a software artifact or run tests. It is common to execute a subset of build targets at each revision and make sure that the change does not break the codebase. Google's build service system uses Bazel to build targets. Bazel takes as input a build that specifies the execution context, flags and build targets to run. The outputs are the build libraries, binaries or test results. To be able to support developer's daily activities, the build service system runs millions of builds per day. It is a known issue that a build with many targets could run out of the allocated memory or exceed its execution deadline. This is problematic because it reduces the developer's productivity, e.g. code submissions or binary releases. In this paper, we propose a technique that predicts the memory usage and executor occupancy of a build. The technique batches a set of targets such that the build created with those targets does not run out of memory or exceed its deadline. This approach significantly reduces the number of builds that run out of memory or exceed the deadlines, hence improving developer's productivity.

Full PDF

aa r X i v : . [ c s . S E ] F e b Smart Build Targets Batching Service at Google

Kaiyuan Wang

Google Inc [email protected]

Daniel Rall

Google Inc [email protected]

Greg Tener

Google Inc [email protected]

Vijay Gullapalli

Google Inc [email protected]

Xin Huang

Google Inc [email protected]

Ahmed Gad

Google Inc [email protected]

Abstract —Google has a monolithic codebase with tens ofmillions build targets. Each build target speciﬁes the informationthat is needed to build a software artifact or run tests. It iscommon to execute a subset of build targets at each revision andmake sure that the change does not break the codebase. Google’sbuild service system uses Bazel to build targets. Bazel takes asinput a build that speciﬁes the execution context, ﬂags and buildtargets to run. The outputs are the build libraries, binaries ortest results. To be able to support developer’s daily activities, thebuild service system runs millions of builds per day.It is a known issue that a build with many targets could run outof the allocated memory or exceed its execution deadline. Thisis problematic because it reduces the developer’s productivity,e.g. code submissions or binary releases. In this paper, wepropose a technique that predicts the memory usage and executoroccupancy of a build. The technique batches a set of targetssuch that the build created with those targets does not run outof memory or exceed its deadline. This approach signiﬁcantlyreduces the number of builds that run out of memory or exceedthe deadlines, hence improving developer’s productivity.

I. I

NTRODUCTION

Google has a monolithic codebase and it grows rapidly withan average of more than 100,000 code submissions per day.To make sure that the new changes do not break the existingcodebase, Google has adopted Continuous Integration (CI) [7],[11]. For each code change, a CI system ﬁrst uses the buildservice system [14] to build the affected libraries and binaries.Then, it runs the affected tests to check if the built librariesand binaries are working as intended.Google’s build service system uses Bazel [1] to buildlibraries and run tests. Bazel takes as input a set of buildspeciﬁcation ﬁles that declare build targets. We refer to a buildtarget as a target in the rest of this paper. A target speciﬁeswhat is needed to produce an artifact, such as a library orbinary. A test target speciﬁes what is needed to run tests andcheck if a code change breaks the codebase. Bazel decideshow to build a given target based on the target’s speciﬁcation.As the codebase grows, each change may affect a lot oftargets. For example, a code change on a common library likeGuava [4] could affect many Java targets. A C++ compilerchange could affect all C++ targets. Moreover, the postsubmitservice needs to guarantee that all targets in a given changerevision compile and pass all tests. The above use casesare common and require Bazel to build up to millions oftargets at once. Running a large number of targets in a buildexecution may cause out of memory ( OOM ) errors in Bazel as it exceeds the memory of a single machine for dependencyanalysis. Moreover, Google’s infrastructure limits the numberof executors, e.g. CPU, GPU and TPU, a build can use inparallel at the build execution time. So executing too manytargets in a build may cause the build to take too long andexceed its invocation deadline, resulting in deadline exceeded ( DE ) errors. In contrary, executing too few targets in a buildmay use up all the machines allocated to run builds.In this paper, we propose a build target batching service( BTBS ) that partitions a large stream of targets into batchesand creates a build for each batch of targets such that thosebuilds do not have OOM or DE errors. The technique relies ona memory estimation model and an executor occupancy esti-mation model. The memory model predicts the peak memoryusage of a build. The occupancy model predicts the averageexecutor occupancy of a build. The technique partitions a largestream of targets into target batches such that the build withthe ﬂags and each batch of targets consumes a limited memoryor executor occupancy. The results show that BTBS generatesfew OOM and DE builds with 0.08% OOM rate and 0.05%DE rate, which saves a lot of computational resources usedfor build failure retries.The paper makes the following contributions: • It presents the ﬁrst technique that effectively creates buildsfrom a large stream of targets with the goal of minimizingthe number of OOM and DE errors. It is also the ﬁrsttechnique that predicts memory and occupancy usages ofbuild executions. • It demonstrates that the technique is able to reduce the OOMrate to 0.08% and the DE rate to 0.05%. Our past experienceshows that BTBS is critical and improves the developer’sproductivity by reducing build failure retries due to OOMor DE errors. II. B

ACKGROUND

A. Bazel

Bazel is an open-source build and test tool and is widelyused within Google. It is responsible for transforming sourcecode into libraries, executable binaries, and other artifacts.Bazel takes as input a set of ﬂags and targets that programmersdeclare in build ﬁles. It supports a large number of commandline options and these options can affect the way Bazelgenerates outputs.ABLE I: Example build ﬂags

Category Example

Error checking –check_visibility enables checking if all dependent targets are visible to all targets to build.Tool ﬂags –copt speciﬁes the arguments to be passed to the C++ compiler.Build semantics –cpu speciﬁes the target CPU architecture to be used for the compilation.Execution strategy –jobs speciﬁes a limit on the number of concurrently running executors during build execution.Output selection –fuseless_output restricts Bazel to generate intermediate output ﬁles in memory.Platform –platforms speciﬁes the labels of the platform rules describing the target platforms.Miscellaneous –use_action_cache enables Bazel’s local action cache. java_library(name = "HelloWorld",srcs = ["HelloWorld.java"],) java_library(name = "HelloWorldTest",srcs =["HelloWorldTest.java"],deps = [":HelloWorld"],) java_test(name = "AllTests",size = "small",tags = ["requires-gpu"],deps = [":HelloWorldTest"],)

Fig. 1: Example Java targetsFig. 2: Build service architectureTable I shows different ﬂag categories with examples.For example, the -fuseless_output ﬂag restricts Bazel togenerate intermediate output in memory instead of writing itto the disk. The -jobs ﬂag speciﬁes a limit on the number ofconcurrently running executors during a build execution. So -fuseless_output and -jobs can signiﬁcantly affect thememory usage and executor occupancy of a build, respectively.Figure 1 shows some example Java targets.

HelloWorld isa Java library target that compiles

HelloWorld.java into alibrary ( java_library is the target rule).

HelloWorldTest is a Java library target that compiles

HelloWorldTest.java into a library and it depends on the

HelloWorld targetbecause

HelloWorldTest.java uses

HelloWorld.java . AllTests is a Java test target that runs the

HelloWorldTest library using JUnit and the execution requires GPU. Whena programmer issues a command to build a target, Bazelﬁrst ensures that the required dependencies of the target arebuilt. Then, it builds the desired target from its sources anddependencies. When a programmer issues a command to runa test target, Bazel will ﬁrst build all dependencies of the testtarget and then execute the tests.

B. Build Service System

BTBS takes as input a set of ﬂags and a stream of targets,and outputs a set of builds that include all targets with thesame ﬂags. The output builds are sent to the build servicesystem [14] for execution. Figure 2 shows how BTBS isconnected to the build service system. The system diagramis simpliﬁed for brevity. BTBS splits a stream of targets into batches and creates abuild for each batch of targets. Those builds are enqueued tothe scheduling service. The scheduling service waits until thereare available resources and dequeues a build to a Bazel workerfor execution. The worker allocates a ﬁxed amount of memoryfor each Bazel process and a build runs out of memory if theBazel process uses more than the allocated memory duringexecution. Bazel translates the build ﬂags and targets intoactions, and sends those actions to the executor cluster forthe actual execution. For example, if a test requires GPU thenBazel will send it to the GPU executors. The executor clusterhas a large but limited number of executors and each executortalks to a Bazel process to execute actions. The executorcluster also has an action cache to minimize duplicate work.Each Bazel process is conﬁgured to use a limited number ofexecutors concurrently to avoid the case where a very largebuild uses a lot of executors and blocks other build executions.As a consequence, executing a large build that could use moreexecutors than the limit causes the additional actions to queueuntil some executors become idle again. This may cause thebuild to exceed its deadline and we call these builds

TypeI

DE builds. It is also possible that some actions depend onother actions and they must be executed in sequence. Thismay cause build to exceed its deadline if some actions arelong running. We call these builds

Type II

DE builds. Anaction uses both memory and executor in the executor cluster.We use an executor service unit (ESU) to unify the expenseof both executor memory and CPU, and 1 ESU is equalto 2.5GB of memory or 1 executor. Note that the executormemory usage is different from the Bazel memory usage. Werefer to the occupancy usage as the ESU used by a build inthe rest of the paper. BTBS reduces

Type I

DE errors bylimiting the occupancy of a build, and we do not consider

Type II

DE errors because it is not correlated with the buildoccupancy usage. BTBS assumes that a build with too muchESU usage is more likely to have queuing actions that cause ype I

DE errors. The executor cluster reports the executoravailability to the quota governor and the scheduling servicemakes dequeuing decisions based on the executor availabilityfrom the quota governor. During a build execution, the Bazelprocess reports the progress and result to the build eventservice. Clients of BTBS can use the build request IDs toquery the build event service and ﬁnd the build progress andstatus in real time.

C. Collateral Damage

Each build may use different executor types. For example,the

AllTests

Java test target can use both x86 CPU and GPUduring execution. The build scheduling service throttles buildsbased on the availability of each executor type. For example,a build that requires both x86 CPU and GPU can only bedequeued when both x86 CPU executors and GPU executorsare available. In other words, a build might be delayed if oneof its required executor types is unavailable even if all otherrequired executor types are available. Moreover, the delayedbuild will reserve the executor resources and block other lowerpriority builds from dequeuing. This design prevents highpriority builds from being starved by lower priority builds thatuse very few ESU [14]. As a consequence, a build that requiresmore executor types is more likely to be delayed and it mayblock other lower priority builds from dequeuing.If a delayed build contains targets that require different setsof executor types, then some of the targets could have beendequeued if they are in a separate build. For example, assumethat a build contains one target that only uses x86 executorsand another target that uses both x86 and GPU executors. Thebuild could be throttled due to insufﬁcient GPU executors butthe target that only uses x86 executors could be dequeued ifit is built separately. We call such delays collateral damage . D. Linear Regression

In machine learning, regression analysis is a set of statisticalprocesses for estimating the relationships between a dependentvariable (denoted as y ) and one or more independent variables(denoted as x ). The most common form of regression analysisis linear regression [12]. It tries to ﬁnd the line that mostclosely ﬁts the data according to a speciﬁc mathematicalcriterion. For example, the method of ordinary least squarescomputes the unique line (or hyperplane) that minimizes thesum of squared distances between the true data and that line(or hyperplane).Given a set of data set { y i , x i , . . . , x ip } ni =1 of n statisticalunits. A simple linear regression model has the following form: y i = β + β x i + · · · + β p x ip + ε i = x T i β + ε i , i = 1 , . . . , n where T denotes the transpose, so that x T i β is the innerproduct between vectors x i and β .Fitting a linear model to a given data set usually requiresestimating the regression coefﬁcients β such that the error term ε i = y i − x T i β is minimized across all n samples. For example,it is common to use the sum of squared errors P ni =1 ε i as thequality of the ﬁt. Linear models can be efﬁciently trained usingstochastic gradient descent [5]. E. Feature Cross

A feature cross is a synthetic feature formed by multiplying(crossing) two or more features [3]. Crossing combinations offeatures can provide predictive abilities beyond what thosefeatures can provide individually. As a consequence, featurecrosses can help us learn non-linear functions using linearregression. A well-known example is that the XOR function f ( x, y ) where x, y, f ( x, y ) ∈ { , } is not linearly separableand it cannot be written as f ( x, y ) = αx + βy + γ where α , β and γ are real numbers. However, the XOR function canbe written as f ( x, y ) = x + y − xy , where the xy term is afeature cross for x and y .In practice, machine learning models seldom cross continu-ous features. However, machine learning models do frequentlycross one-hot feature vectors. Feature crosses of one-hotfeature vectors are analogous to logical conjunctions. Forexample, suppose we bin latitude and longitude, producingseparate one-hot ﬁve-element feature vectors, e.g. [0, 0, 0, 1,0]. Further assume that we want to predict the gross income ofa person using the binned latitude and longitude as features.Creating a feature cross of these two features results in a 25-element one-hot vector (24 zeroes and 1 one). The single 1in the cross identiﬁes a particular conjunction of latitude andlongitude. By feature crossing these two one-hot vectors, themodel can form different conjunctions, which will ultimatelyproduce far better results compared to a linear combination ofindividual features.III. B UILD T ARGET B ATCHING S ERVICE

A. EnqueueTargets API

BTBS provides a single remote procedure call (RPC) called

EnqueueTargets . The RPC is a streaming RPC and it takesas input a sequence of requests and returns to the clients asequence of responses. The main reason to have a streamingRPC is to avoid needing to stream millions of targets in a sin-gle request. We enforce the ﬁrst

EnqueueTargets request toinclude the execution context, build ﬂags and optionally sometargets. The execution context points to either a workspace thatcontains the unsubmitted code, or an existing code revision.The ﬂags and targets are described in Section II-A. Thesubsequent requests may only contain the remaining targets.BTBS will create a set of builds with the same executioncontext and ﬂags but different targets. The created buildsinclude all targets in the requests. Each

EnqueueTargets response contains the enqueued build request ID so that theclient can use it to query the build status.

B. Group Targets

Once BTBS receives all targets from the client, it ﬁrstgroups targets by the executor types they use. This avoids thecollateral damage as described in Section II-C. For example,all targets that use only x86 executors will be grouped together.All targets that use both x86 and Mac executors will begrouped together. Given a build target, BTBS determines itsrequired executor types by checking the tags attribute, e.g.the requires-gpu tag in the

AllTests target in Figure 1. lgorithm 1:

Batching targets algorithm

Input:

List of targets to batch allTargets ; Max targets perbatch maxTargetsPerBatch ; Memory cutoff value memoryCutoff ; Occupancy cutoff value occupancyCutoff . Output:

Target batches batches . batches = [] while len(allTargets) > 0 do targets = allTargets [: maxTargetsPerBatch ] targets = limitBatchSizeByCutoff( targets,memoryCutoff, MEMORY_MODEL ) targets = limitBatchSizeByCutoff( targets,occupancyCutoff, OCCUPANCY_MODEL ) batches .append( targets ) allTargets = allTargets [len( targets ):] return batches Function limitBatchSizeByCutoff( targets, cutoff,modelName ) : low = 0, high = len( targets ) - 1, cutoffIndex = 0 while low <= high do mid = (low + high) / 2estimate = getEstimateFromTargets( targets[:mid + 1],modelName ) if estimate < cutoff then cutoffIndex = midlow = mid + 1 else high = mid - 1 return targets [: cutoffIndex + 1] If a target does not have a tag, then BTBS determines therequired executor type by the target rule. For example, if atarget has the rule ios_ui_test , then it uses Mac executors.Once all targets are grouped by their required executortypes, we sort each group of targets in lexicographical order.This is a heuristic to increase the probability that the batchedtargets share similar dependencies and thus allow Bazel toconstruct tighter dependency graphs and reduce the memoryusage. The idea is that developers tend to declare targets for thesame project under the same directory. The targets in the samedirectory are likely to share dependent targets. For example,target //a/b/c:t1 may share more dependencies with target //a/b/c:t2 than target //d/e/f:t3 . The assumption maynot hold for all targets, but this heuristic works well in practice.

C. Batch Targets

BTBS generates a set of target batches for each group oflexicographically sorted targets. Algorithm 1 shows the targetbatching algorithm. The algorithm takes as input a groupof sorted targets allTargets , a bound on the max numberof targets per batch maxTargetsPerBatch , the memory cutoffvalue memoryCutoff above which to consider the build withthe batch of targets running out of memory, and the occupancycutoff value occupancyCutoff above which to consider thebuild with the batch of targets exceeding the deadline. Theoutput batches is a list of target batches.The algorithm ﬁrst initializes batches to an empty list.When allTargets is not empty, the algorithm takes the ﬁrst maxTargetsPerBatch targets in allTargets as the initial targets list for binary search. The ﬁrst binary search queries the memory model and tries to ﬁnd the largest sublist of targets that ﬁts into the memoryCutoff during execution. The secondbinary search queries the occupancy model and tries to ﬁndthe largest sublist of targets (updated in the ﬁrst binary search)that uses less than occupancyCutoff

ESU. The binary searched targets list always starts with the same initial target in eachiteration. Finally, the batch of targets is added to batches andremoved from allTargets . limitBatchSizeByCutoff is the method that does thebinary search. It takes as input a list of targets to search,a cutoff value to ﬁt the ﬁnal target list and a modelname modelName to query the machine learning model. The getEstimateFromTargets method generates the featuresfrom a synthetic build with the same execution context, buildﬂags as present in the ﬁrst EnqueueTargets request, butwith a different sublist of targets . Then, the method sendsthe generated features to the model which then returns backan estimate. For the memory model, it returns the estimatedpeak memory usage of the build. For the occupancy model,it returns the average occupancy of the build. Finally, the limitBatchSizeByCutoff method returns a sublist of tar-gets that uses less than cutoff

GB of memory at peak or fewerthan cutoff

ESU on average. Note that the returned targets sublist always includes at least one target.

D. Batch Size Reasons

Each target batch is associated with a batch size reason,indicating why the batch of targets is created. We deﬁnea target batch to be valid if either (1) it uses less than memoryCutoff memory and occupancyCutoff occupancy; or(2) it only has one target. Table II shows all possible kindsof batch size reasons.

ONLY_ONE_TARGET means that theremaining unbatched target list only has one target, and BTBSsimply creates a single target batch.

MAX_TARGETS meansthat the initial target batch with size maxTargetsPerBatch is valid.

ALL_REMAINING_TARGETS means that the targetbatch includes all remaining unbatched targets and it isvalid.

MAX_MEMORY means that the target batch is createdby a binary search with the memory model and is valid.

MAX_OCCUPANCY means that the target batch is created bya binary search with the occupancy model and is valid.

MEMORY_ESTIMATE_ERROR means that the memory modelreturns an error and the target batch includes a list of targetswith a fallback size.

OCCUPANCY_ESTIMATE_ERROR meansthat the occupancy model returns an error and the target batchincludes a list of targets with a fallback size. These batchsize reasons can help us understand the impact of the memorymodel and occupancy model.Note that a target batch may have a higher estimate than memoryCutoff or occupancyCutoff when it only contains asingle target. But there is nothing BTBS can do becauseit cannot split a single target. The memory and occupancymodels are served in a separate server and BTBS queries thosemodels via RPC. So it is possible that the RPC fails, e.g. RPCdeadline exceeded errors.ABLE II: Batch size reasons Batch Size Reason Description

ONLY_ONE_TARGET Remaining unbatched target list only has one target.MAX_TARGETS Initial target batch with size maxTargetsPerBatch is valid.ALL_REMAINING_TARGETS Target batch includes all remaining unbatched targets and is valid.MAX_MEMORY Target batch is created by a binary search with the memory model.MAX_OCCUPANCY Target batch is created by a binary search with the occupancy model.MEMORY_ESTIMATE_ERROR Target batch includes a fallback size of unbatched targets due to memory error.OCCUPANCY_ESTIMATE_ERROR Target batch includes a fallback size of unbatched targets due to occupancy error.

E. Create Build

For each target batch, BTBS creates a build with thesame execution context and ﬂags as present in the ﬁrst

EnqueueTargets request. The build will be enqueued in thebuild service system [14]. Finally, BTBS sends back all buildrequest IDs generated by the build service system to the clientsso that they can use them to track the build statuses. The clientsreceive a sequence of

EnqueueTargets responses and eachresponse contains a single build request ID.

F. Build Failure Retry

Even with the memory model and the occupancy model,builds may still fail with OOM or DE errors. If that happens,BTBS can retry those builds on the build service system [14].For an OOM build, BTBS splits the targets in that build intohalf and reruns Algorithm 1 for each subset of the split targets.BTBS does not retry single target OOM builds. For a DE build,BTBS reruns Algorithm 1 with the exact same targets in thatDE build. The reason is that we expect the build service systemto cache some actions done for the DE build, so rerunning thebuild should be faster and may ﬁnish within the deadline.IV. M

EMORY AND O CCUPANCY P REDICTION

A. Data Collection

Bazel is a Java based build system and it runs on JVM.The Bazel process keeps track of the memory and occupancyusage of a build execution.In terms of the memory usage, Bazel records both the peakheap memory usage and the peak post full GC heap memoryusage. The peak heap memory usage is the max heap memoryusage during the build execution. Bazel does not necessarilyrun out of memory if the peak heap memory usage is close tothe allocated memory, because JVM can garbage collect (GC)the unused references and free some heap memory. The peakpost full GC heap memory usage is the max heap memoryusage after a full GC during the build execution. If the peakpost full GC heap memory usage is close to the allocatedmemory, then Bazel will run out of memory because GCcannot free up more memory space. We use the peak post fullGC heap memory usage as the predicting label for the memorymodel. If no GC is triggered during the build execution, thenwe use the peak heap memory usage as the predicting label.In terms of the occupancy usage, Bazel records the exactbuild execution time and the total executor service time forthe build. The build execution time is the wall time of the command-speciﬁc execution, excluding Bazel startup time.The executor service time for a Bazel action is the number ofexecutors (or equivalent memory uniﬁed by ESU) multipliedby their execution time occupied by that action. The totalexecutor service time for a build is the sum of executor servicetime for all actions generated by that build. The averageexecutor occupancy is equal to the total executor service timedivided by the build execution time. Conceptually, the averageexecutor occupancy measures the average concurrently usedexecutors or memory during a build execution. If the averageexecutor occupancy is above the number of concurrent execu-tors limit, then it is more likely to cause a

Type I deadlineexceeded build. We use the average executor occupancy (inESU) as the predicting label for the occupancy model.The build ﬂags and targets are the raw features we use topredict the memory and occupancy usage. The Bazel processis conﬁgured to store all execution statistics to a distributedﬁle system [8] and we have a Flume [6] job to generate allfeatures and labels from the execution statistics every day.

B. Feature Engineering

By measuring how important a feature is in predicting thelabel, we can choose the most representative features insteadof all the features. The beneﬁts of feature reduction include (1)speeding up the model training, (2) making the model trainingstable (training the same model multiple times produces closeresults), (3) avoiding learning bad features, and (4) reduceddata processing, etc. We use the mutual information [10], [13]to directly measure how much information a set of featurescan provide to predict the label.Table III shows the set of important features we use topredict both memory and occupancy usages. The build prioritydoes not affect the memory or occupancy usage of a build.But it happens that high priority builds often run a smallerset of targets compared to low priority builds. For example,the number of targets triggered by a human code changeis often smaller than the number of targets triggered by acode coverage tool. The Bazel command name is importantbecause it affects the way targets are built. For example, the test command not only builds the libraries/binaries but alsoruns the tests. As a comparison, the build command onlybuilds a more restricted set, such as libraries and binaries. Theoriginating user and the product area are important becausecertain users or teams often trigger targets of the same projectand their builds often have similar cost. The tool tag isABLE III: Important features

Features Description

Build priority The priority of the build.Command name The Bazel command name.Originating user The username that requests the build.Product area The product area under which the build is charged.Tool tag The tool that sends targets to BTBS.Targets The set of targets in the build.Target count The number of distinct targets in the build.Packages The set of packages for the corresponding targets in the build.Package count The number of distinct packages of all targets in the build.–cache_test_results Whether Bazel uses the previously saved test results when running tests.–cpu The target CPU architecture to be used for the compilation of binaries during the build.–discard_analysis_cache Whether Bazel should discard the analysis cache right before execution starts.–fuseless_output Whether to generate intermediate output ﬁles in memory.–jobs The limit on the number of concurrent executors used during the build execution.–keep_going Whether to proceed as much as possible even in the face of errors.–keep_state_after_build Whether to keep incremental in-memory state after the build execution.–runs_per_test The number of times each test should be executed.–test_size_ﬁlters Only run test targets with the speciﬁed size.–use_action_cache Whether to use Bazel’s local action cache.another important feature because the sets of targets triggeredby different tools can vary a lot. For example, a presubmitservice often triggers a small set of targets given a codechange, but the postsubmit service often triggers millions oftargets. The targets and the packages, i.e. directories withinwhich the targets are declared, are very important becausetargets often have different costs to build. Typically, a buildwith more targets and packages is more expensive than abuild with fewer targets and packages. Build ﬂags are alsoimportant because they affect how Bazel builds the targets. Forexample, the -discard_analysis_cache ﬂag tells Blazeto discard the analysis cache for the previous build beforethe next build execution starts, so it affects the memory usedby the build. The -keep_going ﬂag tells Bazel to proceedeven in the face of errors and it may cause a failed build touse more memory compared to exiting the execution once afailure occurs. The -runs_per_test ﬂag controls the numberof times each test should be executed so it can affect thenumber of concurrently used executors in the case of parallelexecution. The -test_size_filters ﬂag ﬁlters a subset oftest targets to execute and thus can affect both the memoryand occupancy usage of a build. In general, the importantfeatures selected based on the mutual information indeed affectthe memory and occupancy usage, intuitively. There are moreimportant ﬂags but we do not describe them for brevity.In addition to the basic important features in Table III, wealso generate synthetic features to improve the model accuracy.For example, the target and package counts are discretizedinto quantile buckets. Speciﬁcally, we can split the range oftarget counts by their median, and half of the target countswould fall into the ﬁrst bucket and the rest half would fall intothe second bucket. This strategy converts a numeric featureinto a categorical feature. Another example is to split each target path into multiple fragments based on the path delimiter.Speciﬁcally, we generate the target preﬁx splits feature froma target //a/b/c:t by splitting its path into //a/b/c:t , //a/b/c , //a/b and //a . This helps the model to learnthe memory or occupancy usage patterns for targets underdifferent directories.All basic and synthetic categorical features are used togenerate feature crosses (Section II-E). We use an off-the-shelfblackbox optimization tool similar to AutoML [9] to generatea set of candidate feature crosses. The tool trains the modelwith a small set of steps to see which feature crosses givethe best performance. All feature crosses are generated fromthe basic and synthetic categorical features, and we limit themax number of feature crosses to 6. Finally, the tool returnsthe best feature crosses for the models. We run feature crosssearch separately for the memory and occupancy models. C. Model Training

Once all features are ﬁnalized, they will be used to trainthe memory and occupancy models. We choose to train aregression model because it ﬁts well with the binary search asdescribed in Algorithm 1. Another reason to not train a binaryclassiﬁcation model is that the model may perform badly whenonly few OOM builds are present in the training data.We want our models to be monotonic over all targets relatedfeatures. This property makes sure that adding new targets tothe build with the same ﬂags always results in an increasedmemory estimate. If the monotonicity property does not hold,then the binary search may not work as expected. We achievethe monotonicity property by setting the regularization termto inﬁnity when the weights of the targets related features arenegative. Speciﬁcally, we use the loss function as follows: ( β , x , y ) = 1 n n X i =1 ( y i − x T i β ) + R ( β ) R ( β ) = n X i =1 r i ( β i ) r i ( β i ) = ( λ − i | β i | β i < λ + i | β i | β i ≥ λ − i , λ + i ≥ where L ( β , x , y ) is the mean squared error (MSE) lossfunction and R ( β ) is the regularization term. λ − i and λ + i are the L1 regularization term for the weight β i when β i < and β i ≥ , respectively. We set λ − i = ∞ for the weights ofthe targets related features. This causes all the weights of thetargets related features to be non-negative, thus making themodel monotonic over all targets related features.We train the memory and occupancy models using the last17 days of all builds that are executed in the build servicesystem. This means that the training and testing datasetsinclude builds that are not created by BTBS. 17 is chosento cover at least 2 weeks of weekday data and be able tohandle holiday weekends. However, using a long period oftraining data may cause the memory model to slowly capturethe new memory patterns of recent builds. For example, itis possible that a memory usage regression is injected intoBazel, which causes otherwise identical builds to use morememory than before. The memory model won’t be able tocapture the memory usage increase until a couple of dayslater because a majority of the builds still use less memory. Tosolve this problem, we train another memory model that usesdata from the most recent day and BTBS uses the maximummemory estimate of the two models when performing thebinary search. All the models are continuously trained andpushed to production. V. E VALUATION

A. Production Setup

BTBS is deployed geographically at 3 locations in the USand each location has 5 running jobs that serve trafﬁc allover the world. This distributed setting avoids single pointfailure. Our experiment lasts from 2020/01/16 to 2020/02/19(35 days). We report the performance of BTBS using theproduction data. During the time, BTBS receives 51 million

EnqueueTargets

RPC calls (1.47 million daily on average)and creates 102 million builds (2.92 million daily on average).The maxTargetsPerBatch is set to 900 in Algorithm 1. Thememory cutoff value is set to 7GB for high priority builds,9GB for medium priority builds and 10GB for low prioritybuilds. The reason is that the memory model is not 100%accurate and the number of OOM builds will increase if weset the cutoff close to the allocated memory (13GB). We seta lower memory cutoff for high priority builds because wecannot afford the human users to wait for OOM failure retries.In contrast, we can tolerate more low priority OOM builds, e.g. builds that collect code coverage and do not block developers.The occupancy cutoff value is set to 500 ESU which is smallerthan the max allowed concurrent executors limit (600). If thememory or occupancy model returns errors, then we fallbackto a default batch size of 300.All regularization terms λ + i , i = 1 ...n for positive weightsare set to 7000. All the learning parameters, e.g. learningrate and batch size, are set to the default values. During theexperiment time, the build service system ran 17.8 millionsbuilds on average each day and 34.8% builds were Bazelqueries that do not use much memory or any executor. So weexcluded those builds and used the rest 11.6 million builds asthe training examples daily.In the rest of this section, 1k represents 1 thousand and 1mrepresents 1 million. The error is calculated as the differencebetween the actual and estimated memory usage, i.e. ( y − ˆ y ) . B. BTBS Performance

Table IV shows some metrics to measure the performanceof BTBS.

QPS and

Latency show the queries per second andlatency in milliseconds of the

EnqueueTargets streamingRPC over the past 24 hours during the entire experiment span.

StreamTC shows the total target count per RPC to BTBS.

StreamBC shows the total generated build count per RPC toBTBS.

BuildTC , ExecT , Memory and

Occupancy show thetarget count, execution time in seconds, memory usage ingigabytes and occupancy usage in ESU of individual buildscreated by BTBS.BTBS is used heavily in production and the average QPSover the past 24 hours ranges from 6 to 27. The min andmax QPS over the past 24 hours range from 0 to 12 andfrom 18 to 89, respectively. BTBS receives less trafﬁc onweekends off the peak hours and more trafﬁc on weekdaysduring the peak hours. The batching algorithm is efﬁcient andthe average latency over the past 24 hours ranges from 265 to414 milliseconds. The min and max latency over the past 24hours range from 1 to 81 milliseconds and from 69k to 886kmilliseconds, respectively. The latency of BTBS is very smallif the clients enqueue less than hundreds of targets, but it couldgo up to minutes if the clients enqueue millions of targets. Theaverage, min, median and max number of targets receivedby BTBS per RPC are 674, 1, 2 and 70m, respectively. Amajority of the

EnqueueTargets requests only contain 1-2 targets and they are sent by tools like presubmit failurererunner, culprit ﬁnder, ﬂaky test detector, etc. The postsubmitservice could send millions of targets to BTBS because itneeds to make sure that all tests can pass at a given revision.The average, min, median and max number of builds createdby BTBS per RPC are 2, 1, 1 and 84k, respectively. Thisis consistent with the number of targets received by BTBSbecause BTBS creates more builds as it receives more targetsper RPC. Since a majority of the

EnqueueTargets requestsonly contain 1-2 targets, BTBS only needs to create a singlebuild for most of the time. The average, min, median andmax number of targets per build created by BTBS are 339, 1,24 and 900, respectively. Note that the number of targets perABLE IV: BTBS performance

Metrics Avg Min Median Max

QPS 6-27 0-12 4-24 18-89Latency 265-414 1-81 150-179 69k-886kStreamTC 674 1 2 70mStreamBC 2 1 1 84kBuildTC 339 1 24 900ExecT 394 0.1 250 7923Memory 3.87 0.04 2.82 13.26Occupancy 18.3 0 2.8 2066.4build is bounded by the maxTargetsPerBatch . The average,min, median and max execution time of builds created byBTBS are 394, 0.1, 250 and 7923 seconds, respectively. Itindicates that most builds run with a couple of seconds tominutes but some builds can run with 1-2 hours. A majorityof builds have a deadline of 1.5 hours and will be expired ifthey do not ﬁnish by the deadline. The average, min, medianand max memory usage of builds created by BTBS are 3.87,0.04, 2.82 and 13.26 GB, respectively. It shows that manybuilds take less than a few GB of memory but some of themcan still use up all the allocated memory. The average, min,median and max occupancy of builds created by BTBS are18.3, 0, 2.8 and 2066.4 ESU, respectively. The max occupancyusage is greater than the max allowed concurrent executorsbecause some builds use more executor memory. It shows thata majority of builds only use a few concurrent executors orexecutor memory but some builds can use thousands ESU. Itis worth mentioning that some builds hit the cached actionsso they have 0 occupancy usage.

C. Model Impact

Table V shows the build count distribution by batch sizereasons. shows the number of builds. showsthe number of OOM builds and its percentage out of all buildsin each batch size reason. shows the number of DEbuilds and its percentage out of all builds in each batch sizereason. %Build shows the percentage of builds in each batchsize reason out of all builds. %OOM shows the percentage ofOOM builds in each batch size reason out of all OOM builds. %DE shows the percentage of DE builds in each batch sizereason out of all DE builds.Builds with batch size reasons

ONLY_ONE_TARGET , MAX_TARGETS , ALL_REMAINING_TARGETS , MAX_MEMORY , MAX_OCCUPANCY , MEMORY_ESTIMATE_ERROR and

OCCUPANCY_ESTIMATE_ERROR account for 21.12%,32.95%, 33.06%, 12.71%, 0.06%, 0.10% and 0.00%of the total builds, respectively. The memory model isused in all builds with batch size reasons

MAX_TARGETS , ALL_REMAINING_TARGETS , MAX_MEMORY , MAX_OCCUPANCY and

OCCUPANCY_ESTIMATE_ERROR , which is 78.78% of allbuilds. The occupancy model is used in all builds with batchsize reasons

MAX_TARGETS , ALL_REMAINING_TARGETS , MAX_MEMORY , MAX_OCCUPANCY , which is 78.78% ofall builds.

ONLY_ONE_TARGET builds use neither thememory model nor the occupancy model because BTBSalways creates a build if a single target is left in the remaining target list.

MEMORY_ESTIMATE_ERROR and

OCCUPANCY_ESTIMATE_ERROR builds may use the memoryand occupancy models, respectively, in the initial binarysearch iterations before the failure. So both the memory andoccupancy models are heavily used in BTBS.

MAX_MEMORY builds provide an indicator on the poten-tial number of OOM builds if BTBS simply creates buildswith maxTargetsPerBatch targets. It shows that 0.36% of

MAX_MEMORY builds still run out of memory and they accountfor 58.75% of all out of memory builds. This is expectedbecause the

MAX_MEMORY builds should use more memory thanother builds. The

MEMORY_ESTIMATE_ERROR builds approxi-mately indicate how BTBS would behave without the memorymodel. It shows that 0.93%

MEMORY_ESTIMATE_ERROR buildsrun out of memory. If the

MAX_MEMORY builds have the sameout of memory rate as the

MEMORY_ESTIMATE_ERROR builds,we would have 74k more out of memory builds during theexperiment. In practice, the memory model is very importantbecause an OOM build is very expensive to retry and it blocksthe developer’s productivity. The memory model is usefulin limiting the memory usage while maximizing the batchdensity. BTBS used a small maxTargetsPerBatch before thememory model existed, the problem was that it generated toomany builds and used up almost all Bazel workers, whichblocked other builds from running.

MAX_OCCUPANCY builds provide an indicator on the po-tential number of

Type I

DE builds if BTBS simply cre-ates builds with maxTargetsPerBatch targets. It shows that1.4% of

MAX_OCCUPANCY builds still exceed the deadlinesand they account for 1.84% of all deadline exceeded builds.Interestingly, a majority of DE builds are not

MAX_OCCUPANCY builds. This indicates that most DE builds are

Type II

DEbuilds. Out of all DE builds, 98.68% builds use less thanor equal to 600 ESU and only 1.32% builds use more than600 ESU. This conﬁrms that

Type II

DE builds domi-nate in the current setup. Unfortunately, we only have 716

OCCUPANCY_ESTIMATE_ERROR builds and no one exceeds thedeadline, so it’s hard to quantitatively measure how manymore deadline exceeded builds we would have without theoccupancy model. In practice, we used to have more DE buildsbefore the occupancy model exists, so we believe that theoccupancy model is useful to reduce

Type I

DE errors.Table VI shows the build target count (

BuildTC ), executiontime in seconds (

ExecT ), memory usage in GB (

Memory ) andoccupancy usage in ESU (

Occup ) on average and median,respectively, for each batch size reason.The

MAX_MEMORY builds use 8.1GB and 8.3GB memoryon average and median, respectively. This indicates that

MAX_MEMORY builds use more memory than other builds andthe usage is close to the memory cutoff. The

MAX_OCCUPANCY builds use 471.9 and 521.6 occupancy on average and median,respectively. This indicates that

MAX_OCCUPANCY builds usemore occupancy than other builds and the usage is close to theoccupancy cutoff. Moreover, it takes 1424s and 1301s to run

MAX_OCCUPANCY builds on average and median, respectively.This indicates that

MAX_OCCUPANCY builds are longer runningABLE V: Build count distribution by batch size reasons

Batch Size Reason

ONLY_ONE_TARGET 21.6m 22.5k (0.10) 6.6k (0.03) 21.12 28.13 14.22MAX_TARGETS 33.7m 1.9k (0.01) 6.8k (0.02) 32.95 2.34 14.59ALL_REMAINING_TARGETS 33.8m 7.6k (0.02) 21.0k (0.06) 33.06 9.52 45.17MAX_MEMORY 13.0m 46.9k (0.36) 11.2k (0.09) 12.71 58.75 24.13MAX_OCCUPANCY 59.3k 35 (0.06) 855 (1.4) 0.06 0.04 1.84MEMORY_ESTIMATE_ERROR 104.3k 967 (0.93) 25 (0.02) 0.10 1.21 0.05OCCUPANCY_ESTIMATE_ERROR 716 0 (0.00) 0 (0.00) 0.00 0.00 0.00

Sum

Batch Size Reason Avg MedianBuildTC ExecT Memory Occup BuildTC ExecT Memory Occup

ONLY_ONE_TARGET 1 241 2.2 2.0 1 92 1.6 0.3MAX_TARGETS 900 371 4.0 33.4 900 275 3.7 18.8ALL_REMAINING_TARGETS 67 403 3.2 13.2 5 227 2.4 2.5MAX_MEMORY 154 679 8.1 15.2 9 574 8.3 2.6MAX_OCCUPANCY 506 1424 5.5 471.9 555 1301 5.6 521.6MEMORY_ESTIMATE_ERROR 239 443 3.0 21.0 300 306 1.8 11.9OCCUPANCY_ESTIMATE_ERROR 222 391 2.2 19.6 300 273 1.6 11.3 (a) RMSE by actual memory (b) Build count by memory error (c) Build count by actual memory

Fig. 3: Memory model accuracy (GB) (a) RMSE by actual occupancy (b) Build count by occupancy error (c) Build count by actual occupancy

Fig. 4: Occupancy model accuracy (ESU)than other builds and the execution time is proportional to theoccupancy usage in general. The result shows that the memoryand occupancy models can restrict the memory and occupancyusage of the builds, thus reducing the OOM or DE errors.

ONLY_ONE_TARGET builds are single target builds, so theytypically have the least execution time and memory/occupancyusage as expected. All

MAX_TARGETS builds have 900targets and BTBS cannot generate builds with more than maxTargetsPerBatch targets. So

MAX_TARGETS buildstypically have more execution time and memory/occupancy usage.

ALL_REMAINING_TARGETS builds are generated atthe end of Algorithm 1, so these builds typically have fewtargets and use little memory and occupancy. The averagebuild target count for the

MEMORY_ESTIMATE_ERROR and

OCCUPANCY_ESTIMATE_ERROR builds isless than 300 because the initial binary searchiterations may already cut the batch size to beless than 300. Both

MEMORY_ESTIMATE_ERROR and

OCCUPANCY_ESTIMATE_ERROR builds are rare and they uselittle memory and occupancy. . Model Accuracy

Figure 3a shows the root mean square error (RMSE) of thememory model grouped by different actual memory usages(0GB to 13GB with a step of 0.1GB). The graph is brokendown by whether the memory model overestimates (blue line),underestimates (red line) or accurately predicts (yellow line)the actual memory usage, respectively. It also shows the overallRMSE (green line). We can see that the memory model heavilyoverestimates builds which use very little memory. A commonreason is that some targets may ﬁnish without GC so theirmemory usage is recorded much larger than that after the GC.The memory model learns from builds without GC that sometargets have large memory usage, so it often overestimatesthe memory usage of builds with the same targets that ﬁnishafter GC. The model also overestimates builds which use 7-8GB memory. Many of these builds are single target buildsand the model gives large memory estimates for all of thesebuilds. In general, the model tends to have a larger error inoverestimation compared to underestimation. The error tendsto go up for larger memory builds. Figure 3b shows thenumber of builds grouped by different error values. The graphis broken down by the build priority. It shows that mostbuilds, regardless of their priorities, have an error close to 0,which means that the model performs well. Figure 3c showsthe number of

MAX_MEMORY builds grouped by their actualmemory usages. We can see that the memory usage of high,medium and low priority builds peaks at around 7GB, 9GBand 10GB, respectively. This is consistent with our setup.Figure 4a shows the RMSE of the occupancy model groupedby different actual occupancy usages (0 ESU to 500 ESU witha step of 1 ESU). It seems that the occupancy error increasesfor larger occupancy builds. The overestimation error increasesslowly and then decreases until it diminishes at 500. Themodel only underestimates builds that use more than 500 ESU(not shown in the graph). The underestimation error increasesfor larger occupancy builds. The overall RMSE increases asthe actual build occupancy usage increases. This is expectedbecause most builds use <100 ESU and the model does nothave much data to learn larger occupancy builds. Figure 4bshows the number of builds grouped by different occupancyerror values. It shows that most builds have an error closeto 0, which shows that the model performs well. Figure 4cshows the number of

MAX_OCCUPANCY builds grouped by theiractual occupancy usages. We can see that the occupancy usagepeaks at both 0 and around 500 ESU. The peak at 0 ESU iscaused by many builds hitting the action cache in the executorcluster [14] and these builds do not occupy any executor. Thepeak at around 500 ESU is consistent with our setup.VI. R

ELATED W ORK

We do not ﬁnd any related work about using machinelearning models to construct builds that use limited resourcesto reduce OOM and DE errors. This problem is quite commonin Google and we expect it to be common in other companiesthat have a monolithic code repository and many build targets.Our experience is that a service like BTBS becomes more important as the need to build a large number of targetsincreases. We believe that this work provides an insight fordevelopers who have a similar need, e.g. Bazel [1] andBuck [2] users.Google’s TAP service [11] mentions that the test executionscould fail due to OOM errors, but it does not give a solution.BTBS is designed exactly to solve that problem. In fact, theTAP service is one of the BTBS users.The build service system [14] describes how the build isexecuted remotely in Google. BTBS is one of the build servicesystem users. It is designed to workaround the constraints,i.e. limited memory allocation and max allowed concurrentexecutors, imposed by the build service system which is notcapable of building a very large number of targets.VII. C

ONCLUSIONS

In this paper, we describe the ﬁrst build target batchingservice (BTBS) that is able to partition a large stream oftargets into a set of target batches such that the builds createdfrom those target batches do not have OOM or DE errors. Wediscuss the batching algorithm as well as the machine learningmodels used in BTBS. Overall, we show that BTBS is able toachieve a low OOM rate (0.08%) and a low DE rate (0.05%).We believe that BTBS introduces useful insights and can helpin designing new target batching systems.R

EFERENCES[1] Bazel build tool. https://bazel.build/.[2] Buck build tool. https://buck.build/.[3] Feature cross. https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture/.[4] Google guava. https://github.com/google/guava/.[5] L. Bottou. Large-scale machine learning with stochastic gradientdescent. In

COMPSTAT . 2010.[6] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Brad-shaw, and N. Weizenbaum. Flumejava: easy, efﬁcient data-parallelpipelines.

ACM Sigplan Notices , 2010.[7] P. M. Duvall, S. Matyas, and A. Glover.

Continuous integration:improving software quality and reducing risk . Pearson Education, 2007.[8] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google ﬁle system. In

SOSP , 2003.[9] F. Hutter, L. Kotthoff, and J. Vanschoren.

Automated Machine Learning .2019.[10] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutualinformation.

Physical review E , 2004.[11] A. Memon, Z. Gao, B. Nguyen, S. Dhanda, E. Nickell, R. Siemborski,and J. Micco. Taming google-scale continuous testing. In

ICSE-SEIP ,2017.[12] D. C. Montgomery, E. A. Peck, and G. G. Vining.

Introduction to linearregression analysis . 2012.[13] B. C. Ross. Mutual information between discrete and continuous datasets.

PloS one , 2014.[14] K. Wang, G. Tener, V. Gullapalli, X. Huang, A. Gad, and D. Rall.Scalable Build Service System with Smart Scheduling Service. In