docs-v2/design_proposals/concurrent-bazel-builds.md
Bazel supports high levels of concurrency in its action graph model; however, Skaffold is not currently able to
take advantage of that concurrency to build multiple targets because it is modeled around a separate builder
invocation per artifact. This does not play nicely with Bazel's workspace lock model; invoking multiple builds
in parallel as separate processes forces them to serialize. This effectively renders Skaffold's --build-concurrency
null and void (see https://github.com/GoogleContainerTools/skaffold/issues/6047), and results in longer overall
build times than if multiple target's action execution could interleave.
This document proposes changes to Skaffold which will enable multiple targets to be build in a single Bazel invocation.
As part of this design, I considered three places to introduce the concept of multi-artifact builds:
builder_mux, group builds before scheduling.BuildMultiple method,
and wrap that in a BatchingBuilder which can queue up multiple incoming builds before sending groups into BuildMultiple.bazel/build.go.Ultimately, this design proposes #3.
builder_mux level results in a
ton of method forking (into multi-artifact variants); it also introduces significant scheduling complexity (to avoid
cycles introduced by batching build targets). This complexity did not feel warranted for a feature that is likely
to only be used by Bazel.artifactBuilder implementation.
We only want to group up the actual bazel build invocation, not the subsequent docker.Push/b.loadImage which
is also part of the Build implementation. Batching at the Build implementation level would (a) reduces the speed gains, because we would serialize the loads/pushes,
and (b) adds implementation complexity, because we need to handle BatchingBuilders with different artifactBuilder
constructor parameters (like loadImages).This design proposes implementing build batching for Bazel builds by altering the Bazel Builder to build multiple JARs, and then multiplexing contemporaneous builds (e.g. waiting for a period of time for builds targets to "collect" and then building them all together).
--build_concurrency flag, which limits how many builds will be scheduled contemporaneously.This is an in-place change; we replace buildTar with buildTars, which builds up a list of buildTargets.
A computeBatchKey function which returns the same value if two different builds can be grouped together. Elements of
the Bazel batch key include:
A BatchingBuilder wraps an underlying multi-artifact build function (the buildTars function), and maintains a map of
non-started MultiArtifactBuild objects (per batch key).
type batchingBuilder struct {
builder multiJarBuilder
batchSize time.Duration
builds map[string]*MultiArtifactBuild
mu sync.Mutex
}
type MultiArtifactBuild struct {
ctx context.Context
artifacts []*latest.Artifact
platforms platform.Matcher
workspace string
mu sync.Mutex
doneCond chan bool
running bool
out io.Writer
digest map[*latest.BazelArtifact]string
err error
}
When it is asked to build an artifact, it first checks to see if there is an existing build for the batch key.
artifacts list.batchSize before
invoking its multiJarBuilder.Given our choice in where to break the "one artifact per build" abstraction, this design presents some rough edges around logging and metrics that are worth acknowledging:
OutputWriter is actually used (the first one to create the MultiArtifactBuild object).
<Should we include either of the configuration knobs listed in the Design Overview>
Resolution: Not Yet Resolved
I believe this change is well-contained enough to be implemented in a single PR, please let me know what you think. You can find a functional proof-of-concept here: https://github.com/GoogleContainerTools/skaffold/compare/main...sethnelson-tecton:skaffold:sethn/parallel-builds
New test cases (in addition to status quo test cases of unbatched builds):