design/one-pager-render-engine.md
The crossplane render command has become an important testing tool in the
Crossplane ecosystem. It lets you run a Composition's function pipeline locally
and see what composed resources the reconciler would produce, without a cluster.
Downstream tools build on it: the up CLI uses render for up project render
and up project test, and crossplane-diff uses it to preview changes in CI.
All of these tools depend on the same cmd/crank/render package, which
reimplements the XR reconciler's composition pipeline. Render shares the
mechanics of calling functions with the real reconciler, but everything the
reconciler does before and after the pipeline runs is a parallel implementation.
How composed resource metadata is rendered, how names are generated and
validated, how readiness and conditions are derived: all reimplemented.
Any time someone changes the reconciler, they must remember to update render too. History suggests this doesn't always happen. For example, when name validation was added to the reconciler to reject composed resources with invalid RFC 1123 names, render didn't get the same check. A user could render successfully, deploy with confidence, and then see failures in production.
I want render to be a high-fidelity digital twin of the real reconciler. This means:
crossplane render should support something like
--crossplane-version=v2.2.1, fetching that version's render logic to test
against exactly the reconciler behavior from that release.It's not a goal to replace the existing crossplane render CLI UX. The existing
CLI handles Docker container management, file parsing, and output formatting.
Those stay where they are. I'm proposing a lower-level engine that the existing
CLI (and other tools) can build on.
It's also not a goal to auto-iterate through the reconcile loop, stepping through successive reconciles like a debugger. That said, this design makes it more feasible. The engine runs one reconcile loop per invocation. A harness could call it repeatedly, feeding the composed resources from one loop back as observed resources for the next.
I propose adding a hidden crossplane internal render subcommand to the
Crossplane binary. It runs one real Reconcile() call, the exact same code
path as the production reconciler, but backed by a fake in-memory client instead
of a real API server.
The fake client is simple. It satisfies reads from an in-memory store populated with the caller's input (the XR, Composition, observed resources, secrets), and captures writes (composed resources the reconciler would apply, resources it would delete, status it would set). The reconciler doesn't know it's not talking to a real API server. Every code path (pipeline execution, metadata rendering, name generation, garbage collection, condition derivation) is the real production code.
The subcommand reads a protobuf RenderRequest from stdin and writes a protobuf
RenderResponse to stdout:
cat request.pb | crossplane internal render > response.pb
The expected usage is that tools like crossplane render, up project render,
and crossplane-diff shell out to this subcommand rather than maintaining their
own render implementations. The caller handles UX concerns (file parsing, Docker
container management, output formatting) and delegates the actual reconcile to
the Crossplane binary via stdin/stdout.
The schema is defined in proto/render/v1alpha1/render.proto. The request uses a
protobuf oneof to select the resource type to render:
message RenderRequest {
RequestMeta meta = 1;
oneof input {
CompositeInput composite = 2;
OperationInput operation = 3;
CronOperationInput cron_operation = 4;
WatchOperationInput watch_operation = 5;
}
}
Each input variant contains the resource to render plus any supporting resources
(functions, observed resources, credentials). Kubernetes resources are
represented as google.protobuf.Struct, the same approach used by the function
protocol. For example, CompositeInput contains the XR, Composition, function
gRPC addresses, and optionally observed resources, credentials, and context.
The response mirrors the request with a corresponding output variant:
message RenderResponse {
ResponseMeta meta = 1;
oneof output {
CompositeOutput composite = 2;
OperationOutput operation = 3;
CronOperationOutput cron_operation = 4;
WatchOperationOutput watch_operation = 5;
}
}
Using protobuf gives us a well-defined, self-documenting schema that can generate client types in any language. If we want to expose this as a gRPC service in future, adding a service definition to the existing proto is trivial.
Function runtime management (Docker containers, development servers) is the caller's responsibility. The render engine accepts gRPC addresses, not container images.
Because the render engine lives in the Crossplane binary, callers can fetch a
specific version's binary (or Docker image) to get exactly that version's render
logic. A tool like crossplane render could support
--crossplane-version=v2.2.1 by pulling the v2.2.1 image and shelling out to
it. This guarantees the render output matches what that specific version of
Crossplane would produce in production. This is impossible when the render logic
is a separate reimplementation.
The architecturally pure approach: restructure the reconciler into a "render" phase (determine what to do) and an "apply" phase (do it). Both the production reconciler and the render engine would call the same render function. The production reconciler would additionally call apply.
I explored this in detail. The problem is that the boundary between computation
and I/O in the reconciler isn't clean. Apply results feed back into conditions.
When a composed resource fails to apply with an IsInvalid error, it's marked
as unsynced, which affects the XR's Synced condition. Credentials are fetched
per-pipeline-step during function execution. The result is a three-phase model
(render, apply, finalize conditions) with events and per-resource state spanning
all three phases.
More importantly, forcing the reconciler to maintain a clean computation/I/O boundary constrains its design. The production reconciler is the important thing. Its structure should be driven by what makes it correct, performant, and maintainable as a controller, not by what makes it easy to split into phases for an offline render tool. The fake client approach inverts this priority: the render engine adapts to whatever the reconciler does, and the reconciler stays free to evolve without considering render at all.
The fake client approach is also more robust to reconciler evolution. Any future change that interleaves I/O and computation in new ways would break the render/apply boundary. The fake client approach handles it automatically. New client calls get fake responses, and new logic runs on those fake responses.
Rather than shelling out to a binary, downstream tools could import the
reconciler (or a render wrapper around it) as a Go library. This is essentially
what the current cmd/crank/render package is: a library that downstream tools
like up and crossplane diff import.
I'd prefer not to do this. It would mean taking on Hyrum's law style
compatibility obligations for the reconciler's internal API. Today the
reconciler lives in crossplane/crossplane and its internals can change freely.
Exposing it as a library would constrain that freedom. It would also suggest
moving render-related code to crossplane-runtime, where shared libraries
conventionally live. That means two repos to update for every reconciler change.
The stdin/stdout binary interface avoids this entirely. Downstream tools depend on a stable protobuf envelope, not on Go types. It also enables version pinning, which is impossible when the render logic is compiled into the caller.
We could expose the render engine as a gRPC server, either directly or via HashiCorp's go-plugin framework which adds version negotiation and process lifecycle management. We already spin up functions in Docker containers and call them via gRPC, so the machinery exists.
The difference is that functions are gRPC servers because they run as long-lived services in production. It makes sense for their local development mode to match the production protocol. The render engine has no production deployment. It's only ever a plugin for other CLIs, spun up in a Docker container to handle one request and exit.
A persistent gRPC server would be faster for batch use cases like crossplane-diff rendering many resources in a row, since it would avoid per-call container startup overhead. But it's unlikely to be fast enough to matter for the kinds of UXes we imagine building around this. The cost is server lifecycle management (startup, health checking, graceful shutdown) that every caller has to orchestrate.
Protobuf over stdin gives us the schema benefits (generated types, self-documenting contract, language-agnostic) without the server lifecycle complexity. If we ever need a long-lived server, adding a gRPC service definition to the existing proto is a one-line change.