rfc/20250728-execution-architecture.md
As with many codebases of this size, age, and popularity, it has arguably outgrown it's initial design. Further additions to the codebase within the current architecture have it groaning at the seams. This can be quantified both by the time it takes to add new complex features and by how long it takes to properly onboard a qualified developer (months).
This RFC discusses a proposed future architecture which would address that primary concern of outgrowing the current design, while keeping an eye on both performance and potential feature additions.
We do not propose an exact path to achieve this goal, but instead expect that multiple smaller RFCs will follow to help describe progress toward this shared vision.
OpenTofu performs "operations" such as "validate", "plan", "apply" on a given configuration and state. Each of these actions follows the general pattern of iterating through the configuration and the state to produce an output state and error messages, where the output state should match what was specified by the configuration as closely as possible.
The configuration contains elements which either represent desired entities in the state file or are intermediate values used to link other elements together.
In a given action, OpenTofu must answer the following questions:
OpenTofu builds a graph from the configuration and state to answer the questions above. Once the graph has been built, it walks through the graph to produce the resulting state and messages. The docs contains a more in-depth write-up and should be used as reference https://github.com/opentofu/opentofu/blob/main/docs/architecture.md.
The graph is made up of nodes which each correspond to an element within the configuration, state, or both. Each of these nodes are able to express what dependencies they have on other nodes, as well as the action it will take during the graph walk. The dependencies on other nodes are used to build the edges in the graph.
This describes a relatively simple process, but has grown in complexity over time and with more complex requirements.
In the earliest usable versions of this software, the relationship between the static configuration structure and the dynamic execution graph was relatively straightforward: resource blocks supported count, but it supported only a constant value not derived from anything else, and nothing else in the language supported any sort of dynamic repetition at all.
These requirements grew over time, first to support resource count referring to dynamic data from elsewhere in the configuration, and later to module-level repetition which therefore made it necessary for everything that can be declared in a module to support multi-instancing, thereby making the initial static execution graph a very poor approximation of the actual execution behavior.
That means that additional layers have been added on over time to deal with that additional complexity outside of the dependency graph, but the execution model is still highly dominated by the naive graph traversal and so the resulting logic is very hard to follow and maintain correctly.
Significant changes related to this (included just as a historical overview of how things evolved; no need to read all of the details!):
count featurecount to refer to other symbols (but only those that can be evaluated during graph construction)count could be handled dynamically during the graph walk, rather than during graph construction.count to refer to known resource instance attributes, effectively introducing the possibility of dynamic failure if count refers to something unknown.count evaluation and planning logic.hcl.Expression.Variables to discover what other objects an expression refers to as part of building the dependency graph, and with it the need for roughly everything to have access to provider schemas.for_each as a new kind of repetition for resourcesinstances.Expander generalizes the handling of count and for_each in preparation for module-level repetitioncount and for_each on modules, which made effectively everything need to support dynamic expansion.for_each for provider configurations, which was implemented primarily as a static eval feature due to the complexity of retrofitting this as a dynamic expansion on top of all of the above. (Discussion about the abandoned dynamic alternative in opentofu/opentofu#2088.)One of OpenTofu's most important responsibilities is making sure that operations happen in an appropriate order based on the dependencies declared in the configuration. However, that is far easier said than it is done, in large part because there is no single "appropriate order" that applies in all cases.
The system was originally pretty naive, just visiting each resource in an order decided by the references between them and executing a "destroy" operation if necessary and then a "create" or "update" operation if necessary. But that's already wrong because destroy operations need to happen in reverse order so that when B depends on A, the system doesn't try to destroy A before B has been destroyed.
So the first complication was in separating create/update from destroy so that they could have inverted dependency ordering. But that was only the beginning of the complexity:
Therefore the handling of the interaction between create/update and destroy actions has ballooned in complexity over time, and even the current implementation does not properly model all reasonable remote API designs. The implementation has become so complicated and so sprawled throughout the system that it is very challenging to evolve and very challenging to debug when someone reports an issue. Sometimes it's even hard to determine if a certain behavior is consistent with the intended design or not, because "the intended design" isn't a single coherent system but rather a set of interacting and sometimes-contradictory requirements, some of which are documented in OpenTofu Core Resource Destruction Notes.
Historical changes related to this:
create_before_destroy implementation, which treated it as a single-resource idea but did not properly handle transitive dependencies with other resources that don't set this flag, causing dependency cycles as described in hashicorp/terraform#2493.create_before_destroy on anything else in a dependency chain: this made the system automatically handle the flaw where create_before_destroy had to be set consistently on everything a dependency chain, initially motivated by working with data resources (since they don't have an explicit create_before_destroy argument to set) but this applied to managed resources too. (This initial commit was not sufficient to fully solve the problem, and there were various followup fixes, but that's too much detail for this already-long list.)create_before_destroy in state: it initially seemed like create-before-destroy was a config-only problem, but because of the need to propagate up and down dependency chains it also interacted with "orphaned" resources that were only present in the state, and so needed to be tracked there too.For a long time the system supported rebinding existing objects to different resource instance addresses only as a special "out-of-band" operation using the command tofu state mv, which meant that the main language runtime could generally assume that all objects would have a fixed resource instance address throughout an entire plan/apply round.
However, treating address rebinding as an out-of-band meant that it didn't fit in well to the typical OpenTofu workflow: the configuration and state both need to change together for such a change to be effective, but tofu state mv only updates the state and so teams using it would need to either first update their shared configuration (in version control) and then run tofu state mv, or conversely to run tofu state mv and then update the shared configuration to match, in both cases creating a window of time where the configuration and state are inconsistent and so anyone concurrently running a normal plan/apply round could cause the affected objects to be destroyed completely.
To help address that, today's OpenTofu includes language features for "refactoring" which work, in essence, by asking authors to describe configuration changes that would require state updates as part of the configuration, and then OpenTofu uses that extra information to propose to update the state to match during the next plan/apply round.
The language runtime fundamentally assumes that all objects have fixed resource instance addresses throughout a plan/apply round though, so in practice those features were implemented as a special extra preprocessing step that makes changes to the state before the plan graph construction begins. That preprocessing step needs to take into account dependencies itself, and so it ends up needing to deal with several concerns that the language runtime also shares but yet very little code can be shared between the two.
Additionally, resource instance addresses include a dynamic instance key part but yet the preprocessing step to handle the moves cannot know which instance keys each resource has in the desired state, because the desired state hasn't been created yet. Therefore the system has a counter-intuitive design where the preprocessing step optimistically assumes that the instance keys in source addresses are valid, and then retroactively checks if they were only after the planning phase has completed.
These constraints cause the implementation of the refactoring features to be more complicated, and also constrained the refactoring blocks to only supporting static expressions because they must be handled before the main evaluation process can start, and thus before dynamic expressions can be evaluated. That led, for example, to opentofu/opentofu#1475. Although we cannot be certain that dealing with address changes as part of the main execution is sufficient to support more-dynamic situations, it does seem to be necessary.
Historical changes related to this:
refactoring.ApplyMoves and refactoring.ValidateMoves calls, which sandwich the main planning processcontext.Context API so that refactoring.ApplyMoves could be called at a suitable time relative to everything elseWe are proposing a major design change to OpenTofu's internal architecture, focusing on breaking apart the complexity of the graph operations above.
Operations today re-use a lot of the same concepts and code throughout their execution. In practice, this means that a lot of work needs to be repeated between operations. It is also complicated that the expected result of a given codepath may change dramatically based on the operation, leading to hard to trace bugs.
In short, we have several general purpose structures and concepts that are heavily re-used in ways that are not ideal. Althought code re-use should be a goal we strive for, doing so in to aggressive of a fashion can lead to hard to maintain patterns and brittle designs.
Therefore, we propose to form a stronger conceptual chain between the three classes of operations, Validate, Plan, and Apply. Each operation class would have a clear set of non-overlapping responsibilities that would feed into the next link in the chain.
The main conceptual difference between this model and what exists today is that each operation bases it's execution primarily on the prior operation's output, instead of trying to replicate similar logic that's been twisted for the current operation.
For example, today Apply builds a similar graph to Plan. The graphs differ in several important ways, but it is hard to know what is an intentional difference and what is a bug. Consider if Plan outputs a set of "actions" to take during the Apply, such that Apply is relegated to stepping through what the Plan explicitly laid out.
flowchart LR
Config("InputConfig") --> Validate["Validate"]
Validate --> ConfigReferences("ConfigAnalysis")
Validate --> ConfigValidate("ValidateConfig")
ConfigReferences --> Plan
ConfigValidate --> Plan
InputState("InputState") --> Plan
Plan --> ExpectedState("ExpectedState")
Plan --> RefreshedState("RefreshedState")
Plan --> ChangesToResources("ResourceChanges")
Plan --> ConfigPlan("PlanConfig")
Plan --> PlanGraph("ExecutionGraph")
PlanGraph --> Apply["Apply"]
ConfigPlan --> Apply
RefreshedState --> Apply
ChangesToResources --> Apply
Apply --> OutputState("OutputState")
classDef green fill:#cfb,stroke:#ccc
class Config,ConfigReferences,ConfigValidate,InputState green
class ExpectedState,PlanGraph,ConfigPlan,RefreshedState,ChangesToResources green
class OutputState green
This diagram should serve as a conceptual representation of what this could look like, but is not prescribing an exact implementation. That is reserved for followup prototyping and RFCs.
In this diagram, we can see that each component has a clear job of taking a specific input and transforming it to the expected outputs.
Let's follow this diagram through running tofu apply on a project:
It is unclear at this juncture if several of the distinct outputs of the plan step should instead be represented as a homogeneous structure, instead of being represented piecemeal. For the purposes of this discussion they are distinct, in order to more closely model details the current architecture.
This process could be alternately thought of as a variant of "Parse, Compile, Execute".
By separating opentofu's execution into compartmentalized and distinct concerns:
It does not try to constrain or reduce the complexity of what OpenTofu can do in a given operation, and instead should allow for easier expansion in the future.
As mentioned above, the implementation of these concepts is to large of a topic to cover in a single RFC. However we set a goal of incremental changes to the architecture.
In an ideal world, we would be able to make the new "execution engine" opt-in for a period of time. In practice, this may not be possible to hold to that exact ideal. We should attempt to reduce the "churn" of each step toward this architecture by re-using and adapting as many existing constructs within the codebase as possible. All of the sub-rfcs should discuss this in detail.