docs/compilers/Design/Red-Green Trees.md
This document explains the internal architecture of Roslyn's syntax model, focusing on the "green" layer that powers its efficiency. It is intended for developers working on or with the Roslyn compiler internals.
For the public API and general usage of syntax trees, see the official documentation: Use the .NET Compiler Platform SDK syntax model. This document assumes familiarity with that material and dives into the implementation details that make the syntax model performant.
Roslyn's syntax trees use a design pattern sometimes called "red/green trees." The core idea is to maintain two parallel representations of the same syntactic structure:
Green nodes: The internal, immutable representation. These nodes store only their syntactic kind, their width (character count), and their children. They do not store absolute text positions or parent pointers.
Red nodes: The public-facing wrappers. These provide the familiar API with properties like
Span, Parent, and strongly-typed child accessors. Red nodes are thin wrappers that combine a
green node with positional and parental context.
This split exists because immutability and efficient navigation are fundamentally at odds. An immutable node cannot store a parent pointer (since the same node might have different parents in different contexts). But users need parent navigation. The red/green pattern solves this: green nodes are immutable and shareable, while red nodes provide the navigation API by computing positions and parents on demand.
A green node stores:
MethodDeclaration, IdentifierToken)Crucially, green nodes do not store:
This design choice is fundamental. Everything else flows from it.
Green nodes use integer "slots" to access children. Each green node type knows how many slots it has
and what kind of child belongs in each slot. For example, a green MethodDeclarationSyntax node
might have slots for: attributes (0), modifiers (1), return type (2), identifier (3), type
parameters (4), parameters (5), constraints (6), body (7), and so on.
For every red syntax node type, there is a corresponding internal green syntax node type. So there
is both a public red MethodDeclarationSyntax and an internal green MethodDeclarationSyntax. The
green type can be accessed by slot index for uniform traversal, or by strongly-typed properties that
correspond to each child. The red layer accesses these green properties directly when producing
child nodes through the red API.
This design enables internal algorithms to walk the green tree generically by slot index (without knowing what node type they're visiting), while also providing efficient strongly-typed access when the node type is known.
The decision to omit positions and parents from green nodes enables several powerful optimizations.
When a user edits a file, incremental parsing can reuse vast portions of the previous syntax tree. Because green nodes don't encode absolute positions, a method declaration that was at position 10,000 can be reused at position 10,050 after an earlier edit. The green node is identical; only the red wrapper's computed position changes.
This enables incremental parses to complete in microseconds with memory reuse approaching 99.99% for typical edits. For full details, see Roslyn Incremental Parsing: A Deep Dive.
The same green node can appear in completely unrelated syntax trees. Consider the ParameterList
node for an empty parameter list (). This exact construct appears in thousands of methods across a
large solution. Because green nodes have no position or parent, the same green node object can be
shared across all of them.
This sharing happens automatically through green node caching (described below). The result is significant memory savings at the solution level.
The same green node can appear multiple times within a single syntax tree. If a file contains ten
methods that all have empty parameter lists (), all ten can point to the same ParameterList
green node.
This means "green tree" is actually a misnomer. The green structure is an directed acyclic graph (DAG), not a tree. Multiple parents can share the same child. This dramatically reduces memory for files with repeated constructs.
The red layer hides this complexity. Each red node appears to be a distinct object with its own unique parent and position, even though the underlying green nodes may be shared.
In the public red API, the syntax model uses different types for different concepts:
SyntaxNode: A class (reference type) for syntax nodesSyntaxToken: A struct (value type) for tokensSyntaxTrivia: A struct for whitespace and commentsSyntaxList<T>: A struct for lists of nodesIn the green layer, all of these are heap-allocated node objects. There are no structs. Every
green token, every piece of green trivia, every green list is a GreenNode subclass.
Why? Because structs cannot participate in the sharing scenarios described above. If tokens were structs at the green level, reusing an identifier across multiple locations would require copying the struct each time. With green nodes as objects, we can instead point to the same instance.
Beyond sharing, this choice also enables an important aspect of the green layer’s design: specialization. The green tree contains many specialized implementations tailored for different scenarios (for example, optimized token representations and specialized list implementations). This specialization is essential for achieving good performance and memory usage, and will be discussed in more detail below.
Structs are not well suited to this role, as they cannot participate in inheritance hierarchies or support multiple specialized implementations behind a common abstraction. By representing green nodes as objects, we can use standard inheritance techniques to freely specialize the internal representation, while still exposing a simple, lightweight, struct-based API at th e public syntax level. The public structs effectively act as thin facades over these underlying green nodes.
The consequence is that red wrappers are extremely cheap. A red SyntaxToken struct is just:
No heap allocation occurs when you access a token. The same applies to trivia and lists at the red level.
Since tokens, lists, and trivia typically constitute 75% or more of a syntax tree's elements, this design avoids the vast majority of allocations that would otherwise occur when traversing a tree.
Lists are ubiquitous in syntax trees: parameter lists, argument lists, statement lists, modifier lists, and so on. Roslyn heavily optimizes list representation at the green level.
An empty list is represented as null at the green level. No allocation whatsoever. When wrapped by
a red SyntaxList<T>, it checks for null and returns Count = 0.
A list with exactly one element doesn't allocate a green list node. Instead, the parent green node
points directly to the single child element. The red SyntaxList<T> detects that it's pointing at a
non-list green node and returns Count = 1, providing access to just that element.
Lists of length 2, 3, and 4 have specialized green node subclasses (WithTwoChildren,
WithThreeChildren, etc.). These subclasses store child pointers directly in fields rather than
using an array. This avoids both the array allocation and the indirection of accessing array
elements.
Lists with 5 or more elements use an array-backed representation.
In practice, 90% or more of lists contain 4 elements or fewer. Think about how often you write methods with 0, 1, 2, or 3 parameters versus methods with 5+ parameters. Argument lists, type parameter lists, attribute lists—all follow similar distributions.
By optimizing for the common case, Roslyn avoids array allocations for the overwhelming majority of
lists. The red SyntaxList<T> and SeparatedSyntaxList<T> structs handle all these cases
transparently, presenting a uniform API regardless of the underlying representation.
Certain red lists are able to hold their red children using weak references. This allows portions of the red tree to be reclaimed by the GC when they are no longer strongly reachable, without affecting correctness.
This is safe because a red node can only be collected if nothing is holding a strong reference to it. If the only remaining reference is the weak reference held by the parent list, then there is no possible observer that could notice the red node going away. If the child is later requested again, a new red node will be created from the same underlying green node. Since red nodes are purely a cache over immutable green structure, this behavior is unobservable and semantically equivalent to retaining the original red node.
Currently, this optimization is applied only to blocks ({ ... }) that are owned by a member or
accessor (see here).
These blocks can be large, and are often only needed temporarily during analysis. By allowing
the red lists for such blocks to weakly reference their children, large amounts of red memory can be
released once nothing else is holding onto those nodes, reducing peak memory usage while preserving
the existing red/green tree semantics.
Roslyn maintains a cache of commonly-occurring green nodes. When constructing a new green node during parsing, the parser first checks if an equivalent node already exists in the cache. If so, it reuses the cached node instead of allocating a new one.
Due to combinatorial explosion, only nodes with 3 or fewer children are eligible for caching. The cache itself holds 65,536 entries. Despite these limitations, analysis of parsing the Roslyn codebase itself shows a 55% cache hit rate for cacheable nodes. This means over half the time, the parser can reuse an existing allocation rather than creating a new one.
For implementation details, see PR #80825 which
simplified and analyzed the SyntaxNodeCache.
This caching is what enables the cross-tree and intra-tree sharing described earlier. The ()
parameter list isn't just theoretically shareable—it's actually shared because the cache returns
the same node every time that construct is parsed.
Because green nodes are internal, we have freedom to optimize their storage in ways that would be awkward to expose publicly. Tokens illustrate this well.
An identifier token stores a pointer to its text (e.g., "myVariable"). Since the text is stored,
the width doesn't need separate storage—it can be computed from the string length. This saves space
on every identifier in the tree.
Keyword tokens work in reverse. All return keywords have the exact same text, so there's no need
to store it. Instead, keyword tokens store only their kind, and the text and width are inferred
from that. A return keyword token knows it's SyntaxKind.ReturnKeyword, and the text "return"
and width 6 follow automatically.
Roslyn pre-computes and caches every keyword token without trivia, as well as variants with common
leading and trailing whitespace patterns. Consider void (the keyword void followed by a
trailing space). This exact construct appears frequently in source files, and every occurrence
points to the same literal green node instance. The parser doesn't allocate anything—it just returns
the pre-cached node.
These optimizations are invisible to users of the red API, but they significantly reduce memory usage and allocation pressure during parsing.
The same caching strategy applies to trivia. Common patterns like a single space, a newline, or typical indentation sequences (two spaces, four spaces, eight spaces, tab sequences, etc.) appear countless times within a file and across files. Rather than allocating a new green trivia node each time, Roslyn pre-caches these common patterns. Every occurrence of a single space trivia in a solution points to the same green node instance.
Red nodes are created on-demand when you traverse the syntax tree. If you never access a particular node, its red wrapper is never created.
For red SyntaxNode instances (the class type), caching is required once created. This is necessary
because nodes have reference identity—users expect that accessing the same child twice returns the
same object. The caching is straightforward: when a red parent node is asked for a child, it checks
if the child red node has already been created. If not, it instantiates the child and atomically
writes it into its field storage. Whichever thread wins the race sets the pointer that all
subsequent reads will see.
For red SyntaxToken, SyntaxTrivia, and SyntaxList<T> (all struct types), no caching is needed.
Each access returns a new struct instance, but structs have value semantics, so this is fine. Two
red structs pointing to the same green node at the same position are considered equal.
Features that don't walk the entire tree pay only for what they touch. Consider IDE colorization: it only needs to examine the tokens visible on screen. The red nodes for methods scrolled off-screen are never created (or, if previously created, may be collected if memory pressure occurs).
Combined with the massive green reuse from incremental parsing, this means the syntax layer allocates very little under normal usage patterns:
Green nodes are fully immutable. Once created, a green node never changes. This has several benefits:
When you need to "modify" a syntax tree (via WithXxx methods or SyntaxFactory), you get a new
tree that shares as much structure as possible with the old tree. Only the nodes along the path from
the root to the modification point are newly allocated; everything else is reused.
For more on working with immutable syntax trees, see the official documentation.
Roslyn syntax trees are full-fidelity: every character from the source file is represented somewhere in the tree. Whitespace, comments, preprocessor directives, and even malformed syntax are all preserved. Concatenating all tokens and trivia in order reproduces the original source exactly.
This property is essential for tooling scenarios. Refactoring tools can modify specific parts of the tree while preserving the user's formatting elsewhere. Error recovery ensures that partially-written code still produces a navigable tree.
Full fidelity also enables incremental parsing to reason precisely about which portions of the tree are affected by an edit. See Roslyn Incremental Parsing: A Deep Dive for details.
For those wanting to explore the implementation:
GreenNode:
The base class for all green nodes, defining slot-based child accessSyntaxNodeCache:
The cache enabling green node reuse across and within treesSyntax/InternalSyntax/SyntaxList.cs:
Green list base class with WithTwoChildren, WithThreeChildren, etc. subclassesSyntax/InternalSyntax/:
Folder containing green node infrastructure and list helpersLanguage-specific green nodes live in their respective compiler directories:
CSharp/Portable/Syntax/InternalSyntax/:
C# green tokens, trivia, and syntax factorySyntaxNode:
The public red node base classSyntaxToken:
The public red token structSyntaxTrivia:
The public red trivia structSyntaxList<T>:
The public red list structThe codebase contains many specialized subclasses for different node configurations, list sizes, and token types. This allows heavy optimization at the green level. However, the power of the red/green split is that all this complexity is hidden from users. The public API presents a clean, intuitive syntax tree with parent navigation and absolute positions. Users don't need to know about green nodes, slot-based access, list representation tricks, or caching strategies.
Yet because of these internal optimizations, combined with the usage patterns described above, performance is excellent:
The complexity lives in the implementation so that users get both a pleasant API and great performance.