Back to V8

Parsing and AST in V8

docs/parsing/parser-and-ast.md

15.0.104.1 KB
Original Source

Parsing and AST in V8

Parsing is the process of taking JavaScript source code and turning it into a structured representation that the engine can understand and execute. V8 uses a recursive descent parser to create an Abstract Syntax Tree (AST).

The Parsing Pipeline

The process involves several stages:

  1. Scanning (Lexical Analysis): The Scanner reads the raw character stream of the source code and breaks it down into a sequence of tokens (e.g., Token::kIf, Token::kIdentifier, Token::kSmi). Tokens are defined in src/parsing/token.h.
  2. Parsing (Syntactic Analysis): The Parser consumes the tokens and builds the Abstract Syntax Tree (AST) according to the grammar of JavaScript.

The Scanner

The Scanner (defined in src/parsing/scanner.h) is responsible for breaking the raw character stream into tokens. It is highly optimized as it is the cornerstone of parser performance.

Key optimizations and design choices include:

  • UTF-16 Stream: The scanner operates on a UTF-16 encoded stream of characters to match JavaScript strings and avoid branching for different encodings.
  • Lookahead: It uses a maximum lookahead of 4 characters to disambiguate tokens.
  • ASCII Lookup Table: For common identifier characters (a-z, A-Z, 0-9, $, _), it uses a fast lookup table to check ID_Start and ID_Continue properties without expensive Unicode checks.
  • Perfect Hashing for Keywords: To quickly identify keywords (like if, else, function), V8 uses a perfect hash function generated by gperf based on the length and the first two characters of the identifier.
  • AdvanceUntil: A templatized helper that allows the scanner direct access to the underlying stream data in a tight loop, minimizing function call overhead.
  • Internalization: All string literals and identifiers are deduplicated (internalized) on the boundary between the scanner and the parser.

Eager vs. Lazy Parsing

To improve startup time and reduce memory usage, V8 does not parse all JavaScript code eagerly.

  • Eager Parsing: The full AST is built immediately. This is used for code that is executed right away (like top-level code).
  • Lazy Parsing: Functions that are not immediately executed are skipped by the full parser and processed by the Pre-Parser.

The Pre-Parser

The PreParser is a stripped-down version of the parser.

  • It checks for syntax errors.
  • It records information about the scope and variable declarations.
  • It does not build a full AST or generate bytecode.
  • When a lazily parsed function is eventually called, it is fully parsed (eagerly) at that time.

To share implementation details between the full Parser and the PreParser without virtual function overhead, V8 uses a C++ template base class, ParserBase (defined in src/parsing/parser-base.h), using the Curiously Recurring Template Pattern (CRTP). Both Parser and PreParser inherit from ParserBase passing themselves as the template argument.

The Abstract Syntax Tree (AST)

The AST is a tree structure where each node represents a construct in the source code. AST nodes are defined in src/ast/ast.h. The AST is the final product of the parser and serves as the input for bytecode generation by Ignition (see Ignition).

Key Node Types:

  • FunctionLiteral: Represents a function definition. It is the root of the AST for a function.
  • Block: A sequence of statements.
  • ExpressionStatement: A statement that consists of a single expression.
  • Literal: Represents a constant value (e.g., a number, string, or boolean).
  • VariableProxy: Represents a reference to a variable. It is resolved to a Variable in a specific Scope during scope analysis.

Scope Analysis

During or immediately after parsing, V8 performs scope analysis to resolve variable references to their declarations. This is where the relationships documented in Scopes and ScopeInfos are established.

See Also