Java Grammar Update Process

This document outlines the procedures for updating the Java grammar and integrating new language features into Checkstyle.

Prerequisites

There are some tools and concepts that you should be familiar with before updating the Java grammar:

ANTLR4
Basic knowledge of lexers and parsers
Java Language Specification (JLS)
EBNF

Basics

A few basics to understand:

Lexer: A lexer is a program that performs lexical analysis. It takes the input text and converts it into tokens. Our lexer grammar is defined here.
Parser: A parser is a program that performs syntax analysis. It takes the tokens generated by the lexer and constructs a parse tree. Our parser grammar is defined here.
Parse Tree: A parse tree is a tree representation of the syntactic structure of the source code. It is generated by the parser. We do not use the parse tree directly in Checkstyle, but we use the parser to generate the parse tree and then visit the parse tree to build our Abstract Syntax Tree (AST).
Abstract Syntax Tree (AST): An AST is a tree representation of the syntactic structure of the source code. We build our AST by visiting the parse tree generated by the parser. We use the AST to perform static analysis.
Token: A token is a single element of the source code. For example, a keyword, an identifier, a literal, etc. We define tokens in our lexer grammar.
Token Type: A token type is a category of tokens. For example, IDENT is a token type that includes all identifiers in the source code. We define token types in our lexer grammar.
Parser Rule: A parser rule is a rule that defines the syntax of the programming language. For example, a class declaration.
Parser Rule Context: A parser rule context is a context in which a parser rule is applied, and contains information about the source code that matches the parser rule. We use parser rule contexts to build our AST.
Visitor: A visitor is a design pattern that allows us to separate the algorithm from the object structure. We use a visitor to traverse the parse tree and build our AST.
JavaAstVisitor: A visitor that we use to traverse the parse tree and build our AST. We define our visitors here.

Java Grammar Update Procedure

Let's walk through an example of updating the Java grammar to support a new language feature. We will use the when expression as an example: https://openjdk.org/jeps/441.

Review the JEP and JLS

It is good to first take some time to read the JEP to understand the new language feature. The JEP provides detailed information about the goals and motivations behind the new feature, And help us to come up with good testing strategies.

It is also important to review the Java Language Specification (JLS) to understand the syntax and semantics of the new language feature. The JLS provides the formal definition of the Java programming language and is the ultimate reference for language features.

The JLS defines the when expression as follows (surrounding context provided for clarity):

txt

SwitchBlock:
  { SwitchRule {SwitchRule} }
  { {SwitchBlockStatementGroup} {SwitchLabel :} }

SwitchRule:
  SwitchLabel -> Expression ;
  SwitchLabel -> Block
  SwitchLabel -> ThrowStatement

SwitchBlockStatementGroup:
  SwitchLabel : {SwitchLabel :} BlockStatements

SwitchLabel:
  case CaseConstant {, CaseConstant}
  case null [, default]
  case CasePattern {, CasePattern} [Guard]
  default

CaseConstant:
  ConditionalExpression

CasePattern:
  Pattern

Guard:
  when Expression

Identify New Tokens and Update Lexer Grammar

Not every new language feature requires new tokens. But in the case of the when expression, we need to introduce a new token to represent the when keyword. This requires an update to the lexer grammar.

antlr

LITERAL_WHEN : 'when' ;

Notes:

We preface literal tokens with LITERAL_ to distinguish them from other tokens.
Lexer rules are written in uppercase by convention.

We will also need to add a new TokenType in this case. However, we may not need to always add a token type for every new lexer token, it depends on the use case. An example would be the LCURLY (left curly) token, which is used to represent the {. This token is also a SLIST (statement list) in certain contexts. This greatly eases static analysis, since we do not need to differentiate between the two tokens by checking their context in checks. Token "reuse" is a common pattern, and helps to avoid code duplication and having an unnecessarily large number of token types.

Update Parser Grammar

Next, we need to update the parser grammar to understand in what contexts the new tokens should appear. We need to add a new parser rule for the when expression. The when expression is used in the context of a guard in a switch statement or switch expression. We need to update the guard rule to recognize the when expression.

antlr

guardedPattern
    : primaryPattern guard expression
    ;

guard: LITERAL_WHEN;

Notes:

Parser rules are written in lower camel case by convention.
It is important to consider how many subrules are needed to represent the new language feature. Due to the recursive nature of parsing, Stackoverflow errors can occur if the grammar is too complex with too many subrules.
It is good to try to strike a balance between the number of subrules, the complexity of the grammar, and the complexity of JavaAstVisitor to ease maintenance.

Update JavaAstVisitor

Now, our lexer recognizes the when keyword, and our parser recognizes the when expression. We have updated TokenTypes to include LITERAL_WHEN. At this point, we are able to parse the new syntax, however, the new tokens will not appear in the AST unless we update the JavaAstVisitor. This is because our ANTLR grammar provides us with a parse tree, which we traverse using the JavaAstVisitor to build our AST.

java

@Override
public DetailAstImpl visitGuardedPattern(JavaLanguageParser.GuardedPatternContext ctx) {
    // since the `guard` rule is a terminal rule, we need to create a new AST node for it
    final DetailAstImpl guardAstNode = flattenedTree(ctx.guard());
    // we add the children of the `primaryPattern` and `expression` rules to the AST node
    guardAstNode.addChild(visit(ctx.primaryPattern()));
    guardAstNode.addChild(visit(ctx.expression()));
    return guardAstNode;
}

Above is an example of the transformation of the guardedPattern rule in the JavaAstVisitor. We create a new AST node for the guard rule, and add the children of the primaryPattern and expression rules to the AST node. This is a simplified example, and the actual implementation may vary depending on the complexity of the rule.

Notes:

When we visit a parser rule context, we have access to all the parse tree elements that correspond to that rule. We can use this information to build our AST.
The structure of the AST should be carefully considered; we want to ease static analysis by providing a clear and concise representation of the source code. We can review existing AST nodes within other contexts to get an idea of how to structure the AST; however, there is no strict rules regarding this topic. This part of the process relies heavily on heuristics and experience.
In many cases, we can consider the ASTs yielded by visiting the subrule of a given rule to be the children of the AST node for that rule. This is a common pattern in the JavaAstVisitor; if we did not visit the subrule, the AST would not get built.

Update Tests

Finally, we need to update our tests to ensure that the new language feature is correctly parsed and analyzed. You can find our AST tests here.

Notes:

We typically separate tests into directories based on the language level.
We test the content of the AST only, we generally do not test the actual parse tree.

Additional Considerations

DetailAstPair

The DetailAstPair class is used to represent a pair of AST nodes. It is used in the JavaAstVisitor to represent the nested parent-child relationship between AST nodes, especially where there is recursive nesting of only two to three types of tokens. An example of this is the qualifiedName rule, where the parent node is a DOT token and the child nodes are IDENT tokens. We use DetailAstPair to represent this relationship in the JavaAstVisitor#visitQualifiedName method.

Imaginary Tokens

We often create "Imaginary tokens" to represent the structure of the source code. These tokens are not directly generated by the lexer (or parser in most cases) but are used to represent the structure of the source code in the AST, and ease the process of static analysis. An example of this is the EXPR token; no such token is generated by the lexer, but it is used to represent expressions in the AST to make it easy for checks to find and analyze expressions.

Notes:

Imaginary tokens are not always necessary, and should be used judiciously.
Imaginary tokens take on the column and line number of the first child token in the AST.

Why we don't use the parse tree directly

ANTLR generates a parse tree that represents the syntactic structure of the source code. We could use this parse tree directly to perform static analysis. However, we use the parse tree to build our AST because it provides a more structured representation of the source code that is easier to analyze. The AST is designed specifically for static analysis and provides a more convenient interface for writing checks; it abstracts away the details of the parse tree and provides a simplified view of the source code.

Performance Regressions

When updating the grammar, we should be mindful of the performance impact of our changes. For that, we have a CI job that compares the performance of the changes against a baseline. Make sure to check the results of the performance regression tests after making changes to the grammar. You can find the tests here and the CI job here.

ANTLR Regression Report

Whenever making changes to the grammar, you must also generate an ANTLR Regression Report using the tool described here. This report compares the behavior of the updated grammar against the current baseline (the master branch) and helps us detect unintended parsing regressions. Please include the generated report in your pull request description for review.