docs/GRAMMAR_UPDATES.md
This document outlines the procedures for updating the Java grammar and integrating new language features into Checkstyle.
There are some tools and concepts that you should be familiar with before updating the Java grammar:
A few basics to understand:
IDENT
is a token type that includes all identifiers in the source code. We define
token types in our lexer grammar.Let's walk through an example of updating the Java grammar to support a new language feature. We
will use the when
expression as an example: https://openjdk.org/jeps/441.
It is good to first take some time to read the JEP to understand the new language feature. The JEP provides detailed information about the goals and motivations behind the new feature, And help us to come up with good testing strategies.
It is also important to review the Java Language Specification (JLS) to understand the syntax and semantics of the new language feature. The JLS provides the formal definition of the Java programming language and is the ultimate reference for language features.
The JLS defines the when expression as follows (surrounding context provided for clarity):
SwitchBlock:
{ SwitchRule {SwitchRule} }
{ {SwitchBlockStatementGroup} {SwitchLabel :} }
SwitchRule:
SwitchLabel -> Expression ;
SwitchLabel -> Block
SwitchLabel -> ThrowStatement
SwitchBlockStatementGroup:
SwitchLabel : {SwitchLabel :} BlockStatements
SwitchLabel:
case CaseConstant {, CaseConstant}
case null [, default]
case CasePattern {, CasePattern} [Guard]
default
CaseConstant:
ConditionalExpression
CasePattern:
Pattern
Guard:
when Expression
Not every new language feature requires new tokens. But in the case of the when expression,
we need to introduce a new token to represent the when keyword. This requires an update
to the lexer grammar.
LITERAL_WHEN : 'when' ;
Notes:
LITERAL_ to distinguish them from other tokens.We will also need to add a
new TokenType
in this case. However, we may not need to always add a token type for every new lexer token, it
depends on the use case. An example would be the LCURLY (left curly) token, which is used to
represent the {. This token is also a SLIST (statement list) in certain contexts. This greatly
eases static analysis, since we do not need to differentiate between the two tokens by checking
their context in checks. Token "reuse" is a common pattern, and helps to avoid code duplication and
having an unnecessarily large number of token types.
Next, we need to update the parser grammar to understand in what contexts the new tokens should
appear. We need to add a new parser rule for the when expression. The when expression is
used in the context of a guard in a switch statement or switch expression. We need to update
the guard rule to recognize the when expression.
guardedPattern
: primaryPattern guard expression
;
guard: LITERAL_WHEN;
Notes:
Now, our lexer recognizes the when keyword, and our parser recognizes the when expression. We
have updated
TokenTypes to include LITERAL_WHEN. At this point, we are able to parse the new syntax, however,
the new tokens
will not appear in the AST unless we update the JavaAstVisitor. This is because our ANTLR grammar
provides us with a parse tree, which we traverse using the JavaAstVisitor to build our AST.
@Override
public DetailAstImpl visitGuardedPattern(JavaLanguageParser.GuardedPatternContext ctx) {
// since the `guard` rule is a terminal rule, we need to create a new AST node for it
final DetailAstImpl guardAstNode = flattenedTree(ctx.guard());
// we add the children of the `primaryPattern` and `expression` rules to the AST node
guardAstNode.addChild(visit(ctx.primaryPattern()));
guardAstNode.addChild(visit(ctx.expression()));
return guardAstNode;
}
Above is an example of the transformation of the guardedPattern rule in the JavaAstVisitor. We
create a new AST node for the guard rule, and add the children of the primaryPattern and
expression rules to the AST node. This is a simplified example, and the actual implementation
may vary depending on the complexity of the rule.
Notes:
Finally, we need to update our tests to ensure that the new language feature is correctly parsed and analyzed. You can find our AST tests here.
Notes:
The DetailAstPair class is used to represent a pair of AST nodes. It is used in
the JavaAstVisitor to
represent the nested parent-child relationship between AST nodes, especially where there is
recursive nesting of only two to three types of tokens. An example of this is the qualifiedName
rule, where the parent node is a DOT token and the child nodes are IDENT tokens. We use
DetailAstPair to represent this relationship in the JavaAstVisitor#visitQualifiedName method.
We often create "Imaginary tokens" to represent the structure of the source code. These tokens are
not directly generated by the lexer (or parser in most cases) but are used to represent the
structure of the source code in the AST, and ease the process of static analysis. An example of this
is the EXPR token; no such token is generated by the lexer, but it is used to represent
expressions in the AST to make it easy for checks to find and analyze expressions.
Notes:
ANTLR generates a parse tree that represents the syntactic structure of the source code. We could use this parse tree directly to perform static analysis. However, we use the parse tree to build our AST because it provides a more structured representation of the source code that is easier to analyze. The AST is designed specifically for static analysis and provides a more convenient interface for writing checks; it abstracts away the details of the parse tree and provides a simplified view of the source code.
When updating the grammar, we should be mindful of the performance impact of our changes. For that, we have a CI job that compares the performance of the changes against a baseline. Make sure to check the results of the performance regression tests after making changes to the grammar. You can find the tests here and the CI job here.
Whenever making changes to the grammar, you must also generate an ANTLR Regression Report using the
tool described here.
This report compares the behavior of the updated grammar against the current baseline (the master
branch) and helps us detect unintended parsing regressions. Please include the generated report in
your pull request description for review.