docs/RFCS/20171011_adding_sql_syntax.md
This RFC recommends some general steps for projects that add features visible through SQL syntax (either by adding new syntax or repurposing existing syntax).
There are two aspects to the RFC:
one that focuses on changes to the SQL grammar; in short, new syntactic features should be discussed via the RFC process and merged with the same PR that actually implements the feature.
one that focuses on the general process of adding new language features; this serves as point-in-time reminder of the work currently needed.
Acknowledgement: This RFC complements/accompanies the codelab for adding a new SQL statement by Andrew Couch.
Until this point there have been instances in the past where the SQL grammar was modified/extended to "prepare" for some yet-to-be-implemented feature, thereby reserving a space in the set of valid SQL statements for future use.
The main motivation for that approach was to ensure that a later PR introducing said functionality would contain fewer commits / changes to review, to ease the review process.
This RFC posits this motivation is misguided and instead the SQL contributors should aim to avoid the following pitfalls:
"Things happen" with the grammar over time - cleanups, refactorings, optimizations, renames, etc. This is routine work for the SQL team, but it's also work where the people doing the work need to have ownership of the SQL grammar. The problem with having an orphan syntax, especially without a comment about who's responsible for it, what it is to become and when, is that we can't really work around that in the grammar without risking breaking that person's plans.
there is such a thing about the UI of a language, and that is its grammar. There is also such a thing as UI design, which ensures a number of properties:
Granted, SQL is not really "pleasant" to look at but it has some internal structure and general patterns regarding extensibility that have been adopted as the result of 40 years of accumulated wisdom. For example, the first keyword in a statement has semantic properties all over the place in the language. One cannot just add a new statement prefix and not expect lots of work; whereas adding an alternate form for an existing prefix is relatively harmless.
the natural progression of this practice we've started already would be a grammar definition that is fostering a large population of orphan syntax without functionality, created for projects abandoned before completion. We do not have this problem now, but from experience this just happens with team/company growth unless we're unrealistically careful (in other words, in this author's experience this really is a problem waiting to happen).
A contributor to CockroachDB often has good reasons to suggest a SQL language extension.
One should always approach this task with a clear mindset that distinguishes the mechanisms which deliver the desired feature from the interface presented to users to control it.
The SQL syntax is the "interface". The code behind the syntax is the mechanism. We can have multiple interfaces for the same mechanism: either two or more ways to access the mechanism in SQL directly, or other interfaces (e.g. an admin RPC) that enable access to the mechanism besides SQL.
As a matter of good engineering, it is always useful to discuss and define the mechanisms separately from the design of the interfaces. (And preferably mechanisms no latter than interfaces.)
Soliciting input on both is equally important. That's why both should undergo the RFC process.
If your functionality mainly "takes place" elsewhere than the SQL layer but you still need a SQL interface, be sure to involve the SQL team in the SQL-related discussion for the interface part of your RFC.
This RFC proposes to delay merging syntax additions until the change that actually implements the functionality.
We foresee three accepted and documented exceptions to this policy.
We can reserve syntax that is specified in the SQL standard but currently unsupported in CockroachDB. This provides us a convenient way to ensure we don't introduce conflicting syntax while also giving us a natural mechanism to inform the user that, yes, we know that's a reasonable thing to ask, and that there is GH issue where they can see why we don't support it yet and perhaps provide arguments to influence our roadmap.
We can reserve syntax also when a new mechanism RFC has been accepted which benefits from exposing common functionality to different parts of CockroachDB, but not client applications, via a SQL syntax extension.
In this case, the contributor should be mindful that a feature that is once considered "internal" may be serendipitously found later useful by end-users; to prepare for this eventuality we must pave the road to "publicize" a feature once internal:
Preferably, the code should also contain provision to make the syntax invalid for regular client sessions, e.g. by requiring special permissions or a planner flag.
We can reserve syntax also when all the following conditions hold:
In this case, adding new syntax is acceptable, given the following conditions:
experimental_uuid_v4(), ALTER TABLE EXPERIMENTAL ...)In CockroachDB SQL features will need work at multiple levels, touching the various architectural components discussed in the SQL architecture document.
An RFC planning to add a new SQL feature would therefore do well to outline and clearly distinguish the following aspects:
Mechanism:
Interface:
Somewhere in the process of writing about those things comes the question: what should the SQL interface be?
Whether the author already has a clear idea about what they want/recommend or whether they don't, the RFC should at least cursorily examine past work in other SQL engines and suggest how other SQL engines have offered similar functionality (if at all). Then two situations can arise:
either similar functionality is available in one or more other SQL engines:
no similar functionality as a whole is available elsewhere:
In general:
During parsing the code translates the input text into an abstract syntax tree (AST). Different node types carry different semantics.
Sometimes a new feature with different syntax can reuse existing AST nodes; we (compiler community) say that the new syntax is "sugar" for a semantic construct that was already valid in principle.
For example, the syntax a IS UNKNOWN can be desugared into a IS NULL,
SHOW ALL CLUSTER SETTINGS can be desugared into SHOW CLUSTER SETTING "all", etc.
In general, the fewer the AST nodes the better. Each new AST node multiplies the work by the number of algorithms/transforms in the SQL middle-end layer.
Sometimes, you can't reuse an existing AST node as-is, but it is possible to update/enhance it so that it becomes reusable both for the new features and the statement(s) that already use it. Consider doing that instead of adding a new AST node.
And in general it is not required that a given SQL input text can be parsed and its AST pretty-printed back to exactly the same SQL input. Desugaring is good, use it whenever possible.
Even though you may only have a few use cases in mind to use the new feature being proposed, try to not restrict/tailor the syntax and semantics of the SQL interface to just these use cases.
For example, suppose you are designing a feature to "import CSV data into CockroachDB":
IMPORTCSV FILE <filename> INTO <tablename> USING DELIMITER <string>IMPORT <format> FILE <filename> INTO <tablename> [WITH <option>...]IMPORT [INTO] <relation> <format> DATA (<URL>...) [WITH <option>...]General principles:
format = CSV);Do not forget parse_test.go
Do not forget contextual inline help. Check out sql/parser/README.md.
If your feature needs to use SQL expressions:
analyzeExpr() and do the same;RunFilter() for conditions instead of evaluating manually and comparing
the DBool value.If your feature needs to support placeholders:
WalkableStmt if you're adding a new statement type;pgwire_test.go.If your feature is adding a new SQL statement type:
None.