rfcs/limit-over-join-optimization.md
---
authors: Gil Mizrahi <[email protected]>
discussion:
https://github.com/hasura/graphql-engine-mono/pull/2239
state: draft
---
Optimize GraphQL queries containing a relationships and a limit by limiting the amount of returned results before joining the relationship.
Currently when a user runs a complex query with relationships and is using the limit operator, we construct an SQL query for postgresql that looks somewhat like this:
SELECT *
FROM <base-table> LEFT JOIN <other-table>
LIMIT <limit>;
Since join is an expensive operation, it would be useful if we could limit the number of rows it needs to process before running the join.
In SQL, trying to push limits down into each side of the join is not a semantic perserving operation. This is because the relationship between the two sides is unspecified, and could be one-to-one, many-to-one, or many-to-many.
For example, in a database of users and streaming providers, a user could be subscribed to multiple providers, and streaming providers provide services to multiple users. trying to get all users and their providers, limit by 10, is different than:
For this reason, postgresql will not apply this optimization when it is valid, because it cannot distinguish the cases.
Fortunately, in GraphQL we do specify either have a one-to-one relationship, which means that we can limit one side and get the same result, or we have a one-to-many relationship where we aggregate the results, so we can limit the side of the "one", this side is always the root table in the query, or the "base" table.
It can improve the performance of queries by orders of magnitude (as described by a customer). Was requested by customers (graphql-engine/#5745) which consider this feature a must-have.
Implement this optimization ourselves by pushing the LIMIT into the base table. This has a few caveats:
json_agg function), otherwise the results order is unspecified.Because of (2) and (3) this optimization is only valid when the columns referred from any DISTINCT or ORDER BY are from the base table. If other columns exist, this optimization is not valid.
We can verify the feature works by writing tests inspecting the generated SQL.
Work on this features has been implemented in #2239 by changing the way we translate RQL ASTs to postresql ASTs. This optimization might be better to express as an SQL to SQL transformation.
In order to refactor this code, we'd need to first document the Postgres.Translate.Select module. After that we could refactor this optimization to a straightforward translation of RQL to SQL and then an SQL transformation.