console/doc/design/20231103_server_state_management_migration.md
There are specific features we want in the Console that are not possible or are too complex to implement with how we currently manage server data in the Console.
jun_test schema then remove filters, we show a loading spinner in between even though we've already fetched the initial state. This is more apparent for pagination when you flip between pagesuseEffect dependency array, automatic reopening when connection is closed but the environment is healthy, caching of accessToken)Prefetching
SourcesList.tsx). The data is locally scoped to the Component and anything nested within SourcesList can access this data but anything out of scope (i.e. the navigation bar) can’t. So the solution is to hoist the data fetching to the common ancestor. However this means we can’t prefetch data since each fetch requires a mount of a React component and sometimes we want to prefetch via browser event handlers (i.e. onClick)Caching
Retrying
usePoll
useSqlTyped
useSqlMany
useSqlApiRequest
State transitions aren’t atomic. They’re also duplicated and split between hooks (such as isPolling from usePoll and isLoading from useSqlApiRequest). Due to this, we’d get some unpredictable/convoluted behavior when trying to implement retrying (how would we differentiate an isRetrying state value from an isPolling and isLoading)?
Replace polling with websockets
Similar to above, the current server state management is a bit too complex to integrate a push system without weird behavior.
A state management solution for server state that cleanly satisfies the following:
Fetch data then code download via server side rendering
Micro-optimizations to make cached data seem fresh
Normalization of data in cache
sql
post_id: {
title,
author,
date_created
comments: [commentId]
}
similar to other caching solutions like Relay or Apollo. This is because we don’t need to worry about cache invalidation issues (i.e. you create a new cluster and you need to update anything that fetched that cluster’s information) since all our data is considered stale anyways. This makes the desired caching solution and mutation code a lot simpler too.
Multiplexing of websocket connections
Optimistic updates
List Pagination
Integration testing for real time updates
Changing our current queries when not necessary
React-Query is a popular, highly configurable JavaScript library that acts as a server state manager. It’s optimized for syncing server state and is decoupled from React. It handles caching, prefetching, retries, error states, loading states, refetching, optimistic updates, and garbage collection of the cache. The API is clear and fulfills all criteria above. It has a caching system that acts as the source of truth for server data.
Example / General Algorithm:
Go to the source list page. We need a separate SQL query for getting the list of sources, the list of schemas, and list of databases. Each of these lists will have an id associated with them. We create a cache key based on the ID and what parameters we use in the query. For example, if you have queries like SELECT * FROM mz_sources; , SELECT * FROM mz_sources where name='postgres';
and SELECT * FROM mz_sources where name='kafka';, the id would be source_list and the cache keys in order will be something like ['source_list', ''] ['source_list', 'postgres'] and ['source_list', 'kafka']
Fetch the snapshot state of sources, schemas, and databases using the queries built via Kysely. If the fetch was successful, we populate the cache with the fetched data via each query’s cache key. React Query automatically handles the caching, retries, and the state of the query here via its query functions.
Only when these fetches are successful and in a steady state (not re-fetching and erroring) do we initiate the WebSocket(WS) connection and start updating the data in real time.
To update the data in real time:
SUBSCRIBE(withSnapshot=true, envelope upsert) via the WebSocket connection.The WS connection will be closed on the following conditions:
If the WS connection closes for whatever reason, the client can choose to transiently ignore it, retry a number of times, or surface the disconnection via UI.
If you filter the list, we issue a new source list query with a new cache key, running steps 2→3 again.
Note: We don’t need to use the same query for the snapshot fetch and real-time updating. A case where this is important is when the snapshot query calls an unmaterializable function since SUBSCRIBE doesn’t work for unmaterializable functions. The two should be decoupled.
Edge cases
mz_is_superuser and current_user. Because these functions are unmaterializable, we’re unable to port from polling to subscribe using the same queries. With the new system proposed, the way we would do this is our initial snapshot would query the object table and RBAC table, then SUBSCRIBE would just query the object table. The caveat is when we’re updating the cache, we need to map the result we get from the WS connection to the snapshot result set.
Another approach to this is instead of including the RBAC CTE in each of our queries, we can query the RBAC CTE for all objects on initial load for the Console, then any client code that needs to hide UI due to RBAC we can check client side via the global privilege table. Then our snapshot query and real time query can be the same. The caveat is a slower initial load(depending on if we block the UI) and fetching more data than we need.Pre-work
Testing
I created an MVP during a Skunkworks in this PR. The PR contains pretty similar code to the final implementation and many TODOs.
We plan to do this in the following phases:
This way any future features can use our new system.
These are pages that don’t poll. Each of these can be broken into a separate PR/issue. An example is
SecretsList.tsxThese are pages like our kafka source creation form. An example is
NewKafkaSource.tsxAn example is
SinkRoutes.tsxSo the issue with the proposed solution is you can run into this case:
Snapshot:
| name | age |
|---|---|
| barry | 5 |
| alice | 2 |
Events we miss between the HTTP fetching of the snapshot and setting up the WebSocket connection:
| name | age |
|---|---|
| barry | 5 |
| alice | 1 |
| name | age |
|---|---|
| barry | 5 |
| alice | 3 |
WebSocket sends back the snapshot
| name | age |
|---|---|
| barry | 5 |
| alice | 3 |
Now you might be asking… If the WebSocket connection is already giving us the snapshot, why are we even using http from the get go? We can even use the progress option to derive the loading state! Then for caching, we can still use React Query and update the cache with our WS messages. The following reasons is why this would make the migration much harder:
progress for subscribe, we get ~3 null messages every second per connection. This makes debugging a nightmare.Which leads me to ask… Does it matter if we miss updates? For our graphs, because we’re computing on historic data, it actually does matter. However, if our job is to show the present state of some data (like the tables above), we won’t care if Alice changed to 1 in such a small time frame since Alice=3 is now the correct state, the state we care about.
Because the two biggest server side frameworks Remix and Next don’t support updating their server side caching system from client code, we would have to integrate a client side cache anyways.
SUBSCRIBE, (3x3x1000 = 9000 connections). Is this a realistic case and is that okay for environmentd?mz_is_super_user and current_user be materializable? If so, how long would this change take? Asking since it blocks the work of this migration (refer to the “can’t subscribe on our RBAC CTEs” section)