rfcs/2020-08-31-3645-graphql-api.md
This RFC proposes using GraphQL for the Vector observability API:
The Vector team is working on an observability dashboard that will enable users to:
vector.toml configuration.The protocol used for communication between Vector and a connecting client is required to deliver data that is:
The initial proposal is to provide observability via two clients:
vector top / vector tap CLI commands.This RFC focuses on the web UI, but applies equally to the CLI client due to the possibility of observing a remote Vector instance.
This section summarizes various communications protocols, and their disadvantages for Vector observability:
REST:
gRPC:
WebSockets:
I propose using GraphQL for API communications.
Advantages:
async-graphql (Vector). I initially tried Juniper, but subscriptions (i.e. real-time data) is still WIP and several key elements of the GraphQL spec (such as interfaces) are TBD. Source: Feature comparison.
urql (UI). We used the React Apollo client in Alloy, which was mostly positive. We did run into some issues where data that lacks an id field would return null. Urql has a simpler caching story and may side-step these issues. Both clients use React hooks, which matches our vector-ui tooling.
GraphQL Code Generator (UI). We used this internally at Timber to generate Typescript types and React hooks. It offers an urql plugin and builds type-safe/declarative React hooks which overlay urql to re-render the host React component with data and loading state.
I started work in #3514 to test async-graphql, and expose internal metrics.
A playground is available in #3514 to test queries, including subscription queries:
One of the major benefits of opting for the GraphQL ecosystem is enabling a more declarative style for defining the API, and consuming it.
In the following example, I will provide snippets of an example of our current 'heartbeat' subscription, which returns a UTC timestamp every interval milliseconds back to a connected WebSocket client.
In this example, I will demonstrate:
async-graphqlasync-graphql is 'code first'; method implementations become GraphQL SDL and provide an implicit HTTP flow against an incoming request (in my PoC, I used Warp, since we already depend on it.)
#[SimpleObject]
pub struct Heartbeat { // <-- simple GraphQL object type to provide a `utc` field
utc: DateTime<Utc>,
}
impl Heartbeat {
fn new() -> Self {
Heartbeat { utc: Utc::now() }
}
}
#[derive(Default)]
pub struct HealthSubscription; // <-- 'root' subscription type to merge
#[Subscription]
impl HealthSubscription {
/// Heartbeat, containing the UTC timestamp of the last server-sent payload
async fn heartbeat(
&self,
#[arg(default = 1000, validator(IntRange(min = "100", max = "60_000")))] interval: i32,
// ^^ `interval` param -- defaults to 1,000ms; validates between 100ms - 60 seconds
) -> impl Stream<Item = Heartbeat> {
// Return a stream of heartbeats
tokio::time::interval(Duration::from_millis(interval as u64)).map(|_| Heartbeat::new())
}
}
In the GraphQL playground, this is surfaced as a strongly typed API. Doc comments become GraphQL API comments:
heartbeat SubscriptionThe above example can be queried with:
subscription {
heartbeat(interval: 1000) {
utc
}
}
Which returns data as JSON every interval milliseconds, e.g:
{
"data": {
"heartbeat": {
"utc": "2020-08-31T13:10:47.152412+00:00"
}
}
}
UrqlAfter generating types and the web client with GraphQL Code Generator, the (simplified) implementation looks similar to this:
import React from "react";
// This is generated for us by GraphQL Code Generator
import { useHeartbeatSubscription } from "@/vector/graphql";
// Example component that consumes it
const ExampleComponent: React.FC = () => {
const [{ data, fetching }] = useHeartbeatSubscription({
variables: { interval: 1000 },
});
return <pre>{data?.utc}</pre>;
};
This renders the HTML <pre>2020-08-31T13:10:47.152412+00:00</pre>, and auto-refreshes the data with a new UTC timestamp received from the server every 1000ms.
Many of the benefits we receive with GraphQL are felt during development.
Types and clients are auto-generated on introspection of a live endpoint.
Appreciating that advantage is hard to see in static code blocks. For that reason, I've recorded a 16-minute live coding session which demonstrates the typical dev workflow in the front-end. Use this tree in vector-ui to follow along:
In the above example, we side-stepped a lot of complexity that would otherwise have to be explicitly designed for:
The method implementation becomes the public API. There's no additional schema to maintain.
Compile-time type safety (both server and client). If the query included invalid data, it would fail to compile in Vector and in GraphQL Code Generator in the UI.
Declarative client in the UI that handles the distinction between HTTP/WebSocket connections (for query / mutation and subscription queries, respectively), (re)connection handling, bookkeeping of parallel in-flight requests, response caching, request fetching status, matching responses with requests, and React component re-rendering. It shaves considerable off explicitly designing for those scenarios ourselves.
A type system that accommodates interfaces, unions, enums, primitives and custom scalar types. async-graphql provides built-in abstractions for chrono DateTime types, uuid, and other popular crates.
A known schema for errors. Snafu compatibility for FieldResult<T> custom errors.
Initial Vector observability will be single-instance, and available to anyone that has access to the configured port.
Locking down access will initially rely on network configuration.
Later, as we move into multi-instance observability and more granular API permissions, the requirement for user auth and persistence will surface.
While those concerns are out-of-scope for this RFC, choosing a protocol that facilitates authentication is important to avoid backing ourselves into a corner.
GraphQL is not opinionated with auth. We have any authorization mechanism available to us at the intersection of HTTP and WebSockets.
In previous Timber projects, we appended an Authorization: Bearer <jwt> header for queries/mutations.
For subscription, we passed a JWT along with the initial WebSocket connection payload; browsers pass limited headers with Upgrade requests, so this provided a neat approach to sidestep the lack of a comparable header for WebSockets. The JWT persisted for the life of the open WS.
I anticipate doing something similar with the Vector API. async-graphql has a Context struct, which is typically used for passing in shared resources such as a database connection pool, or request-specific data such as the current user session.
This is largely TBD, but the basic mechanisms are there to allow for flexible auth when we need it.
A Vector observability layer has already been agreed internally. Work is underway.
This proposal discusses the protocol which will govern communications between a running Vector instance, and a web UI and CLI.
The result of this proposal won't directly impact user interaction with observability tooling.
We have built a decent body of experience with GraphQL at Timber, albeit on non-Vector projects.
The tooling I am proposing here represents the same stack (save for swapping Apollo for Urql.)
In a previous Timber project, we used gqlgen as our server library, written in Go. The tooling with async-graphql is similar, though Rust's language features enable a more composable 'code first' approach using macros.
I believe the type of data we are consuming benefits from a strongly typed interface, and that compile-time client generation will significantly reduce the time-to-market of what will be already be a very complex front-end web app.
From relatively superficial searching, I've not been able to find a comparable approach to using GraphQL for internal metrics / observability. However, there are plenty of large companies using GraphQL for public API and very possibly for internal machinery that isn't public facing.
Article refs: GitHub, Facebook, Shopify, Intuit, Airbnb, Trello / Atlassian.
I looked at Rancher and the Kubernetes web UI for comparison, as both offer observability over an API that's intended for internal team use.
Both use JSON payloads. Rancher uses WebSockets to stream data, offering a comparable protocol as the GraphQL schema proposed in this RFC, albeit untyped.
Kubernetes uses OpenAPI v2. There are Typescript generation tools for OpenAPI such as swagger-to-ts and OpenAPI Generator. I have no experience with these tools. Given the REST interface to each, I'm not sure this compares directly to typed messages over WebSockets for streaming data.
Essentially any browser-compatible protocol could be used, and any text/binary format. This could include other libs such as Protobuf, which might form of a hybrid of the aforementioned approaches.
The maturity of tooling would need further investigation, as well as the development experience of working with any given format.
vector top and Vector UI requests?vector top?The following comprises links to relevant discussion in the original RFC PR: