beps/0005-split-backend-discovery/README.md
The goal of this BEP is to define the architecture that we will be using for an automatic discovery API that handles split backends. While users can use the current target-based config, it is not runtime driven and adding plugins requires a config update. This new system allows existing backends to register with gateway nodes at start time, allows them to unregister before the program exits and catches system errors by timing out plugin registrations that haven't been refreshed recently.
Split backends are a consistently difficult space to operate in and design for. There has been a growing desire for the framework to provide a way to get a list of the installed plugins. This was nearly impossible in the old backend, where plugins were hosted on denormalized routes and had non-standard startup sequences. In the new backend, this has become significantly more doable. Moving this forward would unblock a number of cases that require knowledge of your entire Backstage installation, namely a single OpenAPI spec for your instance, checking installed permissions, and DevTools information.
Ideally, this work will also make it easier for adopters to go down the path of split backends.
A node that serves as the primary discovery point. A gateway node will have all of the information necessary to route traffic through your system. The gateway node is not an API gateway, it only has knowledge of which plugins are on which nodes, it will not automatically route traffic to the correct instance.
A node that needs to call a Gateway node for routing outside of its own plugins. This node will still fire requests to other instances, but does not know which instances have which plugins.
We propose a new Discovery plugin that exposes a set of HTTP endpoints that allow dynamic registration and unregistration of plugins across multiple instances in a deployment. This will allow a single instance or type of instance ("gateway node") to have information about all installed plugins across your entire deployment, which may be multiple instances. The API would look like this,
openapi: 3.0.3
info:
title: Discovery API
version: v1
paths:
/register:
post:
summary: Register a plugin with the gateway.
operationId: registerPlugin
requestBody:
content:
application/json:
schema:
type: object
properties:
pluginId:
type: string
externalUrl:
type: string
internalUrl:
type: string
required: true
responses:
'200':
description: Successful operation
'400':
description: Invalid input
/unregister:
post:
summary: Unregister a plugin with the gateway.
operationId: unregisterPlugin
requestBody:
content:
application/json:
schema:
type: object
properties:
pluginId:
type: string
externalUrl:
type: string
internalUrl:
type: string
required: true
responses:
'200':
description: Successful operation
'400':
description: Invalid input or already unregistered.
/registrations:
get:
summary: Get all registered pluginIds
operationId: listRegistrations
responses:
'200':
description: Success
content:
application/json:
schema:
type: array
items:
type: string
/by-plugin/{pluginId}:
get:
summary: Get registration URLs for a given pluginId.
operationId: getRegistrationByPlugin
parameters:
- name: pluginId
in: path
description: The plugin ID to get information for.
required: true
schema:
type: string
responses:
'200':
description: Success
content:
application/json:
schema:
type: object
properties:
internal:
type: string
external:
type: string
Backend DiscoveryService requests will either
HostDiscovery implementation, orTo get that extra information, the backend instances will call the gateway node's by-plugin/{pluginId} endpoint and route with the given response.
We also propose adding a new method to this service to give a list of plugins across the deployment,
interface DiscoveryService {
...
listPlugins: () => Promise<string[]>;
}
This new method is expected to call the gateway node's discovery://registrations method.
On startup, instances will send a request to the gateway node, discovery://register with information about the instance and what plugins it owns. This request must be signed using service-to-service auth keys between the two instances to prevent malicious registrations. We may revisit this in the future.
InstanceMetadataServiceWhile we could attach the existing information to the PluginMetadataService, we propose a new service that handles instance-level information. The existing PluginMetadataService should reveal information about the plugin itself, its pluginId, dependencies or similar. The new InstanceMetadataService should give you information about the entire Backstage instance that you're interrogating. At launch, this should include the list of features installed on your instance that can then be aggregated by the discovery API across the gateway nodes. One could imagine this service also having information about instance URLs, health or gateway status.
interface InstanceMetadataService {
listFeatures: () => BackendFeature[];
// or
listFeatures: () => string[]; // list of pluginIds/moduleIds.
}
This will leverage the existing DiscoveryApi. We propose adding a new method, listPlugins that will return a list of all plugins installed in your Backstage deployment.
interface DiscoveryApi {
...
listPlugins: () => Promise<string[]>;
}
All methods will just call the gateway node's HTTP discovery endpoint for the data, see diagram for more information.
The primary concern with having multiple gateway nodes is alignment on what plugins are installed across the instance. For this, we propose a new database that will store,
export interface PluginRegistrations {
plugin_id: string;
internal_url: string;
external_url: string;
last_health_check: timestamp;
}
As any gateway node could be hit by any given plugin, the implementation should not rely on in-memory values per node. Gateway nodes should read and write from/to the database directly.
The triplet plugin_id, internal_url, external_url should be unique. We may have multiple plugins on multiple URLs either internal or external. Routing in those cases is not covered in this BEP. Horizontally scaled plugins should use external technologies to route requests and handle load balancing. This should be reflected in their backend.baseUrl properties. Instance IP addresses should not be sent or stored in this database.
To prevent stale data, we propose implementing heartbeats. Each plugin should expose a /health endpoint (this will be configurable per plugin). When registering, the plugin will send its internal and external reachable URLs to the gateway node. After registering, every x seconds the gateway node will send a heartbeat request to the health endpoint. That endpoint is expected to return a 200 HTTP response. If it doesn't respond with a 200 or doesn't respond within a reasonable timeout, that plugin will be considered unregistered at its instance URL. If this represents the last instance URL for a given plugin, the entire plugin will be considered unavailable.
The goal with adopting heartbeats is to leverage the decentralized nature of health checks per plugin to enable horizontal scaling easily. Only a single instance has to respond to a health check for it to pass.
These would initially be naively implemented, using in memory intervals on each gateway node that a plugin registers with. This brings with it 2 risks that would result in lost traffic.
Both may cause a temporary outage in plugin traffic.
<!-- TODO: Figure out a good story around error handling and restoring state. -->Each plugin will send a request to the gateway node's /unregister endpoint on shutdown. Any responses will not block the shutdown process. This will remove the plugin/instance combination from the database. As we have to consider horizontally scaled deployments, this may be triggered multiple times for a given plugin/instance combination. For duplicates, the endpoint will send a 400.
This would involve sending a request on instance startup with a list of the current plugins and information about how to route to them. While this is useful with the instance restarts on every plugin addition, it has issues with unregistering plugins when the instance crashes and doesn't restart and causes extra stress on the gateway node in comparison to a pure health check/heartbeat.