BEP: Discovery API - Split Backends

Discussion Issue

Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Release Plan
Dependencies
Alternatives

Summary

The goal of this BEP is to define the architecture that we will be using for an automatic discovery API that handles split backends. While users can use the current target-based config, it is not runtime driven and adding plugins requires a config update. This new system allows existing backends to register with gateway nodes at start time, allows them to unregister before the program exits and catches system errors by timing out plugin registrations that haven't been refreshed recently.

Motivation

Split backends are a consistently difficult space to operate in and design for. There has been a growing desire for the framework to provide a way to get a list of the installed plugins. This was nearly impossible in the old backend, where plugins were hosted on denormalized routes and had non-standard startup sequences. In the new backend, this has become significantly more doable. Moving this forward would unblock a number of cases that require knowledge of your entire Backstage installation, namely a single OpenAPI spec for your instance, checking installed permissions, and DevTools information.

Ideally, this work will also make it easier for adopters to go down the path of split backends.

Goals

As an integrator, I can now get a list of currently installed plugins across my deployment.
As an administrator, I can use the default discovery API across both frontend and backend for complex situations like split backends, without writing my own.
As an administrator, I can add/remove plugins without having to do a full redeployment of all of my Backstage nodes.
Registrations will be authenticated and any plugins that are added to the registry will have a proof of access to write to the registry.

Non-Goals

I can't install plugins using the new discovery API.
I can't control my deployments using the new discovery API.
If I fork the BackendInitialization logic, I may not be able to use this API.

Proposal

Design Details

Glossary

Gateway Node

A node that serves as the primary discovery point. A gateway node will have all of the information necessary to route traffic through your system. The gateway node is not an API gateway, it only has knowledge of which plugins are on which nodes, it will not automatically route traffic to the correct instance.

Non-gateway Node

A node that needs to call a Gateway node for routing outside of its own plugins. This node will still fire requests to other instances, but does not know which instances have which plugins.

New Discovery Plugin

We propose a new Discovery plugin that exposes a set of HTTP endpoints that allow dynamic registration and unregistration of plugins across multiple instances in a deployment. This will allow a single instance or type of instance ("gateway node") to have information about all installed plugins across your entire deployment, which may be multiple instances. The API would look like this,

yaml

openapi: 3.0.3
info:
  title: Discovery API
  version: v1
paths:
  /register:
    post:
      summary: Register a plugin with the gateway.
      operationId: registerPlugin
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                pluginId:
                  type: string
                externalUrl:
                  type: string
                internalUrl:
                  type: string
        required: true
      responses:
        '200':
          description: Successful operation
        '400':
          description: Invalid input
  /unregister:
    post:
      summary: Unregister a plugin with the gateway.
      operationId: unregisterPlugin
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                pluginId:
                  type: string
                externalUrl:
                  type: string
                internalUrl:
                  type: string
        required: true
      responses:
        '200':
          description: Successful operation
        '400':
          description: Invalid input or already unregistered.
  /registrations:
    get:
      summary: Get all registered pluginIds
      operationId: listRegistrations
      responses:
        '200':
          description: Success
          content:
            application/json:
              schema:
                type: array
                items:
                  type: string
  /by-plugin/{pluginId}:
    get:
      summary: Get registration URLs for a given pluginId.
      operationId: getRegistrationByPlugin
      parameters:
        - name: pluginId
          in: path
          description: The plugin ID to get information for.
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Success
          content:
            application/json:
              schema:
                type: object
                properties:
                  internal:
                    type: string
                  external:
                    type: string

Backend Requests

Backend DiscoveryService requests will either

route to their own instance and can use the existing HostDiscovery implementation, or
need to route to a separate instance and will have to go to the gateway node for routing information.

To get that extra information, the backend instances will call the gateway node's by-plugin/{pluginId} endpoint and route with the given response.

We also propose adding a new method to this service to give a list of plugins across the deployment,

interface DiscoveryService {
  ...
  listPlugins: () => Promise<string[]>;
}

This new method is expected to call the gateway node's discovery://registrations method.

Registration Flow

On startup, instances will send a request to the gateway node, discovery://register with information about the instance and what plugins it owns. This request must be signed using service-to-service auth keys between the two instances to prevent malicious registrations. We may revisit this in the future.

New `InstanceMetadataService`

While we could attach the existing information to the PluginMetadataService, we propose a new service that handles instance-level information. The existing PluginMetadataService should reveal information about the plugin itself, its pluginId, dependencies or similar. The new InstanceMetadataService should give you information about the entire Backstage instance that you're interrogating. At launch, this should include the list of features installed on your instance that can then be aggregated by the discovery API across the gateway nodes. One could imagine this service also having information about instance URLs, health or gateway status.

interface InstanceMetadataService {
  listFeatures: () => BackendFeature[];
  // or
  listFeatures: () => string[]; // list of pluginIds/moduleIds.
}

Frontend Requests

This will leverage the existing DiscoveryApi. We propose adding a new method, listPlugins that will return a list of all plugins installed in your Backstage deployment.

interface DiscoveryApi {
  ...
  listPlugins: () => Promise<string[]>;
}

All methods will just call the gateway node's HTTP discovery endpoint for the data, see diagram for more information.

Gateway Scaling

The primary concern with having multiple gateway nodes is alignment on what plugins are installed across the instance. For this, we propose a new database that will store,

export interface PluginRegistrations {
  plugin_id: string;
  internal_url: string;
  external_url: string;
  last_health_check: timestamp;
}

As any gateway node could be hit by any given plugin, the implementation should not rely on in-memory values per node. Gateway nodes should read and write from/to the database directly.

The triplet plugin_id, internal_url, external_url should be unique. We may have multiple plugins on multiple URLs either internal or external. Routing in those cases is not covered in this BEP. Horizontally scaled plugins should use external technologies to route requests and handle load balancing. This should be reflected in their backend.baseUrl properties. Instance IP addresses should not be sent or stored in this database.

Heartbeats

To prevent stale data, we propose implementing heartbeats. Each plugin should expose a /health endpoint (this will be configurable per plugin). When registering, the plugin will send its internal and external reachable URLs to the gateway node. After registering, every x seconds the gateway node will send a heartbeat request to the health endpoint. That endpoint is expected to return a 200 HTTP response. If it doesn't respond with a 200 or doesn't respond within a reasonable timeout, that plugin will be considered unregistered at its instance URL. If this represents the last instance URL for a given plugin, the entire plugin will be considered unavailable.

The goal with adopting heartbeats is to leverage the decentralized nature of health checks per plugin to enable horizontal scaling easily. Only a single instance has to respond to a health check for it to pass.

These would initially be naively implemented, using in memory intervals on each gateway node that a plugin registers with. This brings with it 2 risks that would result in lost traffic.

The gateway node handling heartbeats crashes.
The instance handling traffic crashes.

Both may cause a temporary outage in plugin traffic.

Unregistering Plugins

Each plugin will send a request to the gateway node's /unregister endpoint on shutdown. Any responses will not block the shutdown process. This will remove the plugin/instance combination from the database. As we have to consider horizontally scaled deployments, this may be triggered multiple times for a given plugin/instance combination. For duplicates, the endpoint will send a 400.

Release Plan

Dependencies

Alternatives

Per-instance registration

This would involve sending a request on instance startup with a list of the current plugins and information about how to route to them. While this is useful with the instance restarts on every plugin addition, it has issues with unregistering plugins when the instance crashes and doesn't restart and causes extra stress on the gateway node in comparison to a pure health check/heartbeat.

Discovery API - Split Backends