beps/0012-metrics-service/README.md
Add a core MetricsService to Backstage's framework to provide a unified interface for metrics instrumentation. The service offers industry standards (OTEL) while focusing the MetricsService on distinct Backstage concerns, following the same pattern as other core services (DatabaseService builds on Knex, LoggerService builds on Winston, HttpRouterService builds on Express, etc.).
While individual plugins may implement their own metrics, there's no standardized approach leading to inconsistent metrics patterns across the ecosystem and incompatibility with OpenTelemetry semantic conventions. For example, a plugin implementing MCP functionality might incorrectly namespace metrics as backstage_mcp_client_duration when OpenTelemetry semantic conventions explicitly define mcp.client.operation.duration as the standard.
By providing a core metrics service:
The catalog and scaffolder plugins will be updated to use the new metrics service in the initial alpha release.
LoggerService. Future work to unify observability related concerns would be ideal, but not a goal.Following similar patterns to other core services, create a new RootMetricsService responsible for root-level concerns and the creation of plugin-specific MetricsService instances.
All Backstage metrics follow this hierarchical pattern:
backstage.{scope}.{scope_name}.{metric_name}
Where:
backstage is the root namespace for all Backstage metrics{scope} is the system scope (either plugin or core){scope_name} is the name of the plugin or core service (e.g., catalog, scaffolder, database, scheduler){metric_name} is the hierarchical metric name as provided by the plugin author (e.g., entity.count, tasks.completed.total)The scope represents where it belongs in the Backstage ecosystem.
plugin - A plugin-specific metric (e.g. backstage.plugin.catalog.entity.count)core - A metric provided by the core system (e.g. backstage.core.database.connections.active)Pattern: backstage.plugin.{pluginId}.{metric_name}
# Examples
backstage.plugin.catalog.entities.processed.total
backstage.plugin.scaffolder.tasks.completed.total
backstage.plugin.techdocs.builds.active
backstage.plugin.auth.sessions.active.total # todo: technically a core service and a backend plugin
Pattern: backstage.core.{service}.{metric_name}
# Examples
backstage.core.database.connections.active
backstage.core.scheduler.tasks.queued.total
backstage.core.httpRouter.requests.total
The MetricsService complements rather than duplicates auto-instrumentation by focusing on application-level metrics that only Backstage can provide. For example, the catalog plugin may want to track the number of entities processed by the refresh operation and the kind of entity being processed.
// Auto-instrumentation provides (automatically):
// - http.server.requests.total{method="GET", route="/catalog/entities", status_code="200"}
// - http.server.request.duration{method="GET", route="/catalog/entities"}
// MetricsService provides (manually):
const entityMetrics = metricsService.createCounter('entities.processed.total');
entityMetrics.add(entities.length, {
operation: 'refresh',
'entity.kind': 'Component',
});
// Metric is now available as `entities.processed.total`
A challenging factor of only introducing a MetricsService is the need to collect other OTEL-related configuration such as resources, tracing providers, views, and more prior to starting the SDK. This means that in order to introduce a MetricsService, we must support all OTEL Node SDK configuration along with it. Along with this, the official recommendation from the OTEL team is to not initialize and start the SDK on behalf of the user.
With this, we will not include any configuration as part of this BEP. Users will be responsible for initializing the SDK based on the current guidance
Provide a wrapper around OpenTelemetry's API while leveraging the types from the @opentelemetry/api package. This introduces concepts already familiar to both the Backstage community and those familiar with OpenTelemetry.
interface MetricsService {
// Synchronous instrumentation
createCounter(name: string, options?: MetricOptions): Counter;
createUpDownCounter(name: string, options?: MetricOptions): UpDownCounter;
createHistogram(name: string, options?: MetricOptions): Histogram;
createGauge(name: string, options?: MetricOptions): Gauge;
// Asynchronous instrumentation
createObservableCounter(
name: string,
options?: MetricOptions,
): ObservableCounter;
createObservableUpDownCounter(
name: string,
options?: MetricOptions,
): ObservableUpDownCounter;
createObservableGauge(name: string, options?: MetricOptions): ObservableGauge;
// Future - add additional convenience methods as we learn more about the needs of the framework
}
Each plugin receives a metrics service that automatically configures the Instrumentation Scope to identify the plugin. The scope name follows the pattern backstage-plugin-{pluginId}.
export const metricsServiceFactory = createServiceFactory({
service: coreServices.metrics,
deps: {
pluginMetadata: coreServices.pluginMetadata,
},
factory: ({ pluginMetadata }) => {
const pluginId = pluginMetadata.getId();
const scopeName = `backstage-plugin-${pluginId}`;
return new DefaultMetricsService(scopeName, version, ...);
},
});
const entitiesProcessed = metricsService.createCounter(
'entities.processed.total',
{
description: 'Total entities processed during refresh',
unit: '{entity}',
},
);
entitiesProcessed.add(100);
// ...
// metric is now available as `backstage.plugin.catalog.entities.processed.total`
@alpha.MetricsService.@publicotel SDK MUST BE initialized as EARLY as possible to prevent dependents from receiving no-op meters - we will not change the current guidance on thisPrepend backstage.plugin.{pluginId}. to all metric names. This was the original proposal but conflicts with OpenTelemetry semantic conventions.
Problems:
mcp.*, gen_ai.*, http.*Example of conflict:
// Plugin wants to emit: mcp.client.operation.duration
// Framework forces: backstage.plugin.mcp-actions.mcp.client.operation.duration
// This violates the semantic convention and breaks tooling