pip/pip-320.md
PIP-264, which can also be viewed here, describes in high level a plan to greatly enhance Pulsar metric system by replacing it with OpenTelemetry. You can read in the PIP the numerous existing problems PIP-264 solves. Among them are:
You can here why OpenTelemetry was chosen.
Since OpenTelemetry (a.k.a. OTel) is an emerging industry standard, there are plenty of good articles, videos and documentation about it. In this very short paragraph I'll describe what you need to know about OTel from this PIP perspective.
OpenTelemetry is a project aimed to standardize the way we instrument, collect and ship metrics from applications to telemetry backends, be it databases (e.g. Prometheus, Cortex, Thanos) or vendors (e.g. Datadog, Logz.io). It is divided into API, SDK and Collector:
Just to have some context: Pulsar codebase will use the OTel API to create counters / histograms and records values to
them. So will the Pulsar plugins and Pulsar Function authors. Pulsar itself will be the one creating the SDK
and using that to hand over an implementation of the API where ever needed in Pulsar. Collector is up to the choice
of the user, as OTel provides a way to expose the metrics as /metrics endpoint on a configured port, so Prometheus
compatible scrapers can grab it from it directly. They can also send it via OTLP to OTel collector.
PIP-264 clearly outlined there will be two layers of metrics, collected and exported, side by side: OpenTelemetry and the existing metric system - currently exporting in Prometheus. This PIP will explain in detail how it will work. The basic premise is that you will be able to enable or disable OTel metrics, alongside the existing Prometheus metric exporting.
As specified in PIP-264, OpenTelemetry Java SDK has several fixes the Pulsar community must
complete before it can be used in production. They are documented
in PIP-264. The most important one is reducing memory allocations to be negligible. OTel SDK is built upon immutability,
hence allocated memory in O(#topics) which is a performance killer for low latency application like Pulsar.
You can track the proposal and progress the Pulsar and OTel communities are making in this issue.
Today Pulsar metrics endpoint /metrics has an option to be protected by the configured AuthenticationProvider.
The configuration option is named authenticateMetricsEndpoint in the broker and
authenticateMetricsEndpoint in the proxy.
Implementing PIP-264 consists of a long list of steps, which are detailed in this issue. The first step is add all the bare-bones infrastructure to use OpenTelemetry in Pulsar, such that next PRs can use it to start translating existing metrics to their OTel form. It means the same metrics will co-exist in the codebase and also in runtime, if OTel was enabled.
OpenTelemetry, as any good telemetry library (e.g. log4j, logback), has its own configuration mechanisms:
Pulsar doesn't need to introduce any additional configuration. The user can decide, using OTel configuration things like:
Pulsar will use AutoConfiguredOpenTelemetrySdk which uses all the above configuration mechanisms
(documented here).
This class builds an OpenTelemetrySdk based on configurations. This is the entry point to OpenTelemetry API, as it
implements OpenTelemetry API class.
There are some configuration options we wish to change their default, but still allow the users to override it if they wish. We think those default values will make a much easier user experience.
otel.experimental.metrics.cardinality.limit - value: 10,000
This property sets an upper bound on the amount of unique Attributes an instrument can have. Take Pulsar for example,
an instrument like pulsar.broker.messaging.topic.received.size, the unique Attributes would be in the amount of
active topics in the broker. Since Pulsar can handle up to 1M topics, it makes more sense to put the default value
to 10k, which translates to 10k topics.AutoConfiguredOpenTelemetrySdkBuilder allows to add properties using the method addPropertiesSupplier. The
System properties and environment variables override it. The file-based configuration still doesn't take
those properties supplied into account, but it will.
We would like to have the ability to toggle OpenTelemetry-based metrics, as they are still new. We won't need any special Pulsar configuration, as OpenTelemetry SDK comes with a configuration key to do that. Since OTel is still experimental, it will have to be opt-in, hence we will add the following property to be the default using the mechanism described above:
otel.sdk.disabled - value: true
This property value disables OpenTelemetry.With OTel disabled, the user remains with the existing metrics system. OTel in a disabled state operates in a
no-op mode. This means, instruments do get built, but the instrument builders return the same instance of a
no-op instrument, which does nothing on record-values method (e.g. add(number), record(number)). The no-op
MeterProvider has no registered MetricReader hence no metric collection will be made. The memory impact
is almost 0 and the same goes for CPU impact.
The current metric system doesn't have a toggle which causes all existing data structures to stop collecting
data. Inserting will need changing in so many places since we don't have a single place which through
all metric instrument are created (one of the motivations for PIP-264).
The current system do have a toggle: exposeTopicLevelMetricsInPrometheus. It enables toggling off
topic-level metrics, which means the highest cardinality metrics will be namespace level.
Once that toggle is false, the amount of data structures accounting memory would in the range of
a few thousands which shouldn't post a burden memory wise. If the user refrain from calling
/metrics it will also reduce the CPU and memory cost associated with collecting metrics.
When the user enables OTel it means there will be a memory increase, but if the user disabled topic-level metrics in existing system, as specified above, the majority of the memory increase will be due to topic level metrics in OTel, at the expense of not having them in the existing metric system.
A broker is part of a cluster. It is configured in the Pulsar configuration key clusterName. When the broker is part
of a cluster, it means it shares the topics defined in that cluster (persisted in Metadata service: e.g. ZK)
among the brokers of that cluster.
Today, each unique time series emitted in Prometheus metrics contains the cluster label (almost all of them, as it
is done manually). We wish the same with OTel - to have that attribute in each exported unique time series.
OTel has the perfect location to place attributes which are shared across all time series: Resource. An application can have multiple Resource, with each having 1 or more attributes. You define it once, in OTel initialization or configuration. It can contain attributes like the hostname, AWS region, etc. The default contains the service name and some info on the SDK version.
Attributes can be added dynamically, through addResourceCustomizer() in AutoConfiguredOpenTelemetrySdkBuilder.
We will use that to inject the cluster attribute, taken from the configuration.
In Prometheus, we submitted a proposal to opentelemetry specifications, which was merged, to allow copying resource attributes into each exported unique time series in Prometheus exporter. We plan to contribute its implementation to OTel Java SDK.
Resources in Prometheus exporter, are exported as target_info{} 1 and the attributes are added to this
time series. This will require making joins to get it, making it extremely difficult to use.
The other alternative was to introduce our own PulsarAttributesBuilder class, on top of
AttributesBuilder of OTel. Getting every contributor to know this class, use it, is hard. Getting this
across Pulsar Functions or Plugins authors, will be immensely hard. Also, when exporting as
OTLP, it is very inefficient to repeat the attribute across all unique time series, instead of once using
Resource. Hence, this needed to be solved in the Prometheus exporter as we did in the proposal.
The attribute will be named pulsar.cluster, as both the proxy and the broker are part of this cluster.
pulsar.. Example: pulsar.topic, pulsar.cluster.We should have a clear hierarchy, hence use the following prefix
pulsar.brokerpulsar.proxypulsar.function_workerIt's customary to use reverse domain name for meter names. Hence, we'll use:
org.apache.pulsar.brokerorg.apache.pulsar.proxyorg.apache.pulsar.function_workerOTel meter name is converted to the attribute name otel_scope_name and added to each unique time series
attributes by Prometheus exporter.
We won't specify a meter version, as it is used solely to signify the version of the instrumentation, and currently we are the first version, hence not use it.
OpenTelemetryService class
PulsarBrokerOpenTelemetry class
OpenTelemetryService using the cluster name taken from the broker configurationgetMeter() returns the Meter for the brokerPulsarProxyOpenTelemetry class
PulsarBrokerOpenTelemetry but for Pulsar ProxyPulsarWorkerOpenTelemetry class
PulsarBrokerOpenTelemetry but for Pulsar function worker/metrics endpoint on a user defined port, if user chose to use itAuthenticationProvider