docs/rfcs/044-feature-flag.md
In this RFC, we will describe how we will implement per-tenant feature flags.
Before we start, let's talk about how current feature flag services work. PostHog is the feature flag service we are currently using across multiple user-facing components in the company. PostHog has two modes of operation: HTTP evaluation and server-side local evaluation.
Let's assume we have a storage feature flag called gc-compaction and we want to roll it out to scale-tier users with resident size >= 10GB and <= 100GB.
The first step is to synchronize our user profiles to the PostHog service. We can simply assume that each tenant is a user in PostHog. Each user profile has some properties associated with it. In our case, it will be: plan type (free, scale, enterprise, etc); resident size (in bytes); primary pageserver (string); region (string).
We would create a feature flag called gc-compaction in PostHog with 4 variants: disabled, stage-1, stage-2, fully-enabled. We will flip the feature flags from disabled to fully-enabled stage by stage for some percentage of our users.
When using PostHog's HTTP evaluation mode, the client will make request to the PostHog service, asking for the value of a feature flag for a specific user.
/decide API on PostHog.Using the HTTP evaluation mode we will issue 754X requests a month.
When using PostHog's HTTP evaluation mode, the client (usually the server in a browser/server architecture) will poll the feature flag configuration every 30s (default in the Python client) from PostHog. Such configuration contains data like:
<details> <summary>Example JSON response from the PostHog local evaluation API</summary>[
{
"id": 1,
"name": "Beta Feature",
"key": "person-flag",
"is_simple_flag": True,
"active": True,
"filters": {
"groups": [
{
"properties": [
{
"key": "location",
"operator": "exact",
"value": ["Straße"],
"type": "person",
}
],
"rollout_percentage": 100,
},
{
"properties": [
{
"key": "star",
"operator": "exact",
"value": ["ſun"],
"type": "person",
}
],
"rollout_percentage": 100,
},
],
},
}
]
Note that the API only contains information like "under what condition => rollout percentage". The user is responsible to provide the properties required to the client for local evaluation, and the PostHog service (web UI) cannot know if a feature is enabled for the tenant or not until the client uses the capture API to report the result back. To control the rollout percentage, the user ID gets mapped to a float number in [0, 1) on a consistent hash ring. All values <= the percentage will get the feature enabled or set to the desired value.
To use the local evaluation mode, the system needs:
In this case, we will issue 86400Y + 40X requests per month.
Assume X = 1,000,000 and Y = 100,
| HTTP Evaluation | Local Evaluation | |
|---|---|---|
| Latency of propagating the conditions/properties for feature flag | 24 hours | available locally |
| Latency of applying the feature flag | 1 hour | 5 minutes |
| Can properties be reported from different services | Yes | No |
| Do we need to sync billing info etc to pageserver | No | Yes |
| Cost | 75400$ / month | 4864$ / month |
We will use PostHog only as an UI to configure the feature flags. Whether a feature is enabled or not can only be queried through storcon/pageserver instead of using the PostHog UI. (We could report it back to PostHog via capture_event but it costs $$$.) This allows us to ramp up the feature flag functionality fast at first. At the same time, it would also give us the option to migrate to our own solution once we want to have more properties and more complex evaluation rules in our system.
We only need to pay for the 86400Y local evaluation requests (that would be setting Y=0 in solution 2 => $864/month, and even less if we proxy it through storcon).
evaluate(tenant_id, feature_flag, properties) -> json.:tenant_id/feature/:feature to retrieve the current feature flag status.posthog_lite_client that supports local feature evaluations.Each tenant has a feature_resolver object. After you add a feature flag in the PostHog console, you can retrieve it with:
// Boolean flag
self
.feature_resolver
.evaluate_boolean("flag")
.is_ok()
// Multivariate flag
self
.feature_resolver
.evaluate_multivariate("gc-comapction-strategy")
.ok();
The user needs to handle the case where the evaluation result is an error. This can occur in a variety of cases:
For boolean flags, the return value is Result<(), Error>. Ok(()) means the flag is evaluated to true. Otherwise,
there is either an error in evaluation or it does not match any groups.
For multivariate flags, the return value is Result<String, Error>. Ok(variant) indicates the flag is evaluated
to a variant. Otherwise, there is either an error in evaluation or it does not match any groups.
The evaluation logic is documented in the PostHog lite library. It compares the consistent hash of a flag key + tenant_id with the rollout percentage and determines which tenant to roll out a specific feature.
Users can use the feature flag evaluation API to get the flag evaluation result of a specific tenant for debugging purposes.
curl http://localhost:9898/v1/tenant/:tenant_id/feature_flag?flag=:key&as=multivariate/boolean"
By default, the storcon pushes the feature flag specs to the pageservers every 30 seconds, which means that a change in feature flag in the PostHog UI will propagate to the pageservers within 30 seconds.
plan_type (needs cplane to pass it down).AtomicBool and use it on the read path).