Back to Posthog

Database Interaction Patterns

docs/internal/feature-flags/database-interaction-patterns.md

1.43.117.4 KB
Original Source

Database Interaction Patterns

This document explains how the Rust feature flags service interacts with PostgreSQL, including connection pooling, query routing, error handling, and observability.

Architecture overview

The service uses a four-pool architecture (with an optional fifth pool for behavioral cohorts) to separate concerns and optimize for different access patterns:

text
┌─────────────────────────────────────────────────────────────────┐
│                        DatabasePools                            │
├─────────────────────────────────────────────────────────────────┤
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    PostgresRouter                         │  │
│  │  ┌─────────────────┐  ┌─────────────────┐                │  │
│  │  │ persons_reader  │  │ persons_writer  │  ← Persons DB  │  │
│  │  └─────────────────┘  └─────────────────┘    (optional)  │  │
│  │  ┌─────────────────┐  ┌─────────────────┐                │  │
│  │  │ non_persons_    │  │ non_persons_    │  ← Main DB     │  │
│  │  │ reader          │  │ writer          │                │  │
│  │  └─────────────────┘  └─────────────────┘                │  │
│  └───────────────────────────────────────────────────────────┘  │
│  ┌─────────────────────────────────────┐                        │
│  │ behavioral_cohorts (optional)       │  ← Behavioral cohorts  │
│  └─────────────────────────────────────┘    database             │
└─────────────────────────────────────────────────────────────────┘

When the persons database is not configured separately, the persons pools alias to the non-persons pools, effectively creating a two-pool architecture.

When BEHAVIORAL_COHORTS_READ_DATABASE_URL is configured, a separate reader pool is created for realtime cohort membership lookups. This pool has tight limits (max 5 connections, 1s statement timeout) to avoid impacting flag evaluation latency. When not configured, realtime cohort evaluation is disabled with no impact on existing flag evaluation.

Connection pooling

Pool configuration

The service uses SQLx's PgPool with configurable parameters per pool:

rust
pub struct PoolConfig {
    pub min_connections: u32,        // Minimum idle connections to maintain
    pub max_connections: u32,        // Maximum connections in the pool
    pub acquire_timeout: Duration,   // Timeout for acquiring a connection
    pub idle_timeout: Option<Duration>, // Close idle connections after this duration
    pub test_before_acquire: bool,   // Validate connection health before use
    pub statement_timeout_ms: Option<u64>, // PostgreSQL statement_timeout per connection
    pub pool_name: Option<String>,   // Pool identity for connection creation metrics
}

Default values

ParameterLibrary defaultService defaultPurpose
min_connections00 per poolStart with no connections, scale on demand
max_connections1010Maximum connections per pool
acquire_timeout10s3s (test)Wait time for connection from pool
idle_timeout300s (5 min)300sClose unused connections
test_before_acquiretruetrueValidate connection before use
statement_timeout_msNone5000msCancel queries exceeding this duration

Per-pool statement timeouts

Different pools can have different statement timeouts to match their workload:

PoolConfig keyTypical use
non_persons_readerNON_PERSONS_READER_STATEMENT_TIMEOUT_MSFlag definitions, team data
persons_readerPERSONS_READER_STATEMENT_TIMEOUT_MSPerson lookups, cohort membership
persons_writerWRITER_STATEMENT_TIMEOUT_MSHash key override writes
non_persons_writerWRITER_STATEMENT_TIMEOUT_MSSame as persons_writer
behavioral_cohortshardcoded (1000ms)Realtime cohort membership lookups

Statement timeouts are set via SET statement_timeout = {ms} on each new connection using SQLx's after_connect hook.

The same hook also increments the db_connection_created_total counter when pool_name is set, providing visibility into connection churn per pool.

Total connection count

text
With persons DB routing enabled:  4 pools × max_connections
With persons DB routing disabled: 2 pools × max_connections (pools are aliased)

For production with max_connections=10:

  • Routing enabled: 40 connections max per service instance
  • Routing disabled: 20 connections max per service instance

When BEHAVIORAL_COHORTS_READ_DATABASE_URL is configured, an additional 5 connections (hardcoded max) are added to the total.

Query routing

The PostgresRouter routes queries to the appropriate pool based on the table being accessed:

rust
pub struct PostgresRouter {
    pub persons_reader: PostgresReader,
    pub persons_writer: PostgresWriter,
    pub non_persons_reader: PostgresReader,
    pub non_persons_writer: PostgresWriter,
}

Routing rules

TablesPool
posthog_person, posthog_persondistinctid, posthog_featureflaghashkeyoverridepersons_*
posthog_featureflag, posthog_team, posthog_grouptypemapping, posthog_cohortnon_persons_*

Note: cohort_membership queries bypass PostgresRouter entirely. They are served by the behavioral_cohorts pool on DatabasePools, accessed directly via DatabasePools.behavioral_cohorts_reader. See Architecture overview.

Usage pattern

rust
// Read person data - always include team_id for partition efficiency
let mut conn = router.get_persons_reader().get_connection().await?;
let person = sqlx::query(
    "SELECT * FROM posthog_person WHERE team_id = $1 AND id = $2"
)
    .bind(team_id)
    .bind(person_id)
    .fetch_optional(&mut *conn)
    .await?;

// Read flag definitions
let mut conn = router.get_non_persons_reader().get_connection().await?;
let flags = sqlx::query("SELECT * FROM posthog_featureflag WHERE team_id = $1")
    .bind(team_id)
    .fetch_all(&mut *conn)
    .await?;

Important: Always include team_id in queries against persons tables. These tables are partitioned by team_id, and queries without it will scan all partitions instead of targeting the correct one via the index.

Error handling

Transient error detection

The common_database crate provides error classification for retry logic:

rust
pub fn is_transient_error(error: &SqlxError) -> bool

Transient errors (suitable for retry):

SQLSTATE classMeaning
08***Connection exception
53***Insufficient resources
57***Operator intervention (includes query cancellation)
58***System error
40001Serialization failure
40003Statement completion unknown
40P01Deadlock detected

Non-transient errors (fail immediately):

SQLSTATE classMeaning
23***Integrity constraint violation
42***Syntax error or access violation
22***Data exception

Timeout detection

rust
pub fn is_timeout_error(error: &SqlxError) -> bool
pub fn extract_timeout_type(error: &SqlxError) -> Option<&'static str>

Timeout types detected:

TypeSource
pool_timeoutPool acquisition timed out
io_timeoutNetwork/socket timeout
protocol_timeoutProtocol-level timeout
query_canceledSQLSTATE 57014 (statement_timeout)
lock_not_availableSQLSTATE 55P03 (lock_timeout)
idle_in_transaction_timeoutSQLSTATE 25P03

Foreign key constraint detection

rust
pub fn is_foreign_key_constraint_error(error: &SqlxError) -> bool

Used for retrying hash key override writes when a person is deleted during the operation (race condition).

Retry strategies

The service uses the tokio-retry crate with exponential backoff:

Read operations

rust
let retry_strategy = ExponentialBackoff::from_millis(50)
    .max_delay(Duration::from_millis(300))
    .take(3)  // 3 attempts total
    .map(jitter);
  • Initial delay: 50ms
  • Max delay: 300ms
  • Max attempts: 3
  • Retry on: Transient errors only

Write operations

rust
let retry_strategy = ExponentialBackoff::from_millis(100)
    .max_delay(Duration::from_millis(300))
    .take(2)  // 2 attempts for writes
    .map(jitter);
  • Initial delay: 100ms (slower to avoid overwhelming)
  • Max delay: 300ms
  • Max attempts: 2 (more conservative)
  • Retry on: Foreign key constraint errors (person deletion race)

Observability

Prometheus metrics

MetricLabelsPurpose
flags_db_connection_timepool, operationConnection acquisition latency
flags_person_query_time-Person lookup query duration
flags_definition_query_time-Flag definition query duration
flags_pool_utilization_ratiopoolPool utilization (0.0-1.0)
flags_connection_hold_time_mspool, operationHow long connections are held
flags_hash_key_retries_totalteam_id, operationRetry counter
flags_flag_evaluation_error_totalerror_typeError counter
db_connection_created_totalpoolConnection creation events (physical TCP/TLS, not pool reuse)
flags_db_connection_pool_sizepoolTotal pool size (should equal active + idle)
flags_db_connection_pool_active_totalpoolActive (in-use) connections
flags_db_connection_pool_idle_totalpoolIdle (available) connections
flags_db_connection_pool_max_totalpoolConfigured maximum connections

Example PromQL queries

promql
# Connection creation rate per pool
rate(db_connection_created_total{pool="non_persons_reader"}[5m])

# Pool reuse rate (fraction of acquires that reused an existing connection)
1 - (
  rate(db_connection_created_total[5m])
  /
  sum without(operation) (rate(flags_db_connection_time_count[5m]))
)

Pool stats

Each pool exposes stats via get_pool_stats():

rust
pub struct PoolStats {
    pub size: u32,      // Current number of connections
    pub num_idle: usize, // Connections not currently in use
}

Utilization is calculated as: (size - num_idle) / size

Slow query warnings

Queries exceeding 500ms are logged at WARN level with timing information.

Configuration reference

Environment variables

VariableDefaultPurpose
READ_DATABASE_URLrequiredMain database read replica URL
WRITE_DATABASE_URLrequiredMain database primary URL
PERSONS_READ_DATABASE_URLemptyPersons database read replica (enables routing)
PERSONS_WRITE_DATABASE_URLemptyPersons database primary (enables routing)
MAX_PG_CONNECTIONS10Max connections per pool
MIN_NON_PERSONS_READER_CONNECTIONS0Min idle connections for non-persons reader
MIN_NON_PERSONS_WRITER_CONNECTIONS0Min idle connections for non-persons writer
MIN_PERSONS_READER_CONNECTIONS0Min idle connections for persons reader
MIN_PERSONS_WRITER_CONNECTIONS0Min idle connections for persons writer
ACQUIRE_TIMEOUT_SECS10Connection acquisition timeout
IDLE_TIMEOUT_SECS300Idle connection timeout
TEST_BEFORE_ACQUIREtrueValidate connections before use
NON_PERSONS_READER_STATEMENT_TIMEOUT_MS0 (disabled)Statement timeout for non-persons reads
PERSONS_READER_STATEMENT_TIMEOUT_MS0 (disabled)Statement timeout for persons reads
WRITER_STATEMENT_TIMEOUT_MS0 (disabled)Statement timeout for writes
BEHAVIORAL_COHORTS_READ_DATABASE_URLemptyBehavioral cohorts database (enables realtime cohort evaluation)

Tuning guidance

High traffic deployment:

bash
MAX_PG_CONNECTIONS=25  # Increase pool size
MIN_NON_PERSONS_READER_CONNECTIONS=5  # Keep connections warm
MIN_PERSONS_READER_CONNECTIONS=5

Bursty traffic:

bash
IDLE_TIMEOUT_SECS=600  # Keep connections warm longer
MIN_NON_PERSONS_READER_CONNECTIONS=3  # Pre-warm some connections

Strict timeout enforcement:

bash
NON_PERSONS_READER_STATEMENT_TIMEOUT_MS=5000  # 5s for reads
PERSONS_READER_STATEMENT_TIMEOUT_MS=5000
WRITER_STATEMENT_TIMEOUT_MS=2000  # 2s for writes (should be fast)
FilePurpose
rust/common/database/src/lib.rsPool configuration, error classification
rust/feature-flags/src/database_pools.rsPool architecture (including behavioral cohorts pool)
rust/feature-flags/src/database/postgres_router.rsQuery routing
rust/feature-flags/src/config.rsEnvironment configuration
rust/feature-flags/src/flags/flag_matching_utils.rsQuery patterns, retry logic
rust/feature-flags/src/metrics/consts.rsMetric constants