docs/published/handbook/engineering/person-processing.md
Note: This document describes the person processing system at a point in time. The source of truth is, as always, the source code. If you find a mistake or something out of date, please open a PR!
It's not intended to provide a perfectly detailed view of any one system, rather it should explain how they fit together at a high level, going into detail when relevant.
PostHog's person processing system provides stable identity for users across multiple sessions, devices, and platforms. A single real-world user (a "Person") may interact with your product from their phone, laptop, and through server-side API calls - person processing ensures all of these interactions are attributed to the same identity.
Person profiles power many PostHog products:
Events flow through several systems before they're queryable:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Client SDK │
│ (posthog-js, posthog-node, etc.) │
│ │
└─────────────────────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Capture (Rust service) │
│ - Validates events │
│ - Rate limiting / overflow │
│ - Produces to Kafka │
│ - Partition key: <token>:<distinct_id> │
│ │
└─────────────────────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Kafka (events_plugin_ingestion topic) │
│ - Partitioned by token:distinct_id │
│ - Events for same distinct_id go to same partition (ordering guarantee) │
│ - Different distinct_ids may go to different partitions (no ordering) │
│ │
└─────────────────────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Ingestion Pipeline (Node.js) - formerly called Plugin Server │
│ - Person processing (creates/updates/merges persons in PostgreSQL) │
│ - Property updates ($set, $set_once, $unset) │
│ - Produces person updates to Kafka │
│ - Produces processed events to Kafka │
│ │
└─────────────────────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ Kafka (clickhouse_events_json, person topics) │
│ │
└─────────────────────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ ClickHouse │
│ - events table (with person_id column) │
│ - person table │
│ - person_distinct_id2 table (distinct_id → person mapping) │
│ - person_distinct_id_overrides table (for squashing) │
│ │
└─────────────────────────────────┬───────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ │
│ HogQL / Query Engine │
│ - Translates queries to ClickHouse SQL │
│ - Handles person joins based on PoE (Persons on Events) mode │
│ - Applies person_distinct_id_overrides when needed │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
A distinct_id is an identifier attached to every event. It's how we know which person an event belongs to. A person can have multiple distinct IDs (e.g., an anonymous session ID and a logged-in user ID).
A user's distinct ID is provided by the client as part of the identify call. Before identify is called, the distinct ID is a randomly generated UUID.
Some commonly used Distinct ID formats are: the user's email address, a UUID randomly generated by a client SDK, the primary key id in the customer's User table in their database, a Stripe cus_xxx ID.
A distinct ID must be associated with exactly one user, so it'd be invalid to use e.g. "backend", "python", or anything that relates to a class of users rather than individual.
Every person has a single UUID, generated deterministically from (team_id, distinct_id) at creation time using UUIDv5.
// nodejs/src/worker/ingestion/person-uuid.ts
function uuidFromDistinctId(teamId: number, distinctId: string): string {
return uuidv5(`${teamId}:${distinctId}`, PERSON_UUIDV5_NAMESPACE)
}
A person profile contains:
$identify)When a user identifies themselves (typically on login), the SDK calls $identify with:
distinct_id: The identified user ID (e.g., email, user ID from your database)$anon_distinct_id: The anonymous ID that was being used before loginThis triggers a merge: all events from both the anonymous distinct ID and the identified ID should be attributed to the same person. We do this through a level of indirection (a join between persons_ids and distinct_ids) which we explain in more depth later on.
When a user identifies themselves, there's a problem: historical events in ClickHouse still have the old person_id. If we do nothing, queries for that user would miss all their anonymous events.
The naive solution is to JOIN the events table with a mapping table (person_distinct_id2) that knows which distinct_ids belong to which person. But person_distinct_id2 contains every distinct_id mapping for every user - this table can have hundreds of millions of rows. Joining the events table (which can have billions of rows) with this massive mapping table on every query is prohibitively slow.
Instead, we use a two-part strategy, which involves
Periodically rewriting the person_id on events to respect these merges: We call this process squashing. Once events are rewritten, we delete those rows from person_distinct_id_overrides. This keeps the overrides table small.
Small overrides table for queries: We maintain person_distinct_id_overrides which only contains distinct_ids whose person mapping has changed (i.e., been merged) since the last squash. This table is tiny compared to person_distinct_id2 - typically just thousands of rows instead of millions. Queries can quickly LEFT JOIN to this small table instead of the massive overrides table.
version > 0, a row is inserted into person_distinct_id_overridesposthog/dags/person_overrides.py) periodically:
ALTER TABLE ... UPDATE mutation to rewrite person_ids in the events tableperson_distinct_id_overrides to get the correct person_idThe result: queries stay fast because they only join with a small table, and that table stays small because we continuously squash and clean up.
Terminology: In user-facing documentation, these are called "anonymous events". Internally, we call them "personless events". They're the same thing.
Much of the per-event cost of ingestion comes from person processing - looking up persons in PostgreSQL, creating/updating records, handling merges, and producing to multiple Kafka topics. By skipping person processing for some events, we can offer significantly lower pricing.
The typical use case: most of your traffic is logged-out users browsing your site. You don't need person profiles for these users - you just want to count pageviews and track basic analytics. But when a user logs in, makes a purchase, or does something valuable, you want full person tracking so you can analyze their journey, target them with feature flags, etc.
We put significant engineering effort into making the transition from personless to identified work seamlessly - when a user identifies, their previous anonymous events are linked to their new person profile automatically.
By default, every event creates or updates a person profile. Personless mode ($process_person_profile: false) skips this:
If a personless user later identifies themselves via $identify, an override is created to link their anonymous events to their real person. This gives you the best of both worlds: cheap ingestion for anonymous users, full person support once they identify.
Location: rust/capture/
Responsibilities:
Key behavior for person processing:
Kafka Topic: events_plugin_ingestion
The Kafka partition key for most events is <token>:<distinct_id>:
// rust/common/types/src/event.rs
pub fn key(&self) -> String {
if self.is_cookieless_mode {
format!("{}:{}", self.token, self.ip)
} else {
format!("{}:{}", self.token, self.distinct_id)
}
}
(Cookieless events use a placeholder distinct ID, which is replaced later with a privacy-preserving hash. The placeholder is not suitable as a partioning key, as it is always the same value for every cookieless event, so IP address is used)
Implications:
$identify event (which has a different distinct_id) can be processed in parallel by different workers, the ingestion pipeline code is careful to avoid race conditions here.Location: nodejs/src/worker/ingestion/
The ingestion pipeline processes events in batches. For person processing:
Location: nodejs/src/worker/ingestion/event-pipeline/prefetchPersonsStep.ts
Location: nodejs/src/worker/ingestion/event-pipeline/processPersonlessDistinctIdsBatchStep.ts
For events with $process_person_profile: false:
posthog_personlessdistinctid tableis_merged flag)-- nodejs/src/worker/ingestion/persons/repositories/postgres-person-repository.ts
INSERT INTO posthog_personlessdistinctid (team_id, distinct_id, is_merged, created_at)
VALUES ($1, $2, false, now())
ON CONFLICT (team_id, distinct_id) DO UPDATE
SET is_merged = posthog_personlessdistinctid.is_merged
RETURNING is_merged
Location: nodejs/src/worker/ingestion/event-pipeline/processPersonsStep.ts
Two branches based on $process_person_profile:
If $process_person_profile: false (personless mode):
If $process_person_profile: true (or not set):
$identify event, handle the mergeLocation: nodejs/src/worker/ingestion/persons/person-merge-service.ts
When $identify is called with $anon_distinct_id:
// nodejs/src/worker/ingestion/persons/person-merge-service.ts
async mergeDistinctIds(
otherPersonDistinctId: string, // e.g., "anon-123"
mergeIntoDistinctId: string, // e.g., "[email protected]"
teamId: number,
timestamp: DateTime
)
Three cases:
When are overrides created?
An override is needed when events exist in ClickHouse with a person_id that's now incorrect (because of a merge). The version field in posthog_persondistinctid controls this:
version = 0: No override - this distinct_id's events already have the correct person_id (e.g. the first $identify for a user, due to the deterministic UUID v5)version >= 1: Override created - events exist with an old person_id that needs rewritingposthog_person: The source of truth for person data
id: Internal integer IDuuid: The person's UUID (deterministic from primary distinct_id)team_id: Which team this person belongs toproperties: JSONB of person propertiesis_identified: Whether $identify was calledversion: Incremented on updates (for ClickHouse consistency)posthog_persondistinctid: Maps distinct_ids to persons
distinct_id: The distinct ID stringperson_id: FK to posthog_personteam_id: Which teamversion: 0 for primary, >=1 for merged (triggers override)posthog_personlessdistinctid: Tracks distinct_ids used in personless mode
distinct_id: The distinct IDteam_id: Which teamis_merged: Whether this has been merged into a real personAfter person processing, updates are produced to Kafka:
KAFKA_PERSON: Person creates/updates/deletesKAFKA_PERSON_DISTINCT_ID: Distinct ID mapping changesevents: The main events table
person_id: UUID of the person (may be outdated if merge hasn't been squashed)distinct_id: The distinct_id that was sent with the event (not changed by squashing)person: ReplacingMergeTree of person data
person_distinct_id2: All distinct_id → person mappings
person_distinct_id_overrides: Only pending overrides
version > 0HogQL is PostHog's query language - a dialect of SQL that provides useful abstractions over raw ClickHouse queries. It serves two main purposes (and many others):
team_id filters to all queries, ensuring customers can only access their own data - this lets us safely expose SQL access to usersHogQL is smart about when to add JOINs - it only adds them when you actually need them.
If your query doesn't reference person_id, HogQL won't add the overrides JOIN:
-- This query doesn't need the overrides JOIN
SELECT count() FROM events WHERE event = '$pageview'
But if you reference events.person_id or events.person.id, HogQL automatically adds the LEFT JOIN to person_distinct_id_overrides:
-- HogQL adds: LEFT JOIN person_distinct_id_overrides ON ...
SELECT count() FROM events WHERE events.person.id = 'some-uuid'
This means queries that don't need person data stay fast, while queries that do need it get correct results (even for recently-merged persons that haven't been squashed yet).
Option 1: person.properties (PoE)
Person properties are stored directly on the events table at ingestion time. No JOIN needed:
-- Fast: reads directly from events table
SELECT person.properties.email FROM events
These properties reflect the state at the time the event was ingested. If a person's email changes later, historical events still show the old email.
Option 2: pdi.person.properties (JOIN to person table)
This JOINs through person_distinct_id to the person table:
-- Slower: requires JOIN to person table
SELECT pdi.person.properties.email FROM events
These properties reflect the current state of the person. All events show the person's current email, even if it was different when the event occurred.
PDI = Person Distinct ID
| Access pattern | JOIN required? | Property state |
|---|---|---|
person.properties.X | No | At ingestion time |
pdi.person.properties.X | Yes | Current |
Most queries should use person.properties for performance. Use pdi.person.properties only when you specifically need the current property values.
It is possible to change person.properties to use the PDI properties instead, using the PoE mode setting. This can be set at both the query level and the team level, though we would like to remove the team-level setting soon.
This is set through the HogQLQueryModifiers class.
If this setting is overridden, you can access PoE properties regardless of the PoE mode by using poe.properties.X
-- Run against ClickHouse
SELECT * FROM person_distinct_id_overrides
WHERE team_id = X AND distinct_id = 'anon-123'
-- Run against ClickHouse
SELECT * FROM person_distinct_id2
WHERE team_id = X AND distinct_id IN ('anon-123', '[email protected]')
-- Run against PostgreSQL
SELECT * FROM posthog_personlessdistinctid
WHERE team_id = X AND distinct_id = 'anon-123'
-- Run against PostgreSQL
SELECT * FROM posthog_persondistinctid
WHERE team_id = X AND distinct_id = 'anon-123'
-- version = 0 means primary, no override
-- version >= 1 means override should exist
| File | Purpose |
|---|---|
rust/capture/src/sinks/kafka.rs | Produces events to Kafka, sets partition key |
rust/common/types/src/event.rs | Event type, includes key() method for partition key |
| File | Purpose |
|---|---|
nodejs/src/worker/ingestion/person-uuid.ts | Deterministic UUID generation |
nodejs/src/worker/ingestion/event-pipeline/processPersonsStep.ts | Entry point for person processing |
nodejs/src/worker/ingestion/event-pipeline/processPersonlessStep.ts | Personless event handling |
nodejs/src/worker/ingestion/event-pipeline/processPersonlessDistinctIdsBatchStep.ts | Batch personless tracking |
nodejs/src/worker/ingestion/persons/person-merge-service.ts | Merge/identify handling, override version logic |
nodejs/src/worker/ingestion/persons/person-create-service.ts | Person creation |
nodejs/src/worker/ingestion/persons/repositories/postgres-person-repository.ts | PostgreSQL queries for person operations |
| File | Purpose |
|---|---|
posthog/models/person/person.py | Django models: Person, PersonDistinctId, PersonlessDistinctId |
| File | Purpose |
|---|---|
posthog/models/person/sql.py | Person, person_distinct_id, person_distinct_id_overrides table definitions |
| File | Purpose |
|---|---|
posthog/dags/person_overrides.py | Dagster job that squashes person_id overrides |
| File | Purpose |
|---|---|
posthog/hogql/database/schema/persons.py | HogQL schema for persons table |
posthog/hogql/database/schema/person_distinct_ids.py | HogQL schema for person_distinct_id tables |
posthog/hogql/database/schema/person_distinct_id_overrides.py | HogQL schema for overrides table |