doc/development/aggregation_engines.md
The Aggregation Framework provides a unified interface for building analytical queries across different database backends. It supports both PostgreSQL (via ActiveRecord) and ClickHouse, allowing developers to define reusable aggregation engines with metrics, dimensions, and filters.
The ActiveRecord engine (Gitlab::Database::Aggregation::ActiveRecord::Engine) generates PostgreSQL queries using ActiveRecord's query interface.
class IssueAggregationEngine < Gitlab::Database::Aggregation::ActiveRecord::Engine
filters do
exact_match :project_id, :integer, description: 'Filter by project ID'
exact_match :state, :string, description: 'Filter by issue state'
end
dimensions do
column :author_id, :integer, description: 'Group by author'
date_bucket :created_at, :datetime,
parameters: { granularity: { in: %i[daily weekly monthly yearly], type: :string } },
description: 'Group by creation date'
end
metrics do
count description: 'Total number of issues'
mean :weight, :float, description: 'Average issue weight'
end
end
The ActiveRecord engine generates a single-level SQL query:
SELECT
"issues"."author_id" AS aeq_author_id,
date_trunc('month', "issues"."created_at") AS aeq_created_at,
COUNT(*) AS aeq_total_count,
AVG("issues"."weight") AS aeq_mean_weight
FROM "issues"
WHERE "issues"."project_id" IN (1, 2, 3)
AND "issues"."state" IN ('opened')
GROUP BY aeq_author_id, aeq_created_at
ORDER BY aeq_author_id, aeq_created_at
Key characteristics:
aeq_ (Aggregation Engine Query). This prefix is removed by AggregationResult object.WHERE or HAVING clausesGROUP BY columnsCOUNT, AVG)count metricCounts rows using COUNT(*).
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | No | Name for the count metric. Default: 'total'. Identifier becomes :{name}_count |
type | Symbol | No | Data type. Default: :integer |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
mean metricCalculates the average value using AVG().
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name to average. Identifier becomes :mean_{name} |
type | Symbol | No | Data type. Default: :float |
expression | Proc | No | Custom Arel expression instead of column |
scope_proc | Proc | No | Modifies the ActiveRecord scope (for example for JOINs) |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
column dimensionGroups results by a column value.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name or identifier |
type | Symbol | Yes | Data type (:string, :integer, :datetime, etc.) |
expression | Proc | No | Custom Arel expression instead of column |
scope_proc | Proc | No | Modifies the ActiveRecord scope (for example for JOINs) |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
date_bucket dimensionGroups results by time intervals using PostgreSQL's date_trunc() function. Supports parameters.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Date/datetime column name |
type | Symbol | Yes | Data type (:date or :datetime) |
expression | Proc | No | Custom Arel expression instead of column |
scope_proc | Proc | No | Modifies the ActiveRecord scope |
parameters | Hash | No | Parameter configuration (see below) |
description | String | No | Human-readable description |
Supported Parameters:
| Parameter | Type | Values | Default | Description |
|---|---|---|---|---|
granularity | String | daily, weekly, monthly, yearly | monthly | Time interval for grouping |
exact_match filterFilters rows by exact value match using WHERE column IN (...).
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name to filter |
type | Symbol | Yes | Data type of filter values |
expression | Proc | No | Custom Arel expression instead of column |
max_size | Integer | No | Maximum number of values allowed in filter |
description | String | No | Human-readable description |
The ClickHouse engine (Gitlab::Database::Aggregation::ClickHouse::Engine) generates optimized queries for ClickHouse's columnar database.
class SessionAnalyticsEngine < Gitlab::Database::Aggregation::ClickHouse::Engine
self.table_name = 'sessions'
filters do
exact_match :flow_type, :string, description: 'Filter by flow type'
range :created_at, :datetime, description: 'Filter by creation date'
end
dimensions do
column :flow_type, :string, description: 'Group by flow type'
date_bucket :created_at, :datetime,
parameters: { granularity: { in: %i[daily weekly monthly], type: :string } },
description: 'Group by date'
end
metrics do
count description: 'Total sessions'
count :completed, :integer,
expression: -> { Arel.sql('1') },
if: -> { Arel.sql('finished_at IS NOT NULL') },
description: 'Completed sessions'
mean :duration, :float,
expression: -> { Arel.sql('finished_at - created_at') },
if: -> { Arel.sql('finished_at IS NOT NULL') },
description: 'Average session duration'
rate :completion,
numerator_if: -> { Arel.sql('finished_at IS NOT NULL') },
description: 'Session completion rate'
quantile :duration, :float,
expression: -> { Arel.sql('finished_at - created_at') },
parameters: { quantile: { type: :float, description: 'Quantile value (0.0-1.0)' } },
description: 'Duration percentile'
end
end
The ClickHouse engine generates a two-level nested query for optimal performance. Overall structure can be expressed like this:
-- metacode query to emphasize on query structure
SELECT dimensions, metrics
FROM (
SELECT
primary_key_columns,
dimensions_expressions,
metrics_expressions,
FROM source_table
WHERE filters
GROUP BY ALL
) ch_aggregation_inner_query
GROUP BY ALL
ORDER BY orders
Inner query precalculates data for each primary key in source table. Outer query calculates metrics and dimensions based on inner query.
Example full query:
SELECT
`ch_aggregation_inner_query`.`aeq_flow_type` AS aeq_flow_type,
toStartOfInterval(
`ch_aggregation_inner_query`.`aeq_created_at`,
INTERVAL 1 month
) AS aeq_created_at,
COUNT(*) AS aeq_total_count,
countIf(`ch_aggregation_inner_query`.`aeq_completed_secondary` = 1) AS aeq_completed_count,
avgIf(
`ch_aggregation_inner_query`.`aeq_mean_duration`,
`ch_aggregation_inner_query`.`aeq_mean_duration_secondary` = 1
) AS aeq_mean_duration,
countIf(`ch_aggregation_inner_query`.`aeq_completion_rate` = 1) / COUNT(*) AS aeq_completion_rate,
quantile(0.5)(`ch_aggregation_inner_query`.`aeq_duration_quantile`) AS aeq_duration_quantile
FROM (
SELECT
`sessions`.`flow_type` AS aeq_flow_type,
`sessions`.`created_at` AS aeq_created_at,
finished_at IS NOT NULL AS aeq_completed_secondary,
finished_at - created_at AS aeq_mean_duration,
finished_at IS NOT NULL AS aeq_mean_duration_secondary,
finished_at IS NOT NULL AS aeq_completion_rate,
finished_at - created_at AS aeq_duration_quantile,
`sessions`.`user_id`,
`sessions`.`session_id`
FROM `sessions`
WHERE `sessions`.`created_at` BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY ALL
) ch_aggregation_inner_query
GROUP BY ALL
ORDER BY aeq_flow_type, aeq_created_at
Key characteristics:
*Merge columns easily as well as *If aggregations.*If functionsaeq_ (Aggregation Engine Query). This prefix is removed by AggregationResult object.WHERE or HAVING clauses on the inner queryHAVING clauses on the outer queryGROUP BY columns on outer querycount metricCounts rows with support for distinct counting and conditional counting using countIf().
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | No | Name for the count metric. Default: 'total'. Identifier becomes :{name}_count |
type | Symbol | No | Data type. Default: :integer |
expression | Proc | No | Custom expression for counting specific values |
if | Proc | No | Condition expression for conditional counting (countIf) |
distinct | Boolean | No | Enable distinct counting. Default: false |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
mean metricCalculates the average value with support for conditional averaging using avgIf().
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name or identifier. Identifier becomes :mean_{name} |
type | Symbol | No | Data type. Default: :float |
expression | Proc | No | Custom expression for the value to average |
if | Proc | No | Condition expression for conditional averaging (avgIf) |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
rate metricCalculates the ratio between rows matching a numerator condition and rows matching a denominator condition (or total rows).
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Identifier name. Identifier becomes :{name}_rate |
type | Symbol | No | Data type. Default: :float |
numerator_if | Proc | Yes | Condition for the numerator (rows to count) |
denominator_if | Proc | No | Condition for the denominator. If not provided, uses total count |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
quantile metricCalculates percentiles using ClickHouse's quantile() function. Supports parameters.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name or identifier. Identifier becomes :{name}_quantile |
type | Symbol | No | Data type. Default: :float |
expression | Proc | No | Custom expression for the value |
parameters | Hash | No | Parameter configuration (see below) |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
Supported Parameters:
| Parameter | Type | Values | Default | Description |
|---|---|---|---|---|
quantile | Float | 0.0 - 1.0 | 0.5 | Quantile value (0.5 = median, 0.9 = p90, 0.99 = p99) |
retained_count metricCounts values that appear in both the current and previous period, using groupBitmapState
and arrayIntersect. Use retained_count for feature retention or returning-user counts.
The dimension referenced by over: must be requested in the query.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Identifier name. Identifier becomes :{name}_count |
type | Symbol | No | Data type. Default: :integer |
expression | Proc | No | Expression for the value to deduplicate, for example user_id |
over | Symbol | Yes | Dimension that defines the period. Must be a dimension on the engine |
lag_offset | Integer | No | Number of periods to compare against. Default: 1 |
description | String | No | Human-readable description |
Example:
metrics do
retained_count :returning_users, :integer, -> { sql('user_id') }, over: :timestamp,
description: 'Users present in both the current and previous period'
end
lagged_count metricReturns the distinct count of values from the previous period, using uniqExact with
lagInFrame. Pair lagged_count with retained_count to compute retention rates
(returning ÷ previous). The dimension referenced by over: must be requested in the query.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Identifier name. Identifier becomes :{name}_count |
type | Symbol | No | Data type. Default: :integer |
expression | Proc | No | Expression for the value to deduplicate |
over | Symbol | Yes | Dimension that defines the period |
lag_offset | Integer | No | Number of periods to look back. Default: 1 |
description | String | No | Human-readable description |
Example:
metrics do
lagged_count :previous_period_users, :integer, -> { sql('user_id') }, over: :timestamp,
description: 'Distinct users in the previous period'
end
When a request includes more dimensions than just over:, the framework partitions the
lag window by the extra dimensions. Each combination gets an independent sequence, so
values do not leak across categories. For example, with dimensions: [feature, timestamp]
where timestamp is a date_bucket with granularity: 'daily' and the metric uses
over: :timestamp, the generated SQL contains
OVER (PARTITION BY aeq_feature ORDER BY aeq_timestamp_daily ASC). Retention for
code_suggestions does not mix with chat.
column dimensionGroups results by a column value.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name or identifier |
type | Symbol | Yes | Data type (:string, :integer, :datetime, etc.) |
expression | Proc | No | Custom expression instead of column |
formatter | Proc | No | Formatting function applied to results |
description | String | No | Human-readable description |
association | Boolean | No | When true, the dimension is also accessible without the _id suffix as an object. Defaults to false. |
date_bucket dimensionGroups results by time intervals using ClickHouse's toStartOfInterval() function. Supports parameters.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Date/datetime column name |
type | Symbol | Yes | Data type (:date or :datetime) |
expression | Proc | No | Custom expression instead of column |
parameters | Hash | No | Parameter configuration (see below) |
description | String | No | Human-readable description |
Supported Parameters:
| Parameter | Type | Values | Default | Description |
|---|---|---|---|---|
granularity | String | daily, weekly, monthly, yearly | monthly | Time interval for grouping |
exact_match filterFilters rows by exact value match. Supports filtering on regular columns or merge columns (pre-aggregated data).
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name to filter |
type | Symbol | Yes | Data type of filter values |
expression | Proc | No | Custom expression instead of column |
merge_column | Boolean | No | If true, applies filter using HAVING instead of WHERE |
max_size | Integer | No | Maximum number of values allowed in filter |
description | String | No | Human-readable description |
range filterFilters rows by value range using BETWEEN. Supports filtering on regular columns or merge columns.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Column name to filter |
type | Symbol | Yes | Data type of filter values (:datetime, :integer, etc.) |
expression | Proc | No | Custom expression instead of column |
merge_column | Boolean | No | If true, applies filter using HAVING instead of WHERE |
description | String | No | Human-readable description |
metric_exact_match filterFilters groups by exact match on an aggregated metric value. Applied as a HAVING clause in post-aggregation.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Identifier of the metric to filter by. Must match a metric defined in the same engine. |
type | Symbol | Yes | Data type of filter values |
max_size | Integer | No | Maximum number of values allowed in filter |
description | String | No | Human-readable description |
The referenced metric must also be requested in the same Request. For parameterized metrics,
the filter parameters must match the parameters of the requested metric instance.
Example:
filters do
metric_exact_match :total_count, :integer
end
Gitlab::Database::Aggregation::Request.new(
filters: [{ identifier: :total_count, values: [1, 2] }],
dimensions: [{ identifier: :user_id }],
metrics: [{ identifier: :total_count }]
)
metric_range filterFilters groups by value range on an aggregated metric using BETWEEN. Applied as a HAVING
clause in post-aggregation.
| Option | Type | Required | Description |
|---|---|---|---|
name | Symbol | Yes | Identifier of the metric to filter by. Must match a metric defined in the same engine. |
type | Symbol | Yes | Data type of filter values (:integer, :float, etc.) |
description | String | No | Human-readable description |
The referenced metric must also be requested in the same Request. For parameterized metrics,
the filter parameters must match the parameters of the requested metric instance, so the filter
targets the correct metric instance.
Example:
filters do
metric_range :total_count, :integer
metric_range :duration_quantile, :float
end
Gitlab::Database::Aggregation::Request.new(
filters: [
{ identifier: :duration_quantile, parameters: { quantile: 0.1 }, values: 200..nil }
],
dimensions: [{ identifier: :user_id }],
metrics: [{ identifier: :duration_quantile, parameters: { quantile: 0.1 } }]
)
Transient columns are named SQL expression aliases you define once and
reference across dimensions, metrics, and filters blocks. They are
not projected in the final query result. Use transient columns to
eliminate duplication of complex SQL expressions.
Call transient at the class level with a name and a block that returns
an Arel expression. Define transient columns before you reference them.
transient(:duration) do
sql("dateDiff('seconds', anyIfMerge(created_event_at), anyIfMerge(finished_event_at))")
end
transient(:is_finished) { sql('anyIfMerge(finished_event_at) IS NOT NULL') }
Inside dimensions, metrics, or filters blocks, call
transient(:name) to insert the stored expression. Pass the return
value anywhere a lambda expression is accepted: as a positional
argument or as a keyword argument value.
metrics do
mean :duration, :float, transient(:duration),
description: 'Average session duration in seconds'
count :finished, if: transient(:is_finished),
description: 'Number of finished sessions'
end
request = Gitlab::Database::Aggregation::Request.new(
filters: [
{ identifier: :project_id, values: [1, 2, 3] },
{ identifier: :state, values: ['opened'] }
],
dimensions: [
{ identifier: :author_id },
{ identifier: :created_at, parameters: { granularity: 'monthly' } },
{ identifier: :created_at, parameters: { granularity: 'weekly' } },
],
metrics: [
{ identifier: :total_count },
{ identifier: :mean_weight }
],
order: [
{ identifier: :total_count, direction: :desc } # order identifier must reference dimension or metric.
]
)
engine = IssueAggregationEngine.new(context: { scope: Issue.all })
response = engine.execute(request)
if response.success?
puts "Success: #{response.payload[:data].to_a.inspect}"
else
puts "Errors: #{response.errors}"
end
The framework consists of several key components:
┌────────────────────────────────────────────────────┐
│ Request │
│ (metrics, dimensions, filters, order) │
└────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────┐
│ QueryPlan │
│ (validates request, builds plan parts) │
└────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────┐
│ Engine │
│ (executes query plan, returns AggregationResult) │
└────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────┐
│ AggregationResult │
│ implements Enumerable to access formatted results │
└────────────────────────────────────────────────────┘
The framework validates requests before execution:
granularity: { in: %i[daily weekly monthly], type: :string } will require granularity value to be one of 3 provided strings.Use the Gitlab::Database::Aggregation::Graphql::Mounter module to expose aggregation engines
in the GraphQL API.
The GraphQL integration automatically generates:
OFFSET paginationUse the mount_aggregation_engine method in your GraphQL type to expose an aggregation engine:
module Types
class ProjectType < BaseObject
extend Gitlab::Database::Aggregation::Graphql::Mounter
mount_aggregation_engine(
IssueAggregationEngine,
field_name: 'issue_analytics',
description: 'Issue analytics aggregation'
) do
# Define base aggregation scope. Build your own scope or inherit one from parent object.
def aggregation_scope
object.issues
end
end
end
end
[!note] All filters, metrics, and dimensions are exposed automatically.
| Option | Type | Description |
|---|---|---|
field_name | String/Symbol | The GraphQL field name. Defaults to :aggregation |
types_prefix | String/Symbol | Prefix for all child types like *AggregationResponse. Defaults to field_name |
description | String | Description for the GraphQL field |
authorize | Symbol | Permission required to access the field (e.g. :read_project). Passed directly to the GraphQL field definition |
Use the authorize option to restrict access to the field:
mount_aggregation_engine(
IssueAggregationEngine,
field_name: 'issue_analytics',
description: 'Issue analytics aggregation',
authorize: :read_project
) do
# authorize :read_project - this also supported.
def aggregation_scope
object.issues
end
end
If authorize is not specified, you must take care of authorization manually.
The generated GraphQL subtree uses a two-level structure:
issueAnalytics) accepts dimension and non-metric filter arguments.aggregated field accepts metric filter, ordering, and pagination arguments,
and returns the paginated connection.query IssueAnalytics($projectId: ID!) {
project(fullPath: $projectId) {
issueAnalytics(
state: ["opened", "closed"]
createdAtFrom: "2024-01-01"
createdAtTo: "2024-12-31"
) {
aggregated(
totalCountFrom: 5
orderBy: [{ identifier: "totalCount", direction: DESC }]
first: 10
) {
nodes {
dimensions {
createdAt(granularity: "monthly")
}
totalCount
meanWeight
highQuantile: durationQuantile(0.9)
medianQuantile: durationQuantile(0.5)
}
pageInfo {
hasNextPage
endCursor
}
}
}
}
}
Filter arguments are split across the two levels based on when the filter is applied:
exact_match or range) appear on the outer field (e.g. issueAnalytics).metric_exact_match or metric_range) appear on the
inner aggregated field.Add custom validation logic to discard specific aggregation requests while maintaining the GraphQL schema. This is useful when you need to enforce custom runtime constraints on specific requests.
Raise a GraphQL::ExecutionError to reject the request with a custom error message.
To add custom validations, override the validate_request! method in the mounting block:
module Types
class ProjectType < BaseObject
extend Gitlab::Database::Aggregation::Graphql::Mounter
mount_aggregation_engine(IssueAggregationEngine) do
# Other configuration options...
# Custom validation logic
def validate_request!(engine_request)
if engine_request.dimensions.empty?
raise GraphQL::ExecutionError, 'At least one dimension must be specified'
end
end
end
end
end
The validate_request! method receives a Gitlab::Database::Aggregation::Request object containing dimensions, metrics, filters and order specifications.
Dimensions can be marked as associations using the association: true option. This changes how the dimension is exposed in GraphQL, automatically resolving the associated model instead of exposing just the ID.
In your aggregation engine, declare a dimension with association: true:
class AgentPlatformSessions < Gitlab::Database::Aggregation::ClickHouse::Engine
dimensions do
column :flow_type, :string, description: 'Type of session'
column :user_id, :integer, description: 'Session owner', association: true
end
end
When a dimension is marked as an association, an object is exposed instead of the raw *_id field. The dimension above transforms to field :user, Types::UserType, ... in GraphQL with batch loading by ID.
You can order the dimensions by the association ID using the association name without _id suffix (for example, orderBy: [{ identifier: "user", direction: DESC }]).
[!note] You must ensure all proper authorization checks on association GraphQL type (e.g.
authorize :read_user).
By default, the association model and GraphQL type are inferred from the dimension name:
user_id → UserUser → Types::UserTypeYou can customize this behavior by passing a hash to the association option:
dimensions do
column :author_id, :integer,
description: 'Issue author',
association: { model: User }
# or model and GraphQL type
# association: { model: User, graphql_type: Types::CurrentUserType }
end
The following example is a query without association:
query {
project(fullPath: "gitlab-org/gitlab") {
aiUsage {
agentPlatformSessions {
aggregated {
nodes {
dimensions {
userId # Returns: 123 (integer)
}
}
}
}
}
}
}
The following example is a query with association:
query {
project(fullPath: "gitlab-org/gitlab") {
aiUsage {
agentPlatformSessions(
userId: [1, 2] # Filter still uses original dimension identifier
) {
aggregated(
orderBy: [{ identifier: "user", direction: DESC }] # Order uses association name
) {
nodes {
dimensions {
user { # Returns: full User object
id
username
name
}
}
}
}
}
}
}
}