docs/getting-started/concepts/tiling.md
Tiling is an optimization technique for streaming time-windowed aggregations that enables massively efficient feature computation by pre-aggregating data into smaller time intervals (tiles) and storing Intermediate Representations (IRs) for correct merging.
Primary Use Case: Streaming
Tiling provides speedup for streaming scenarios where features are updated frequently (every few minutes) from sources like Kafka, Kinesis, or PushSource.
Key Benefits (Streaming):
Traditional approaches to time-windowed aggregations either:
You cannot correctly merge many common aggregations:
WRONG: avg(tile1, tile2) ≠ (avg_tile1 + avg_tile2) / 2
Example:
tile1: [10, 20, 30] → avg = 20
tile2: [100] → avg = 100
Correct merged avg: (10+20+30+100) / 4 = 40
Wrong merged avg: (20 + 100) / 2 = 60
The same problem exists for:
std)var)Instead of storing final aggregated values, store intermediate data that preserves the mathematical properties needed for correct merging.
Traditional (Incorrect):
Tile 1: avg = 20
Tile 2: avg = 20
Merged avg = (20 + 20) / 2 = 20 - WRONG
With IRs (Correct):
Tile 1: sum = 60, count = 3
Tile 2: sum = 100, count = 1
Merged: sum = 160, count = 4
Merged avg = 160 / 4 = 40 - CORRECT
These can be merged by applying the same aggregation function to tiles:
| Aggregation | Stored Value | Merge Strategy | Storage |
|---|---|---|---|
sum | sum | sum(tile_sums) | 1 column |
count | count | sum(tile_counts) | 1 column |
max | max | max(tile_maxes) | 1 column |
min | min | min(tile_mins) | 1 column |
No IRs needed - the final value is the IR!
These require storing multiple intermediate values:
avg, mean)Stored IRs: sum, count
Final computation: avg = sum / count
Merge strategy: Sum the sums and counts, then divide
Storage: 3 columns (final + 2 IRs)
std, stddev)Stored IRs: count, sum, sum_of_squares
Final computation:
variance = (sum_sq - sum²/count) / (count - δ)
std = sqrt(variance)
# δ = 1 for sample, 0 for population
Merge strategy: Sum all three IRs, then apply formula
Storage: 4 columns (final + 3 IRs)
var, variance)Stored IRs: count, sum, sum_of_squares
Final computation: Same as std but without sqrt()
Storage: 4 columns (final + 3 IRs)
Tiling is optimized for streaming scenarios with frequent updates (e.g., every few minutes).
Stream Events → Partition by Hop Intervals → Compute IRs → Store Windowed Aggregations
| | | |
| | | └─> Online Store (Redis, etc.)
| | └─> avg_sum, avg_count, std_sum_sq, etc.
| └─> 5-min hops: [00:00-00:05], [00:05-00:10], ...
└─> customer_id=1: [txn1, txn2, txn3, ...]
Every 5 minutes:
- New events arrive
- Only 1 new tile computed (5 min of data)
- 11 previous tiles reused (in memory during streaming session)
- Final aggregation = merge 12 tiles (1 new + 11 reused)
Why It's Fast:
| Update | Without Tiling | With Tiling | Tile Reuse |
|---|---|---|---|
| T=00:00 | Compute 1hr | Compute 12 tiles | 0% reuse (initial) |
| T=00:05 | Compute 1hr (1000+ events) | Compute 1 tile + reuse 11 | 92% reuse |
| T=00:10 | Compute 1hr (1000+ events) | Compute 1 tile + reuse 11 | 92% reuse |
| T=00:15 | Compute 1hr (1000+ events) | Compute 1 tile + reuse 11 | 92% reuse |
Key Benefit: Tiles stay in memory during the streaming session, enabling massive reuse.
Benefits:
Tiling provides maximum benefit for streaming scenarios with frequent updates:
from feast import StreamFeatureView, Aggregation
from feast.data_source import PushSource, KafkaSource
from datetime import timedelta
# Example with Kafka streaming source
customer_features = StreamFeatureView(
name="customer_transaction_features",
entities=[customer],
source=KafkaSource(
name="transactions_stream",
kafka_bootstrap_servers="localhost:9092",
topic="transactions",
timestamp_field="event_timestamp",
batch_source=file_source, # For historical data
),
aggregations=[
Aggregation(column="amount", function="sum", time_window=timedelta(hours=1), name="sum_amount_1h"),
Aggregation(column="amount", function="avg", time_window=timedelta(hours=1), name="avg_amount_1h"),
Aggregation(column="amount", function="std", time_window=timedelta(hours=1), name="std_amount_1h"),
],
timestamp_field="event_timestamp",
online=True,
# Tiling configuration
enable_tiling=True, # speedup for streaming!
tiling_hop_size=timedelta(minutes=5), # Update frequency
)
When to Enable:
aggregations: List of time-windowed aggregations to compute. Each Aggregation accepts:
column: source column to aggregatefunction: aggregation function (sum, avg, mean, min, max, count, std)time_window: duration of the aggregation windowslide_interval: hop/slide size (defaults to time_window)name (optional): output feature name. Defaults to {function}_{column} (e.g., sum_amount). Set this to use a custom name (e.g., name="sum_amount_1h").timestamp_field: Column name for timestamps (required when aggregations are specified)enable_tiling: Enable tiling optimization (default: False)
True for streaming scenariostiling_hop_size: Time interval between tiles (default: 5 minutes)
Tiling in Feast uses a simple, pure pandas architecture that works with any compute engine:
┌─────────────────┐
│ Engine DataFrame│ (Spark/Ray/etc)
└────────┬────────┘
│ .toPandas() / .to_pandas()
▼
┌─────────────────┐
│ Pandas DataFrame│
└────────┬────────┘
│ orchestrator.apply_sawtooth_window_tiling()
▼
┌─────────────────┐
│ Cumulative │ (pandas with _tile_start, _tile_end, IRs)
│ Tiles │
└────────┬────────┘
│ tile_subtraction.convert_cumulative_to_windowed()
▼
┌─────────────────┐
│ Windowed │ (pandas with final aggregations)
│ Aggregations │
└────────┬────────┘
│ spark.createDataFrame() / ray.from_pandas()
▼
┌─────────────────┐
│ Engine DataFrame│
└─────────────────┘
Tiling with Intermediate Representations provides a powerful optimization for streaming time-windowed aggregations in Feast.