docs/content/concepts/query-and-transform/dataframe-queries.md
Robotic and sensor data is inherently messy:
Machine learning workloads, on the other hand, rely on aligned rows where each row represents one sample, with a consistent schema and a single index.
Dataframe queries are designed to bridge this gap. They allow you to query arbitrary Rerun data and produce a dataframe as output.
Dataframe queries can be used in two contexts:
DatasetEntry object provides API to filter and query datasets and turn them into dataframes.Let's use an example to illustrate how dataframe queries work.
Dataframe queries run against datasets stored on a Data Platform. We can create a demo recording and load it into a temporary local catalog using the following code:
snippet: concepts/query-and-transform/dataframe_query_example[setup]
We can then perform a dataframe query (against the local open-source Data Platform included in Rerun):
snippet: concepts/query-and-transform/dataframe_query_example[query]
This should produce an output similar to:
┌──────────────────────────────────┬────────────────────┬───────────────────────────────────┐
│ rerun_segment_id ┆ step ┆ /data:Scalars:scalars │
│ --- ┆ --- ┆ --- │
│ type: Utf8 ┆ type: nullable i64 ┆ type: nullable List[nullable f64] │
│ ┆ index_name: step ┆ archetype: Scalars │
│ ┆ kind: index ┆ component: Scalars:scalars │
│ ┆ ┆ component_type: Scalar │
│ ┆ ┆ entity_path: /data │
│ ┆ ┆ kind: data │
╞══════════════════════════════════╪════════════════════╪═══════════════════════════════════╡
│ 5712205b356b470e8d1574157e55f65e ┆ 13 ┆ [0.963558185417193] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5712205b356b470e8d1574157e55f65e ┆ 14 ┆ [0.9854497299884601] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5712205b356b470e8d1574157e55f65e ┆ 15 ┆ [0.9974949866040544] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5712205b356b470e8d1574157e55f65e ┆ 16 ┆ [0.9995736030415051] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5712205b356b470e8d1574157e55f65e ┆ 17 ┆ [0.9916648104524686] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5712205b356b470e8d1574157e55f65e ┆ 18 ┆ [0.9738476308781951] │
└──────────────────────────────────┴────────────────────┴───────────────────────────────────┘
Let's unpack what happened here:
rr.server.Server() to spin up a temporary local catalog. In production, you might connect to a Rerun Data Platform deployment instead. We then obtain the dataset to be queried from the catalog.filter_contents() method restricts the scope of the query to specific entities. This affects which columns are returned, but may also change which rows are returned since rows are only produced where at least one filtered column has data (see How are rows produced?).reader(index=…) method returns a DataFusion dataframe. The index parameter specifies which timeline drives row generation: a row is produced for each unique value of this index where data exists. The returned dataframe doesn't execute until it is collected.filter() to filter rows based on the data. Again, these are lazy operations that only build a query plan.print(df) implicitly executes the dataframe's query plan and returns the final result. The same would happen when converting to dataframe for other frameworks (Pandas, Polars, PyArrow, etc.).direction: down
Dataset: {
shape: cylinder
}
view: {
label: "Dataset view"
}
DataFrame: {
label: "DataFusion DataFrame"
}
Result: {
label: "Materialized rows\n(Arrow RecordBatch)"
shape: page
}
Dataset -> view: "filter_contents()\nfilter_segments()"
Dataset -> DataFrame: "reader()"
view -> DataFrame: "reader()"
DataFrame -> Result: "collect()"
A row is produced for each distinct index (or timeline) value for which there is at least one value in the filtered content.
For example, if you filter for entities /camera and /lidar, and /camera has data at timestamps [1, 2, 3] while /lidar has data at [2, 4], the output will have rows for timestamps [1, 2, 3, 4]. Columns without data at a given timestamp will contain null values (unless sparse fill is enabled).
filter_contents() and DataFusion's select()?At first glance, both methods control which columns appear in the result. However, they differ in an important way:
filter_contents() restricts which entities are considered for row generation. This affects both which columns and which rows are returned.select() is a DataFusion operation that only filters columns after rows have been determined. It does not affect row generation.Building on the previous example, if /camera has data at timestamps [1, 2, 3] and /lidar has data at [2, 4]:
# Rows at [1, 2, 3] with only /camera columns
dataset.filter_contents("/camera").reader(index="timestamp")
# Rows at [1, 2, 3, 4] with only /camera columns
# (null values at timestamp 4 where /camera has no data)
dataset.filter_contents(["/camera", "/lidar"]).reader(index="timestamp").select("/camera")
When querying a dataset with multiple segments, the query is applied on a segment-by-segment basis. This means:
rerun_segment_id column identifying which segment each row comes from.filter_segments() on a dataset or dataset view to restrict the query to specific segment IDs.As a reminder, static data has no associated timeline and represents values that don't change over time. When data is logged as static to a column, it is considered valid for all timelines and for all times, overriding any temporal data otherwise logged to the same column.
The consequence of this is that static data cannot, by itself, generate rows. However, for rows that are generated by other (temporal) data, static data will show up in their respective columns provided they are part of the filtered content.
In practice, this can cause performance and/or memory issues when the same large static data is yielded in every row.
For this reason, it may be preferable to filter static columns out (e.g. using filter_contents()) and query the static data separately.
Querying static data only can also be useful for retrieving configuration, calibration data, or other time-invariant information.
This is achieved by setting the index to None:
df = dataset.reader(index=None)
The returned dataframe contains a single row with all the static data from the filtered content.
By default, rows are produced only at index values where data exists. To sample at specific timestamps (even if no data exists there), use the using_index_values parameter combined with fill_latest_at=True:
# Sample at fixed 10Hz (100ms intervals)
timestamps = np.arange(start_time, end_time, np.timedelta64(100, "ms"))
df = dataset.reader(
index="timestamp",
using_index_values=timestamps,
fill_latest_at=True,
)
using_index_values specifies the exact timestamps to samplefill_latest_at=True fills null values with the most recent data (latest-at/forward fill semantics)For a complete example, see the Time-align data how-to.