docs/connectors/bigtable.md
!!! warning "Experimental"
This connector is experimental and the API may change in future releases.
Google Cloud Bigtable is a fully managed, scalable NoSQL database service. Daft can write DataFrames to Bigtable tables using [df.write_bigtable()][daft.dataframe.DataFrame.write_bigtable].
Bigtable support requires the google-cloud-bigtable package:
pip install google-cloud-bigtable
import daft
# Create a DataFrame
df = daft.from_pydict({
"user_id": ["user_001", "user_002", "user_003"],
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 25, 35],
"email": ["[email protected]", "[email protected]", "[email protected]"],
})
# Write to Bigtable
result = df.write_bigtable(
project_id="my-gcp-project",
instance_id="my-bigtable-instance",
table_id="users",
row_key_column="user_id",
column_family_mappings={
"name": "profile",
"age": "profile",
"email": "contact",
},
)
Every Bigtable row requires a unique row key. Use row_key_column to specify which DataFrame column should be used as the row key:
df.write_bigtable(
...,
row_key_column="user_id", # This column becomes the row key
)
Bigtable organizes columns into column families. Use column_family_mappings to specify which family each column belongs to:
df.write_bigtable(
...,
column_family_mappings={
"name": "user_data", # 'name' column goes to 'user_data' family
"age": "user_data", # 'age' column goes to 'user_data' family
"email": "contact_info", # 'email' column goes to 'contact_info' family
},
)
The column families must already exist in the Bigtable table.
| Parameter | Type | Required | Description |
|---|---|---|---|
project_id | str | Yes | Google Cloud project ID |
instance_id | str | Yes | Bigtable instance ID |
table_id | str | Yes | Bigtable table ID |
row_key_column | str | Yes | Column name to use as the row key |
column_family_mappings | dict[str, str] | Yes | Mapping of column names to column families |
client_kwargs | dict | No | Additional arguments for the Bigtable Client |
write_kwargs | dict | No | Additional arguments for MutationsBatcher |
serialize_incompatible_types | bool | No | Auto-convert incompatible types to JSON (default: True) |
Bigtable cells only accept data that can be converted to bytes. By default, Daft automatically serializes incompatible types to JSON:
df = daft.from_pydict({
"id": ["row1"],
"data": [{"nested": "object"}], # Complex type
})
# Complex types are automatically serialized to JSON
df.write_bigtable(
...,
serialize_incompatible_types=True, # Default behavior
)
To disable automatic serialization (will raise an error for incompatible types):
df.write_bigtable(
...,
serialize_incompatible_types=False,
)
Pass additional options to the Bigtable client:
result = df.write_bigtable(
...,
client_kwargs={
"admin": True,
"channel": custom_channel,
},
)
Configure the MutationsBatcher for write operations:
result = df.write_bigtable(
...,
write_kwargs={
"flush_count": 1000,
"max_row_bytes": 5 * 1024 * 1024, # 5MB
},
)
import daft
from daft import col
# Read sensor data
df = daft.read_parquet("s3://bucket/sensors/*.parquet")
# Prepare for Bigtable (create composite row key)
df = df.with_column(
"row_key",
col("device_id").concat("#").concat(col("timestamp").cast(str)),
)
# Write to Bigtable
df.write_bigtable(
project_id="iot-project",
instance_id="sensor-data",
table_id="readings",
row_key_column="row_key",
column_family_mappings={
"temperature": "metrics",
"humidity": "metrics",
"device_id": "metadata",
"timestamp": "metadata",
},
)
import daft
df = daft.from_pydict({
"user_id": ["u001", "u002"],
"preferences": [{"theme": "dark"}, {"theme": "light"}],
"last_login": ["2024-01-15", "2024-01-16"],
})
df.write_bigtable(
project_id="my-project",
instance_id="user-store",
table_id="profiles",
row_key_column="user_id",
column_family_mappings={
"preferences": "settings",
"last_login": "activity",
},
)