Back to Daft

Writing to Google Cloud Bigtable

docs/connectors/bigtable.md

0.7.104.9 KB
Original Source

Writing to Google Cloud Bigtable

!!! warning "Experimental"

This connector is experimental and the API may change in future releases.

Google Cloud Bigtable is a fully managed, scalable NoSQL database service. Daft can write DataFrames to Bigtable tables using [df.write_bigtable()][daft.dataframe.DataFrame.write_bigtable].

Installing Dependencies

Bigtable support requires the google-cloud-bigtable package:

bash
pip install google-cloud-bigtable

Basic Usage

python
import daft

# Create a DataFrame
df = daft.from_pydict({
    "user_id": ["user_001", "user_002", "user_003"],
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 35],
    "email": ["[email protected]", "[email protected]", "[email protected]"],
})

# Write to Bigtable
result = df.write_bigtable(
    project_id="my-gcp-project",
    instance_id="my-bigtable-instance",
    table_id="users",
    row_key_column="user_id",
    column_family_mappings={
        "name": "profile",
        "age": "profile",
        "email": "contact",
    },
)

Key Concepts

Row Keys

Every Bigtable row requires a unique row key. Use row_key_column to specify which DataFrame column should be used as the row key:

python
df.write_bigtable(
    ...,
    row_key_column="user_id",  # This column becomes the row key
)

Column Families

Bigtable organizes columns into column families. Use column_family_mappings to specify which family each column belongs to:

python
df.write_bigtable(
    ...,
    column_family_mappings={
        "name": "user_data",      # 'name' column goes to 'user_data' family
        "age": "user_data",       # 'age' column goes to 'user_data' family
        "email": "contact_info",  # 'email' column goes to 'contact_info' family
    },
)

The column families must already exist in the Bigtable table.

Parameters

ParameterTypeRequiredDescription
project_idstrYesGoogle Cloud project ID
instance_idstrYesBigtable instance ID
table_idstrYesBigtable table ID
row_key_columnstrYesColumn name to use as the row key
column_family_mappingsdict[str, str]YesMapping of column names to column families
client_kwargsdictNoAdditional arguments for the Bigtable Client
write_kwargsdictNoAdditional arguments for MutationsBatcher
serialize_incompatible_typesboolNoAuto-convert incompatible types to JSON (default: True)

Data Type Handling

Bigtable cells only accept data that can be converted to bytes. By default, Daft automatically serializes incompatible types to JSON:

python
df = daft.from_pydict({
    "id": ["row1"],
    "data": [{"nested": "object"}],  # Complex type
})

# Complex types are automatically serialized to JSON
df.write_bigtable(
    ...,
    serialize_incompatible_types=True,  # Default behavior
)

To disable automatic serialization (will raise an error for incompatible types):

python
df.write_bigtable(
    ...,
    serialize_incompatible_types=False,
)

Advanced Configuration

Client Options

Pass additional options to the Bigtable client:

python
result = df.write_bigtable(
    ...,
    client_kwargs={
        "admin": True,
        "channel": custom_channel,
    },
)

Write Options

Configure the MutationsBatcher for write operations:

python
result = df.write_bigtable(
    ...,
    write_kwargs={
        "flush_count": 1000,
        "max_row_bytes": 5 * 1024 * 1024,  # 5MB
    },
)

Use Cases

IoT Data Storage

python
import daft
from daft import col

# Read sensor data
df = daft.read_parquet("s3://bucket/sensors/*.parquet")

# Prepare for Bigtable (create composite row key)
df = df.with_column(
    "row_key",
    col("device_id").concat("#").concat(col("timestamp").cast(str)),
)

# Write to Bigtable
df.write_bigtable(
    project_id="iot-project",
    instance_id="sensor-data",
    table_id="readings",
    row_key_column="row_key",
    column_family_mappings={
        "temperature": "metrics",
        "humidity": "metrics",
        "device_id": "metadata",
        "timestamp": "metadata",
    },
)

User Profile Storage

python
import daft

df = daft.from_pydict({
    "user_id": ["u001", "u002"],
    "preferences": [{"theme": "dark"}, {"theme": "light"}],
    "last_login": ["2024-01-15", "2024-01-16"],
})

df.write_bigtable(
    project_id="my-project",
    instance_id="user-store",
    table_id="profiles",
    row_key_column="user_id",
    column_family_mappings={
        "preferences": "settings",
        "last_login": "activity",
    },
)

Notes

  • The Bigtable table and column families must exist before writing
  • Row keys should be designed carefully for efficient access patterns
  • Consider Bigtable's row key design best practices for optimal performance