docs/notebooks/quickstart.ipynb
Daft is the best multimodal data processing engine that allows you to load data from anywhere, transform it with a powerful DataFrame API and AI functions, and store it in your destination of choice. In this quickstart, you'll see what this looks like in practice with a realistic e-commerce data workflow.
Daft requires Python 3.10 or higher.
You can install Daft using pip. Run the following command in your terminal or notebook:
!pip install -U "daft[openai]" # Includes OpenAI extras needed for this quickstart
Additionally, install these packages for image processing (used later in this quickstart):
!pip install numpy pillow
Let's start by loading an e-commerce dataset from Hugging Face. This dataset contains 10,000+ Amazon products from diverse categories including electronics, toys, home goods, and more. Each product includes details like names, prices, descriptions, technical specifications, and product images.
import daft
df_original = daft.read_huggingface("calmgoose/amazon-product-data-2020")
Daft can load data from many sources including <a href="https://docs.daft.ai/en/stable/connectors/aws/">S3</a>, <a href="https://docs.daft.ai/en/stable/connectors/iceberg/">Iceberg</a>, <a href="https://docs.daft.ai/en/stable/connectors/delta_lake/">Delta Lake</a>, <a href="https://docs.daft.ai/en/stable/connectors/hudi/">Hudi</a>, and <a href="https://docs.daft.ai/en/stable/connectors/">more</a>. We're using Hugging Face here as a demonstration.
</div>Now let's take a look at what we loaded. You can inspect the DataFrame by simply printing it:
df_original
You see the above output because Daft is lazy by default - it displays the schema (column names and types) but doesn't actually load or process your data until you explicitly tell it to. This allows Daft to optimize your entire workflow before executing anything.
To actually view your data, you have two options:
Option 1: Preview with .show() - View the first few rows:
df_original.show(2)
This materializes and displays just the first 2 rows, which is perfect for quickly inspecting your data without loading the entire dataset.
Option 2: Materialize with .collect() - Load the entire dataset:
# df_original.collect()
This would materialize the entire DataFrame (all 10,000+ rows in this case) into memory. Use .collect() when you need to work with the full dataset in memory.
For quick experimentation, let's create a smaller, simplified version of the dataframe with just the essential columns:
# Select only the columns we need and limit to 5 rows for faster iteration
df = df_original.select("Product Name", "About Product", "Image").limit(5)
Now we have a manageable dataset of 5 products with just the product name, description, and image URLs. This simplified dataset lets us explore Daft's features without the overhead of unnecessary columns.
Let's extract and download product images. The Image column contains pipe-separated URLs. We'll extract the first URL and download it:
# Extract the first image URL from the pipe-separated list
# The pattern captures everything before the first pipe or the entire string if no pipe
df = df.with_column(
"first_image_url",
daft.functions.regexp_extract(
df["Image"],
r"^([^|]+)", # Extract everything before the first pipe
1, # Get the first capture group
),
)
# Download the image data
df = df.with_column("image_data", daft.functions.download(df["first_image_url"], on_error="null"))
# Decode images for visual display (in Jupyter notebooks, this shows actual images!)
df = df.with_column("image", daft.functions.decode_image(df["image_data"], on_error="null"))
# Check what we have - in Jupyter notebooks, the 'image' column shows actual images!
df.select("Product Name", "first_image_url", "image_data", "image").show(3)
In Jupyter notebooks, the image column will display actual thumbnail images instead of <Image> text.
This demonstrates Daft's multimodal capabilities:
regexp_extract() to parse structured text with Rust-powered regexdaft.functions.download()decode_image() for visual displayThe decoded images are now ready for further processing.
Let's use AI to analyze product materials at scale. Daft automatically parallelizes AI operations across your local machine's cores, making it efficient to process multiple images concurrently.
Let's suppose you want to create a new column that shows if each product is made of wood or not. This might be useful for, for example, a filtering feature on your website.
If you're running this in Google Colab or Jupyter, run the following cell to set your OpenAI API key. In Colab, first add your key to Secrets (🔑 icon in the left sidebar) with the name OPENAI_API_KEY. In Jupyter, you'll be prompted to enter your key and the input will be hidden.
import os
try:
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
except ImportError:
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
from pydantic import BaseModel, Field
from daft.functions import prompt
# Define a simple structured output model
class WoodAnalysis(BaseModel):
is_wooden: bool = Field(description="Whether the product appears to be made of wood")
# Run AI inference on each image - Daft automatically batches and parallelizes this
df = df.with_column(
"wood_analysis",
prompt(
["Is this product made of wood? Look at the material.", df["image"]],
return_format=WoodAnalysis,
model="gpt-4o-mini", # Using mini for cost-efficiency
provider="openai",
# api_key="your-key-here", # Use OPENAI_API_KEY env var, or uncomment to set manually
),
)
# Extract the boolean value from the structured output
# The result is a struct, so we extract the 'is_wooden' field
df = df.with_column("is_wooden", df["wood_analysis"]["is_wooden"])
# Materialize the dataframe to compute all transformations
df = df.collect()
# View results
df.select("Product Name", "image", "is_wooden").show()
The AI analyzes each product image to determine if it's made of wood. Notice that the longboard is identified as wooden (true), while the electronic circuits, design studio, puzzle, and 3D printing filament are identified as not wooden (false).
<div style="background-color: #448aff22; border-left: 4px solid #448aff; padding: 12px; margin: 16px 0;"> <strong style="color: #448aff;">Improving Accuracy</strong>Looking at the actual product data, the longboard is made of bamboo and fiberglass, not wood. However, this is exactly what a human might categorize from the image alone! To improve accuracy, you could feed additional context to the AI like the product name, category, and description alongside the image. This example demonstrates how to get started with image-based analysis.
</div>Now, suppose you're satisfied with the results from your small subset and want to scale up. Instead of analyzing just 5 products, let's run the same analysis on 100 products to get more meaningful insights:
from pydantic import BaseModel, Field
from daft.functions import prompt
# Define a simple structured output model (same as before)
class WoodAnalysis(BaseModel):
is_wooden: bool = Field(description="Whether the product appears to be made of wood")
# Start fresh with the first 100 products
df_large = df_original.select("Product Name", "About Product", "Image").limit(100)
# Apply the same image processing pipeline
# 1. Extract first image URL
df_large = df_large.with_column("first_image_url", daft.functions.regexp_extract(df_large["Image"], r"^([^|]+)", 1))
# 2. Download images
df_large = df_large.with_column("image_data", daft.functions.download(df_large["first_image_url"], on_error="null"))
# 3. Decode images
df_large = df_large.with_column("image", daft.functions.decode_image(df_large["image_data"], on_error="null"))
# 4. Run AI analysis on all 100 products
df_large = df_large.with_column(
"wood_analysis",
prompt(
["Is this product made of wood? Look at the material.", df_large["image"]],
return_format=WoodAnalysis,
model="gpt-4o-mini", # Using mini for cost-efficiency
provider="openai",
# api_key="your-key-here", # Use OPENAI_API_KEY env var, or uncomment to set manually
),
)
# 5. Extract the boolean value
df_large = df_large.with_column("is_wooden", df_large["wood_analysis"]["is_wooden"])
# Materialize the dataframe to compute all transformations
df_large = df_large.collect()
# Count wooden products
wooden_count = df_large.where(df_large["is_wooden"]).count_rows()
total_count = df_large.count_rows()
print(f"Out of {total_count} products analyzed:")
print(f" - {wooden_count} are made of wood")
print(f" - {total_count - wooden_count} are not made of wood")
print(f" - Percentage of wooden products: {(wooden_count / total_count * 100):.1f}%")
AI models are non-deterministic, so you may see slightly different numbers when running this analysis.
</div>After processing your data, you'll often want to save it for later use. Let's store our analyzed dataset as Parquet files:
# Write the analyzed data to local Parquet files
df_large.write_parquet("product_analysis", write_mode="overwrite")
This writes your data to the product_analysis/ directory. Daft automatically handles file naming using UUIDs to prevent conflicts. The write_mode="overwrite" parameter ensures that any existing data in the directory is replaced.
Just like reading, Daft can write data to many destinations including <a href="https://docs.daft.ai/en/stable/connectors/aws/">S3</a>, <a href="https://docs.daft.ai/en/stable/connectors/iceberg/">Iceberg</a>, <a href="https://docs.daft.ai/en/stable/connectors/delta_lake/">Delta Lake</a>, and <a href="https://docs.daft.ai/en/stable/connectors/">more</a>.
</div>Let's verify the stored data by loading it back from those Parquet files:
# Read the data back from Parquet files
df_loaded = daft.read_parquet("product_analysis/*.parquet")
# Verify the data loaded correctly
df_loaded.show(5)
Now that you have a basic sense of Daft's functionality and features, here are some more resources to help you get the most out of Daft:
Work with your favorite table and catalog formats:
Explore our Examples to see Daft in action: