DEVELOPMENT.md
How to set up your local machine.
uv is faster and provides reproducible builds via lockfile.
uv sync # Creates .venv and installs all dependencies
uv run data_formulator # Run app (opens browser automatically)
uv run data_formulator --dev # Run backend only (for frontend development)
Which command to use:
uv run data_formulator - starts server and opens browser to http://localhost:5567uv run data_formulator --dev - starts backend server only, then run yarn start separately for the Vite dev server on http://localhost:5173Create a Virtual Environment
python -m venv venv
source venv/bin/activate # Unix
# or .\venv\Scripts\activate # Windows
Install Dependencies
pip install -r requirements.txt
Configure environment variables (optional)
.env.template to .env and fill in your values:
{PROVIDER}_ENABLED=true, {PROVIDER}_API_KEY=..., and {PROVIDER}_MODELS=... for each LLM provider you want to use. See the LiteLLM setup guide for provider-specific fields.DISABLE_DISPLAY_KEYS, SANDBOX, etc.Run the app
# Unix
./local_server.sh
# Windows
.\local_server.bat
# Or directly
data_formulator # Opens browser automatically
data_formulator --dev # Backend only (for frontend development)
Install NPM packages
yarn
Development mode
First, start the backend server (in a separate terminal):
uv run data_formulator --dev # or ./local_server.sh
Then, run the frontend in development mode with hot reloading:
yarn start
Open http://localhost:5173 to view it in the browser. The page will reload if you make edits. You will also see any lint errors in the console.
Build the frontend and then the backend
Compile the TypeScript files and bundle the project:
yarn build
This builds the app for production to the py-src/data_formulator/dist folder.
Then, build python package:
# With uv
uv build
# Or with pip
pip install build
python -m build
This will create a python wheel in the dist/ folder. The name would be data_formulator-<version>-py3-none-any.whl
Test the artifact
You can then install the build result wheel (testing in a virtual environment is recommended):
# replace <version> with the actual build version.
pip install dist/data_formulator-<version>-py3-none-any.whl
Once installed, you can run Data Formulator with:
data_formulator
or
python -m data_formulator
Open http://localhost:5567 to view it in the browser.
AI-generated Python code runs inside a sandbox to isolate it from the main server process. Two backends are available:
| Backend | Flag | How it works | Overhead |
|---|---|---|---|
| local (default) | --sandbox local | Persistent warm subprocess with pre-imported pandas/numpy/duckdb. Audit hooks block file writes and dangerous operations (subprocess, shutil, etc.). | ~1 ms |
| docker | --sandbox docker | Each execution runs in a disposable docker run --rm container. Workspace is mounted read-only; output is returned via a bind-mounted parquet file. Memory/CPU/PID limits enforced. | ~700 ms |
# Use the default local sandbox
python -m data_formulator
# Use Docker sandbox (requires Docker daemon)
python -m data_formulator --sandbox docker
The Docker sandbox image is built from py-src/data_formulator/sandbox/Dockerfile.sandbox:
docker build -t data-formulator-sandbox -f py-src/data_formulator/sandbox/Dockerfile.sandbox .
Source: py-src/data_formulator/sandbox/
By default, workspace data (uploaded files, parquet tables, metadata) is stored on the local filesystem under ~/.data_formulator/workspaces/. For cloud deployments you can switch to Azure Blob Storage so all workspace data lives in a blob container instead.
Install extra dependencies:
pip install azure-storage-blob
# or with uv:
uv pip install azure-storage-blob
Create a storage account & container (one-time setup):
az storage account create -n <account> -g <resource-group> -l eastus --sku Standard_LRS
az storage container create -n data-formulator --account-name <account>
Get the connection string:
az storage account show-connection-string -n <account> -g <resource-group> -o tsv
Add to .env:
WORKSPACE_BACKEND=azure_blob
AZURE_BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=https;AccountName=...
# AZURE_BLOB_CONTAINER=data-formulator # default, change if needed
Run normally:
uv run data_formulator --dev
Or pass as CLI flags:
data_formulator --workspace-backend azure_blob \
--azure-blob-connection-string "DefaultEndpointsProtocol=https;AccountName=..."
In production (Azure App Service, AKS, etc.) you can authenticate the app to blob storage via Managed Identity instead of a connection string. This eliminates secrets entirely.
Install extra dependencies:
pip install azure-storage-blob azure-identity
Assign a role to the app's Managed Identity:
# Get the App Service's principal ID
PRINCIPAL_ID=$(az webapp identity show -n <app-name> -g <rg> --query principalId -o tsv)
# Grant it "Storage Blob Data Contributor" on the storage account
az role assignment create \
--assignee "$PRINCIPAL_ID" \
--role "Storage Blob Data Contributor" \
--scope "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
Set environment variables (no secrets needed):
WORKSPACE_BACKEND=azure_blob
AZURE_BLOB_ACCOUNT_URL=https://<account>.blob.core.windows.net
# AZURE_BLOB_CONTAINER=data-formulator
The app uses DefaultAzureCredential, which automatically picks up the Managed Identity.
For local development with the same Entra ID path, log in with the Azure CLI:
az login
# Grant your user the same "Storage Blob Data Contributor" role
az role assignment create \
--assignee "<[email protected]>" \
--role "Storage Blob Data Contributor" \
--scope "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>"
Then set:
WORKSPACE_BACKEND=azure_blob
AZURE_BLOB_ACCOUNT_URL=https://<account>.blob.core.windows.net
DefaultAzureCredential will use your az login session.
| Method | Env var | When to use |
|---|---|---|
| Connection string | AZURE_BLOB_CONNECTION_STRING | Local dev, quick tests |
| Entra ID (Managed Identity) | AZURE_BLOB_ACCOUNT_URL | Azure App Service, AKS — no secrets |
| Entra ID (az login) | AZURE_BLOB_ACCOUNT_URL | Local dev without secrets |
| Entra ID (service principal) | AZURE_BLOB_ACCOUNT_URL + AZURE_CLIENT_ID / AZURE_TENANT_ID / AZURE_CLIENT_SECRET | CI/CD pipelines |
If both AZURE_BLOB_CONNECTION_STRING and AZURE_BLOB_ACCOUNT_URL are set, the connection string takes precedence.
All workspace data is stored under <datalake_root>/<sanitized_identity_id>/ inside the container:
data-formulator/ ← container
workspaces/ ← datalake_root (default)
browser_550e8400.../ ← anonymous user workspace
workspace.yaml
sales_data.parquet
user_alice_example_com/ ← authenticated user workspace
workspace.yaml
quarterly_report.parquet
| Flag | Env var | Default | Description |
|---|---|---|---|
--workspace-backend | WORKSPACE_BACKEND | local | local or azure_blob |
--azure-blob-connection-string | AZURE_BLOB_CONNECTION_STRING | — | Shared-key connection string |
--azure-blob-account-url | AZURE_BLOB_ACCOUNT_URL | — | Account URL for Entra ID auth |
--azure-blob-container | AZURE_BLOB_CONTAINER | data-formulator | Blob container name |
⚠️ IMPORTANT SECURITY WARNING FOR PRODUCTION DEPLOYMENT
When deploying Data Formulator to production, please be aware of the following security considerations:
Workspace and table data: Table data is stored in per-identity workspaces (e.g. parquet files). DuckDB is used only in-memory per request when needed (e.g. for SQL mode); no persistent DuckDB database files are created by the app.
Identity Management:
user:[email protected] or browser:550e8400-...)Data persistence: User data may be written to workspace storage (e.g. parquet) on the server. In multi-tenant deployments, ensure workspace directories are isolated and access-controlled.
For production deployment, consider:
--disable-database flag to disable table-connector routes when you do not need external or uploaded table support# For stateless deployment (recommended for public hosting)
python -m data_formulator.app --disable-database
Data Formulator supports a hybrid identity system that supports both anonymous and authenticated users.
┌─────────────────────────────────────────────────────────────────────┐
│ Frontend Request │
├─────────────────────────────────────────────────────────────────────┤
│ Headers: │
│ X-Identity-Id: "browser:550e8400-..." (namespace sent by client) │
│ Authorization: Bearer <jwt> (if custom auth implemented) │
│ (Azure also adds X-MS-CLIENT-PRINCIPAL-ID automatically) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Backend Identity Resolution │
│ (auth.py: get_identity_id) │
├─────────────────────────────────────────────────────────────────────┤
│ Priority 1: Azure X-MS-CLIENT-PRINCIPAL-ID → "user:<azure_id>" │
│ Priority 2: JWT Bearer token (if implemented) → "user:<jwt_sub>" │
│ Priority 3: X-Identity-Id header → ALWAYS "browser:<id>" │
│ (client-provided namespace is IGNORED for security) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Storage Isolation │
├─────────────────────────────────────────────────────────────────────┤
│ "user:[email protected]" → alice's DuckDB file (ONLY via auth) │
│ "browser:550e8400-..." → anonymous user's DuckDB file │
└─────────────────────────────────────────────────────────────────────┘
Critical Security Rule: The backend NEVER trusts the namespace prefix from the client-provided X-Identity-Id header. Even if a client sends X-Identity-Id: "user:alice@...", the backend strips the prefix and forces browser:alice@.... Only verified authentication (Azure headers or JWT) can result in a user: prefixed identity.
The key security principle is namespaced isolation with forced prefixing:
| Scenario | X-Identity-Id Sent | Backend Resolution | Storage Key |
|---|---|---|---|
| Anonymous user | browser:550e8400-... | Strips prefix, forces browser: | browser:550e8400-... |
| Azure logged-in user | browser:550e8400-... | Uses Azure header (priority 1) | user:alice@... |
| Attacker spoofing | user:alice@... (forged) | No valid auth, strips & forces browser: | browser:alice@... |
Why this is secure: An attacker sending X-Identity-Id: user:alice@... gets browser:alice@... as their storage key, which is completely separate from the real user:alice@... that only authenticated Alice can access.
To add JWT-based authentication:
tables_routes.py): Uncomment and configure the JWT verification code in get_identity_id()utils.tsx): Implement getAuthToken() to retrieve the JWT from your auth contextcurrent_app.config['JWT_SECRET']When deployed to Azure with EasyAuth enabled:
X-MS-CLIENT-PRINCIPAL-ID header to authenticated requestsThe frontend (src/app/identity.ts) manages identity as follows:
// Identity is always initialized with browser ID
identity: { type: 'browser', id: getBrowserId() }
// If user logs in (e.g., via Azure), it's updated to:
identity: { type: 'user', id: userInfo.userId }
// All API requests send namespaced identity:
// X-Identity-Id: "browser:550e8400-..." or "user:alice@..."
This ensures:
See the Usage section on the README.md page.