Back to Spacetimedb

DEVELOP.md

docs/DEVELOP.md

2.1.011.6 KB
Original Source

DEVELOP.md

This document explains how to configure the environment, run the LLM benchmark tool, and work with the benchmark suite.


Table of Contents

  1. Prerequisites
  2. Quick Checks & Fixes
  3. Environment Variables
  4. Benchmark Suite
  5. Context Construction
  6. Troubleshooting

Prerequisites

  • Run from repo rootcargo llm and related commands must be run from the workspace root (this repo).
  • TypeScript benchmarks — Run pnpm build in crates/bindings-typescript first. Rust and C# use local crates that are built as part of the workspace.
  • Windows (nvm4w) — If pnpm is not found when running TypeScript benchmarks, set NODEJS_DIR to your Node.js bin directory (e.g. C:\nvm\v20.10.0).

Quick Checks & Fixes

Use this single command to quickly unblock CI by regenerating hashes and running only GPT-5 for the minimal Rust + C# passes. This is not the full benchmark suite.

Note: You will need OpenAI API keys to run this locally. Alternatively, any SpacetimeDB member can comment /update-llm-benchmark on a PR to start a CI job to do this.

cargo llm ci-quickfix What this does:

  1. Runs Rust rustdoc_json pass for GPT-5 only.
  2. Runs C# docs pass for GPT-5 only.
  3. Writes updated results & summary.

Model IDs passed to --models must match configured routes (see model_routes.rs), e.g. "openai:gpt-5".

Spacetime CLI

Publishing is performed via the spacetime CLI (spacetime publish -c -y --server <name> <db>). Ensure:

  • spacetime is on PATH
  • The target server is reachable/running

Environment Variables

These are the defaults and/or recommended dev values.

NamePurposeValues / ExampleRequired
SPACETIME_SERVERTarget SpacetimeDB environmentlocal
LLM_DEBUGPrint short debug info while generatingtrue / false (default true in dev)
LLM_DEBUG_VERBOSEExtra‑verbose logs (payloads, scoring detail)false
LLM_BENCH_CONCURRENCYParallel task concurrency across the whole bench run20
LLM_BENCH_ROUTE_CONCURRENCYPer‑route concurrency (throttle per vendor/model)4
OPENAI_API_KEYOpenAI credentialsk-...optional*
OPENAI_BASE_URLOpenAI-compatible base URL overridehttps://api.openai.com/optional
ANTHROPIC_API_KEYAnthropic credential...optional*
ANTHROPIC_BASE_URLAnthropic base URL overridehttps://api.anthropic.comoptional
GOOGLE_API_KEYGemini credential...optional*
GOOGLE_BASE_URLGemini base URL overridehttps://generativelanguage.googleapis.comoptional
XAI_API_KEYxAI Grok credential...optional
DEEPSEEK_API_KEYDeepSeek credential...optional
META_API_KEYMeta Llama credential...optional*

*Required only if you plan to run that provider locally.

Canonical dev block (copy/paste into your shell profile):

bash
OPENAI_API_KEY=
OPENAI_BASE_URL=https://api.openai.com/

ANTHROPIC_API_KEY=
ANTHROPIC_BASE_URL=https://api.anthropic.com

GOOGLE_API_KEY=
GOOGLE_BASE_URL=https://generativelanguage.googleapis.com

XAI_API_KEY=
XAI_BASE_URL=https://api.x.ai

DEEPSEEK_API_KEY=
DEEPSEEK_BASE_URL=https://api.deepseek.com

META_API_KEY=
META_BASE_URL=https://openrouter.ai/api/v1

SPACETIME_SERVER="local"
LLM_DEBUG=true
LLM_DEBUG_VERBOSE=false
LLM_BENCH_CONCURRENCY=20
LLM_BENCH_ROUTE_CONCURRENCY=4

Windows PowerShell:

powershell
$env:SPACETIME_SERVER="local"
$env:LLM_DEBUG="true"
$env:LLM_DEBUG_VERBOSE="false"
$env:LLM_BENCH_CONCURRENCY="20"
$env:LLM_BENCH_ROUTE_CONCURRENCY="4"

LLM Providers — Keys & Base URLs

Notes

  • These match the providers wired in this repo (OpenAiClient, AnthropicClient, GoogleGeminiClient, XaiGrokClient, DeepSeekClient, MetaLlamaClient).
ProviderAPI Key EnvBase URL Env (optional)Default Base URL
OpenAIOPENAI_API_KEYOPENAI_BASE_URLhttps://api.openai.com
AnthropicANTHROPIC_API_KEYANTHROPIC_BASE_URLhttps://api.anthropic.com
Google GeminiGOOGLE_API_KEYGOOGLE_BASE_URLhttps://generativelanguage.googleapis.com
xAI GrokXAI_API_KEYXAI_BASE_URLhttps://api.x.ai
DeepSeekDEEPSEEK_API_KEYDEEPSEEK_BASE_URLhttps://api.deepseek.com
METAMETA_API_KEYMETA_BASE_URLhttps://openrouter.ai/api/v1

Benchmark Suite

Results directory: docs/llms

Result Files

There are two sets of result files, each serving a different purpose:

FilesPurposeUpdated By
docs-benchmark-details.json
docs-benchmark-summary.jsonTest documentation quality with a single reference model (GPT-5)cargo llm ci-quickfix
llm-comparison-details.json
llm-comparison-summary.jsonCompare all LLMs against the same documentationcargo llm run
  • docs-benchmark: Used by CI to ensure documentation quality. Contains only GPT-5 results.
  • llm-comparison: Used for manual benchmark runs to compare LLM performance. Contains results from all configured models.

Results writes are lock-safe and atomic. The tool takes an exclusive lock and writes via a temp file, then renames it, so concurrent runs won't corrupt results.

Open llm_benchmark_stats_viewer.html in a browser to inspect merged results locally.

Current Benchmarks

basics 000. empty-reducers — tests whether it can create basic reducers with various arguments 001. basic-tables — can it create tables with basic columns 002. scheduled-table — can it create a scheduled table and reducer 003. struct-in-table — can it put a struct in a table 004. insert — can it insert a row 005. update — can it update a row 006. delete — can it delete a row 007. crud — can it insert, update, and delete a row in the same reducer 008. index-lookup — can it look up something from an index 009. init — can it write the init reducer 010. connect — can it write the client_connected/client_disconnected reducers 011. helper-function — can it create a non-reducer helper function

schema 012. spacetime-product-type — can it define a new spacetime product type 013. spacetime-sum-type — can it define a new sum type 014. elementary-columns — can it create columns with basic types 015. product-type-columns — can it create columns with product types 016. sum-type-columns — can it create columns with sum types 017. scheduled — can it create scheduled columns 018. constraints — can it add primary keys, unique constraints, and indexes 019. many-to-many — can it create a many-to-many relationship 020. ecs — can it create a basic ecs 021. multi-column-index — can it create a multi-column index

Benchmarks live under benchmarks/ with structure like:

benchmarks/
  category/
    t_001_foo/
      tasks/
        rust.txt
        csharp.txt
      answers/
        rust.rs
        csharp.cs
      spec.rs          # scoring config, reducer/schema checks, etc.

Creating a new benchmark

  1. Copy existing benchmark
  • Duplicate any existing benchmark folder.
  • Bump the numeric prefix to a new, unused ID: t_123_my_task.
  1. Rename for the new task
  • Rename the folder to your ID + short slug: t_123_my_task.
  1. Write the task prompt
  • Create/update tasks/rust.txt and/or tasks/csharp.txt.
  • Be explicit (tables, reducers, helpers, constraints). Avoid ambiguity.
  1. Add golden answers
  • Implement the canonical solution in answers/rust.rs and/or answers/csharp.cs.
  1. Define scoring
  • Edit spec.rs to add scorers (e.g., schema/table/field checks, reducer/func exists).
  1. Quick validation
  • Build goldens only:
    cargo llm run --goldens-only --tasks t_123_my_task
  1. Categorize
  • Ensure the folder sits under the right category path.

Typical Commands

bash
# Run everything with current env (providers/models from your .env)
cargo llm run

# Only Rust (or C#)
cargo llm run --lang rust
cargo llm run --lang csharp

# Only certain categories (use your actual category names)
cargo llm run --categories basics,schema

# Only certain tasks by number (globally numbered)
cargo llm run --tasks 0,7,12

# Limit providers/models explicitly
cargo llm run \
  --providers openai,anthropic \
  --models "openai:gpt-5 anthropic:claude-sonnet-4-5"

# Dry runs
cargo llm run --hash-only         # build context only (no provider calls)
cargo llm run --goldens-only      # build/check goldens only

# Be aggressive (skip some safety checks)
cargo llm run --force

# CI sanity check per language
cargo llm ci-check --lang rust
cargo llm ci-check --lang csharp

# Generate PR comment markdown (compares against master baseline)
cargo llm ci-comment
# With custom baseline ref
cargo llm ci-comment --baseline-ref origin/main

Outputs:

  • Logs to stdout/stderr (respecting LLM_DEBUG/LLM_DEBUG_VERBOSE).
  • JSON results in a per‑run folder (timestamped), merged into aggregate reports.

Context Construction

The benchmark tool constructs a context (documentation) that is sent to the LLM along with each task prompt. The context varies by language and mode.

Modes

ModeLanguageSourceDescription
rustdoc_jsonRustcrates/bindingsGenerates rustdoc JSON and extracts documentation from the spacetimedb crate
docsC#docs/docs/**/*.mdConcatenates all markdown files from the documentation

Tab Filtering

When building context for a specific language, the tool filters <Tabs> components to only include content relevant to the target language. This reduces noise and helps the LLM focus on the correct syntax.

Filtered tab groupIds:

groupIdPurposeTab Values
server-languageServer module code examplesrust, csharp, typescript
client-languageClient SDK code examplesrust, csharp, typescript, cpp, blueprint

Filtering behavior:

  • For C# tests: Only value="csharp" tabs are kept
  • For Rust tests: Only value="rust" tabs are kept
  • If no matching tab exists (e.g., client-language with only cpp/blueprint), the entire tabs block is removed

Example transformation:

Before (in markdown):

html
<Tabs groupId="server-language" queryString>
<TabItem value="csharp" label="C#">
C# code here
</TabItem>
<TabItem value="rust" label="Rust">
Rust code here
</TabItem>
</Tabs>

After (for C# context):

C# code here

Documentation Best Practices

When writing documentation that will be used by the benchmark:

  1. Use consistent tab groupIds: Always use server-language for server module code and client-language for client SDK code
  2. Include all supported languages: Ensure each <Tabs> block has tabs for all languages you want to test
  3. Use consistent naming conventions: The benchmark compares LLM output against golden answers, so documentation should reflect the expected conventions (e.g., PascalCase table names for C#)

Troubleshooting

HTTP 400/404 from providers

  • Check the model ID spelling and whether it’s available for your account/region.
  • Verify the correct base URL for non-default gateways.

Timeouts / Rate-limits

  • Lower LLM_BENCH_CONCURRENCY or LLM_BENCH_ROUTE_CONCURRENCY.
  • Some providers aggressively throttle bursts; use backoff/retry when supported.