beps/docs/proposals/BEP-005-prompt-optimization/01_design.md
baml-cli optimize
commandCopy the graet things about DSPy. The people demand prompt optimization!
BAML already has the key components needed for automated optimization:
prompt #"..."# text@description("...") annotations@alias(...) annotations@descriptionAll optimized together as a cohesive system. Changes to schema descriptions/aliases can improve parsing without requiring test updates.
The optimization system follows GEPA's evolutionary approach:
Key differences from DSPy's GEPA:
@@assert and @@check) instead of custom metrics# Basic usage - optimize all functions with tests
baml-cli optimize
# Optimize specific function(s)
baml-cli optimize --function ExtractReceipt --function ClassifyEmail
# Optimize with test filtering
baml-cli optimize --test "ExtractReceipt::*"
# Control optimization budget
baml-cli optimize --max-evals 50 # Total function evaluations
baml-cli optimize --trials 20 # Optimization iterations
# Auto-sized optimization budgets
baml-cli optimize --auto light # Quick exploration (6 candidates)
baml-cli optimize --auto medium # Balanced (12 candidates)
baml-cli optimize --auto heavy # Thorough (18 candidates)
# Multi-objective optimization
baml-cli optimize --weight accuracy=0.8,tokens=0.2
baml-cli optimize --weight accuracy=0.7,latency=0.2,prompt_tokens=0.1
baml-cli optimize --weight accuracy=0.9,completion_tokens=0.1
# Resume previous optimization run
baml-cli optimize --resume .baml_optimize/run_20250106_143022
# Reset GEPA reflection prompts to defaults
baml-cli optimize --reset-gepa-prompts
# Control parallelism
baml-cli optimize --parallel 8
# Output and logging
baml-cli optimize --output-dir .baml_optimize/custom_run
baml-cli optimize --verbose
The optimization system reuses existing BAML syntax with no new language features required:
// Existing BAML function - the prompt will be optimized
function ExtractReceipt(image: image) -> Receipt {
client GPT4o
prompt #"
Extract structured receipt information from this image.
Return the merchant name, date, items, and total.
"#
}
// Existing BAML tests - these define the optimization objective
test ReceiptTest1 {
functions [ExtractReceipt]
args {
image { file "test_receipts/starbucks.jpg" }
}
// Assertions are the success criteria
@@assert({{ this.merchant == "Starbucks" }})
@@assert({{ this.total > 0 }})
@@check(correct_items, {{ this.items|length == 2 }})
}
test ReceiptTest2 {
functions [ExtractReceipt]
args {
image { file "test_receipts/target.jpg" }
}
@@assert({{ this.merchant == "Target" }})
@@assert({{ this.total == 45.67 }})
}
// Example: Custom checks for multi-objective optimization (Phase 3)
test ReceiptWithGroundedness {
functions [ExtractReceipt]
args {
image { file "test_receipts/complex.jpg" }
}
@@assert({{ this.merchant != "" }})
// Custom checks can be weighted in optimization
@@check(groundedness, {{ this.confidence > 0.8 }})
@@check(safety, {{ this.contains_no_pii }})
}
A key design principle: GEPA's reflection logic is implemented in BAML itself. This makes the optimization process transparent, customizable, and dogfoods BAML for optimizing BAML.
GEPA reflection functions live in .baml_optimize/gepa/baml_src/:
.baml_optimize/
└── gepa/
└── baml_src/
├── gepa.baml # Reflection functions
├── clients.baml # Client configs
└── .gepa_version # Tracks baml-cli version
First run behavior:
$ baml-cli optimize
Creating .baml_optimize/gepa/baml_src/ with defaults from baml-cli 0.73.0...
Using reflection model: gpt-4o (default)
Customization workflow:
# User modifies reflection logic
$ vim .baml_optimize/gepa/baml_src/gepa.baml
# Or changes the reflection model
$ vim .baml_optimize/gepa/baml_src/clients.baml
# Next run uses custom GEPA implementation
$ baml-cli optimize
Version tracking:
The .gepa_version file contains:
{
"baml_cli_version": "0.73.0",
"created_at": "2025-01-06T14:30:22Z",
"gepa_baml_hash": "a3f5c9d..."
}
Modifications are detected by comparing file hash to embedded default. On version mismatch:
$ baml-cli --version
baml-cli 0.74.0
$ baml-cli optimize
Warning: Your GEPA implementation is from baml-cli 0.73.0
Run 'baml-cli optimize --reset-gepa-prompts' to upgrade
The default gepa.baml embedded in baml-cli includes:
Data Models:
class SchemaFieldDefinition {
field_name string
field_type string
description string?
aliases string[]
is_optional bool
}
class ClassDefinition {
class_name string
description string?
fields SchemaFieldDefinition[]
}
class EnumDefinition {
enum_name string
values string[]
value_descriptions map<string, string>
}
class OptimizableFunction {
function_name string
prompt_text string
classes ClassDefinition[] // All reachable classes
enums EnumDefinition[] // All reachable enums
}
class ReflectiveExample {
inputs map<string, string>
generated_outputs map<string, string>
feedback string
failure_location string? // "prompt" | "parsing" | "schema"
}
class ImprovedFunction {
prompt_text string
classes ClassDefinition[] // Only modified classes
enums EnumDefinition[] // Only modified enums
rationale string
}
Core Reflection Function:
function ProposeImprovements(
current_function: OptimizableFunction,
failed_examples: ReflectiveExample[],
successful_examples: ReflectiveExample[]?
) -> ImprovedFunction {
client ReflectionModel
prompt #"
You are optimizing a BAML function. Improve both the prompt and schema.
## Current Implementation
Prompt:
```
{{ current_function.prompt_text }}
```
Schema:
{% for class in current_function.classes %}
class {{ class.class_name }} {
{% for field in class.fields %}
/// @description("{{ field.description or 'none' }}")
{{ field.field_name }} {{ field.field_type }}{% if field.aliases %} @alias({{ field.aliases | join(", ") }}){% endif %}
{% endfor %}
}
{% endfor %}
## Failures
{% for ex in failed_examples %}
Inputs: {{ ex.inputs }}
Generated: {{ ex.generated_outputs }}
Issue: {{ ex.feedback }}
{% endfor %}
## Your Task
Analyze failures and propose improvements to:
1. Prompt text - clarity, instructions, examples
2. Field descriptions - guide LLM parsing
3. Field aliases - catch output variations
Consider: Do the prompt and schema work well together?
Return improvements as ImprovedFunction JSON.
"#
}
function MergeVariants(
variant_a: OptimizableFunction,
variant_b: OptimizableFunction,
variant_a_strengths: string[],
variant_b_strengths: string[]
) -> ImprovedFunction {
client ReflectionModel
prompt #"
Merge two successful BAML function variants.
Combine their strengths into a single improved version.
[Details omitted for brevity]
"#
}
Default clients.baml:
client<llm> ReflectionModel {
provider openai
options {
model "gpt-4o"
temperature 1.0
max_tokens 8000
}
}
Example 1: Adding aliases based on failures
Before:
class Receipt {
/// @description("Merchant name")
merchant string
}
After reflection on failures where LLM outputs "store_name":
class Receipt {
/// @description("Merchant name exactly as shown on receipt")
merchant string @alias("store_name", "shop_name", "vendor")
}
Example 2: Improving descriptions
Before:
class Receipt {
/// @description("Total")
total float
}
After reflection on failures with parsing errors:
class Receipt {
/// @description("Total amount in decimal format (e.g., 12.99, not '$12.99')")
total float @alias("amount", "total_amount")
}
Example 3: Coordinated prompt and schema improvements
GEPA optimizes prompt and schema together, ensuring they work cohesively:
// Before
function ExtractReceipt(image: image) -> Receipt {
prompt #"Extract receipt information"#
}
class Receipt {
merchant string
total float
}
// After GEPA optimization
function ExtractReceipt(image: image) -> Receipt {
prompt #"
Extract structured receipt data:
- merchant: exact name as printed
- total: decimal amount (e.g., 45.67)
"#
}
class Receipt {
/// @description("Merchant name preserving exact capitalization")
merchant string @alias("store_name", "vendor")
/// @description("Total in decimal format, no currency symbols")
total float @alias("amount", "sum")
}
The optimization objective is computed from BAML test results:
Primary metric: Test pass rate
@@assert statements@@assert stops evaluation of remaining assertionsSecondary metrics (optional weights)
Like DSPy GEPA, BAML supports multi-objective optimization with the following metrics:
tokens: Minimize total tokens (prompt + completion). Useful for reducing API costs.latency: Minimize inference latency (milliseconds). Useful for real-time applications.prompt_tokens: Minimize prompt tokens specifically. Useful when optimizing prompt length.completion_tokens: Minimize completion tokens. Useful for controlling output verbosity.@@check: User-defined checks can be weighted (Phase 3 feature)
groundedness: For RAG applications, measure citation qualitysafety: Domain-specific safety constraintscompliance: Regulatory or policy compliance checksThe optimizer stores artifacts in baml_src/../.baml_optimize/run_<timestamp>/:
.baml_optimize/
└── run_20250106_143022/
├── config.json # Optimization parameters
├── candidates/
│ ├── 00_initial.baml # Initial prompts
│ ├── 01_candidate.baml # Generated variations
│ ├── 02_candidate.baml
│ └── ...
├── evaluations/
│ ├── 00_initial.json # Test results per candidate
│ ├── 01_candidate.json
│ └── ...
├── reflections/
│ ├── iteration_01.json # Failure analysis
│ └── ...
├── state.json # Resumable optimization state
├── pareto_frontier.json # Current best candidates
└── final_results.json # Summary statistics
State Format (JSON, not pickle):
All optimization state is stored in human-readable JSON format for language-agnostic resumability:
{
"version": "1.0",
"baml_cli_version": "0.73.0",
"iteration": 15,
"total_evals": 450,
"budget_remaining": 550,
"rng_seed": 42,
"pareto_frontier_indices": [3, 7, 12, 15],
"candidate_lineage": {
"0": {"parents": null, "method": "initial"},
"1": {"parents": [0], "method": "reflection"},
"2": {"parents": [0, 1], "method": "merge"}
},
"normalization_stats": {
"tokens": {"mean": 1500, "std": 300, "min": 800, "max": 2500},
"latency": {"mean": 1200, "std": 200, "min": 800, "max": 1800}
}
}
This replaces pickle files, ensuring:
Each candidate file contains only the optimized functions:
// Generated candidate #5
// Iteration: 3
// Parent candidates: [2, 4]
// Score: 0.85 (accuracy=0.90, tokens=-0.05)
function ExtractReceipt(image: image) -> Receipt {
client GPT4o
prompt #"
Carefully analyze the receipt image and extract:
1. Merchant name (exactly as shown)
2. Purchase date (in ISO format)
3. Line items with prices
4. Total amount
Pay special attention to currency formatting.
"#
}
The reflection phase analyzes test failures to guide prompt evolution:
Collect failure data:
Generate reflective dataset:
{
"function": "ExtractReceipt",
"examples": [
{
"inputs": {"image": "test_receipts/starbucks.jpg"},
"outputs": {"merchant": "STARBUCKS", "total": 8.50},
"feedback": "Assertion failed: this.merchant == 'Starbucks'. The merchant name should match the expected casing exactly."
},
{
"inputs": {"image": "test_receipts/target.jpg"},
"outputs": {"merchant": "Target", "total": 45.0},
"feedback": "Assertion failed: this.total == 45.67. The total is incorrect, possibly due to missing cents."
}
]
}
Propose new prompt:
You are optimizing a BAML prompt. Here is the current prompt:
<current_prompt>
{current_prompt_text}
</current_prompt>
Here are examples where the prompt failed:
<failures>
{reflective_dataset}
</failures>
Based on these failures, propose an improved version of the prompt that:
1. Addresses the specific issues shown in the failures
2. Maintains the overall structure and intent
3. Is clear and concise
New prompt:
Merge successful variants (optional):
When optimizing multiple objectives, maintain a Pareto frontier:
The optimizer respects BAML's existing client configuration:
function ExtractReceipt(image: image) -> Receipt {
client GPT4o
prompt #"..."#
}
// The optimizer will use GPT4o for all evaluations
// To optimize for a different model, create a variant:
function ExtractReceipt(image: image) -> Receipt {
client GPT4oMini // Changed client
prompt #"..."#
}
Users can optimize separately for different models by using function variants or by modifying the client between optimization runs.
If a function uses @@dynamic types, tests can override type definitions:
class Receipt {
merchant string
total float
@@dynamic
}
test ReceiptWithCustomFields {
functions [ExtractReceipt]
args {
image { file "test_receipts/custom.jpg" }
}
type_builder {
dynamic Receipt {
merchant string
total float
loyalty_number string // Additional field for this test
}
}
@@assert({{ this.loyalty_number != "" }})
}
The optimizer will respect these test-specific type extensions when evaluating candidates.
This proposal should be orthogonal to expression language as possible, so we can ship it ASAP.
When we ship expression language, we get two benefits:
More interesting checks and asserts - e.g. checks and asserts could load data from external sources and make their own LLM calls, so that we could test RAG groundedness directly.
Cool Alert 😎. We can optimize expression functions. Write
a test block for expression function, baml-cli optimize
will build a pareto frontier for the full workflow by optimizing
every LLM call made in the full workflow and evaluating the
Workflow's own test checks & asserts. I think DSPy does something
like this too.
Fully backward compatible.
Reflection Model Selection: Should we default to a specific reflection model, or require users to specify one?
gpt-4o with opt-in to others via --reflection-modelAutomatic Prompt Updates: Should the optimizer automatically update BAML source files, or just recommend changes?
Validation: How do we ensure optimized prompts don't break type safety or introduce security issues?