fern/01-guide/09-comparisons/openai-sdk.mdx
OpenAI SDK now supports structured outputs natively, making it easier than ever to get typed responses from GPT models.
Let's explore how this works in practice and where you might hit limitations.
OpenAI's structured outputs look fantastic at first:
from pydantic import BaseModel
from openai import OpenAI
class Resume(BaseModel):
name: str
skills: list[str]
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "user", "content": "John Doe, Python, Rust"}
],
response_format=Resume,
)
resume = completion.choices[0].message.parsed
Simple and type-safe! Let's add education to make it more realistic:
+class Education(BaseModel):
+ school: str
+ degree: str
+ year: int
class Resume(BaseModel):
name: str
skills: list[str]
+ education: list[Education]
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "user", "content": """John Doe
Python, Rust
University of California, Berkeley, B.S. in Computer Science, 2020"""}
],
response_format=Resume,
)
Still works! But let's dig deeper...
Your extraction works 90% of the time, but fails on certain resumes. You need to debug:
# What prompt is actually being sent?
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": resume_text}],
response_format=Resume,
)
# You can't see:
# - How the schema is formatted
# - What instructions the model receives
# - Why certain fields are misunderstood
You start experimenting with system messages:
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract resume information accurately."},
{"role": "user", "content": resume_text}
],
response_format=Resume,
)
# But what if you need more specific instructions?
# How do you tell it to handle edge cases?
Now you need to classify resumes by seniority:
from enum import Enum
class SeniorityLevel(str, Enum):
JUNIOR = "junior"
MID = "mid"
SENIOR = "senior"
STAFF = "staff"
class Resume(BaseModel):
name: str
skills: list[str]
education: list[Education]
seniority: SeniorityLevel
But the model doesn't know what these levels mean! You try adding a docstring:
class Resume(BaseModel):
"""Resume with seniority classification.
Seniority levels:
- junior: 0-2 years experience
- mid: 2-5 years experience
- senior: 5-10 years experience
- staff: 10+ years experience
"""
name: str
skills: list[str]
education: list[Education]
seniority: SeniorityLevel
But docstrings aren't sent to the model. So you resort to prompt engineering:
messages = [
{"role": "system", "content": """Extract resume information.
Classify seniority as:
- junior: 0-2 years experience
- mid: 2-5 years experience
- senior: 5-10 years experience
- staff: 10+ years experience"""},
{"role": "user", "content": resume_text}
]
Now your business logic is split between types and prompts...
Your team wants to experiment with Claude for better reasoning:
# With OpenAI SDK, you're stuck with OpenAI
from openai import OpenAI
client = OpenAI()
# Want to try Claude? Start over with a different SDK
from anthropic import Anthropic
anthropic_client = Anthropic()
# Completely different API
message = anthropic_client.messages.create(
model="claude-3-opus-20240229",
messages=[{"role": "user", "content": resume_text}],
# No structured outputs support!
)
# Now you need custom parsing
import json
resume_data = json.loads(message.content)
resume = Resume(**resume_data) # Hope it matches!
You want to test your extraction and track costs:
# How do you test without burning tokens?
def test_resume_extraction():
completion = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": test_resume}],
response_format=Resume,
)
# This costs money every time!
# Mock the OpenAI client?
from unittest.mock import Mock
mock_client = Mock()
mock_client.beta.chat.completions.parse.return_value = ...
# You're not really testing the extraction logic
# Track token usage?
completion = client.beta.chat.completions.parse(...)
print(completion.usage.total_tokens) # At least this exists!
# But how many tokens does the schema formatting use?
# Could you optimize it?
As your app scales, you need:
Your code evolves:
class ResumeExtractor:
def __init__(self):
self.client = OpenAI()
self.fallback_client = OpenAI() # Different API key?
def extract_with_retries(self, text: str, max_retries: int = 3):
for attempt in range(max_retries):
try:
return self._extract(text, model="gpt-4o")
except RateLimitError:
if attempt == max_retries - 1:
# Try fallback model
return self._extract(text, model="gpt-3.5-turbo")
time.sleep(2 ** attempt)
def _extract(self, text: str, model: str):
messages = self._build_messages(text)
completion = self.client.beta.chat.completions.parse(
model=model,
messages=messages,
response_format=Resume,
)
self._log_usage(completion, model)
return completion.choices[0].message.parsed
# ... more infrastructure code
The simple API is now buried in error handling and logging.
BAML was built for real-world LLM applications. Here's the same resume extraction:
class Education {
school string
degree string
year int
}
enum SeniorityLevel {
JUNIOR @description("0-2 years of experience")
MID @description("2-5 years of experience")
SENIOR @description("5-10 years of experience")
STAFF @description("10+ years of experience, technical leadership")
}
class Resume {
name string
skills string[]
education Education[]
seniority SeniorityLevel
}
function ExtractResume(resume_text: string) -> Resume {
client GPT4
prompt #"
Extract structured information from this resume.
Resume:
---
{{ resume_text }}
---
{{ ctx.output_format }}
"#
}
See the difference?
// Define all your models
client<llm> GPT4 {
provider openai
options {
model "gpt-4o"
temperature 0.1
}
}
client<llm> GPT35 {
provider openai
options {
model "gpt-3.5-turbo"
temperature 0.1
}
}
client<llm> Claude {
provider anthropic
options {
model "claude-3-opus-20240229"
}
}
client<llm> Llama {
provider ollama
options {
model "llama3"
}
}
// Use ANY model with the SAME function
function ExtractResume(resume_text: string) -> Resume {
client GPT4 // Just change this line!
prompt #"..."#
}
In Python:
from baml_client import baml as b
# Default model
resume = await b.ExtractResume(resume_text)
# Use different models for different scenarios
cheap_extraction = await b.ExtractResume(simple_text, {"client": "GPT35"})
quality_extraction = await b.ExtractResume(complex_text, {"client": "Claude"})
private_extraction = await b.ExtractResume(sensitive_text, {"client": "Llama"})
# Same interface, same types, different models!
With BAML's VSCode extension:
No mocking, no token costs, real testing.
// Retry configuration
client<llm> GPT4WithRetries {
provider openai
options {
model "gpt-4o"
temperature 0.1
}
retry_policy {
max_retries 3
strategy exponential_backoff
}
}
// Fallback chains
client<llm> SmartRouter {
provider fallback
options {
clients ["GPT4", "Claude", "GPT35"]
}
}
All the production concerns handled declaratively.
OpenAI's structured outputs are great if you:
But production LLM applications need more:
BAML's advantages over OpenAI SDK:
Why this matters:
With BAML, you get all the benefits of OpenAI's structured outputs plus the flexibility and control needed for production applications.
BAML has some limitations:
If you're building a simple OpenAI-only prototype, the OpenAI SDK is fine. If you're building production LLM features that need to scale, try BAML.