ee/hogai/eval/README.md
We use AI evaluations (evals) to test our AI outputs against a curated set of inputs. Evals allow us to verify prompt performance, spot regressions, or compare different model versions.
We currently use Braintrust as our evaluation platform. Braintrust tracks evaluation results, including LLM traces - which helps both track performance, and dig into any issues on a case-by-case basis. To access Braintrust and/or get an API key for it, ask #team-posthog-ai.
Export environment variable BRAINTRUST_API_KEY (ask #team-posthog-ai).
Run all evals with:
pytest ee/hogai/eval/ci
The key bit is specifying the ee/hogai/eval/ci directory – that activates our eval-specific config, ee/hogai/eval/pytest.ini!
As always with pytest, you can also run a specific file, e.g. pytest ee/hogai/eval/ci/eval_root.py. Apply the --eval sql argument to only run evals for test cases that contain sql.
Voila! Max ran, evals executed, and results and traces uploaded to the Braintrust platform + summarized in the terminal.
For historical eval runs, see the full Experiments list in Braintrust.
For offline evaluation, you typically need to collect a dataset first. You can do that in PostHog LLM Analytics. There are a few requirements for the shape of a dataset item:
input, output, metadata fields must be valid JSON objects.metadata must contain the team_id field.Remember to continuously review traces and curate your datasets–it's the key to quality.
Additionally, you need an evaluation module in ee/hogai/eval/offline/* that contains an evaluation test case with defined scorers. The test suite may contain multiple test cases, and they are separately reported. For example, if we wanted to evaluate SQL, we could implement the following evaluation module:
import pytest
from braintrust import EvalCase, Score
from pydantic import BaseModel
from posthog.schema import HumanMessage
from posthog.models import Team
from ee.hogai.eval.base import MaxPrivateEval
from ee.hogai.eval.offline.conftest import EvaluationContext, capture_score, get_eval_context
from ee.hogai.eval.schema import DatasetInput
from ee.hogai.eval.scorers.sql import SQLSemanticsCorrectness, SQLSyntaxCorrectness
from ee.hogai.chat_agent import AssistantGraph
from ee.hogai.utils.types import AssistantState
from ee.models import Conversation
class EvalOutput(BaseModel):
...
async def call_graph(entry: DatasetInput, *args):
eval_ctx = get_eval_context() # Get local evaluation context
team = await Team.objects.aget(id=entry.team_id)
conversation = await Conversation.objects.acreate(team=team, user=eval_ctx.user)
graph = AssistantGraph(team, eval_ctx.user).compile_full_graph()
state = await graph.ainvoke(
AssistantState(messages=[HumanMessage(content=entry.input["query"])]),
{
"callbacks": eval_ctx.get_callback_handlers(entry.trace_id),
"configurable": {
"thread_id": conversation.id,
"team": team,
"user": eval_ctx.user,
"distinct_id": eval_ctx.distinct_id,
},
},
)
return EvalOutput(...)
@capture_score # Decorator to automatically capture the score result
async def sql_semantics_scorer(input: DatasetInput, expected: str, output: EvalOutput, **kwargs) -> Score:
# Make sure you pass the traced OpenAI client to a scorer, so the scorer traces are captured.
client = get_eval_context().get_openai_client_for_tracing(input.trace_id)
metric = SQLSemanticsCorrectness(client=client)
return await metric.eval_async(...)
@capture_score # Decorator to automatically capture the score result
async def sql_syntax_scorer(input: DatasetInput, expected: str, output: EvalOutput, **kwargs) -> Score:
# Algorithmic scorer doesn't need the traced OpenAI client.
metric = SQLSyntaxCorrectness()
return await metric.eval_async(...)
# Generate eval cases from dataset items
def generate_test_cases(eval_ctx: EvaluationContext):
for entry in eval_ctx.dataset_inputs:
yield EvalCase(input=entry, expected=entry.expected["output"])
@pytest.mark.django_db
async def eval_offline_sql(eval_ctx: EvaluationContext, pytestconfig):
await MaxPrivateEval(
experiment_name=eval_ctx.formatted_experiment_name,
task=call_graph,
scores=[sql_syntax_scorer, sql_semantics_scorer],
data=generate_test_cases(eval_ctx),
pytestconfig=pytestconfig,
)
Log in to Dagster Cloud and run a new run_evaluation job with a following config:
ops:
prepare_dataset:
config:
dataset_id: '01992de8-3773-7946-afad-e028d45eba01' # Dataset ID
spawn_evaluation_container:
config:
evaluation_module: ee/hogai/eval/offline/eval_sql.py # Evaluation module
image_name: posthog-ai-evals # Leave as is or provide another image
image_tag: master # Use master or commit hash of the branch you want to evaluate
The job will pull the provided dataset, validate dataset items, export team data, run the evaluation, and report results back to you.
If you want to run an evaluation for a branch that is not master, you will need to build an image with the build-ai-evals-image tag. Once the CI is complete, you are ready to run the evaluation.
Evaluation results are automatically reported to the #evals-max-ai channel in Slack. You can also access the same data in the Dagster asset catalog. The report will contain links with captured traces for the evaluation run.