.agents/skills/runtime-behavior-probe/SKILL.md
Use this skill to investigate real runtime behavior, not to restate code or documentation. Start by planning the investigation, then execute a case matrix, record observed behavior, and report both the findings and the method used to obtain them.
OPENAI_API_KEY and other expected default names for the system under test.tool_choice when the question depends on tool invocation.container_auto and container_reference as separate cases, not interchangeable setup details.single-shot for deterministic one-run checks.repeat-N for cache, retry, streaming, interruption, rate-limit, concurrency, or other run-to-run-sensitive behavior.warm-up + repeat-N when first-run cold-start effects could distort the result.
Use these defaults unless the task clearly needs something else:repeat-3.warm-up + repeat-10.repeat-3, then expand only if the answer remains unclear.
If it is genuinely unclear whether extra runs are worth the time or cost, ask the user before expanding the probe.origin/main, the latest release, or the same request without the suspected option.openai-agents-python, make the runtime context explicit:uv run python when practical.uv run python, say exactly why and what interpreter or environment was used instead.Use a matrix that makes the news easy to scan. Start from the runtime question and the observation summary, not just from expected and pass or fail.
Use a matrix with at least these columns:
case_idscenariomodequestionsetupobservation_summaryresult_flagevidenceAdd these columns when they materially improve the investigation:
comparison_basisvariable_under_testheld_constantoutput_constraintstatusconfidencestate_setuprepeatswarm_upvarianceusage_noterisk_profileenv_varsapprovalcontrolTreat result_flag as a fast scan field such as unexpected, negative, expected, or blocked. Use status only when there is a credible comparison basis, baseline, or documented contract to compare against.
Always consider whether the matrix should include these categories:
Open validation-matrix.md when you need a stronger prioritization model or a reusable case template.
Write one-off scripts in a temporary file or temporary directory such as one created by mktemp -d or Python tempfile. Keep the script outside the repository by default, even when it imports code from the repository.
If the probe needs repository code:
PYTHONPATH or the equivalent import path explicitly.openai-agents-python, prefer uv run python /tmp/probe.py from the repository root.Design the probe to maximize observability:
Before deleting the temporary script or directory, keep a short run summary of the script path, command used, runtime context, and whether the evidence was kept or deleted.
Open python_probe.py when you want a lightweight disposable Python probe scaffold.
Report in this order:
For comparative probes, the report should also say what was held constant, what variable was under test, and whether the result supports only pattern parity or a broader quality claim.
Open reporting-format.md for the recommended response template.