docs/workspace/compare-evaluation-analysis/protocol-migration-minimal-plan.md
目标:把 compare / rewrite 发送给 LLM 的“机器协议层”从 Markdown 拼接,迁移为“少量自然语言说明 + JSON payload 证据层”。 约束:尽量不扩大 compare 主能力范围,不改用户可见功能语义;优先降低边界模糊、fence 嵌套、schema 漂移、消息包装漂移。
pairwise judge、structured compare synthesis、rewrite-from-evaluation 都已切到“规则说明 + JSON payload”协议。rewriteGuidance,用于表达 skip / minor-rewrite / rewrite 的第一版 gating 结论。@prompt-optimizer/core build、pnpm compare:calibrate 已跑通。synthetic-schema-drift-regression: 4/4synthetic-cosmetic-regression: 3/3synthetic-replica-instability: 3/3synthetic-overfit-risk: 3/4role/content 包装或消息数组包装。当前 compare / rewrite 链路里,发送给 LLM 的核心输入大量依赖 Markdown 结构:
pairwise judge 使用:
roleBindingsMarkdownrenderedTestCasesMarkdownrenderedLeftSnapshotMarkdownrenderedRightSnapshotMarkdownsynthesis 使用:
roleBindingsMarkdownsynthesisHintsMarkdownjudgeResultsMarkdownrewrite-from-evaluation 虽然已经补上了 workspacePrompt / referencePrompt,但整体仍是“自然语言规则 + 文本分段”的组织方式。这会带来四类问题:
role/content 对象、消息数组,或者错误继承正文里的展示包装。一句话概括:
Markdown 适合作为展示层,不适合作为机器协议层。
本次最小迁移只做一件事:
同时保留:
EvaluationService 的整体调用时序即:
以后每条 compare / rewrite LLM 请求都分成两层:
说明层只做:
JSON payload 只做:
被评估 prompt / output / reasoning / test input 中即使包含:
也都只能出现在 JSON 字段值里,视为原始证据正文,而不是协议层结构。
这些仍可保留 Markdown:
docs/workspace/compare-evaluation-analysis/real-api-samples/*structured-compare-calibration/latest/*/llm-calls.mdrequest.md / response.md但这些 Markdown 是调试渲染产物,不是模型真实接收的协议文本。
当前:
roleBindingsMarkdown + testCasesMarkdown + left/right snapshot markdown目标:
Evidence Payload 的 JSON 文本建议 payload 结构:
{
"scenario": {
"language": "zh",
"pairKey": "target-vs-replica",
"pairType": "targetReplica",
"pairLabel": "Target vs Replica",
"purpose": "Judge whether the target prompt behaves stably across repeated executions instead of improving by chance.",
"signalName": "stability",
"allowedSignalValues": ["stable", "unstable", "unclear"],
"focusBrief": "如果同一个 target prompt 在重复执行时出现格式飘移或边界滑移,应把稳定性问题显式暴露出来。"
},
"roleBindings": [
{ "snapshotId": "a", "snapshotLabel": "A", "role": "target" },
{ "snapshotId": "b", "snapshotLabel": "B", "role": "baseline" },
{ "snapshotId": "c", "snapshotLabel": "C", "role": "reference" },
{ "snapshotId": "d", "snapshotLabel": "D", "role": "referenceBaseline" },
{ "snapshotId": "e", "snapshotLabel": "E", "role": "replica" }
],
"testCases": [
{
"id": "tc-1",
"label": "工单输入",
"input": {
"kind": "text",
"label": "工单输入",
"content": "用户反馈同一个月内收到 5 次异常登录提醒,并怀疑账号被盗。"
}
}
],
"leftSnapshot": {
"id": "a",
"label": "A",
"role": "target",
"testCaseId": "tc-1",
"promptRef": { "kind": "workspace", "label": "Workspace" },
"promptText": "你是风险分级助手。只输出 JSON 对象...",
"output": "{\"level\":\"high\",...}",
"modelKey": "custom",
"versionLabel": "workspace"
},
"rightSnapshot": {
"id": "e",
"label": "E",
"role": "replica",
"testCaseId": "tc-1",
"promptRef": { "kind": "workspace", "label": "Replica" },
"promptText": "你是风险分级助手。只输出 JSON 对象...",
"output": "```json\\n{\"level\":\"high\",...}\\n```\\n补充说明:建议同时检查近期设备记录。",
"modelKey": "custom",
"versionLabel": "workspace-replica"
}
}
当前:
synthesisHintsMarkdownjudgeResultsMarkdown 拼接进去目标:
Synthesis Payload建议 payload 结构:
{
"scenario": {
"roleName": "Structured System Prompt Compare Synthesizer",
"subjectLabel": "system prompt",
"sharedCompareInputs": true,
"samePromptAcrossSnapshots": true,
"crossModelComparison": true,
"focusBrief": "优先判断改动是否真正减少额外解释与格式滑移。"
},
"roleBindings": [
{ "snapshotId": "a", "snapshotLabel": "A", "role": "target" },
{ "snapshotId": "b", "snapshotLabel": "B", "role": "baseline" },
{ "snapshotId": "c", "snapshotLabel": "C", "role": "reference" },
{ "snapshotId": "d", "snapshotLabel": "D", "role": "referenceBaseline" }
],
"deterministicHints": {
"signalSnapshot": {
"progress": "improved",
"gap": "none",
"promptValidity": "supported",
"stability": "unstable"
},
"derivedStopSignals": {
"targetVsBaseline": "improved",
"targetVsReferenceGap": "none",
"overfitRisk": "high",
"stopRecommendation": "review"
},
"learnableSignals": [
"在提示词中明确使用“只输出 JSON 对象”并列出字段名,可以稳定输出格式。"
],
"overfitWarnings": [
"Target 在 Replica 测试中出现 JSON 外补充说明。"
],
"conflictSignals": [
"improvementUnstableAcrossReplicas",
"sampleOverfitRiskVisible"
]
},
"judgeResults": [
{
"pairKey": "target-vs-baseline",
"pairType": "targetBaseline",
"pairSignal": "improved",
"verdict": "left-better",
"confidence": "high",
"analysis": "..."
}
]
}
当前:
workspacePrompt / referencePrompt 文本块result.summary / improvements / compareInsights 等文本块目标:
Rewrite Payload建议 payload 结构:
{
"scenario": {
"language": "zh",
"evaluationType": "compare",
"subjectLabel": "系统提示词",
"overallScore": 65
},
"sourcePrompts": {
"workspacePrompt": "你是风险分级助手。只输出一个 JSON 对象...",
"referencePrompt": "你是风险分级助手。输出 level, rationale, next_action。"
},
"compressedEvaluation": {
"summary": "Target 相比 Baseline 有进步,但 Replica 暴露出格式漂移。",
"improvements": [
"在提示词中明确使用“只输出 JSON 对象”并列出字段格式。"
],
"stopSignals": {
"targetVsBaseline": "improved",
"targetVsReferenceGap": "none",
"overfitRisk": "high",
"stopRecommendation": "review"
},
"compareInsights": {
"progressSummary": { "...": "..." },
"stabilitySummary": { "...": "..." },
"conflictSignals": [
"improvementUnstableAcrossReplicas",
"sampleOverfitRiskVisible"
]
}
}
}
packages/core/src/services/evaluation/structured-compare-prompts.ts当前职责:
pairwise judge / synthesis 模板上下文要改成:
建议新增函数:
buildStructuredComparePairJudgePayload()buildStructuredCompareSynthesisPayload()对应新的 params 类型:
StructuredComparePairJudgePayloadParamsStructuredCompareSynthesisPayloadParamspackages/core/src/services/template/default-templates/evaluation-structured-compare/*当前模板里有很多:
roleBindingsMarkdownrenderedTestCasesMarkdownjudgeResultsMarkdown要改成:
pairJudgePayloadJsonsynthesisPayloadJson并在 system prompt 中明确写:
packages/core/src/services/evaluation/service.ts当前:
要改成:
也就是说:
renderStructuredCompareRoleBindings()renderStructuredCompareJudgeResults()renderStructuredCompareSynthesisHints()这些函数可以继续保留给 debug view 用
但真正给 LLM 的 builder 改走 JSON payload。
packages/core/src/services/evaluation/rewrite-from-evaluation.ts当前:
workspacePrompt / referencePrompt建议改成:
buildRewritePayload()Rewrite Payload JSONrequest.md / response.md / llm-calls.md 的 Markdown 导出方式为避免“协议层升级后,人类不易读”,建议并行保留两个输出:
pairJudgePayloadJsonsynthesisPayloadJsonrewritePayloadJsonrendered-messages.mdrequest.mdllm-calls.md也就是:
这样不会影响:
要更新的测试主要有两类:
packages/core/tests/unit/evaluation/structured-compare-prompts.test.ts
packages/core/tests/unit/evaluation/rewrite-from-evaluation.test.ts
workspacePromptreferencePromptcompressedEvaluationscripts/run-structured-compare-calibration.mjs 不需要改业务流程,只需:
建议新增产物:
pair-judge-payload.jsonsynthesis-payload.jsonrewrite-payload.json这样以后复盘时可以直接看机器协议层是否干净。
只改:
structured-compare-prompts.tsevaluation-structured-compare 模板service.ts 里 pairwise message 构造验收标准:
synthetic-replica-instability 仍稳定命中synthetic-schema-drift-regression 仍稳定命中只改:
验收标准:
summary.md 里的 stop signals 与当前校准结果不明显退化只改:
rewrite-from-evaluation.tsevaluation-rewrite/*验收标准:
role/content 包装synthetic-schema-drift-regression 的 rewrite 继续能恢复 contract如果现在就开始做,我建议不要一步到位把所有 Markdown 都删掉。
最小、最稳的改法是:
这样有几个好处:
对于你们这个项目,我建议把协议层原则正式定下来:
Markdown 只做展示层,JSON payload 才是机器协议层。
这是对 compare / rewrite 最有价值的一次“基础设施型”收敛,因为它会同时提升: