docs/refactor/acp.md
ACP lifecycle currently works, but too much of it is inferred after the fact.
Process cleanup reconstructs ownership from PIDs, command strings, wrapper
paths, and the live process table. Session visibility reconstructs ownership
from session-key strings plus secondary sessions.list({ spawnedBy }) lookups.
That makes narrow fixes possible, but it also makes edge cases easy to miss:
PID reuse, quoted commands, adapter grandchildren, multi-gateway state roots,
cancel versus close, and tree versus all visibility all become separate
places to rediscover the same ownership rules.
This refactor makes ownership first-class. The goal is not a new ACP product surface; it is a safer internal contract for the existing ACP and ACPX behavior.
cancel, close, and startup reaping have distinct lifecycle intents.sessions_list, sessions_history, sessions_send, and status checks use
the same requester-owned session model./acp command surface.cancel close reusable ACP sessions.Each Gateway process should have a stable runtime instance id:
type GatewayInstanceId = string;
It can be generated on Gateway startup and persisted in state for the life of that install. It is not a security secret; it is an ownership discriminator used to avoid confusing one Gateway's ACP processes with another Gateway's processes.
Every spawned ACP session should have normalized ownership metadata:
type AcpSessionOwner = {
sessionKey: string;
spawnedBy?: string;
parentSessionKey?: string;
ownerSessionKey: string;
agentId: string;
backend: "acpx";
gatewayInstanceId: GatewayInstanceId;
createdAt: number;
};
The Gateway should return these fields on session rows where they are known. Visibility filtering should be a pure check over row metadata:
canSeeSessionRow({
row,
requesterSessionKey,
visibility,
a2aPolicy,
});
That removes hidden secondary sessions.list({ spawnedBy }) calls from
visibility checks. A spawned cross-agent ACP child is requester-owned because
the row says so, not because a second query happens to find it.
Every generated wrapper launch should create a lease record:
type AcpxProcessLease = {
leaseId: string;
gatewayInstanceId: GatewayInstanceId;
sessionKey: string;
wrapperRoot: string;
wrapperPath: string;
rootPid: number;
processGroupId?: number;
commandHash: string;
startedAt: number;
state: "open" | "closing" | "closed" | "lost";
};
The wrapper process should receive the lease id and gateway instance id in its environment:
OPENCLAW_ACPX_LEASE_ID=...
OPENCLAW_GATEWAY_INSTANCE_ID=...
When the platform allows it, verification should prefer live process metadata that cannot be confused by command quoting:
wrapperRootIf the live process cannot be verified, cleanup fails closed.
Introduce one ACPX lifecycle controller that owns process leases and cleanup policy:
interface AcpxLifecycleController {
ensureSession(input: AcpRuntimeEnsureInput): Promise<AcpRuntimeHandle>;
cancelTurn(handle: AcpRuntimeHandle): Promise<void>;
closeSession(input: {
handle: AcpRuntimeHandle;
discardPersistentState?: boolean;
reason?: string;
}): Promise<void>;
reapStartupOrphans(): Promise<void>;
verifyOwnedTree(lease: AcpxProcessLease): Promise<OwnedProcessTree | null>;
}
cancelTurn requests turn cancellation only. It must not reap reusable wrapper
or adapter processes.
closeSession is allowed to reap, but only after loading the session record,
loading the lease, and verifying the live process tree still belongs to that
lease.
reapStartupOrphans starts from open leases in state. It may use the process
table to find descendants, but it should not scan arbitrary ACP-looking
commands first and then decide they are probably ours.
Generated wrappers should stay small. They should:
Wrappers should not decide session policy. They only enforce local process-tree cleanup for their own adapter group.
Visibility should use normalized row ownership:
type SessionVisibilityInput = {
requesterSessionKey: string;
row: {
key: string;
agentId: string;
ownerSessionKey?: string;
spawnedBy?: string;
parentSessionKey?: string;
};
visibility: "self" | "tree" | "agent" | "all";
a2aPolicy: AgentToAgentPolicy;
};
Rules:
self: only the requester session.tree: requester session plus rows owned by or spawned from the requester.all: all same-agent rows, a2a-allowed cross-agent rows, and requester-owned
spawned cross-agent rows even when general a2a is disabled.agent: same agent only, unless an explicit owner relationship says the row
belongs to the requester.This makes tree and all monotonic: all must not hide an owned child that
tree would show.
gatewayInstanceId to Gateway state.leaseId on new ACPX session records.leaseId first.closed after verified cleanup.lost when the process is gone before cleanup.closed and lost leases with a bounded retention window.ownerSessionKey or spawnedBy.sessions.list({ spawnedBy }) lookups.After one release window:
Add two table-driven suites.
Process lifecycle simulator:
ps command is notSession visibility matrix:
self, tree, agent, alltreeThe important invariant: a requester-owned spawned child is visible wherever
the configured visibility includes the requester session tree, and all is not
less capable than tree.
Old session records may not have leaseId. They should use the legacy
fail-closed cleanup path:
If a legacy record cannot be verified, leave it alone. Startup lease cleanup and the next release window should eventually retire the fallback.
cancel aborts the active turn without closing reusable sessions.sessions_list can show requester-owned cross-agent ACP children under both
tree and all.