.agent/research/isolation-model-comparison.md
Comparing five isolation approaches from weakest to strongest security boundary, with focus on how each binds to the host kernel and what the actual trusted computing base is.
| Namespaces/Jailing | Docker (default) | agentOS/Secure-Exec | gVisor | Firecracker | |
|---|---|---|---|---|---|
| Isolation primitive | Kernel namespaces | Namespaces + seccomp + caps + LSM | Userspace TypeScript kernel | Userspace Go kernel (Sentry) | Hardware virtualization (VT-x/AMD-V) |
| Host kernel shared? | Yes (full) | Yes (filtered) | Yes (underneath Node.js) | Yes (minimal surface) | No (guest gets own kernel) |
| Host syscalls reachable | ~385 (all) | ~361 unconditional + ~65 conditional | All (via Node.js), jailing coming soon | 53-68 (seccomp-enforced) | ~40 (seccomp-enforced, from VMM only) |
| Kernel exploit = host compromise? | Yes | Yes | N/A (no guest kernel) | Only if in Sentry's 53-68 syscalls | No (only compromises guest) |
| TCB size | Host kernel | Host kernel + runc | V8 + Node.js + kernel code | Host kernel (53-68 paths) + Sentry (Go) | KVM + VMM (50K lines Rust) + Jailer |
| Memory safety | C (kernel) | C (kernel) | TypeScript/JS (GC'd) | Go (GC'd) | Rust (compile-time) |
| Boot time | Instant | ~50ms | Near-instant | 50-100ms | ~125ms |
| Memory overhead | Negligible | <10 MiB | Minimal | 10-50 MiB | <5 MiB |
| Multi-tenant safe? | No | No | No | Yes (with caveats) | Yes (production-proven at AWS scale) |
Raw Linux kernel primitives that virtualize global resources. Namespaces change what a process can see, not what it can do.
CLONE_NEWPID): Isolated process ID tree. Process sees its own PID 1.CLONE_NEWNET): Isolated network stack, routing tables, firewall rules, port space.CLONE_NEWNS): Isolated filesystem mount points. Different view of the filesystem hierarchy.CLONE_NEWUTS): Isolated hostname and NIS domain name.CLONE_NEWIPC): Isolated System V IPC objects and POSIX message queues.CLONE_NEWUSER): Isolated UID/GID mappings. Root inside maps to unprivileged UID on host.CLONE_NEWCGROUP): Virtualized view of /proc/[pid]/cgroup.CLONE_NEWTIME): Isolated CLOCK_MONOTONIC and CLOCK_BOOTTIME.Directly. The process makes syscalls straight to the host kernel.
A namespaced process has access to the entire ~385 syscall table. Namespaces do not restrict syscalls at all. They only virtualize resource views (PID numbers, network stacks, mount trees). The kernel processes every syscall from a namespaced process identically to any other process, just with a different namespace context.
Process in namespace --> syscall --> Host kernel (all ~385 syscalls available)
^
Only namespace context changes what
the process sees, NOT what it can call
cgroups (v1 or v2) limit resource consumption (CPU, memory, PIDs, block I/O) but do not prevent privilege escalation or restrict syscalls.
A kernel vulnerability in any of the ~385 syscall handlers is exploitable from within a namespace. There is zero syscall filtering, zero capability reduction, and zero MAC enforcement unless you add those layers yourself.
Never alone for untrusted code. Raw namespaces are building blocks, not a complete isolation solution.
Namespaces + cgroups + seccomp + capability dropping + AppArmor/SELinux + /proc//sys masking. Defense-in-depth on a shared kernel.
docker-default or container_t)/proc and /sys masking: tmpfs over sensitive paths, /dev/null bind-mounts over /proc/kcore, /proc/keys, etc.pivot_root + unmount old root: Old root completely inaccessible (unlike chroot)no_new_privs bit: Prevents setuid escalationio_uring_*, kexec_load, pivot_root, userfaultfd, vm86, kernel module syscalls, etc.Granted: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE.
Critically NOT granted: SYS_ADMIN, SYS_MODULE, SYS_RAWIO, SYS_PTRACE, NET_ADMIN, DAC_READ_SEARCH, BPF.
Still directly, but with a reduced attack surface.
Container process --> seccomp filter --> Host kernel (~361 syscalls reachable)
^ ^
Blocks ~23+ Still processes all allowed syscalls
dangerous calls on the shared kernel
The container process still makes real host kernel syscalls. seccomp reduces which ones, capabilities reduce what root can do, LSM profiles add MAC restrictions. But the kernel is shared. A vulnerability in any of the ~361 allowed syscall paths is exploitable.
--privileged flag: Grants all capabilities, disables seccomp/AppArmor, gives access to all devices. Container can mount host filesystem, load kernel modules, nsenter into host namespaces./var/run/docker.sock is mounted, container can create new privileged containers./proc/sys/kernel/core_pattern: Can specify a pipe program that runs on host.release_agent: Executes on host when last process in cgroup exits.Docker CLI -> dockerd -> containerd -> containerd-shim -> runc -> container process
^
Per-container parent process,
survives containerd restarts
runc does the actual kernel setup: creates namespaces, configures cgroups, does pivot_root, masks /proc//sys, drops capabilities, applies seccomp, applies LSM profiles, then execve() into the container entrypoint.
A POSIX-compatible operating system kernel written in TypeScript that virtualizes all I/O and process management. All syscalls from guest code are intercepted and mediated by the kernel before reaching the host.
This is architecturally most similar to gVisor. Both implement a userspace kernel that intercepts syscalls. The key differences are the language, platform, and enforcement mechanism.
Agent code (V8 isolate / Worker thread)
|
v
Syscall shim (SharedArrayBuffer RPC / Node.js module interception)
|
v
Secure-Exec Kernel (TypeScript)
|-- VFS (in-memory, host dir, S3 backends)
|-- Process Table (global PIDs, signals, waitpid across runtimes)
|-- Socket Table (loopback in-kernel, external via HostNetworkAdapter)
|-- Pipe Manager (64KB buffers, cross-runtime IPC)
|-- PTY Manager (terminal emulation)
|-- Permission Wrapper (deny-by-default)
|
v
Node.js APIs (fs, net, child_process, crypto)
|
v
Host kernel (all syscalls available to Node.js process)
Three execution environments, all sharing the same kernel:
node-ivm. All fs/net/process calls shimmed to go through kernel via SharedArrayBuffer RPC.Indirectly, through Node.js, but the full Node.js API surface is available to the kernel.
Guest code -> Kernel permission check -> TypeScript kernel -> Node.js -> Host kernel
^ ^ ^
Deny-by-default Mediates all I/O Full libuv/V8/OpenSSL
per-path/socket/env No direct bypass surface available
The kernel itself is a normal Node.js process. It can call any Node.js API and therefore any host syscall that Node.js/libuv uses. The isolation comes from:
interface Permissions {
fs?: (request: FsAccessRequest) => { allow: boolean; reason?: string };
network?: (request: NetworkAccessRequest) => { allow: boolean; reason?: string };
childProcess?: (request: ChildProcessAccessRequest) => { allow: boolean; reason?: string };
env?: (request: EnvAccessRequest) => { allow: boolean; reason?: string };
}
Programmable, fine-grained, per-path/per-socket/per-env-var decisions. This is more flexible than any other model in this comparison.
Layer 1: Runtime isolation (V8 isolate heap / Worker thread). No shared JS state between processes.
Layer 2: Kernel permission checks (deny-by-default). All I/O mediated.
Layer 3 (coming soon): OS-level jailing (namespaces + seccomp on the Node.js process). This will restrict the host syscall surface available to the kernel itself, similar to how gVisor's Sentry self-imposes seccomp. This closes the biggest gap with gVisor and makes agentOS a 3-step escape chain.
| Secure-Exec | gVisor | |
|---|---|---|
| Kernel language | TypeScript | Go |
| Syscall interception | SharedArrayBuffer RPC / module shimming | seccomp SIGSYS trap / KVM ring switch |
| Host syscall restriction | None yet (full Node.js surface), jailing coming soon | seccomp-enforced 53-68 syscalls |
| Filesystem proxy | VFS backends (in-memory, host dir, S3) | Gofer process (LISAFS protocol) |
| Network stack | Socket table + HostNetworkAdapter | Netstack (full userspace TCP/IP) |
| Enforcement mechanism | Software (V8 isolate boundary + permission code) | Software (Go memory safety) + seccomp hardware |
| Can bypass kernel? | Only via V8 isolate escape (~1-2/year historically) | Only via Sentry escape + host seccomp bypass |
The critical gap (being addressed): gVisor applies a seccomp filter to itself (the Sentry), restricting it to 53-68 host syscalls. Even if the Sentry is fully compromised, the attacker can only use those 53-68 calls. Secure-Exec's kernel currently runs as unrestricted Node.js. OS-level jailing (namespaces + seccomp on the Node.js process) is coming soon, which will restrict the host syscall surface and add a third isolation layer.
waitpid() across runtimes.All processes in a Secure-Exec instance share the same TypeScript kernel. They share process table metadata, clock resolution, and heap state. This is agent-sandboxing, not multi-tenant isolation.
A userspace kernel written in Go that reimplements Linux syscall semantics. Guest processes make syscalls that are intercepted and handled entirely by the Sentry (gVisor's kernel), which then makes a minimal set of host syscalls.
This is the closest analog to Secure-Exec/agentOS, but with hardware-assisted enforcement and a restricted host syscall surface.
Application process
|
v
Syscall trap (seccomp SIGSYS on systrap platform, or VM exit on KVM platform)
|
v
Sentry (userspace kernel, Go)
|-- VFS2 (full virtual filesystem)
|-- Netstack (userspace TCP/IP stack)
|-- Memory management (backed by single memfd)
|-- Process scheduling (goroutines)
|-- Implements 274 of 350 Linux amd64 syscalls
|
|-- [filesystem access] --> Gofer process (LISAFS protocol)
| |
| v
| Host filesystem
|
v
Host kernel (only 53-68 syscalls reachable, seccomp-enforced)
Systrap (default since mid-2023):
SECCOMP_RET_TRAP to deliver SIGSYS when the application attempts a syscall.syscall instructions with jmp to trampoline, bypassing seccomp overhead entirely after first interception.KVM:
ptrace (deprecated):
PTRACE_SYSEMU. Very high context-switch overhead.Minimally. The Sentry applies a seccomp filter to itself.
Application -> Sentry (handles syscall in userspace) -> Host kernel
^
Only 53-68 syscalls allowed
(seccomp self-imposed)
Critically blocked from the Sentry: open, socket, execve, fork, mount, ptrace. The Sentry cannot open files, create sockets, or spawn processes on the host. Filesystem access goes through the Gofer process.
Compare:
A separate Go process that mediates all host filesystem access. Communicates with the Sentry via LISAFS protocol. The Sentry operates in an empty mount namespace and cannot open files itself. In directfs mode (now default), the Gofer donates FDs to the Sentry, with seccomp enforcing O_NOFOLLOW to prevent symlink traversal attacks.
gVisor implements a complete userspace TCP/IP stack. No host kernel networking code is involved for packet processing. This eliminates the entire host kernel network stack as attack surface. Throughput: ~17 Gbps vs 42 Gbps native (significant but acceptable for security).
PACKET_RX_RING) was unexploitable because gVisor never implemented that code path.A lightweight Virtual Machine Monitor (VMM) that uses hardware virtualization (KVM + VT-x/AMD-V) to run microVMs. Each VM gets its own Linux kernel. The guest never shares a kernel with the host.
Guest application
|
v
Guest Linux kernel (entirely separate from host)
|
v
VM Exit (hardware trap, CPU-enforced)
|
v
Firecracker VMM (50K lines Rust, single process)
|-- virtio-net (network)
|-- virtio-block (storage)
|-- virtio-vsock (host communication)
|-- Serial console
|-- Keyboard controller (reset only)
|
v
Jailer sandbox (namespaces + chroot + seccomp + privilege drop)
|
v
Host kernel (~40 syscalls allowed, seccomp-enforced)
Through the narrowest possible pipe, with hardware enforcement.
The guest never makes host syscalls. The CPU hardware enforces this. When the guest does something that requires VMM intervention (I/O to a virtio device, for example), the CPU traps (VM Exit) and transfers control to the Firecracker VMM process on the host.
Guest process -> Guest kernel -> VM Exit (hardware) -> KVM -> Firecracker VMM -> ~40 syscalls -> Host kernel
^ ^ ^
Separate kernel CPU enforces boundary Rust, 50K lines
(exploit only (no software bypass) (memory-safe)
affects guest)
The Firecracker VMM process itself is sandboxed by the Jailer:
Only 5 emulated devices (QEMU has hundreds). Each device is a potential attack surface, so minimizing them is a security strategy:
No USB, GPU, audio, or any other device.
Layer 1: CPU hardware (VT-x/AMD-V)
Guest cannot execute host instructions. Period.
Breaking this requires a CPU microarchitecture bug.
Layer 2: Firecracker VMM (Rust, 50K lines)
Translates virtio requests to host I/O.
Memory-safe. Minimal device model.
Layer 3: Jailer (namespaces + seccomp + chroot)
Even if VMM is compromised, attacker is in a sandbox
with ~40 syscalls and no filesystem access.
Layer 4: Privilege separation
Unprivileged process. Cannot escalate.
A guest escape requires breaching ALL FOUR layers. Each is independent. This is why AWS trusts it for Lambda (billions of untrusted invocations on shared hardware).
Namespaces: ████████████████████████████████████████ ~385 syscalls (full kernel)
Docker: ██████████████████████████████████████ ~361 syscalls (seccomp filtered)
agentOS: ████████████████████████████████████████ ~385 syscalls (Node.js unrestricted, jailing coming soon)
gVisor: ██████ 53-68 syscalls (self-imposed seccomp)
Firecracker: █████ ~40 syscalls (Jailer seccomp, from VMM only)
Note: agentOS's host syscall surface is currently comparable to raw namespaces because the kernel runs as an unrestricted Node.js process. The difference is that guest code cannot directly invoke those syscalls. It must go through the TypeScript kernel. OS-level jailing (namespaces + seccomp) is coming soon, which will significantly reduce this surface.
Hardware Software kernel Software checks
enforced? reimplemented? only?
Namespaces: X
Docker: X
agentOS: X (jail, soon) X (partial) X
gVisor: X (seccomp) X (full)
Firecracker: X (VT-x + seccomp)
Namespaces: 1 step (kernel exploit)
Docker: 1 step (kernel exploit, or runc bug, or misconfiguration)
agentOS: 3 steps (V8 escape -> kernel bug -> jail escape) [jailing coming soon]
gVisor: 2 steps (Sentry logic bug -> exploit one of 53-68 host syscalls)
Firecracker: 4 steps (guest kernel -> VM escape -> VMM bug -> Jailer escape)
Note: Without jailing, agentOS is currently 2 steps (V8 escape -> kernel bug). With jailing, even after breaching the V8 isolate and exploiting a kernel bug, the attacker must also escape the OS-level jail (namespaces + seccomp) to reach the full host.
agentOS and gVisor are architecturally very similar. Both:
The key differences that make gVisor stronger:
net/dgram.What makes agentOS more flexible:
Coming soon: OS-level jailing. The Node.js kernel process will be run inside a jail (namespaces + seccomp), restricting the host syscall surface available even if the kernel itself is compromised. This is the single highest-impact improvement and brings agentOS to a 3-step escape chain (V8 escape -> kernel bug -> jail escape), comparable in structure to gVisor.
Further hardening opportunities: