docs/jailer.md
The jailer is a program designed to isolate the Firecracker process in order to enhance Firecracker's security posture. It is meant to address the security needs of Firecracker only and is not intended to work with other binaries. Additionally, each jailer binary should be used with a statically linked Firecracker binary (with the default musl toolchain) of the same version. Experimental gnu builds are not supported.
The jailer is invoked in this manner:
jailer --id <id> \
--exec-file <exec_file> \
--uid <uid> \
--gid <gid> \
[--cgroup-version <cgroup_version>] \
[--cgroup <cgroup>] \
[--parent-cgroup <parent_cgroup>] \
[--chroot-base-dir <chroot_base>] \
[--netns <netns>] \
[--resource-limit <resource=value>] \
[--daemonize] \
[--new-pid-ns] \
[--...extra arguments for Firecracker]
--id specifies the unique VM identification string, which may contain
alphanumeric characters and hyphens. The maximum length is currently 64
characters.--exec-file specifies the path to the Firecracker binary that will be
exec-ed by the jailer.--uid and --gid specify the uid and gid the jailer switches to as it execs
the target binary.--cgroup-version is used to select which type of cgroup hierarchy to use for
the creation of cgroups. The default value is "1" which means that cgroups
specified with --cgroup will be created within a v1 hierarchy. Supported
options are "1" for cgroup-v1 and "2" for cgroup-v2.--cgroup can be passed to the jailer to let it set the values when the
microVM process is spawned. The argument must follow this format:
<cgroup_file>=<value> (e.g cpuset.cpus=0). This argument can be used
multiple times to set multiple cgroups. This is useful to avoid providing
privileged permissions to another process for setting the cgroups before or
after the jailer is executed. The --cgroup flag can help as well to set
Firecracker process cgroups before the VM starts running, with no need to
create the entire cgroup hierarchy manually (which requires privileged
permissions).--parent-cgroup is used to allow the placement of microvm cgroups in custom
nested hierarchies. The default value is the filename of <exec_file>, which
will be henceforth referred to as <exec_file_name>. The behavior of this
parameter depends on the following condition:
--cgroup parameter is specifed or --cgroup-version=1 is
passed, the jailer will create a new cgroup named <id> for the microvm in
the <cgroup_base>/<parent_cgroup> subfolder. <cgroup_base> is the cgroup
controller root for cgroup v1 (e.g. /sys/fs/cgroup/cpu) or the unified
controller hierarchy for cgroup v2 (e.g. /sys/fs/cgroup/unified).
<parent_cgroup> is a relative path within that hierarchy. For example, if
--parent-cgroup all_uvms/external_uvms is specified, the jailer will write
all cgroup parameters specified through --cgroup in
/sys/fs/cgroup/<controller_name>/all_uvms/external_uvms/<id>.--cgroup parameters are specified and --cgroup-version=2 is
passed, the jailer will not create a new cgroup. If the cgroup specified
with --parent-cgroup exists, the jailer will move the process to the
specified cgroup, contrary to its name. This behavior can be used when users
want to configure a cgroup beforehand by themselves and move the process to
the configured cgroup. Note that, if the specified cgroup has domain
controllers (e.g. memory) enabled in cgroup.subtree_control, the move
fails due to "no internal process constraint" and jailer exits with an
error. If the cgroup spcified with --parent-cgroup does not exist, the
jailer does not move the process to any cgroup and proceeds without error.--chroot-base-dir specifies the base folder where chroot jails are built.
The default is /srv/jailer.--netns specifies the path to a network namespace handle. If present, the
jailer will use this to join the associated network namespace.--resource-limit can be
used to set bounds to the process resources. The argument must follow this
format: <resource>=<value> (e.g no-file=1024) and can be used multiple
times to set multiple bounds. Current available resources that can be limited
using this argument are:
fsize: The maximum size in bytes for files created by the process.no-file: Specifies a value one greater than the maximum file descriptor
number that can be opened by this process.Here is an example on how to set multiple resource limits using this argument:
--resource-limit fsize=250000000 --resource-limit no-file=1024
--daemonize causes the jailer to call setsid() and redirect
all three standard I/O file descriptors to /dev/null.--new-pid-ns causes the jailer to spawn the provided binary
into a new PID namespace. It makes use of the libc clone() function with the
CLONE_NEWPID flag. As a result, the jailer and the process running the exec
file have different PIDs. The PID of the child process is stored in the jail
root directory inside <exec_file_name>.pid.-- are forwarded to Firecracker. For example,
this can be paired with the --config-file Firecracker argument to specify a
configuration file when starting Firecracker via the jailer (the file path and
the resources referenced within must be valid relative to a jailed
Firecracker). Please note the jailer already passes --id parameter to the
Firecracker process.After starting, the Jailer goes through the following operations:
/proc/<jailer-pid>/fd except input,
output and error.<chroot_base>/<exec_file_name>/<id>/root folder, which will be
henceforth referred to as <chroot_dir>. Nothing is done if the path already
exists (it should not, since <id> is supposed to be unique).--exec-file to <chroot_dir>/<exec_file_name>.
This ensures the new process will not share memory with any other Firecracker
process.--resource-limit argument, by calling setrlimit() system call with the
specific resource argument. If no limits are provided, the jailer bounds
no-file to a maximum default value of 2048.cgroup v1 or
cgroup v2. On most systems, this is mounted by default in /sys/fs/cgroup
(should be mounted by the user otherwise). The jailer will parse
/proc/mounts to detect where each of the controllers required in --cgroup
can be found (multiple controllers may share the same path). For each
identified location (referred to as <cgroup_base>), the jailer creates the
<cgroup_base>/<parent_cgroup>/<id> subfolder, and writes the current pid to
<cgroup_base>/<parent_cgroup>/<id>/tasks. Also, the value passed for each
<cgroup_file> is written to the file.unshare() into a new mount namespace, use pivot_root() to switch the
old system root mount point with a new one base in <chroot_dir>, switch the
current working directory to the new root, unmount the old root mount point,
and call chroot into the current directory.mknod to create a /dev/net/tun equivalent inside the jail.mknod to create a /dev/kvm equivalent inside the jail.chown to change ownership of the <chroot_dir> (root path / as seen
by the jailed firecracker), /dev/net/tun, /dev/kvm. The ownership is
changed to the provided <uid>:<gid>.--netns <netns> is present, attempt to join the specified network
namespace.--daemonize is specified, call setsid() and redirect STDIN, STDOUT,
and STDERR to /dev/null.--new-pid-ns is specified, call clone() with CLONE_NEWPID flag to
spawn a new process within a new PID namespace. The new process will assume
the role of init(1) in the new namespace. The parent will store child's PID
inside <exec_file_name>.pid, while the child drops privileges and exec()s
into the <exec_file_name>, as described below.uid and gid.<exec_file_name> --id=<id> --start-time-us=<opaque> --start-time-cpu-us=<opaque>
(and also forward any extra arguments provided to the jailer after --, as
mentioned in the Jailer Usage section), where:
<id>: (string) - The <id> argument provided to jailer.<opaque>: (number) time calculated by the jailer that it spent doing its
work.Let’s assume Firecracker is available as /usr/bin/firecracker, and the jailer
can be found at /usr/bin/jailer. We pick the unique id
551e7604-e35c-42b3-b825-416853441234, and we choose to run on NUMA node 0
(in order to isolate the process in the 0th NUMA node we need to set
cpuset.mems=0 and cpuset.cpus equals to the CPUs of that NUMA node), using
uid 123, and gid 100. For this example, we are content with the default
/srv/jailer chroot base dir.
We start by running:
/usr/bin/jailer --id 551e7604-e35c-42b3-b825-416853441234
--cgroup cpuset.mems=0 --cgroup cpuset.cpus=$(cat /sys/devices/system/node/node0/cpulist)
--exec-file /usr/bin/firecracker --uid 123 --gid 100 \
--netns /var/run/netns/my_netns --daemonize
After opening the file descriptors mentioned in the previous section, the jailer will create the following resources (and all their prerequisites, such as the path which contains them):
/srv/jailer/firecracker/551e7604-e35c-42b3-b825-416853441234/root/firecracker
(copied from /usr/bin/firecracker)We are going to refer to
/srv/jailer/firecracker/551e7604-e35c-42b3-b825-416853441234/root as
<chroot_dir>.
Let’s also assume the, cpuset cgroups are mounted at
/sys/fs/cgroup/cpuset. The jailer will create the following subfolder (which
will inherit settings from the parent cgroup):
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234It’s worth noting that, whenever a folder already exists, nothing will be done,
and we move on to the next directory that needs to be created. This should only
happen for the common firecracker subfolder (but, as for creating the chroot
path before, we do not issue an error if folders directly associated with the
supposedly unique <id> already exist).
The jailer then writes the current pid to
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/tasks,
It also writes 0 to
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/cpuset.mems,
And the corresponding CPUs to
/sys/fs/cgroup/cpuset/firecracker/551e7604-e35c-42b3-b825-416853441234/cpuset.cpus.
Since the --netns parameter is specified in our example, the jailer opens
/var/run/netns/my_netns to get a file descriptor fd, uses
setns(fd, CLONE_NEWNET) to join the associated network namespace, and then
closes fd.
The --daemonize flag is also present, so the jailers opens /dev/null as
RW and keeps the associate file descriptor as dev_null_fd (we do this
before going inside the jail), to be used later.
Build the chroot jail. First, the jailer uses unshare() to enter a new mount
namespace, and changes the propagation of all mount points in the new namespace
to private using mount(NULL, “/”, NULL, MS_PRIVATE | MS_REC, NULL), as a
prerequisite to pivot_root(). Another required operation is to bind mount
<chroot_dir> on top of itself using
mount(<chroot_dir>, <chroot_dir>, NULL, MS_BIND, NULL). At this point, the
jailer creates the folder <chroot_dir>/old_root, changes the current directory
to <chroot_dir>, and calls syscall(SYS_pivot_root, “.”, “old_root”). The
final steps of building the jail are unmounting old_root using
umount2(“old_root”, MNT_DETACH), deleting old_root with rmdir, and finally
calling chroot(“.”) for good measure. From now, the process is jailed in
<chroot_dir>.
Create the special file /dev/net/tun, using
mknod(“/dev/net/tun”, S_IFCHR | S_IRUSR | S_IWUSR, makedev(10, 200)), and then
call chown(“/dev/net/tun”, 123, 100), so Firecracker can use it after dropping
privileges. This is required to use multiple TAP interfaces when running jailed.
Do the same for /dev/kvm.
Change ownership of <chroot_dir> to <uid>:<gid> so that Firecracker can
create its API socket there.
Since the --daemonize flag is present, call setsid() to join a new session,
a new process group, and to detach from the controlling terminal. Then, redirect
standard file descriptors to /dev/null by calling dup2(dev_null_fd, STDIN),
dup2(dev_null_fd, STDOUT), and dup2(dev_null_fd, STDERR). Close
dev_null_fd, because it is no longer necessary.
Finally, the jailer switches the uid to 123, and gid to 100, and execs
./firecracker \
--id="551e7604-e35c-42b3-b825-416853441234" \
--start-time-us=<opaque> \
--start-time-cpu-us=<opaque>
Now firecracker creates the socket at
/srv/jailer/firecracker/551e7604-e35c-42b3-b825-416853441234/root/<api-sock>
to interact with the VM.
Note: default value for <api-sock> is /run/firecracker.socket.
--cgroup command line argument.notify_on_release
mechanism, while being wary about potential race conditions (the instance
crashing before the subscription process is complete, for example).--new-pid-ns flag enables the Jailer to exec the
binary file in a new PID namespace, in order to become a pseudo-init process.
Alternatively, the user can spawn the jailer in a new PID namespace via a
combination of clone() with the CLONE_NEWPID flag and exec().root user; it actually requires a more restricted
set of capabilities, but that's to be determined as features stabilize.--daemonize runs towards the end, instead of the very
beginning. We are working on adding better logging capabilities.<exec_file_name>.pid for all cases
regardless of whether --new-pid-ns was provided. The suggested way to fetch
Firecracker's PID when using the Jailer is to read the firecracker.pid file
present in the Jailer's root directory.