pkg/sentry/seccheck/README.md
This package provides a remote interface to observe behavior of the application running inside the sandbox. It was built with runtime monitoring in mind, e.g. threat detection, but it can be used for other purposes as well. It allows a process running outside the sandbox to receive a stream of trace data asynchronously. This process can watch actions performed by the application, generate alerts when something unexpected occurs, log these actions, etc.
First, let's go over a few concepts before we get into the details.
container/start is a point
that is fired when a new container starts.container/start has a id field with the container ID that is
getting started.If you're interested in exploring further, there are more details in the design doc.
Every trance point in the system is identified by a unique name. The naming convention is to scope the point with a main component followed by its name to avoid conflicts. Here are a few examples:
sentry/signal_deliveredcontainer/startsyscall/openat/enterNote: the syscall trace point contains an extra level to separate the enter/exit points.
Most of the trace points are in the syscall component. They come in 2 flavors:
raw, schematized. Raw syscalls include all syscalls in the system and contain
the 6 arguments for the given syscall. Schematized trace points exist for many
syscalls, but not all. They provide fields that are specific to the syscalls and
fetch more information than is available from the raw syscall arguments. For
example, here is the schema for the open syscall:
message Open {
gvisor.common.ContextData context_data = 1;
Exit exit = 2;
uint64 sysno = 3;
int64 fd = 4;
string fd_path = 5;
string pathname = 6;
uint32 flags = 7;
uint32 mode = 8;
}
As you can see, some fields are in both raw and schematized points, like fd
which is also arg1 in the raw syscall, but here it has a name and correct
type. It also has fields like pathname that are not available in the raw
syscall event. In addition, fd_path is an optional field that can take the
fd and translate it into a full path for convenience. In some cases, the same
schema can be shared by many syscalls. In this example, message Open is used
for open(2), openat(2) and creat(2) syscalls. The sysno field can be
used to distinguish between them. The schema for all syscall trace points can be
found
here.
Other components that exist today are:
The following command lists all trace points available in the system:
$ runsc trace metadata
POINTS (973)
Name: container/start, optional fields: [env], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name]
Name: sentry/clone, optional fields: [], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name]
Name: syscall/accept/enter, optional fields: [fd_path], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name]
...
Note: the output format for
trace metadatamay change without notice.
The list above also includes what optional and context fields are available for
each trace point. Optional fields schema is part of the trace point proto, like
fd_path we saw above. Context fields are set in context_data field of all
points and is defined in
gvisor.common.ContextData.
Sinks receive enabled trace points and do something useful with them. They are
identified by a unique name. The same runsc trace metadata command used above
also lists all sinks:
$ runsc trace metadata
...
SINKS (2)
Name: remote
Name: null
Note: the output format for
trace metadatamay change without notice.
The remote sink serializes the trace point into protobuf and sends it to a separate process. For threat detection, external monitoring processes can receive connections from remote sinks and be sent a stream of trace points that are occurring in the system. This sink connects to a remote process via Unix domain socket and expects the remote process to be listening for new connections. If you're interested in creating a monitoring process that communicates with the remote sink, this document has more details.
The remote sink has many properties that can be configured when it's created (more on how to configure sinks below):
endpoint (mandatory): Unix domain socket address to connect.retries: number of attempts to write the trace point before dropping it in
case the remote process is not responding. Note that a high number of
retries can significantly delay application execution.backoff: initial backoff time after the first failed attempt. This value
doubles with every failed attempt, up to the max.backoff_max: max duration to wait between retries.The null sink does nothing with the trace points and it's used for testing. Syscall tests enable all trace points, with all optional and context fields to ensure there is no crash with them enabled.
The strace sink has not been implemented yet. It's meant to replace the strace mechanism that exists in the Sentry to simplify the code and add more trace points to it.
Note: It requires more than one trace session to be supported.
Trace sessions scope a set of trace points with their corresponding
configuration and a set of sinks that receive the points. Sessions can be
created at sandbox initialization time or during runtime. Creating sessions at
init time guarantees that no trace points are missed, which is important for
threat detection. It is configured using the --pod-init-config flag (more on
it below). To manage sessions during runtime, runsc trace create|delete|list
is used to manipulate trace sessions. Here are few examples assuming there is a
running container with ID=cont123 using Docker:
$ sudo runsc --root /run/docker/runtime-runc/moby trace create --config session.json cont123
$ sudo runsc --root /run/docker/runtime-runc/moby trace list cont123
SESSIONS (1)
"Default"
Sink: "remote", dropped: 0
$ sudo runsc --root /var/run/docker/runtime-runc/moby trace delete --name Default cont123
$ sudo runsc --root /var/run/docker/runtime-runc/moby trace list cont123
SESSIONS (0)
Note: There is a current limitation that only a single session can exist in the system and it must be called
Default. This restriction can be lifted in the future when more than one session is needed.
The event session can be defined using JSON for the runsc trace create
command. The session definition has 3 main parts:
name: name of the session being created. Only Default for now.points: array of points being enabled in the session. Each point has:
name: name of trace point being enabled.optional_fields: array of optional fields to include with the trace
point.context_fields: array of context fields to include with the trace
point.sinks: array of sinks that will process the trace points.
name: name of the sink.config: sink specific configuration.ignore_setup_error: ignores failure to configure the sink. In the
remote sink case, for example, it doesn't fail container startup if the
remote process cannot be reached.The session configuration above can also be used with the --pod-init-config
flag under the "trace_session" JSON object. There is a full example
here
Note: For convenience, the
--pod-init-configfile can also be used withrunsc trace createcommand. The portions of the Pod init config file that are not related to the session configuration are ignored.
Here, we're going to explore a how to use runtime monitoring end to end. Under
the examples directory there is an implementation of the monitoring process
that accepts connections from remote sinks and prints out all the trace points
it receives.
First, let's start the monitoring process and leave it running:
$ bazel run examples/seccheck:server_cc
Socket address /tmp/gvisor_events.sock
The server is now listening on the socket at /tmp/gvisor_events.sock for new
gVisor sandboxes to connect. Now let's create a session configuration file with
some trace points enabled and the remote sink using the socket address from
above:
$ cat <<EOF >session.json
{
"trace_session": {
"name": "Default",
"points": [
{
"name": "sentry/clone"
},
{
"name": "syscall/fork/enter",
"context_fields": [
"group_id",
"process_name"
]
},
{
"name": "syscall/fork/exit",
"context_fields": [
"group_id",
"process_name"
]
},
{
"name": "syscall/execve/enter",
"context_fields": [
"group_id",
"process_name"
]
},
{
"name": "syscall/sysno/35/enter",
"context_fields": [
"group_id",
"process_name"
]
},
{
"name": "syscall/sysno/35/exit"
}
],
"sinks": [
{
"name": "remote",
"config": {
"endpoint": "/tmp/gvisor_events.sock"
}
}
]
}
}
EOF
Now, we're ready to start a container and watch it send traces to the monitoring
process. The container we're going to create simply loops every 5 seconds and
writes something to stdout. While the container is running, we're going to call
runsc trace command to create a trace session.
# Start the container and copy the container ID for future reference.
$ docker run --rm --runtime=runsc -d bash -c "while true; do echo looping; sleep 5; done"
dee0da1eafc6b15abffeed1abc6ca968c6d816252ae334435de6f3871fb05e61
$ CID=dee0da1eafc6b15abffeed1abc6ca968c6d816252ae334435de6f3871fb05e61
# Create new trace session in the container above.
$ sudo runsc --root /var/run/docker/runtime-runc/moby trace create --config session.json ${CID?}
Trace session "Default" created.
In the terminal where you are running the monitoring process, you'll start seeing messages like this:
Connection accepted
E Fork context_data { thread_group_id: 1 process_name: "bash" } sysno: 57
CloneInfo => created_thread_id: 110 created_thread_group_id: 110 created_thread_start_time_ns: 1660249219204031676
X Fork context_data { thread_group_id: 1 process_name: "bash" } exit { result: 110 } sysno: 57
E Execve context_data { thread_group_id: 110 process_name: "bash" } sysno: 59 pathname: "/bin/sleep" argv: "sleep" argv: "5"
E Syscall context_data { thread_group_id: 110 process_name: "sleep" } sysno: 35 arg1: 139785970818200 arg2: 139785970818200
X Syscall context_data { thread_group_id: 110 process_name: "sleep" } exit { } sysno: 35 arg1: 139785970818200 arg2: 139785970818200
The first message in the log is a notification that a new sandbox connected to
the monitoring process. The E and X in front of the syscall traces denotes
whether the trace belongs to an Enter or eXit syscall trace. The first
syscall trace shows a call to fork(2) from a process with group_thread_id
(or PID) equal to 1 and the process name is bash. In other words, this is the
init process of the container, running bash, and calling fork to execute
sleep 5. The next trace is from sentry/clone and informs that the forked
process has PID=110. Then, X Fork indicates that fork(2) syscall returned to
the parent. The child continues and executes execve(2) to call sleep as can
be seen from the pathname and argv fields. Note that at this moment, the PID
is 110 (child) but the process name is still bash because it hasn't executed
sleep yet. After execve(2) is called the process name changes to sleep as
expected. Next, it shows the nanosleep(2) raw syscalls, which have sysno=35
(referred to as syscall/sysno/35 in the configuration file), one for enter
with the exit trace happening 5 seconds later.
Let's list all trace sessions that are active in the sandbox:
$ sudo runsc --root /var/run/docker/runtime-runc/moby trace list ${CID?}
SESSIONS (1)
"Default"
Sink: "remote", dropped: 0
It shows the Default session created above, using the remote sink and no
trace points have been dropped. Once we're done, the trace session can be
deleted with the command below:
$ sudo runsc --root /var/run/docker/runtime-runc/moby trace delete --name
Default ${CID?} Trace session "Default" deleted.
In the monitoring process you should see a message Connection closed to inform
that the sandbox has disconnected.
If you want to set up runsc to connect to the monitoring process automatically
before the application starts running, you can set the --pod-init-config flag
to the configuration file created above. Here's an example:
$ sudo runsc --install --runtime=runsc-trace -- --pod-init-config=$PWD/session.json