pkg/sentry/vfs/g3doc/inotify.md
Inotify is a mechanism for monitoring filesystem events in Linux--see inotify(7). An inotify instance can be used to monitor files and directories for modifications, creation/deletion, etc. The inotify API consists of system calls that create inotify instances (inotify_init/inotify_init1) and add/remove watches on files to an instance (inotify_add_watch/inotify_rm_watch). Events are generated from various places in the sentry, including the syscall layer, the vfs layer, the process fd table, and within each filesystem implementation. This document outlines the implementation details of inotify.
Inotify data structures are implemented in the vfs package.
Inotify instances are represented by vfs.Inotify objects, which implement vfs.FileDescriptionImpl. As in Linux, inotify fds are backed by a pseudo-filesystem (anonfs). Each inotify instance receives events from a set of vfs.Watch objects, which can be modified with inotify_add_watch(2) and inotify_rm_watch(2). An application can retrieve events by reading the inotify fd.
The set of all watches held on a single file (i.e., the watch target) is stored in vfs.Watches. Each watch will belong to a different inotify instance (an instance can only have one watch on any watch target). The watches are stored in a map indexed by their vfs.Inotify owner’s id. Hard links and file descriptions to a single file will all share the same vfs.Watches (with the exception of the gofer filesystem, described in a later section). Activity on the target causes its vfs.Watches to generate notifications on its watches’ inotify instances.
A single watch, owned by one inotify instance and applied to one watch target. Both the vfs.Inotify owner and vfs.Watches on the target will hold a vfs.Watch, which leads to some complicated locking behavior (see Lock Ordering). Whenever a watch is notified of an event on its target, it will queue events to its inotify instance for delivery to the user.
vfs.Event is a simple struct encapsulating all the fields for an inotify event. It is generated by vfs.Watches and forwarded to the watches' owners. It is serialized to the user during read(2) syscalls on the associated fs.Inotify's fd.
There are three locks related to the inotify implementation:
Inotify.mu: the inotify instance lock. Inotify.evMu: the inotify event queue lock. Watches.mu: the watch set lock, used to protect the collection of watches on a target.
The correct lock ordering for inotify code is:
Inotify.mu -> Watches.mu -> Inotify.evMu.
Note that we use a distinct lock to protect the inotify event queue. If we simply used Inotify.mu, we could simultaneously have locks being acquired in the order of Inotify.mu -> Watches.mu and Watches.mu -> Inotify.mu, which would cause deadlocks. For instance, adding a watch to an inotify instance would require locking Inotify.mu, and then adding the same watch to the target would cause Watches.mu to be held. At the same time, generating an event on the target would require Watches.mu to be held before iterating through each watch, and then notifying the owner of each watch would cause Inotify.mu to be held.
See the vfs package comment to understand how inotify locks fit into the overall ordering of filesystem locks.
In Linux, watches reside on inodes at the virtual filesystem layer. As a result, all hard links and file descriptions on a single file will all share the same watch set. There is no common inode structure across filesystem types (some may not even have inodes), so we have to plumb inotify support through each specific filesystem implementation. Some of the technical considerations are outlined below.
For filesystems with inodes, like tmpfs, the design is quite similar to that of Linux, where watches reside on the inode.
Technically, because inotify is implemented at the vfs layer in Linux, pseudo-filesystems on top of kernfs support inotify passively. However, watches can only track explicit filesystem operations like read/write, open/close, mknod, etc., so watches on a target like /proc/self/fd will not generate events every time a new fd is added or removed. As of this writing, we leave inotify unimplemented in kernfs and anonfs; it does not seem particularly useful.
The gofer filesystem has several traits that make it difficult to support inotify:
For events that must be generated above the vfs layer, we provide the following DentryImpl methods to allow interactions with targets on any FilesystemImpl:
There are several options that can be set for a watch, specified as part of the mask in inotify_add_watch(2). In particular, IN_EXCL_UNLINK requires some additional support in each filesystem.
A watch with IN_EXCL_UNLINK will not generate events for its target if it corresponds to a path that was unlinked. For instance, if an fd is opened on “foo/bar” and “foo/bar” is subsequently unlinked, any reads/writes/etc. on the fd will be ignored by watches on “foo” or “foo/bar” with IN_EXCL_UNLINK. This requires each DentryImpl to keep track of whether it has been unlinked, in order to determine whether events should be sent to watches with IN_EXCL_UNLINK.
One-shot watches expire after generating a single event. When an event occurs, all one-shot watches on the target that successfully generated an event are removed. Lock ordering can cause the management of one-shot watches to be quite expensive; see Watches.Notify() for more information.