eden/fs/docs/InodeLifetime.md
Inodes are managed via InodePtr objects. InodePtr is a smart-pointer class
that maintains a reference count on the underlying InodeBase object, similar
to std::shared_ptr.
However, unlike std::shared_ptr, inodes are not necessarily deleted
immediately when their reference count drops to zero. Instead, they may remain
in memory for a while in case they are used again soon.
InodeMap holds a reference to the root inode. This ensures that the root
inode remains in existence for as long as the EdenMount exists.
Each inode holds a reference to its parent TreeInode. This ensures that if
an inode exists, all of its parents all the way to the mount point root also
exist.
For all other call sites, callers obtain a reference to an inode when they
look it up. The lookup functions return InodePtr objects that the call site
should retain for as long as they need access to the inode.
InodeMap does not hold a reference to the inodes it contains. Otherwise, it
would never be possible to unload or destroy any inodes. Instead, the
InodeMap holds raw pointers to inode objects. When Inode objects are
unloaded they are always explicitly removed from the InodeMap's list of
loaded inodes.
A TreeInode does not hold a reference to any of its children. Otherwise,
this would cause circular reference, since each child holds a reference to its
parent TreeInode. The TreeInode is always explicitly informed when one of
its children inodes is unloaded, so it can remove the raw pointer to the child
from its child entries map.
Inodes may be looked up in one of two ways, either by name or by inode number.
TreeInode::getOrLoadChild() is the API for doing inode lookups by name, and
InodeMap::lookupinode() is the API for doing inode lookups by inode number.
Either of these two APIs may have to create the inode object. Alternatively, if
the specified inode already exists, they will increment the reference count to
the existing object and return it. It is possible the inode is already present
in the InodeMap, but was previously unreferenced, so these APIs may increment
the reference count from 0 to 1.
The InodeMap class keeps track of all currently loaded inodes as well as
information about inodes that have inode numbers allocated but are not loaded.
For each unloaded inode, InodeMap records if it is currently being loaded.
This allows InodeMap to avoid starting two load attempts for the same inode.
If a second lookup attempt occurs for an inode already being loaded, InodeMap
handles notifying both waiting callers when the single load attempt completes.
Inode unloading can be triggered by several events:
When the inode reference count drops to zero, we have a chance to decide if we want to unload the inode or not.
When shutting down the mount point, we always destroy each inode as soon as its reference count goes to zero.
If the inode is unlinked and its FUSE reference count is also zero, we also destroy the inode immediately.
In other cases we generally leave the inode object loaded, but it would be valid to decide to unload it based on other criteria (for instance, we could decide to immediately unload unreferenced inodes if we are low on memory).
When the FUSE reference count goes to zero, we should destroy the inode immediately if it is unlinked and its pointer reference count is also zero.
To simplify synchronization, we currently collapse this case into the one above:
we only decrement the FUSE reference count on a loaded inode when we are holding
a normal InodePtr reference to the inode. Therefore, we will always see the
normal reference count drop to zero at some point after the FUSE reference count
drops to zero, and we process the unload at that time.
We will likely add a periodic background task to unload unreferenced inodes that have not been accessed in some time. This unload operation could also be triggered in response to other events (for instance, a thrift call, or going over some memory usage limit).
Synchronization of inode loading and unloading is slightly tricky, particularly for unloading.
When loading an inode, we always hold the InodeMap lock to check if the inode
in question is already loaded or if a load is in progress. Once the inode is
loaded, we acquire its parent TreeInode's contents_ lock, then the
InodeMap lock (in that order), so we can insert the inode into its parent's
entry list and into the InodeMap's list of loaded inodes.
InodePtr itself does not hold any extra locks when performing reference count
updates. The main inode reference count is updated with atomic operations, but
without any other locks held.
However, there is one important item to note here: updates done via InodePtr
copying can never increment the reference count from 0 to 1. The lookup APIs
(TreeInode::getOrLoadChild() and InodeMap::lookupInode()) are the only two
places that can ever increment the reference count from 0 to 1. Both of these
lookup APIs hold a lock when potentially updating the reference count from 0
to 1.
TreeInode::getOrLoadChild() holds the parent TreeInode's contents_ lock,
and InodeMap::lookupInode() holds the InodeMap lock. This means that if you
hold both of these locks and you see that an inode's reference count is
currently 0, no other thread can acquire a reference count to that inode.
Holding the parent TreeInode's contents_ lock and the InodeMap lock
ensures that no other thread can acquire a new reference on an inode, but that
alone does not mean it is safe to destroy the inode. We still need to prevent
multiple threads from both trying to destroy an inode.
For instance, consider if thread A destroys the last InodePtr to an inode,
dropping its reference count to 0. However, before thread A has a chance to grab
the TreeInode and InodeMap locks and decide if it wants to unload the inode,
thread B looks up the inode, increasing the reference count from 0 to 1, but
then immediately destroys its InodePtr, dropping the reference count back
to 0.
In this situation thread A and thread B have both just dropped the reference count to 0. We need to make sure that only one of these two threads can try to destroy the inode.
This is achieved through another counter, called the "acquire" counter. This counter is incremented each time the inode reference count goes from 0 to 1, and decremented each time the reference count goes from 1 to 0. However, unlike the main reference count, the acquire counter is only modified while holding some additional locks.
Increments to the acquire counter are only done while holding either the parent
TreeInode's contents_ lock (in the case of TreeInode::getOrLoadChild()) or
the InodeMap lock (in the case of InodeMap::lookupInode()).
Decrements to the acquire counter are only done while holding both the parent
TreeInode's contents_ lock and the InodeMap lock.
When thread A and thread B both see that the main reference count drops to 0,
they both attempt to acquire both the TreeInode and InodeMap locks.
Whichever thread acquires the locks first will see that the acquire count is
non-zero (since both threads incremented it when bumping the main reference
count from 0 to 1). This thread decrements the acquire count and does nothing
else since the acquire count is non zero. The second thread can then acquire the
locks, decrement the acquire count and see that it is now zero. This second
thread can then perform the unload (while still holding both locks).
All inode objects store a pointer to the EdenMount that they are a part of.
This means that the EdenMount itself cannot be destroyed until all of its
inodes are destroyed.
We achieve this via the root TreeInode's reference count. During normal
operation, the EdenMount holds a reference to the root TreeInode
(technically the InodeMap holds the reference, but the EdenMount owns the
InodeMap). When the EdenMount needs to be destroyed, we release the
reference count on the root inode. When the root inode becomes unreferenced we
know that all of its children have been destroyed, and it is now safe to destroy
the EdenMount object itself.
All of this is triggered through the EdenMount::destroy() function. This
function marks the mount as shutting down, which causes the InodeMap to
immediately unload any inodes that become newly unreferenced. We then trigger an
immediate unload scan to unload any inodes that were already unreferenced. Once
this is done, we release the InodeMap's reference count on the root inode,
allowing it to become unreferenced once all of its children are destroyed.
In addition to the reference count tracking how many InodePtr objects are
currently referring to an inode, InodeBase also keeps track of how many
outstanding references to this inode exist in the FUSE layer (this is the number
of lookup()/create()/mkdir()/symlink()/link() calls made for this
inode, minus the number of times it was forgotten via forget()).
However, the FUSE reference count is not directly related to the inode object lifetime.
Inode objects may be unloaded even when the FUSE reference count is non-zero. In
this case, the InodeMap retains enough information needed to re-create the
Inode object if the inode number is later looked up again by the FUSE API.
The FUSE reference count is only adjusted while holding a normal InodePtr reference to the inode.