docs/_docs/30-fasterkv-record-locking.md
There are two levels of locking in FASTER:
BlockAllocate causes epoch refresh.LockableContext or LockableUnsafeContext (hereafter referred to collectively as Lockable*Context); the user manually locks keys.All locks are obtained via spinning on Interlocked.CompareExchange and Thread.Yield(). Ephemeral locks have limited spin count, to avoid deadlocks; if they fail to acquire the desired lock in this time, the operation retries.
Manual locks have a longer duration, and enter the LockTable when either the readcache or hybrid log experiences memory pressure and must evict pages, and those pages contain locked records. FASTER has a LockEvictionObserver that runs in this situation.
As noted above, manual locking is done by obtaining the Lockable*Context instance from a ClientSession. There are currently 4 *Context implementations; all are struct for inlining. All *Context are obtained as properties on the ClientSession named for the type (e.g. clientSession.LockableContext). The characteristics of each *Context are:
BasicContext: This is exactly the same as ClientSession, internally calling directly through to ClientSession's methods and reusing ClientSession's FasterSession. It provides safe epoch management (acquiring and releasing the epoch on each call) and ephemeral locking.UnsafeContext : IUnsafeContext: This provides ephemeral locking, but rather than safe epoch management, this supports "unsafe" manual epoch management from the client via BeginUnsafe() and EndUnsafe; it is the client's responsibility to make these calls correctly. UnsafeContext API methods call the internal ContextRead etc. methods without doing the Resume and Suspend (within try/finall) of epoch protection as is done by the "Safe" API methods.LockableContext : ILockableContext: This provides safe epoch management, but rather than ephemeral locking, this allows long-lived locks via BeginLockable and EndLockable. It is important that all locks are acquired before any methods accessing those keys are called.LockableUnsafeContext : ILockableContext, IUnsafeContext: This combines the manual epoch management and manual locking, exposing both sets of calls.See Ephemeral Locking Conceptual Flow for additional information on ephemeral locking.
Here are some Manual-locking use cases:
It is important not to mix operations between the different context types:
BeginUnsafe() from one of the *UnsafeContexts and then make a call on a BasicContext (or ClientSession) or LockableContext, the latter will try to acquire the epoch which is held by the *UnsafeContext.Lockable*Contexts and then make an update call with a non-Lockable context on the same key, the latter will try to acquire the exclusive lock.All manual locking of keys must lock the keys in a deterministic order, and unlock in the reverse order, to avoid deadlocks.
The introduction of Manual Locking introduces the possibility of deadlocks even with ephemeral locking. In the following examples, LUC1 is a Lockable*Context and S1 is a standard ClientSession (which does ephemeral locking):
LockTable-based
Because of this, ephemeral locking limits the number of spins for which it attempts the lock; after this, it will return failure and the caller will RETRY_LATER (which refreshes the epoch, allowing other operations such as the OnPagesClosed() mentioned above to complete).
Ephemeral locks are never held across pending I/O operations. All the data operations' low-level implementors (InternalRead, InternalUpsert, InternalRMW, and InternalDelete--collectively known as InternalXxx) release the lock when the call is exited; if the operations must be retried, the locks are reacquired as part of the normal operation there. This prevents potential unnecessary long-held locks while a Flush() completes, for example.
Here are examples of the above two use cases, taken from the unit tests in LockableUnsafeContextTests.cs:
Lock one key for read:
using (var luContext = session.GetLockableUnsafeContext())
{
luContext.BeginUnsafe();
luContext.BeginLockable();
luContext.Lock(51, LockType.Shared);
luContext.Read(24, out var value24);
luContext.Read(51, out var value51);
luContext.Upsert(75, value24 + value51);
luContext.Unlock(51, LockType.Shared);
luContext.EndLockable();
luContext.EndUnsafe();
Lock multiple keys, one for update and two for read:
using (var luContext = session.GetLockableUnsafeContext())
{
luContext.BeginUnsafe();
luContext.BeginLockable();
luContext.Lock(24, LockType.Shared);
luContext.Lock(51, LockType.Shared);
luContext.Lock(75, LockType.Exclusive);
luContext.Read(24, out var value24);
luContext.Read(51, out var value51);
luContext.Upsert(75, value24 + value51);
luContext.Unlock(24, LockType.Shared);
luContext.Unlock(51, LockType.Shared);
luContext.Unlock(75, LockType.Exclusive);
luContext.EndLockable();
luContext.EndUnsafe();
This section covers the internal design and implementation of manual locking. Although Sealing or Invalidating a record is not strictly a lock, these are still part of this document because they are closely intertwined with Record Transfers and other concurrency considerations.
For locking we use the following terminology for records:
The following sections refer to the following bits in the RecordInfo:
RecordInfos in the ReadOnly region may be locked and unlocked directly.RecordInfo.Seal. This is only done when the RecordInfo is already exclusively locked, so Seal() does a simple bitwise operation. However, Lock checks the Sealed bit and fails if it is found; a Sealed record should never be operated on.Unseal() a record is when we are unable to complete the operation; the thread Unseal()s the record and retries (again, this is done while the record is XLocked)..PreviousAddress to move along the chain. This has relevance to two areas of Record Transfers
ReadCache we do not Seal records; the semantics of Seal are that the operation is restarted when a Sealed record is found. For non-readcache records, this causes the execution to restart at the tail of the main-log records. However, readcache records are at the beginning of the hash chain, before the main-log records; thus, restarting would start again at the hash bucket, traverse the readcache records, and hit the Sealed record again, ad infinitum. Thus, we instead set readcache records to Invalid when they are no longer correct; this allows the traversal to continue until the readcache chain links to the first main-log record.readcache or LockTable) has been "made permanent" in a race that a newly-inserted main-log record loses. In that case, the main-log record is already in the hash chain, and must be marked Invalid. This behavior is new with this version of locking; previously, Invalid records were never found in the main-log portion of the hash chain (they existed only as a result of a failed CAS (which may still happen), so they were never in the hash chain).This section provides the basic calls that implement ephemeral locking. See Ephemeral Locking Conceptual Flow to see how they're used.
IFunctions has been modified:
DisableEphemeralLocking flag has been moved from IFunctions to a FasterKV constructor argument. This value must be uniform across all asessions. It is only to control the ephemeral locking done by FasterKV, and as such has been renamed DisableEphemeralLocking; this replaces the concept of user-controlled locking that was provided within the IFunctions methods for concurrent record access.IFunctions have been removed; locking is now done internally only, using the RecordInfo bits, and controlled by DisableEphemeralLocking.IFasterSession has added methods that wrap ephemeral locking, based upon whether the session context does manual locking:
Lockable* contexts, this returns true; all locking in those contexts is explicitly controlled by Lock() and Unlock() calls. For other contexts, this returns whether the FASTER-level DisableEphemeralLocking option is set.!DisableEphemeralLocking. Fails after some number of spins if the record is Sealed, Tentative, or Invalid. For Lockable*Context, this asserts that the record is already exclusively locked.!DisableEphemeralLocking. Fails after some number of spins if the record is Sealed, Tentative, or Invalid. For Lockable*Context, this asserts that the record already has at least one sharelock.!DisableEphemeralLocking. Does not fail; an exclusively locked record cannot be modified or moved.!DisableEphemeralLocking. Fails if the record is Sealed or Invalid (since it is locked, it cannot be tentative).These methods are called in InternalXxx, which does all locking outside the IFasterSession calls. Previous IFasterSession private methods such as ConcurrentWriterLock and ConcurrentWriterNoLock are no longer needed; the locking is controlled outside the IFasterSession calls.
Ephemeral locks generally do not participate in record eviction, either from main log or from readcache. This is because InternalXxx releases those locks before going pending. However, because InternalXxx calls BlockAllocate() to create a new record, and BlockAllocate() can result in BumpCurrentEpoch being called,
Manual locking is done via InternalLock(), called by the Lock() and Unlock() methods of LockableContext and LockableUnsafeContext. We do not expose promoting SLocks to XLocks in Lockable*Context because that could lead to deadlock; the caller must lock it the right way the first time.
Other record operations that must consider locks are InternalUpsert, InternalRead and InternalCompletePendingRead, InternalRMW and InternalCompletePendingRMW, InternalDelete, and Compact. This is described in Ephemeral Locking Conceptual Flow.
InternalLock() does not issue PENDING operations to retrieve on-disk data, and locking/unlocking is designed to avoid pending I/O operations by use of a LockTable consisting of {TKey, RecordInfo} pairs, where TKey is the FasterKV Key type and RecordInfo is used to perform the locking/unlocking. If a record to be locked is not found in memory above hlog.HeadAddress, then a record is created in the LockTable; if this record is subsequently read from the disk (via InternalContinuePending(Read|RMW)), the locks from the LockTable are applied (and the LockTable entry is removed).
Normally, we can assume that hlog.HeadAddress and readcache.HeadAddress will not change for the duration of an InternalXxx call, including InternalContinuePendingRead and InternalContinuePendingRMW. However, there are two calls in particular that can result in an epoch refresh:
BlockAllocate and BlockAllocateReadCache. If these have to allocate a new page, they may refresh the epoch.SpinWaitUntil*() methods, as described in the next subsection.When a record has gone below the HeadAddress of its specified log (AllocatorBase subclass), it will soon be evicted. Eviction is usually done as a result of BlockAllocate or BlockAllocateReadCache having to allocate a new page that will exceed the in-memory size limit of the log. Eviction does the following:
ShiftHeadAddress() is called to MonotonicUpdate(ref HeadAddress, newHeadAddress, ...). If successful, it calls BumpCurrentEpoch(() => OnPagesClosed(newHeadAddress)).
BumpCurrentEpoch may drain the epoch list immediately, which means that the BlockAllocate call may run OnPagesClosed().OnPagesClosed() adjusts addresses and performs eviction:
SafeHeadAddress to newHeadAddress, which is the highest address that will be closed.oldHeadAddress to newHeadAddress, incrementing ClosedUntilAddress as each region is closed.
ReadCacheEvict and any main-log observer; in our case, we register a LockEvictionObserver.LockTable.As InternalXxx (including pending) methods proceed, they check whether a record being operated on goes below its log's HeadAddress. If so, it must call one of the following functions, which spinwait until the record has been "closed" by going below hlog.ClosedUntilAddress or, if a readcache record, `readcache.ClosedUntilAddress:
SpinWaitUntilAddressIsClosed: This takes an address and spins until the address is below ClosedUntilAddress.SpinWaitUntilRecordIsClosed: This takes an address and a ref key and spins until the address is below ClosedUntilAddress -or- until the key is found in the LockTable. This optimization reduces time spent waiting if there are earlier, as-yet-unclosed ranges on the page.The caller tests that the record is below HeadAddress rather than SafeHeadAddress because SafeHeadAddress is modified within OnPagesClosed; HeadAddress is set before OnPagesClosed is called.
The actual semantics of "taking a lock", and to a lesser extent "releasing a lock", include more than simply the lock bits; as described in Relevant RecordInfo Bits, the Sealed and Invalid bits are significant. It is important that these be set and tested atomically. In particular:
RETRY_LATER, which does an epoch refresh before retrying the operation. Records are Sealed when they have been superseded by another record for that key; this record is appended at the tail of the log (or at the readcache splice point).When we acquire a lock, we test the Sealed or Invalid condition during the CAS loop; if either is found, the loop terminates without attempting to acquire the lock. Once a record is marked Sealed or Invalid, it will never be modified or reused. While we could lock the record, then test for Sealed/Invalid and unlock if found, the fact that we will never want to operate on a Sealed or Invalid record allows us to centralize this to the Lock() operation, returning false if either condition is encountered.
When operations have locked an in-memory record, that record may go below its log's HeadAddress due to BlockAllocate or SpinWaitUntil*Closed. In this case, we cannot unlock that recordInfo; the reference points to evicted memory. Instead, we must SpinWaitUntilRecordIsClosed, then unlock the LockTable record.
When Read() operations have locked a readcache record, they must check for the case that it has been set Invalid due to a CopyToTail or CopyUpdate from the immutable region. In this case, its locks have been transferred (actually just copied) to a new main-log record. To manage this, Unlock() tests the returned post-operation value from Interlocked.Add; if this has the Invalid bit set, Unlock() returns false, which lets the caller know there is a new record that was not affected by the Unlock(). In this case, the Unlock() must be retried via InternalLock(), to find the new record and release the lock.
The flow of locking differs for Read and updaters (RMW, Upsert, Delete).
For operations where the key exists in memory:
IFunctions method.IFunctions method call, as in previous versions).
There are a number of variables necessary to track the main hash table entry information, the 'source' record as defined above (including locks), and other stack-based data relevant to the operation. The code has been refactored to place these within structs that live on the stack at the InternalXxx level.
This is used for hash-chain traversal and CAS updates. It consists primarily of:
HashBucketEntry at the time the HashEntryInfo was populated.
Address (which may or may not include the readcache bit) and AbsoluteAddress (Address stripped of the readcache bit) accessors.HashBucketEntry that may be updated by other sessions as our current operation proceeds.
CurrentAddress (which may or may not include the readcache bit) and AbsoluteCurrentAddress (CurrentAddress stripped of the readcache bit).HashBucketEntry with the current information from the 'live' pointer.This is actually RecordSource<TKey, TValue> and carries the information identifying the source record and lock information for that record.
HashEntryInfo Address;RecordSource contains the lowest readcache logical and physical addresses. These are used for 'splicing' a new main-log record into the gap between the readcache records and the main log recoreds; see ReadCache below.This contains all information on the stack that is necessary for the operation, making parameter lists much smaller. It contains:
HashEntryInfo and RecordSource<TKey, TValue> for this operation. These are generally used together, and in some situations, such as when hlog.HeadAddress has changed due to an epoch refresh, RecordSource<TKey, TValue> is reinitialized from HashEntryInfo during the operation.CreateNewRecord* method.The readcache is a cache for records that are read from disk. In the case of record that become 'hot' (read multiple times) this saves multiple IOs. It is of fixed size determined at FasterKV startup, and has no on-disk component; records in the ReadCache evicted from the head without writing to disk when new records are added to the tail.
Records in the readcache participate in locking, either because they are the record to be read, or because they are the 'source' of an update. If the key is found in the readcache, then the RecordSource<TKey, TValue>.Log is set to the readcache. If the operation is an update, then the readcache record is Invalidated before being unlocked.
When locking for Read() operations, we never take an exclusive lock. This requires some checking when unlocking, as noted in Ephemeral Lock Release, to ensure we have released the shared lock on the key.
Read-locks records as soon as they are found, calls the caller's IFunctions callback, unlocks, and returns. The sequence is:
If the record is in the readcache:
IFasterSession.SingleReaderIf the record is in the mutable region:
IFasterSession.ConcurrentReaderIf the record is in the immutable region:
IFasterSession.SingleReaderInternalRead() (or in this case, ReadFromImmutableRegion()).This goes through the pending-read processing starting with InternalContinuePendingRead.
We first look for an existing source record for the key; another session may have added that key to the log, readcache, or LockTable. If found, it is locked.
Call IFasterSession.SingleReader.
If the ReadCache is enabled or CopyToTail is true, call InternalTryCopyToTail; if successful, this transfers locks from the source record and clears the "source is locked" flag in RecordSource<TKey, TValue>.
Unlock the source record, if any (and if still locked), and return.
This is called when a pending read completes, during compaction, and when copying records from the immutable region to the readcache.
First, we see whether the key is found "later than" the expected logical address. This expected logical address is dependent upon the operation:
InternalRead.If the key is found, we return an appropriate NOT FOUND code, as docuemnted in the code.
Otherwise, we allocate a block on either the readcache or hlog, depending on which we are copying to (readcache or CTT). Allocation may entail flush() which may entail epoch refresh, so when this is complete, ensure all of our in-memory addresses are still in memory (VerifyInMemoryAddresses(), which updates readcache), and return RETRY_NOW if an in-memory main-log location was evicted.
After this we know HeadAddress will not change. Initialize the new record from the passed-in Value and call IFunctions.SingleWriter(), then insert the new record into the chain:
Call CompleteTwoPhaseCopyToTail (like InternalTryCopyToTail, this includes copying to ReadCache despite the name). Since we have a readlock, not an exclusive lock, we must transfer all read locks, either from an in-memory source record or from a LockTable entry.
Because this is a blind update, it operates differently depending on whether, and where, a source record is found.
If the record is in the readcache:
IFasterSession.SingleReader, ...If the record is in the mutable region:
IFasterSession.ConcurrentReaderIf the record is in the immutable region:
IFasterSession.SingleReaderInternalRead() (or in this case, ReadFromImmutableRegion()).TODO
TODO
TODO
For records of keys not in memory (including those which have been evicted due to memory pressure while locked), as well as keys that have not yet been added to the log, we use a LockTable that records locks "as if" the key were present on the log and in memory.
We optimize the operations that access the LockTable for the cases where no locks are present, or no locks are present for the bucket to which a key hashes.
We implement the LockTable with two custom data structures; ConcurrentDictionary is too slow, and requires an IHeapContainer<TKey> allocation even for lookups. The structures are:
InMemKV and related: An underlying replacement for ConcurrentDictionary that optimizes bucket-level locking.LockTable: Provides the locking layer that uses InMemKV<TKey, THeapKey, TValue, TUserFunctions> for underlying storage.This is an in-memory key/value store that optimizes locking and reuse of C# allocations (to minimize GC overhead). In this discussion, the LockTable will be used to illustrate its classes and their methods, but it is a separate class from the LockTable to allow reuse in other parts of FASTER. It and its related classes are generic with the following type parameters:
TKey: The Key type, which for the LockTable is the same as the FasterKV's.THeapKey: This is the type allocated for persistent storage of the Key, which for the LockTable is IHeapContainer<TKey>.TValue: The Value type, which for the LockTable is the RecordInfo.TUserFunctions: The implmentation type for IInMemoryKVUserFunctions<TKey, THeapKey, TValue> as described below.The in-memory KV class hierarchy is described here. For brevity, abbreviations such as InMemKV<> will be used to refer to the full InMemKV<TKey, THeapKey, TValue, TUserFunctions or other generic class within the hierarchy.
InMemKV, without generic args: This is a static class that contains constants.IInMemKVUserFunctions<TKey, THeapKey, TValue>: This is an interface that must be implemented by the caller (such as LockTable). It provides User functions that replace IFasterEqualityComparer to allow comparison of TKey and THeapKey, as well as other methods such as creating a THeapKey and determining if a value IsActive.InMemKVEntry<TKey, THeapKey, TValue, TUserFunctions>: This is the actual entry for a key that contains:
THeapKey: The persistent storage for the key. For the LockTable, it is an IHeapContainer<TKey>.TValue: The value for the key. For the LockTable, it is a RecordInfo.InMemKVChunk<TKey, THeapKey, TValue, TUserFunctions>: This contains a chunk of allocations; it is the unit by which the bucket grows, if needed. It is a class containing:
InMemKVEntry<> structsInMemKVChunk<> references.InMemKVBucket<TKey, THeapKey, TValue, TUserFunctions>: This is the hash bucket, a vertebra of the hashtable spine. It is a struct containing:
long word: analogous to the word of a RecordInfo; it is used for locking, including maintaining the ExclusiveGeneration: a counter that rolls over at 33 million. Buckets follow the protocol that any membership-modifying operations must exclusively lock the bucket; ExclusiveGeneration allows quick checking of whether the current thread acquired the exclusive lock without an intervening thread having potentially modified the bucket membership since the current thread's read-only pass.InMemKVEntry<> initialEntry: The initial entry of the hash chain mapping to this bucket. We optimize for the case where there are no collisions on a bucket; if multiple colliding keys are present at the same time, we create overflow allocations.InMemKVChunk<> LastOverflowChunk: The last overflow chunk, if any are allocated. We store the last one instead of the first to facilitate compaction, which accesses the end of the chunk list.InMemKV<TKey, THeapKey, TValue, TUserFunctions>: The high-level In-Memory Key/Value store, containing:
InMemKVBucket<>[] buckets: The spine of the hashtable.TUserFunctions userFunctions: The table-level user functions.Below InMemKV<> in the hierarchy, all InMemKV* classes and struct are entirely internal, and are not used directly by production code (only by tests).
Unfortunately, the IHeapContainer<TKey> approach is necessary, despite the need to make interface calls instead of inlining, because we cannot specialize a generic type parameter for all the types that implement IHeapContainer<TKey> for the different allocator types, for a single member. Instead, we minimize the number of allocations made, and compare the TKey directly to the IHeapContainer<TKey>.Get().
The InMemKV<> is designed to expose functions that take references to user-implemented interfaces implemented via structures (for inlining). This allows it to provide generic functionality for Adding, Finding, and Removing entries. Because it is optimized to minimize locking and implement on-the-fly compaction, it relies on the calling class to provide the IsActive method; during compaction, an entry that is not active is removed.
Following are public methods of InMemKV<> (the class is internal, but the "internal API" methods are marked "public"):
**bool IsActive**: InMemKV<> maintains a count of active buckets; a bucket's active state is determined by whether the InitialEntry is active or not (Compaction ensures this accurately reflects the bucket's state, and is described below). The caller checks IsActive to avoid unnecessary lock operations.
bool FindEntry<TFuncs>: Searches for an entry in the table and calls the appropriate method on the where TFuncs : IFindEntryFunctions:
Returns true if the entry was found, else false. The TFuncs implementation retains further information to allow the caller to, for example, return the result of RecordInfo.TryLock().
void AddEntry<TFuncs>: Searches for an entry in the table and calls the single method on the where TFuncs : IAddEntryFunctions:
For speed, AddEntry<> does not search for an existing entry; it is the caller's responsibility to call this only in situations where uniqueness is known. Other situations should call FindOrAddEntry<>.
Returns void; always adds a new record.
bool FindOrAddEntry<TFuncs>: Searches for an entry in the table and calls the appropriate method on the where TFuncs : IFindEntryFunctions:
Returns true if the entry was found, else false. The TFuncs implementation retains further information to allow the caller to, for example, return the result of RecordInfo.TryLock().
Note that there is no Remove<TFuncs> method. As described, InMemKV<> relies on the IsActive method to determine whether an entry is active and available for removal by compaction. Therefore, to remove an entry, supply a TFuncs implementation to FindEntry that simply sets the entry's valid to something that IsActive will return false for. This TFuncs implementation may or may not verify that the entry was found.
The LockTable uses InMemKV<> for its backing store. It exposes the following methods (as with InMemKV<>, the class is internal but "public" methods are marked 'public'):
bool IsActive: A wrapper around InMemKV<>.IsActive.bool TryLockEphemeral: Try to lock the key for an ephemeral operation; if there is no lock, it will return without locking and let two-phase insert handle it as described below. Calls InMemKV<>.FindEntry with the following implementation, and returns the value of success:
NotFound(): success = trueFoundEntry(): success = RecordInfo.TryLock()bool TryLockManual: Try to acquire a manual lock (which may be tentative), either by locking an existing LockTable record or by adding a new record. Calls InMemKV<>.FindOrAddEntry with the following implementation, and returns the value of success:
FoundEntry(): success = RecordInfo.TryLock() and clear tentative stateAddedEntry(): initialize lock, store tentative state, success = truebool Unlock: Unlock a key. Calls InMemKV<>.FindEntry with the following implementation, and returns the value of success:
NotFound(): success = false. The caller may raise an error in this case, if the key was expected to be found.FoundEntry(): success = RecordInfo.Unlock(). This may fail if the RecordInfo was marked invalid, e.g. prior to a transfer.bool Remove: Remove the key from the lockTable, if it exists. Called after a record is transferred. Calls InMemKV<>.FindEntry with the following implementation, and returns the value of wasFound:
NotFound(): wasFound = false. This is ok; we may have spinwaited for an evicted record that had no locks.FoundEntry(): wasFound = true, RecordInfo.ClearLocks().void TransferFromLogRecord: Transfers locks from a mainlog or readcache record that is being evicted due to memory pressure. Calls InMemKV<>.AddEntry with the following implementation:
AddedEntry(): RecordInfo.TransferLocksFrom(log record)void TransferToLogRecord: Transfers locks from the LockTable to a new log record. Calls InMemKV<>.FindEntry with the following implementation:
FoundEntry(): logRecord.TransferLocksFrom(lock table record)bool ClearTentative: Clears the Tentative bit from the key's lock--makes it "real". Calls InMemKV<>.FindEntry with the following implementation, and returns the value of success:
NotFound(): success = falseFoundEntry(): clear tentative flag; success = truevoid UnlockOrRemoveTentative: Unlock the key, or remove a tentative entry that was added. This is called when the caller is abandoning the current attempt and will retry. Calls InMemKV<>.FindEntry with the following implementation, and asserts the value of success (the record should always be found):
NotFound(): success = false. The caller may raise an error in this case, if the key was expected to be found.FoundEntry(): success = RecordInfo.Unlock(). This may fail if the RecordInfo was marked invalid, e.g. prior to a transfer.bool CompleteTwoPhaseInsert: Complete the two-phase insertion protocol: wait for any tentative lock on the key (it will be cleared by the owner) or return false if a non-tentative lock was inserted. Calls InMemKV<>.FindEntry with the following implementation, and returns the value of notFound:
NotFound(): notFound = trueFoundEntry(): notFound = false, foundTentative = recordInfo.IsTentative()bool ContainsKey: Test whether a key is present in the Lock Table. In production code, this is used to implement FasterKV<>.SpinWaitUntilRecordIsClosed. It is a wrapper around InMemKV<>.ContainsKey.The LockTable is optimized to add minimal overhead when not in use; when in use, it is optimized for the no-collisions case per-bucket.
This means that we expect the bucket.InitialEntry to be most often operated on:
When collisions occur, as may happen more often in memory-constrained cases with multiple transactions and numerous keys, the bucket obtains InMemKVChunk<>s from the C# new operator, currently in fixed chunks of size 16. Because LockTableEntry<TKey> contains an interface IHeapContainer<TKey>, the memory cannot be pinned, so we have no motivate to use another allocator.
The bucket.LastOverflowChunk holds the last chunk. If it is null, then there are no overflow chunks. Chunks are maintained in a linked list. We store the last chunk instead of the first because we compact on-the-fly from the end of the list over the !IsActive entries.
Compaction is triggered by InMemKV<>.FindEntry detecting that the entry is !Active following the IFindEntryFunctions.FoundEntry call. If this condition is found, then the bucket promotes to an exclusive lock. This process saves off bucket.ExclusiveGeneration before the exclusive lock and examines it on lock acquisition (prior to incrementing it). If this is unchanged, we know no other session has modified the bucket contents, and can do an optimized swap of the last active entry over the just-inactivated entry (this is why we hold onto the last chunk, not the first, in the bucket field). Otherwise, we must verify that the no-longer-active object is still in the list: the chunk must be in our assigned list (verifying this requires iterating the set of chunks), and the entry at the index must be inactive (but not default). Some notes about this:
- We only insert new entries at the end of the list, never in the middle. This means our target entry will never be at a position "later" than it was when we made this call.
- Every CompactInto() removes InActive entries at the end of the list (an InActive entry will never be swapped into a target entry). Since another thread got the XLock before us, our target entry was removed if it was at the end of the list.
When all entries have been compacted off the last chunk, it is "freed" and the the bucket field is updated. "Freeing" a chunk means it is passed off to the InMemKV<>'s freelist, which is limited in size, so the chunk may be dropped and garbage-collected.
The bucket.ExclusiveGeneration is stored in 25 bits of the bucket.word field, which means it will roll over after 33 million allocations, which will not happen in one iteration of the loop (this could be reduced if the bits are needed elsewhere).
The InMemeKV<TKey> maintains an invariant that once traversal hits an entry that is IsDefault (no heap key is set), traversal stops; on-the-fly compaction ensures this is consistent. This works in conjunction with locking rules. The locking rules are:
ExclusiveGeneration value is used. If after acquiring the promoted XLock we see that the current ExclusiveGeneration is the same as before the XLock attempt, then we know the list membership has not changed; we return this knowledge to the PromoteSharedLockToX caller as out bool immediateLock.With these locking rules, we can implement Compaction quite quickly:
LastActiveChunkEntryIndex (stored in 32 bits of bucket.word) and then decrement LastActiveChunkEntryIndex. If this releases the last entry on the chunk, then we free the chunk as described above.IsDefault (or reaches the end of the final chunk).END OF REVISIONS 12/09/2022
For records not found in memory, the LockTable is used. The semantics of LockTable entries are as follow. This is a conceptual view; implementation details are described in subsequent sections.
Lock call, if the key is not found in memory, the LockTable is searched for the Key.
LockTable it is locked as specifiedUnlock call, if the key is not found in memory, the LockTable is searched for the Key.
LockType is unlocked. If this leaves the RecordInfo unlocked, its entry is rmoved.LockTable; if the key is found:
LockTable it is Sealed, else a new Tentative record is addedLockTable entry to the retrieved recordInfoLockTable, or deletes the entry if it was TentativeLockTable, and if the key is found:
LockTableLockTable entry to the retrieved recordInfoLockTable, or deletes the entry if it was TentativeLockTable use does not verify that the key actually exists (as it does not issue a pending operation to ensure the requested key, and not a collision, is found in the on-disk portion), it is possible that keys will exist in the LockTable that do not in fact exist in the log. This is fine; if we do more than Lock them, then they will be added to the log at that time, and the locks applied to them.This is the complementary side of Insertion to LockTable due to Update:
When a thread doing Lock() looks for a key in the LockTable and cannot find it, it must do a Tentative insertion into the locktable, because it is possible that another thread CAS'd that key to the Tail of the log after the current thread had passed the hash table lookup:
LockTable has an entry for that key
LockTable entry and exits successfully.This is the complementary side of Insertion to LockTable due to Lock and applies to Upsert, RMW, and Delete, when any of these append a record to the tail of the log (for brevity, Update is used). It is necessary so that threads that try to Lock() the Upsert()ed record as soon as it is CAS'd into the Log will not "split" locks between the log record and a LockTable entry. There is a bit of Catch-22 here; we cannot CAS in the non-Tentative log record before we have transferred the locks from a LockTable entry; but we must have a record on the log so that Lock() will not try to add a new entry, or lock an existing entry, while Upsert is in the process of creating the record and possibly transferring the locks from the LockTable.
For performance reasons, Upsert cannot do an operation on the LockTable for each added record; therefore, we defer the cost until the last possible point, where we know we have to do something with the LockTable (which is very rare).
When Upsert must append a new record:
LockTable to see if there is an entry in it for this key.
LockTable, then Upsert checks to see if it is marked Tentative.
Here are the sequences of operations to remove records from the Lock Table:
LockTable conditionally on IsLocked == false and Sealed == false.
ReadCache or CopyToTail, Pending RMW to Tail, or Upsert or Delete of a key in the LockTable
LockTable record is Sealed as described in Relevant RecordInfo bits
LockTable entry and the thread does RETRY_NOWLockTable
When the ReadCache is enabled, "records" from the ReadCache (actually simply their RecordInfo headers) are inserted into the chain starting at the HashTable (these records are identified as ReadCache by a combination of FasterKV.UseReadCache being set and the ReadCache bit in the RecordInfo is set). All ReadCache records come before any main log record. So (using r#### to indicate a ReadCache record and m#### to indicate a main log record):
ReadCache entries in a hash chain, it looks like: HashTable -> m4000 -> m3000 -> m...ReadCache entries in a hash chain, it looks like: HashTable -> r8000 -> r7000 -> m4000 -> m3000 -> m...As a terminology note, the sub-chain of r#### records is referred to as the ReadCache prefix chain of that hash chain.
In FASTER v1, updates involving ReadCache records stripped the entire ReadCache prefix from the chain. Additionally, the ReadCache prefix is stripped from the hash chain when a ReadCache page with that hashcode is evicted due to memory limits. In FASTER v2, because ReadCache records may be locked, we must not lose those locks. This is resolved in two ways:
ReadCache prefix chains are preserved except for the specific record being updated, which is spliced out and transferred to a CopyToTail on the main log, including any locks.ReadCache pages are evicted, their records are removed from the ReadCache prefix chain, and any with locks are transferred to the LockTable.In normal FASTER operation, records are appended at the tail of the log and do not move. The HashTable points to these records for each distinct hash code.
Record transfers occur when a ReadCache entry must be updated, or a record is evicted from either ReadCache or the main log while it holds locks.
ReadCache Records at Tail of LogFor brevity, ReadCache is abbreviated RC, CopyToTail is abbreviated CTT, and LockTable is abbreviated LT. Main refers to the main log. The "final RC record" is the one at the RC->Main log boundary. As always, avoiding locking cost is a primary concern.
For record transfers involving the ReadCache, we have the following high-level considerations:
ReadCache entry.
.PreviousAddress.ReadCacheEvict thread outspliced the final RC record (in this case, the (formerly) final RC record will have its Invalid bit set), so RETRY_NOWLockableUnsafeContext session at the time ReadCacheEvict is called; otherwise there will be no locks. However, we must already traverse the ReadCache records, and it is possible for a new LockableUnsafeContext session to start during the duration of ReadCacheEvict, so there is no benefit to checking for the no-LockableUnsafeContext case (unlike Main Log Evictions, which can avoid page scans by checking for this).The above covers single-record operations on the RC prefix. Two-record operations occur when we must outsplice one record and insplice another, because the value for a record in the RC prefix is updated, e.g. Upsert updating a record in the ReadOnly region or RMW doing a CopyUpdater (of mutable or readonly), or either of these operating updating a key that is in the RC prefix chain. The considerations here are:
When main log pages are evicted due to memory limits, if there are any active LockableUnsafeContext sessions, then each record on those pages must be examined and any locks transferred to LockTable entries.
Transfers to the LockTable due to main log evictions are handled in the following manner:
TentativeHeadAddress (THA) field is added next to HeadAddress.We must clear in-memory records' lock bits during recovery.
RecoveryInfo.manualLockingActive, an indication of whether any LockableUnsafeContext were active during the Checkpoint.Following are the 4 FASTER data operations and Lock/Unlock, and their flow for the various lock states.
Abbreviations:
LockOperations instance passed to one of the InternalXxx methods.One big consideration for Upsert is that it blindly upserts when a scan for a record drops below HeadAddress. This in conjunction with our two insertion points--at HT->RC and at RC->MainLog--gives rise to the following lost-update anomaly:
General algorithm, iff readcache entries are present: each participating thread adds a Tentative entry, then scans; if it does not find an existing record, then finalize. This is modified by ensuring that the update wins (its data is more recent). We must have such a two-phase operation at both ends, to ensure that whichever side completes the scan first, it will find an entry, either final or Tentative, for the other operation.
OPTIMIZATION: Use readcache records rather than going to disk. However, there are issues here with the record being marked Invalid/Sealed in case multiple threads do it.
LockTable insertion as described in Insertion to LockTable due to LockNote that this changes specified here, including both shared and exclusive locks in the ReadOnly region, clarifies the distinction between a data-centric view of the ReadOnly region being implicitly read-locked (because it cannot be updated), vs. a transactional view that requires explicit read locks. In a transactional view, a read lock prevents an exclusive lock; implicit readlocks based on address cannot do this in FASTER, because we can always do XLock or an RCU. Therefore, we need explicit read locks, and reads must of course block if there is an XLock. This also means that SingleReader would have to lock anyway, losing any distinction between it and ConcurrentReader. Therefore, we have consolidated ConcurrentReader and SingleReader into a single function.
RMW considerations are similar to Upsert from the sealing and "encountering locks" point of view.