website/docs/dev/tsavorite/reviv.md
Revivification in Tsavorite refers to reusing ("revivifying") Tombstoned (deleted) records, as well as in-memory source records for RCUs done by Upsert or RMW. This minimizes log growth (and unused space) for high-delete scenarios. It is used in the following cases:
There are two forms of revivification:
These are separate from the reuse of a record due to a RETRY return from CreateNewRecordXxx. In that case the record address is stored in the PendingContext and CreateNewRecordXxx returns either RETRY_LATER or RETRY_NOW; when the CreateNewRecordXxx is attempted again, it will retrieve this record (if the length is still sufficient and it has not fallen below the minimal required address). Retry reuse is always enabled; revivification might not be.
This section describes the external API and command-line arguments for Revivification.
RevivificationSettingsThis struct indicates whether revivification is to be active:
EnableRevivification: If this is true, then at least in-chain revivification is done; otherwise, record revivification is not done (but Retry reuse still is).FreeListBins: If this array of RevivificationBin is non-null, then revivification will include a freelist of records, as defined below.SearchNextHigherBin: By default, when looking for FreeRecords we search only the bin for the specified size. If this is set, then if there are no records in the initial bin, then next-higher bin is search as well.RevivifiableFraction: It may be desirable not to use a record that is too close to the ReadOnlyAddress; some apps will prefer that a newly-inserted record remain in the mutable region as long as possible. RevivifiableFraction limits the eligible range of revivification to be within this fraction of memory immediately belowthe TailAddress; it has the same semantics as LogSettings.MutablePercent, and cannot be greater than that. For example, if RevivifiableFraction is .2, TailAddress is 100,000, and HeadAddress is 50,000, then the .2 x 50,000 = 10,000 records closest to the tail will be eligible for reuse. This is done on an address-space basis, not record count, so the actual number of records that can be revivified will vary for variable-length records.RestoreDeletedRecordsIfBinIsFull Deleted records that are to be added to a RevivificationBin are elided from the hash chain. If the bin is full, this option controls whether the record is restored (if possible) to the hash chain. This preserves them as in-chain revivifiable records, at the potential cost of having the record evicted to disk while part of the hash chain, and thus having to do an I/O only to find that the record is deleted and thus potentially unnecessary. For applications that add and delete the same keys repeatedly, this option should be set true if the FreeList is used.UseFreeRecordPoolForCopyToTail When doing explicit CopyToTail operations such as Compaction, CopyToTail when reading from the immutable in-memory region, or disk IO, this controls whether the allocation for the retrieved records may be satisfied from the FreeRecordPool. These operations may require that the records be allocated at the tail of the log, to remain in memory as long as possible.RevivificationBinThis struct contains the definition of the free-list bins, which are lists of free addresses within a certain range of sizes.
RecordSize: The maximum size of records in this partition (the minimum size is either 16 or the previous bin's RecordSize + 8). These should be partitioned according to the anticipated record sizes for your app. Ignored if the TsavoriteKV uses fixed-length data; in that case there is only a single bin for fixed-length records.NumberOfRecords: The number of records (addresses) to be held in this bin. Tsavorite makes a best-effort attempt to partition the bin into segments to reduce sequential search among the various sizes.BestFitScanLimit: The maximum number of entries to scan for best fit after finding first fit; may be the special UseFirstFit value to use first-fit. Ignored for fixed-length datatypes.GarnetServer.exe supports the following commandline arguments for Revivification:
--reviv: A shortcut to specify revivification with default power-of-2-sized bins. This default can be overridden by --reviv-in-chain-only or by the combination of --reviv-bin-record-sizes and --reviv-bin-record-counts.--reviv-bin-record-sizes: For the main store, the sizes of records in each revivification bin, in order of increasing size. Supersedes the default --reviv; cannot be used with --reviv-in-chain-only.--reviv-bin-record-counts: For the main store, the number of records in each bin:
RevivificationBin.DefaultRecordsPerBin records.--reviv-bin-record-sizes is specified, then all bins have this number of records, else error.--reviv; cannot be used with --reviv-in-chain-only.--reviv-fraction: Fraction of mutable in-memory log space, from the highest log address down to the read-only region, that is eligible for revivification. Applies to both main and object store.reviv-search-next-higher-bins: Search this number of next-higher bins if the search cannot be satisfied in the best-fitting bin. Requires --reviv or the combination of --reviv-bin-record-sizes and --reviv-bin-record-counts.--reviv-bin-best-fit-scan-limit: Number of records to scan for best fit after finding first fit. Requires --reviv or the combination of --reviv-bin-record-sizes and --reviv-bin-record-counts. Values are:
RevivificationBin.UseFirstFit: Return the first address whose record is large enough to satisfy the request.RevivificationBin.BestFitScanAll: Scan all records in the bin for best fit, stopping early if we find an exact match.--reviv-in-chain-only: Revivify tombstoned records in tag chains only (do not use free list). Cannot be used with reviv-bin-record-sizes or reviv-bin-record-counts. Also applies to object store.--reviv-obj-bin-record-count: Number of records in the single free record bin for the object store. The Object store has only a single bin, unlike the main store. Ignored unless the main store is using the free record list.This section describes the internal design and implementation of Revivification.
Because variable-length record Values can grow and shrink, we must store the actual length of the value in addition to the Value data. Fixed-length datatypes (including objects) do not need to store this, because their Value length does not change.
This storage of record lengths applies to both Revification and non-Revivification scenarios. ConcurrentWriter and InPlaceUpdater may change the record size, and need to know how much space they have available if they are to to be able to grow in place later:
ConcurrentWriter and InPlaceUpdater use this to set the UpsertInfo and RMWInfo properties UsedValueLength and FullValueLength. These properties are discussed below.UpsertInfo and RMWInfo properties UsedValueLength and FullValueLength.Storing the extra value length is done by rounding up the used value length (the current length of the value) to int boundary, and then if the allocated Value space is 4 bytes or more greater, storing the extra length at (start of Value plus rounded-up used space). Thus, the full value size is the rounded-up used value size plus the extra value size. If this extra length can be stored then the RecordInfo.Filler bit is set to indicate the extra length is present. Variable-length log allocations are 8-byte (RecordInfo size) aligned, so any extra length less than sizeof(int) can be inferred from this rounding up.
Storing the extra value length must be done carefully so we maintain the invariant that unused space is zeroed. This is necessary because log scans assume any non-zero data they land on past the previous record length is a valid RecordInfo header. If the extra length is set before the Filler bit is, or if the Filler bit is cleared before the extra length is zeroed, then a log scan could momentarily see a short length, land on the nonzero extra length, and think it is a valid RecordInfo. The other direction is safe: the filler bit may be set but the extra length is zero. In this case the scan sees a zeroed RecordInfo (RecordInfo.IsNull()) and assumes it is uninitialized space and steps forward by the size of a RecordInfo (8 bytes).
This invariant is also followed when variable-length values are shrunk, even in the absence of the Filler bit. When a value is shrunk, the data beyond the new (shorter) size must be zeroed before the shorter size is set.
Combining these two, when we adjust a variable-length value length, we must:
The Tsavorite-provided variable-length implementation is SpanByte, and SpanByteFunctions variants do the right thing here.
For variable-length datatypes, the UpsertInfo, RMWInfo, and DeleteInfo have two fields that let the relevant IFunctions callbacks know about the record length (without having to know the internals of value storage):
For SpanByte, the VariableLenghtBlittableAllocator.GetValue overload with (physicalAddress, physicalAddress + allocatedSize) that calls the default SpanByte implementation of IVariableLengthStructureSettings actually initializes the value space to be a valid SpanByte that includes the requested Value length (up to the maximum number of elements that can fit). However, other variable-length data types may not do this.
In general, the IFunctions Dispose* functions are intended for data that does not get written to the log (usually due to CAS failure or a failure in SingleWriter, InitialUpdater, or CopyUpdater). Data written to the log will be seen by OnEvictionObserver if one is registered. However, revivified records are taken from the log and reused; thus they must give the application an opportunity to Dispose().
To handle this a new DisposeForRevivification IFunctions method has been added. When a record is moved to the FreeList, revivified from either in-chain or freelist, or is reused from Retry, it has a Key and potentially a Value. Tsavorite calls DisposeForRevivification twice:
newKeySize parameter is -1, indicating that we are not reusing it yet. The main reason for this call is to release no-longer-used objects as soon as possible. For non-object types, there is minimal cost to this; fixed-length types do nothing, and the only Garnet variable-length type is SpanByte, which also does not need to do anything at this time (there are no allocations).
DisposeForRevivification is not called for In-Chain tombstoned records; the key must remain valid, and the Value will be disposed at eviction.DisposeForRevivification is not called for Retry-recycled records at the time they are added to the PendingContext, because it is called when retrieving them and that is always done (we always retry).newKeySize set to the actual size of the new key if being reused from the FreeList. This is only a concern for variable-length types and for Garnet that means SpanByte only, which ensures that the record is always zero-initialized such that it is "valid" for scan at any point. At reuse time, Tsavorite guarantees a valid Key and Value (even if the Value is default) are in the record, and calls DisposeForRevivification in the following way:DisposeForRevivification with newKeySize > 0 if the record was retrieved from the freelist
DisposeForRevivification clears the Value and possibly Key, it must do so in accordance with the protocol for zeroing data after used space as described in Maintaining Extra Value Length. SpanByte does not need to clear anything as a SpanByte contains nothing that needs to be freed.SpanByte variable-length implementation must either implement DisposeForRevivification to zero-init the space or must revise its SingleWriter, InitialUpdater, and CopyUpdater implementations to recognize an existing value is there and handle it similarly to ConcurrentWriter or InPlaceUpdater.
SpanByte, SingleWriter et al. will have a valid target value in the destination for non-revivified records, because VariableLengthBlittableAllocator.GetAndInitializeValue (called after record allocation) calls the default SpanByte implementation of IVariableLengthStructureSettings and actually initializes the value space to be a valid SpanByte that includes the entire requested value size. For newly-allocated (not revivified) records the value data initialized to zero; for revivified records this is not guaranteed; only the value space after the usedValueLength is guaranteed to be zeroed, so SingleWriter et al. must be prepared to shrink the destination value.If the FreeList is not active, all Tombstoned records are left in the hash chain; and even if the FreeList is active, a deleted record may not be elidable. A subsequent Upsert or RMW of the same key will revivify the record if it has sufficient allocated value length.
In-Chain revivification is always active if RevivificationSettings.EnableRevivification is true. It functions as follows:
Delete() is done in the normal way; the Tombstone is set. If the Tombstoned record is at the tail of the tag chain (i.e. is the record in the HashBucketEntry) and the FreeList is enabled, then it will be moved to the FreeList. Otherwise (or if this move fails), it remains in the tag chain.Upsert() and RMW() will try to revivify a Tombstoned record:
DisposeForRevivification as described in Maintaining Extra Value Length.If RevivificationSettings.FreeBinList is not null, this creates the freelist according to the RevivificationBin elements of the array. If the data types are fixed, then there must be one element of the array; otherwise there can be many.
When the FreeList is active and a record is deleted, if it is at the tail of the hash chain, is not locked, and its PreviousAddress points below BeginAddress, then it can be CASed (Compare-And-Swapped) out of the hash chain. If this succeeds, it is added to the freelist. Similar considerations apply to freelisting the source records of RCUs done by Upsert and RMW.
FreeList revivification functions as follows:
Delete() checks to see if the record is at the tail of the hash chain (i.e. is the record in the HashBucketEntry).
Delete() checks to see if its PreviousAddress points below BeginAddress. If not, then we must leave it; it may be the marker for a deleted record below it, and removing it from the tag chain would allow the earlier record to be reached erroneously..PreviousAddress into the HashBucketEntry.
Add the record onto the freelist. This Seals it, in case other threads are traversing the tag chain.
Add fails, it will be due to the bin being full. Rather than lose the record entirely, Tsavorite attempts to re-insert it as a deleted record.Upsert and RMW which perform RCU will check before doing the CAS to see that the source record is in the HashBucketEntry. If so and all other conditions apply, the source record is freelisted as for Delete (except that no attempt is made to reinsert it if Add to the freelist fails).Upsert() and RMW() will try to revivify a freelist record if they must create a new record:
TryTakeFreeRecord to remove a record from the freelist.TryTakeFreeRecord initializes the record by:
DisposeForRevivification as described in Maintaining Extra Value Length.FreeList Revivification requires ConcurrencyControlMode not be ConcurrencyControlMode.None; otherwise the tag chain can be unstable, with the first record in the tag chain potentially removed and reused by revivification while a thread is tracing back from it:
- Thread1: Get the first record address from the HashBucketEntry
- Thread2: Delete and elide the first record, putting it on the revivification FreeList
- Thread3: Revivify that record with a different key, setting its .PreviousAddress
- Thread1: Follows the former .PreviousAddress and is now on an entirely different tag chain.
This is only possible for the first record in the tag chain (the tail-most record, in the HashBucketEntry); we do not elide records in the middle of the tag chain.
For ConcurrencyControlMode.LockTable, because the only current LockTable implementation is via the HashBuckets and locking at the HashBucket level is higher than the tag chain, we get a stable tag chain almost for free. The cost is that the HashBucket must be locked before calling TracebackForKeyMatch.
FreeRecordPool DesignThe FreeList hierarchy consists of:
FreeRecordPool, which maintains the bins, deciding which bin should be used for Enqueue and Dequeue.FreeRecordBins, one for each RevivificationBin in the RevivificationSettings with the corresponding number of records, max record size, and best-fit scan specification.FreeRecordEach bin is a separate allocation, so the pool uses a size index which is a cache-aligned separate vector of ints of the bin sizes that is sequentially searched to determine the bin size. This avoids pulling in each bin’s separate cache line to check its size.
FreeRecord DesignA FreeRecord contains a free log address and the epoch in which it was freed; its data is a 'long' containing:
RecordInfo.PreviousAddress, with the same number of bitsBecause this is only a long, we have atomic compare and exchange when setting and taking the address.
FreeRecordBin DesignThe records of the bin are of sizes between [previous bin's max record size + 8] and [current bin's max record size]. As a shortcut we refer to bins by their max record size, i.e. "the 64-byte bin" means "the bin whose min record size is 8 more than the max size of the previous bin, and whose max record size is 64."
Each bin has a cache-aligned vector of FreeRecord that operates as a circular buffer. Unlike FIFO queues, we don't maintain a read/write pointer, for the following reasons:
We wish to obtain a solution that has only one interlock on read and write operations, does not iterate the entire segment, and makes a best effort at best-fit. To accomplish this we use a strategy of segmenting the bin by record sizes.
For varlen Key and Value types, we segment bins based on the range of sizes at 8-byte alignment.
First, the max number of records in the bin is rounded up to a cache-aligned size, and may be further rounded up to ensure that each segment starts on a cache boundary. FreeRecords are 16 bytes as stated above so there are 4 per cache line. However, 4 elements is too small for a segment, so we will require a minimum segment size of 8 (2 cache lines). This is a defined constant MinSegmentSize so can be modified if desired.
We calculate bin segments by evenly dividing the number of records in the by by the range of possible 8-byte-aligned record sizes in the bin, with a minimum of two cache lines (8 items) per segment. Segments do not have read/write pointers; they are just the index to start at in the bin’s record array, calculated on the fly from the record size (if Adding) or the requested record size (if Taking).
Here are 3 examples of determining the size ranges, using bin definitions as follow:
The following segmenting strategies are used:
MinSegmentSize), and there are 1032 total elements in the bin. These segments start at indexes 0, 344, 688.MinSegmentSize-aligned 256, and therefore the segments start at 0, 256, 512, 768.MinSegmentSize, so instead we set the bin's segment count to MinSegementSize and divide the bin record count by that to get the number of segments (rounded up to integer), and then set the segment size increment to the size range divided by segment count. We therefore have 16 segments, and thus we divide the size range by 16 to get the record size ranges for the segments. The number of records is the segment size multiplied by the segment count. The segments are:
An alternative to size-based partitioning would be to use the thread id, as LightEpoch does. However, for threadId-based partitioning there is no organization to the size; we could have a random distribution of sizes. With size-based partitioning, we will much more likely land on the size (or close to it) that we want, and overflowing will put us into the next-highest record-size partition half the time, potentially giving us near-best-fit with minimal effort. Additionally, size-based partitioning makes best-fit search for a request easier.
As the names imply, we have the option of first-fit or best-fit to satisfy a request. We allow both, via RevivificationBin.BestFitScanLimit:
UseFirstFit: The value is zero, so we do not scan; when a record is requested from the bin, it returns the first one that has a size >= the requested size, has an addedEpoch >= SafeToReclaimEpoch, and has an address >= the minAddress passed to the Take call.BestFitScanAll: The value is Int.MaxValue, so we scan the entire bin (possibly wrapping), keeping track of the best fit, then try to return that (which may fail if another thread returned it first, in which case we retry). If at any point there is an exact fit, that bin attempts to return that record.BestFitScanAll except that it limits the number of records to scan.For fixed-length Key and Value datatypes there is only one bin and of course no size-based partitioning. In this case we could use thread-id-based partitioning (as could any sufficiently large partition) to determine the index within the partition to start iterating from. However, threadId-based partitioning suffers from the possibility that the writers write to different partitions than the readers read from; in the worst case the readers must wrap all the way around through all partitions to get to the records. This may be offset by the reduced cache-line ping-ponging between processor caches if the first 4 records of the partition are repeatedly updated by different threads, but this has not been tried.
We do not want to maintain a per-operation counter of records in the FreeRecordPool or each FreeRecordBin because this would be an additional interlock per Add or Take. Because of this, we cannot have an "accurate at all times" indication of whether the pool or a bin are empty. However, checking empty bins is expensive, so we want to optimize this.
There are numerous "performance vs. accuracy" trade-offs for setting one or more coarse-grained flags indicating whether the pool and/or individual bins are empty:
The approach selected is to maintain bin-level isEmpty flags.
This Task to iterate bins to set isEmpty is done by the CheckEmptyWorker class, which uses a separate on-demand Task to do the bump. It does this in a loop that performs:
CheckEmptyWorker has a Start method that is called by FreeRecordPool Add or Take. This Start method checks whether the CheckEmptyWorkerWorker has been started, and starts it if not.
Because this is a persistent Task, it has a CancellationTokenSource that is signaled by CheckEmptyWorker.Dispose() when we are shutting down the TsavoriteKV.
For non-variable-length types, the record size is fixed, so the FreeRecordPool has only a single bin for it. Otherwise, it has the full range of variable-length bins.
When Adding a record to the pool:
FreeRecordPool.TryAdd.TryAdd call. TryAdd then returns true.TryAdd returns falseWhen Taking a record, we must:
HashTableEntry.Address is passed as a minimum required address. This is either a valid lower address or an invalid address (below BeginAddress).The operation then proceeds as:
FreeRecordPool.TryTake.TryTake returns true.TryTake returns false