Replace RefreshSafeTail with epoch table based safe tail tracking [dev]#1720
Draft
Replace RefreshSafeTail with epoch table based safe tail tracking [dev]#1720
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR replaces the previous “background refresh / refresh-frequency” SafeTail mechanism in TsavoriteLog with a lazy, epoch-table-based SafeTail computation (via LightEpoch user-word slots), and removes the associated tuning knobs from settings/options across the repo.
Changes:
- Introduces
LightEpoch“user-word” slots and uses them inTsavoriteLogto track per-thread in-flight enqueue lower bounds for SafeTail computation. - Removes
SafeTailRefreshFrequencyMsand related Garnet config/options, updating AOF/pubsub construction accordingly. - Updates log-scan iterators and tests to use
RefreshSafeTailAddress()and cached SafeTail reads instead of relying on periodic refresh.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| libs/storage/Tsavorite/cs/test/LogTests.cs | Updates tests to wait using RefreshSafeTailAddress() and removes deprecated settings usage. |
| libs/storage/Tsavorite/cs/test/LogScanTests.cs | Removes SafeTailRefreshFrequencyMs from test log settings and refreshes safe tail via new API. |
| libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogSettings.cs | Removes SafeTailRefreshFrequencyMs setting. |
| libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogScanSingleIterator.cs | Adds per-iterator backoff around safe-tail refresh scans for single-iterator waits. |
| libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogScanIterator.cs | Uses cached SafeTailAddress on hot paths; refreshes safe tail in uncommitted wait path. |
| libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs | Implements epoch-user-word-based SafeTail caching/refresh and replaces old callback/worker refresh logic. |
| libs/storage/Tsavorite/cs/src/core/Epochs/LightEpoch.cs | Adds user-word slot allocation, per-thread access, and min-fold scanning APIs. |
| libs/server/Servers/GarnetServerOptions.cs | Removes AOF replication refresh-frequency option and stops passing it into TsavoriteLogSettings. |
| libs/server/PubSub/SubscribeBroker.cs | Removes subscriber refresh-frequency constructor parameter and setting usage. |
| libs/host/defaults.conf | Removes removed refresh-frequency configuration keys. |
| libs/host/GarnetServer.cs | Updates SubscribeBroker construction to new signature (no refresh-frequency). |
| libs/host/Configuration/Options.cs | Removes CLI options for AOF/subscriber refresh-frequency and corresponding mapping. |
| libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs | Renames and simplifies safe-tail page-shift callback usage for fast truncation. |
| benchmark/BDN.benchmark/Embedded/EmbeddedRespServer.cs | Updates SubscribeBroker construction to new signature. |
Comments suppressed due to low confidence (1)
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogScanSingleIterator.cs:92
- The backoff gate can cause the iterator to wait even though data is already available: if scanBackoffUntil is in the future (after previous failed scans) and a producer enqueues a single record then stops, the onEnqueue signal wakes the iterator but it may skip RefreshSafeTailAddress() due to backoff and then await onEnqueue again, potentially blocking forever. Reset scanBackoffUntil (and/or bypass backoff) on enqueue signals and when resetting scanBackoffTicks to 0 so a wake-up always leads to at least one safe-tail refresh attempt.
// Fast check: cached SafeTailAddress may already be ahead (e.g., another iterator
// or the multi-iterator pre-refresh in NotifyParkedWaiters advanced it).
if (this.NextAddress < this.tsavoriteLog.SafeTailAddress)
{
scanBackoffTicks = 0; // data available — reset backoff
return true;
}
// Scan the epoch table only if the backoff interval has elapsed. This limits scan
// frequency when the consumer is caught up and the producer is slow, capping at one
// scan per 10ms in the worst case (matching the old baseline refresh frequency).
if (Stopwatch.GetTimestamp() >= scanBackoffUntil)
{
if (this.NextAddress < this.tsavoriteLog.RefreshSafeTailAddress())
{
scanBackoffTicks = 0; // scan advanced past NextAddress — reset backoff
return true;
}
// Scan didn't help — double the backoff interval (starting from ~100μs).
scanBackoffTicks = scanBackoffTicks == 0
? Stopwatch.Frequency / 10_000 // initial: ~100μs
: Math.Min(scanBackoffTicks * 2, MaxBackoffTicks);
scanBackoffUntil = Stopwatch.GetTimestamp() + scanBackoffTicks;
}
// Ignore refresh-uncommitted exceptions, except when the token is signaled
await onEnqueue.WaitAsync().ConfigureAwait(false);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
603d52f to
c4855e4
Compare
…ducers Replaces the TsavoriteLog SafeTailAddress background-worker/bump design and its SafeTailRefreshFrequencyMs / AofReplicationRefreshFrequencyMs tuning knob with a per-thread in-flight publish protocol, using a new general-purpose user-word primitive on LightEpoch. LightEpoch: - Extend Entry (still 64-byte cache line) with 6 per-thread long user-word slots. - New API: AllocateUserWord/ReleaseUserWord, ThisThreadUserWord, ForEachUserWord<TVisitor> (struct visitor for JIT-specialized folds), and ILightEpochUserWordVisitor interface. - Acquire/Release reset allocated user words to their configured initial values. TsavoriteLog: - Every enqueue publishes a conservative lower bound into its thread's slot before the allocator FAA, tightens it to the exact slot start after the FAA, and clears it (long.MaxValue sentinel) after the payload write. - SafeTailAddress is now a monotonic cached property. RefreshSafeTailAddress() computes min(TailAddress, min-over-in-flight) via a struct-visitor scan, advances the cache monotonically, and fires SafeTailShiftCallback on advance. - Iterators call RefreshSafeTailAddress() when the cached value is behind, so scanUncommitted no longer requires a refresh-frequency setting. - TrueDispose releases the user-word slot; background worker and auto-reset events are gone. Removed: - TsavoriteLogSettings.SafeTailRefreshFrequencyMs - GarnetServerOptions.AofReplicationRefreshFrequencyMs, aof-refresh-freq option and defaults.conf entry - DoAutoRefreshSafeTailAddress, SafeTailRefreshWorker, PeriodicRefreshSafeTailAddressBumpCallback, associated CTS / events / tasks. - scanUncommitted guard that required SafeTailRefreshFrequencyMs >= 0. SubscribeBroker retains its subscriberRefreshFrequencyMs ctor parameter for API compatibility (now unused). Tests updated to call RefreshSafeTailAddress() where they previously polled the SafeTailAddress field expecting background advancement. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Producers publish their in-flight slot via a release store (BeginInflightEnqueue) and then advance TailAddress via an interlocked FAA. The reader must observe these two effects in order: if it sees the new tail, it must also see the producer's in-flight publish. The previous code read the inflight column first and the tail second. On reordering-friendly architectures (and even with in-order x86 loads interleaved with concurrent producer stores), a reader could observe the stale inflight value (pre-BeginInflightEnqueue) while also seeing the fresh tail (post-FAA), incorrectly advancing SafeTailAddress past an in-flight slot whose payload has not yet been written. This manifested as replica AOF sync lag in cluster replication when a replica reconnected after primary had populated new data. Fix: read TailAddress first, issue an Interlocked.MemoryBarrier(), then scan the inflight column. With this ordering plus the fence, observing a fresh tail guarantees observing at least as recent an inflight column state, preventing the reader from over-advancing SafeTailAddress. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-callback regression - Rename field SafeTailShiftCallback -> SafeTailPageShiftCallback to reflect its page-granular semantics. - Move the old AofTaskStore-side page-diff filter into TsavoriteLog via MaybeInvokePageShiftCallback (monotonic CAS on lastPublishedSafeTailPage), so concurrent advances fire the callback at most once per page transition. - Add producer-side page-boundary drive: when an enqueue crosses into a new page, the CAS winner calls RefreshSafeTailAddress. This keeps the callback progressing with no active iterators (the previous design silently stalled when there were no replicas / subscribers driving scans). - Remove dead logPageSizeBits/logPageSizeMask fields from AofTaskStore. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wake parked iterators from AllocateBlock retry-wait path: Suspend resets the user-word slot, which can raise min(inflight) — call NotifyParkedWaiters so readers do not wait until the next enqueue. - Invoke SafeTailPageShiftCallback outside epoch protection (as in the old background-refresh design). Callers that are epoch-protected (e.g., producer-drive from EndInflightEnqueue) now Suspend/Resume around the callback so it can safely re-enter Tsavorite APIs. - Reset monotonic page trackers (lastPublishedSafeTailPage, lastProducerObservedPage) on Reset() and Initialize() so that post-recovery producer drive and callback dispatch re-arm from the new floor. - Serialize LightEpoch.AllocateUserWord / ReleaseUserWord under a lock; the previous CAS-first-write-later pattern could race two concurrent allocators to write different initial values into the same column before the loser retried, leaving the winner with the loser's initial value. - Document ReleaseUserWord's quiescence requirement. - Remove SubscriberRefreshFrequencyMs config option / plumbing: the new per-enqueue publish protocol makes refresh-frequency obsolete for pub/sub as well, matching the earlier removal of AofReplicationRefreshFrequencyMs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetNext's scan-uncommitted boundary check was calling RefreshSafeTailAddress() on every iteration — an O(kTableSize) epoch-table scan per record consumed. This caused severe regression at low thread counts with an active consumer. Revert to reading the cached SafeTailAddress (O(1)) in GetNext. The scan is deferred to the WaitAsync / SlowWaitUncommittedAsync slow path, which only triggers when the iterator actually catches up to the cache boundary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add LightEpoch.GetMinUserWord(wordIndex) that walks the epoch table via raw pointer arithmetic instead of going through the generic ForEachUserWord<T> visitor pattern. The direct scan avoids interface dispatch overhead and gives the JIT a simpler loop to optimize. RefreshSafeTailAddress now calls GetMinUserWord instead of instantiating a MinVisitor struct and calling ForEachUserWord. Remove the now-unused MinVisitor struct from TsavoriteLog. The generic ForEachUserWord and ILightEpochUserWordVisitor are retained in LightEpoch as general-purpose API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The visitor pattern is fully replaced by GetMinUserWord. Remove the unused interface and generic method to keep the API surface minimal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When more than one ScanSingle iterator exists, NotifyParkedWaiters now calls RefreshSafeTailAddress before signaling so all woken iterators see the fresh cache and skip their own redundant scans. This makes the scan cost O(1) instead of O(N_iterators) per wake cycle in the multi-replica case. The activeSingleIteratorCount is a stale-tolerant hint read via Volatile.Read: a stale value of 1 (actually 2) causes one redundant scan in the iterator; a stale value of 2 (actually 1) causes one extra scan in the producer. Both are harmless — correctness is maintained because each iterator falls back to its own RefreshSafeTailAddress if the cache is still stale. With a single iterator (the common 1-replica case), no scan happens in NotifyParkedWaiters — the scan cost stays on the iterator's WaitAsync path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BeginInflightEnqueue publishes allocator.GetTailAddress() which is always ≤ the eventual slot start (TailAddress is monotonic, FAA can only increase it). The CommitInflightEnqueue step that tightened the publish to the exact slot address was an unnecessary Volatile.Write on the hot path — the conservative lower bound from Begin is sufficient for correctness. The only effect of removing Commit is that SafeTailAddress may lag by up to O(N_threads × entry_size) bytes more than before, because concurrent threads' Begin publishes can overlap. This is negligible compared to the page-level granularity of downstream consumers (replication, pub/sub, truncation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When TailAddress <= cached SafeTailAddress, no new records have been allocated so scanning the inflight column cannot yield a higher SafeTailAddress. Return the cached value immediately, avoiding the O(kTableSize) epoch-table scan. This eliminates scans in two common cases: - Consumer caught up with no active producers (tail not moving) - Consumer wakes from signal but another consumer already refreshed the cache The fast path is a single volatile read of TailAddress + comparison — no memory barrier, no epoch-table walk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the consumer catches up and RefreshSafeTailAddress fails to advance past NextAddress, back off exponentially (100μs → 200μs → ... → 10ms cap) before scanning again. Reset immediately when data becomes available (either via cached SafeTailAddress or a successful scan). Also defer the RefreshSafeTailAddress scan from WaitAsync's fast path into SlowWaitUncommittedAsync so the backoff logic controls all scan decisions. WaitAsync now only checks the cached SafeTailAddress (O(1)). This reduces scan frequency when the consumer is caught up and the producer is slow, matching the old baseline's ~100 Hz scan rate in the steady state while preserving immediate response when records flow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ease reset AllocateUserWord now uses CAS on userWordMask to claim a slot, then the CAS winner exclusively initializes the column. No lock object needed. Remove ResetAllocatedUserWords from the Acquire and Release hot paths. The application (TsavoriteLog) is responsible for writing its own initial value via BeginInflightEnqueue/EndInflightEnqueue — LightEpoch does not need to reset slots on every epoch entry recycle. This removes a mask read + bit iteration + Volatile.Write per slot from every Resume/Suspend call. Deleted: userWordSlotLock, userWordInitialValues array, ResetAllocatedUserWords method. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The JIT was failing to inline EndInflightEnqueue into TryEnqueue because NotifyParkedWaiters' body (foreach, Interlocked.CompareExchange, RefreshSafeTailAddress call) exceeded the IL size budget for forced inlining. This pushed the entire enqueue hot path out-of-line, adding ~3ns per call. Split NotifyParkedWaiters into: - Fast path (AggressiveInlining): two null checks + conditional branch. In the NoCons/no-iterator case both are null, so the JIT emits just two loads and a branch — trivially inlinable. - Slow path (NoInlining): the actual wake logic, only called when iterators or TCS waiters are present. Benchmark shows 1T NoCons recovering from ~15 Mops to ~18.6 Mops, matching the dev baseline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…offset 1. AllocateBlock / AllocateBlock<TEpochAccessor>: call EndInflightEnqueue() before epoch.Suspend() on allocation failure. LightEpoch no longer auto- resets user words on Release, so the stale BeginInflightEnqueue publish would pin min(inflight) and prevent SafeTailAddress from advancing. Fix stale comments referencing the removed ResetAllocatedUserWords. 2. GetMinUserWord: derive base address from UserWordRef(1, wordIndex) instead of hardcoding the byte offset (+16) into Entry. Resilient to layout changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetNext was using only the cached SafeTailAddress, which could remain stale when no consumer-side RefreshSafeTailAddress had run yet. This caused the iterator to return false prematurely, and since no scan happened in GetNext, replication consumers would hang waiting for records that were already safe to read. Fix: check the cached value first (O(1)). Only call RefreshSafeTailAddress when the current address has caught up to the cached boundary. This preserves the fast path (no scan while the cache is ahead) while ensuring correctness when the cache is stale. Also restore RefreshSafeTailAddress in WaitAsync's fast path for the same reason — the cache-only check could miss records and enter SlowWait unnecessarily. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the iterator catches up to the cached SafeTailAddress, spin briefly (SpinWait(100) ≈ a few μs) before scanning the epoch table. This lets concurrent producers complete in-flight writes and advance TailAddress, so the subsequent RefreshSafeTailAddress scan jumps the cache forward by many records. The next batch of GetNext calls then use the O(1) cached path, amortizing the scan cost to ~1ns/record. Benchmark: 1T WithConsumer improves from ~2.6 Mops (immediate scan) to ~8.0 Mops (spin-wait), matching the dev@10ms baseline (7.7 Mops). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move BeginInflightEnqueue() right after epoch.Resume() and wrap the enqueue body in try/finally so that EndInflightEnqueue() + epoch.Suspend() always execute, even if payload serialization, SetHeader, or commitNum checks throw. Without this, an exception between Begin and End would leave the inflight slot pinned at a low address, permanently preventing SafeTailAddress from advancing past that point and stalling all uncommitted scan iterators. 15 methods fixed: 9 TryEnqueue variants (moved Begin up, wrapped in try/ finally) and 6 void Enqueue variants using AllocateBlock (wrapped post- AllocateBlock payload writes in try, moved End+Suspend to finally). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
178c5b6 to
c299a1f
Compare
…ssor overloads 1. MaybeInvokePageShiftCallback: catch and log callback exceptions so they cannot escape EndInflightEnqueue and skip epoch.Suspend() in producer cleanup paths. Callback is best-effort (AOF truncation). 2. TEpochAccessor Enqueue overloads (3 methods): move EndInflightEnqueue() from try body into finally block. If SerializeTo/CopyTo/SetHeader throws after AllocateBlock, the inflight slot is now always cleared. 3. AllocateBlock / AllocateBlock<TEpochAccessor>: wrap TryAllocateRetryNow in try/catch so that if the allocator throws, EndInflightEnqueue() is called before rethrowing. Prevents pinning min(inflight) on exceptional allocator paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.