Skip to content

Replace RefreshSafeTail with epoch table based safe tail tracking [dev]#1720

Draft
badrishc wants to merge 19 commits intodevfrom
badrishc/lightepoch-fold-safetail
Draft

Replace RefreshSafeTail with epoch table based safe tail tracking [dev]#1720
badrishc wants to merge 19 commits intodevfrom
badrishc/lightepoch-fold-safetail

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

No description provided.

@badrishc badrishc changed the title Badrishc/lightepoch fold safetail Replace RefreshSafeTail with epoch table based safe tail tracking [dev] Apr 21, 2026
@badrishc badrishc marked this pull request as ready for review April 22, 2026 01:13
Copilot AI review requested due to automatic review settings April 22, 2026 01:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the previous “background refresh / refresh-frequency” SafeTail mechanism in TsavoriteLog with a lazy, epoch-table-based SafeTail computation (via LightEpoch user-word slots), and removes the associated tuning knobs from settings/options across the repo.

Changes:

  • Introduces LightEpoch “user-word” slots and uses them in TsavoriteLog to track per-thread in-flight enqueue lower bounds for SafeTail computation.
  • Removes SafeTailRefreshFrequencyMs and related Garnet config/options, updating AOF/pubsub construction accordingly.
  • Updates log-scan iterators and tests to use RefreshSafeTailAddress() and cached SafeTail reads instead of relying on periodic refresh.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
libs/storage/Tsavorite/cs/test/LogTests.cs Updates tests to wait using RefreshSafeTailAddress() and removes deprecated settings usage.
libs/storage/Tsavorite/cs/test/LogScanTests.cs Removes SafeTailRefreshFrequencyMs from test log settings and refreshes safe tail via new API.
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogSettings.cs Removes SafeTailRefreshFrequencyMs setting.
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogScanSingleIterator.cs Adds per-iterator backoff around safe-tail refresh scans for single-iterator waits.
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogScanIterator.cs Uses cached SafeTailAddress on hot paths; refreshes safe tail in uncommitted wait path.
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs Implements epoch-user-word-based SafeTail caching/refresh and replaces old callback/worker refresh logic.
libs/storage/Tsavorite/cs/src/core/Epochs/LightEpoch.cs Adds user-word slot allocation, per-thread access, and min-fold scanning APIs.
libs/server/Servers/GarnetServerOptions.cs Removes AOF replication refresh-frequency option and stops passing it into TsavoriteLogSettings.
libs/server/PubSub/SubscribeBroker.cs Removes subscriber refresh-frequency constructor parameter and setting usage.
libs/host/defaults.conf Removes removed refresh-frequency configuration keys.
libs/host/GarnetServer.cs Updates SubscribeBroker construction to new signature (no refresh-frequency).
libs/host/Configuration/Options.cs Removes CLI options for AOF/subscriber refresh-frequency and corresponding mapping.
libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs Renames and simplifies safe-tail page-shift callback usage for fast truncation.
benchmark/BDN.benchmark/Embedded/EmbeddedRespServer.cs Updates SubscribeBroker construction to new signature.
Comments suppressed due to low confidence (1)

libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogScanSingleIterator.cs:92

  • The backoff gate can cause the iterator to wait even though data is already available: if scanBackoffUntil is in the future (after previous failed scans) and a producer enqueues a single record then stops, the onEnqueue signal wakes the iterator but it may skip RefreshSafeTailAddress() due to backoff and then await onEnqueue again, potentially blocking forever. Reset scanBackoffUntil (and/or bypass backoff) on enqueue signals and when resetting scanBackoffTicks to 0 so a wake-up always leads to at least one safe-tail refresh attempt.
                // Fast check: cached SafeTailAddress may already be ahead (e.g., another iterator
                // or the multi-iterator pre-refresh in NotifyParkedWaiters advanced it).
                if (this.NextAddress < this.tsavoriteLog.SafeTailAddress)
                {
                    scanBackoffTicks = 0; // data available — reset backoff
                    return true;
                }

                // Scan the epoch table only if the backoff interval has elapsed. This limits scan
                // frequency when the consumer is caught up and the producer is slow, capping at one
                // scan per 10ms in the worst case (matching the old baseline refresh frequency).
                if (Stopwatch.GetTimestamp() >= scanBackoffUntil)
                {
                    if (this.NextAddress < this.tsavoriteLog.RefreshSafeTailAddress())
                    {
                        scanBackoffTicks = 0; // scan advanced past NextAddress — reset backoff
                        return true;
                    }

                    // Scan didn't help — double the backoff interval (starting from ~100μs).
                    scanBackoffTicks = scanBackoffTicks == 0
                        ? Stopwatch.Frequency / 10_000  // initial: ~100μs
                        : Math.Min(scanBackoffTicks * 2, MaxBackoffTicks);
                    scanBackoffUntil = Stopwatch.GetTimestamp() + scanBackoffTicks;
                }

                // Ignore refresh-uncommitted exceptions, except when the token is signaled
                await onEnqueue.WaitAsync().ConfigureAwait(false);
            }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs Outdated
Comment thread libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs
Comment thread libs/storage/Tsavorite/cs/src/core/Epochs/LightEpoch.cs
Comment thread libs/storage/Tsavorite/cs/src/core/Epochs/LightEpoch.cs Outdated
@badrishc badrishc force-pushed the badrishc/lightepoch-fold-safetail branch 3 times, most recently from 603d52f to c4855e4 Compare April 23, 2026 23:38
Copilot and others added 18 commits April 24, 2026 12:35
…ducers

Replaces the TsavoriteLog SafeTailAddress background-worker/bump design and its
SafeTailRefreshFrequencyMs / AofReplicationRefreshFrequencyMs tuning knob with a
per-thread in-flight publish protocol, using a new general-purpose user-word
primitive on LightEpoch.

LightEpoch:
- Extend Entry (still 64-byte cache line) with 6 per-thread long user-word slots.
- New API: AllocateUserWord/ReleaseUserWord, ThisThreadUserWord,
  ForEachUserWord<TVisitor> (struct visitor for JIT-specialized folds), and
  ILightEpochUserWordVisitor interface.
- Acquire/Release reset allocated user words to their configured initial values.

TsavoriteLog:
- Every enqueue publishes a conservative lower bound into its thread's slot
  before the allocator FAA, tightens it to the exact slot start after the FAA,
  and clears it (long.MaxValue sentinel) after the payload write.
- SafeTailAddress is now a monotonic cached property. RefreshSafeTailAddress()
  computes min(TailAddress, min-over-in-flight) via a struct-visitor scan,
  advances the cache monotonically, and fires SafeTailShiftCallback on advance.
- Iterators call RefreshSafeTailAddress() when the cached value is behind, so
  scanUncommitted no longer requires a refresh-frequency setting.
- TrueDispose releases the user-word slot; background worker and auto-reset
  events are gone.

Removed:
- TsavoriteLogSettings.SafeTailRefreshFrequencyMs
- GarnetServerOptions.AofReplicationRefreshFrequencyMs, aof-refresh-freq option
  and defaults.conf entry
- DoAutoRefreshSafeTailAddress, SafeTailRefreshWorker,
  PeriodicRefreshSafeTailAddressBumpCallback, associated CTS / events / tasks.
- scanUncommitted guard that required SafeTailRefreshFrequencyMs >= 0.

SubscribeBroker retains its subscriberRefreshFrequencyMs ctor parameter for
API compatibility (now unused). Tests updated to call RefreshSafeTailAddress()
where they previously polled the SafeTailAddress field expecting background
advancement.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Producers publish their in-flight slot via a release store (BeginInflightEnqueue)
and then advance TailAddress via an interlocked FAA. The reader must observe
these two effects in order: if it sees the new tail, it must also see the
producer's in-flight publish.

The previous code read the inflight column first and the tail second. On
reordering-friendly architectures (and even with in-order x86 loads interleaved
with concurrent producer stores), a reader could observe the stale inflight
value (pre-BeginInflightEnqueue) while also seeing the fresh tail (post-FAA),
incorrectly advancing SafeTailAddress past an in-flight slot whose payload has
not yet been written. This manifested as replica AOF sync lag in cluster
replication when a replica reconnected after primary had populated new data.

Fix: read TailAddress first, issue an Interlocked.MemoryBarrier(), then scan
the inflight column. With this ordering plus the fence, observing a fresh tail
guarantees observing at least as recent an inflight column state, preventing
the reader from over-advancing SafeTailAddress.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-callback regression

- Rename field SafeTailShiftCallback -> SafeTailPageShiftCallback to reflect
  its page-granular semantics.
- Move the old AofTaskStore-side page-diff filter into TsavoriteLog via
  MaybeInvokePageShiftCallback (monotonic CAS on lastPublishedSafeTailPage),
  so concurrent advances fire the callback at most once per page transition.
- Add producer-side page-boundary drive: when an enqueue crosses into a new
  page, the CAS winner calls RefreshSafeTailAddress. This keeps the callback
  progressing with no active iterators (the previous design silently stalled
  when there were no replicas / subscribers driving scans).
- Remove dead logPageSizeBits/logPageSizeMask fields from AofTaskStore.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wake parked iterators from AllocateBlock retry-wait path: Suspend resets
  the user-word slot, which can raise min(inflight) — call NotifyParkedWaiters
  so readers do not wait until the next enqueue.
- Invoke SafeTailPageShiftCallback outside epoch protection (as in the old
  background-refresh design). Callers that are epoch-protected (e.g.,
  producer-drive from EndInflightEnqueue) now Suspend/Resume around the
  callback so it can safely re-enter Tsavorite APIs.
- Reset monotonic page trackers (lastPublishedSafeTailPage,
  lastProducerObservedPage) on Reset() and Initialize() so that post-recovery
  producer drive and callback dispatch re-arm from the new floor.
- Serialize LightEpoch.AllocateUserWord / ReleaseUserWord under a lock; the
  previous CAS-first-write-later pattern could race two concurrent allocators
  to write different initial values into the same column before the loser
  retried, leaving the winner with the loser's initial value.
- Document ReleaseUserWord's quiescence requirement.
- Remove SubscriberRefreshFrequencyMs config option / plumbing: the new
  per-enqueue publish protocol makes refresh-frequency obsolete for pub/sub
  as well, matching the earlier removal of AofReplicationRefreshFrequencyMs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetNext's scan-uncommitted boundary check was calling RefreshSafeTailAddress()
on every iteration — an O(kTableSize) epoch-table scan per record consumed.
This caused severe regression at low thread counts with an active consumer.
Revert to reading the cached SafeTailAddress (O(1)) in GetNext. The scan is
deferred to the WaitAsync / SlowWaitUncommittedAsync slow path, which only
triggers when the iterator actually catches up to the cache boundary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add LightEpoch.GetMinUserWord(wordIndex) that walks the epoch table via raw
pointer arithmetic instead of going through the generic ForEachUserWord<T>
visitor pattern. The direct scan avoids interface dispatch overhead and gives
the JIT a simpler loop to optimize.

RefreshSafeTailAddress now calls GetMinUserWord instead of instantiating a
MinVisitor struct and calling ForEachUserWord. Remove the now-unused
MinVisitor struct from TsavoriteLog. The generic ForEachUserWord and
ILightEpochUserWordVisitor are retained in LightEpoch as general-purpose API.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The visitor pattern is fully replaced by GetMinUserWord. Remove the unused
interface and generic method to keep the API surface minimal.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When more than one ScanSingle iterator exists, NotifyParkedWaiters now calls
RefreshSafeTailAddress before signaling so all woken iterators see the fresh
cache and skip their own redundant scans. This makes the scan cost O(1)
instead of O(N_iterators) per wake cycle in the multi-replica case.

The activeSingleIteratorCount is a stale-tolerant hint read via Volatile.Read:
a stale value of 1 (actually 2) causes one redundant scan in the iterator;
a stale value of 2 (actually 1) causes one extra scan in the producer. Both
are harmless — correctness is maintained because each iterator falls back to
its own RefreshSafeTailAddress if the cache is still stale.

With a single iterator (the common 1-replica case), no scan happens in
NotifyParkedWaiters — the scan cost stays on the iterator's WaitAsync path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BeginInflightEnqueue publishes allocator.GetTailAddress() which is always ≤
the eventual slot start (TailAddress is monotonic, FAA can only increase it).
The CommitInflightEnqueue step that tightened the publish to the exact slot
address was an unnecessary Volatile.Write on the hot path — the conservative
lower bound from Begin is sufficient for correctness.

The only effect of removing Commit is that SafeTailAddress may lag by up to
O(N_threads × entry_size) bytes more than before, because concurrent threads'
Begin publishes can overlap. This is negligible compared to the page-level
granularity of downstream consumers (replication, pub/sub, truncation).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When TailAddress <= cached SafeTailAddress, no new records have been allocated
so scanning the inflight column cannot yield a higher SafeTailAddress. Return
the cached value immediately, avoiding the O(kTableSize) epoch-table scan.

This eliminates scans in two common cases:
- Consumer caught up with no active producers (tail not moving)
- Consumer wakes from signal but another consumer already refreshed the cache

The fast path is a single volatile read of TailAddress + comparison — no
memory barrier, no epoch-table walk.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the consumer catches up and RefreshSafeTailAddress fails to advance past
NextAddress, back off exponentially (100μs → 200μs → ... → 10ms cap) before
scanning again. Reset immediately when data becomes available (either via
cached SafeTailAddress or a successful scan).

Also defer the RefreshSafeTailAddress scan from WaitAsync's fast path into
SlowWaitUncommittedAsync so the backoff logic controls all scan decisions.
WaitAsync now only checks the cached SafeTailAddress (O(1)).

This reduces scan frequency when the consumer is caught up and the producer
is slow, matching the old baseline's ~100 Hz scan rate in the steady state
while preserving immediate response when records flow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ease reset

AllocateUserWord now uses CAS on userWordMask to claim a slot, then the CAS
winner exclusively initializes the column. No lock object needed.

Remove ResetAllocatedUserWords from the Acquire and Release hot paths. The
application (TsavoriteLog) is responsible for writing its own initial value
via BeginInflightEnqueue/EndInflightEnqueue — LightEpoch does not need to
reset slots on every epoch entry recycle. This removes a mask read + bit
iteration + Volatile.Write per slot from every Resume/Suspend call.

Deleted: userWordSlotLock, userWordInitialValues array,
ResetAllocatedUserWords method.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The JIT was failing to inline EndInflightEnqueue into TryEnqueue because
NotifyParkedWaiters' body (foreach, Interlocked.CompareExchange,
RefreshSafeTailAddress call) exceeded the IL size budget for forced inlining.
This pushed the entire enqueue hot path out-of-line, adding ~3ns per call.

Split NotifyParkedWaiters into:
- Fast path (AggressiveInlining): two null checks + conditional branch.
  In the NoCons/no-iterator case both are null, so the JIT emits just two
  loads and a branch — trivially inlinable.
- Slow path (NoInlining): the actual wake logic, only called when iterators
  or TCS waiters are present.

Benchmark shows 1T NoCons recovering from ~15 Mops to ~18.6 Mops, matching
the dev baseline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…offset

1. AllocateBlock / AllocateBlock<TEpochAccessor>: call EndInflightEnqueue()
   before epoch.Suspend() on allocation failure. LightEpoch no longer auto-
   resets user words on Release, so the stale BeginInflightEnqueue publish
   would pin min(inflight) and prevent SafeTailAddress from advancing.
   Fix stale comments referencing the removed ResetAllocatedUserWords.

2. GetMinUserWord: derive base address from UserWordRef(1, wordIndex) instead
   of hardcoding the byte offset (+16) into Entry. Resilient to layout changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetNext was using only the cached SafeTailAddress, which could remain stale
when no consumer-side RefreshSafeTailAddress had run yet. This caused the
iterator to return false prematurely, and since no scan happened in GetNext,
replication consumers would hang waiting for records that were already safe
to read.

Fix: check the cached value first (O(1)). Only call RefreshSafeTailAddress
when the current address has caught up to the cached boundary. This preserves
the fast path (no scan while the cache is ahead) while ensuring correctness
when the cache is stale.

Also restore RefreshSafeTailAddress in WaitAsync's fast path for the same
reason — the cache-only check could miss records and enter SlowWait
unnecessarily.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the iterator catches up to the cached SafeTailAddress, spin briefly
(SpinWait(100) ≈ a few μs) before scanning the epoch table. This lets
concurrent producers complete in-flight writes and advance TailAddress,
so the subsequent RefreshSafeTailAddress scan jumps the cache forward by
many records. The next batch of GetNext calls then use the O(1) cached
path, amortizing the scan cost to ~1ns/record.

Benchmark: 1T WithConsumer improves from ~2.6 Mops (immediate scan) to
~8.0 Mops (spin-wait), matching the dev@10ms baseline (7.7 Mops).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move BeginInflightEnqueue() right after epoch.Resume() and wrap the enqueue
body in try/finally so that EndInflightEnqueue() + epoch.Suspend() always
execute, even if payload serialization, SetHeader, or commitNum checks throw.

Without this, an exception between Begin and End would leave the inflight
slot pinned at a low address, permanently preventing SafeTailAddress from
advancing past that point and stalling all uncommitted scan iterators.

15 methods fixed: 9 TryEnqueue variants (moved Begin up, wrapped in try/
finally) and 6 void Enqueue variants using AllocateBlock (wrapped post-
AllocateBlock payload writes in try, moved End+Suspend to finally).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/lightepoch-fold-safetail branch from 178c5b6 to c299a1f Compare April 24, 2026 19:36
@badrishc badrishc marked this pull request as draft April 24, 2026 19:37
…ssor overloads

1. MaybeInvokePageShiftCallback: catch and log callback exceptions so they
   cannot escape EndInflightEnqueue and skip epoch.Suspend() in producer
   cleanup paths. Callback is best-effort (AOF truncation).

2. TEpochAccessor Enqueue overloads (3 methods): move EndInflightEnqueue()
   from try body into finally block. If SerializeTo/CopyTo/SetHeader throws
   after AllocateBlock, the inflight slot is now always cleared.

3. AllocateBlock / AllocateBlock<TEpochAccessor>: wrap TryAllocateRetryNow
   in try/catch so that if the allocator throws, EndInflightEnqueue() is
   called before rethrowing. Prevents pinning min(inflight) on exceptional
   allocator paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants