Trace preserved-order writes at the recovery boundary#14588
Trace preserved-order writes at the recovery boundary#14588xingbowang wants to merge 3 commits intofacebook:mainfrom
Conversation
Preserve-write-order tracing regressed after ec832ff moved trace emission earlier to reduce the WAL/trace gap. That placement is correct for WAL-backed writes because a successful WAL update is the recovery boundary, but it is wrong for disableWAL writes, which are not recoverable until the memtable path succeeds. A disableWAL WriteBatchWithIndex ingest can still fail in SwitchMemtable, leaving the trace with a write that never became visible in the DB. Fix this by encoding tracing as a stage policy instead of scattering mode checks through each write path. Introduce stage-named helpers for tracing after WAL and after memtable, keep early tracing only for successful WAL-backed writes, delay disableWAL tracing until successful memtable publish, and skip non-memtable writers on the late tracing path. Also move unordered disableWAL tracing into UnorderedWriteMemtable so it actually runs after memtable success, and require the WAL status to be OK before the after-WAL helper emits anything. Add a deterministic regression test for disableWAL IngestWriteBatchWithIndex that injects a WAL-creation failure in SwitchMemtable, verifies the failed write never becomes visible in the live DB, replays the trace into a fresh DB, and verifies replay also skips the failed write. Tested:\n- make db_test2 -j192\n- timeout 60s ./db_test2 --gtest_filter='DBTest2.TracePreserveWriteOrderSkipsFailedWrite:DBTest2.TracePreserveWriteOrderSkipsFailedDisableWALWBWIIngest'\n- COERCE_CONTEXT_SWITCH=1 timeout 60s ./db_test2 --gtest_filter='DBTest2.TracePreserveWriteOrderSkipsFailedDisableWALWBWIIngest' --gtest_repeat=5\n- COERCE_CONTEXT_SWITCH=1 timeout 60s ./db_test2 --gtest_filter='DBTest2.TracePreserveWriteOrderSkipsFailedWrite' --gtest_repeat=5
✅ clang-tidy: No findings on changed linesCompleted in 329.0s. |
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D100056251. |
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit a5d6148 Code Review: Trace Preserved-Order Writes at the Recovery BoundaryVerdict: APPROVE with suggestions Executive SummaryThis PR correctly fixes a regression from ec832ff where disableWAL writes could be traced before memtable insertion succeeded, causing inconsistent trace replay. The two-stage tracing model (AfterWAL / AfterMemtable) is sound and aligns tracing with the correct recovery boundary for each write type. After multi-agent review (9 agents), debate, and manual adjudication: no critical correctness bugs found. Findings (7 total: 0 Critical, 0 High, 2 Medium, 5 Low)
Verified Safe (dismissed after debate)
Full review written to ℹ️ About this responseGenerated by Claude Code. Limitations:
Commands:
|
# Conflicts: # db/db_impl/db_impl_write.cc
Summary
Preserve-write-order tracing regressed after ec832ff moved trace emission earlier to reduce the WAL/trace gap. That placement is correct for WAL-backed writes because a successful WAL update is the recovery boundary, but it is wrong for disableWAL writes, which are not recoverable until the memtable path succeeds. A disableWAL WriteBatchWithIndex ingest can still fail in SwitchMemtable, leaving the trace with a write that never became visible in the DB.
Fix this by encoding tracing as a stage policy instead of scattering mode checks through each write path. Introduce stage-named helpers for tracing after WAL and after memtable, keep early tracing only for successful WAL-backed writes, delay disableWAL tracing until successful memtable publish, and skip non-memtable writers on the late tracing path. Also move unordered disableWAL tracing into UnorderedWriteMemtable so it actually runs after memtable success, and require the WAL status to be OK before the after-WAL helper emits anything.
Add a deterministic regression test for disableWAL IngestWriteBatchWithIndex that injects a WAL-creation failure in SwitchMemtable, verifies the failed write never becomes visible in the live DB, replays the trace into a fresh DB, and verifies replay also skips the failed write.
Test Plan
timeout 60s ./db_test2 --gtest_filter='DBTest2.TracePreserveWriteOrderSkipsFailedWrite:DBTest2.TracePreserveWriteOrderSkipsFailedDisableWALWBWIIngest'COERCE_CONTEXT_SWITCH=1 timeout 60s ./db_test2 --gtest_filter='DBTest2.TracePreserveWriteOrderSkipsFailedDisableWALWBWIIngest' --gtest_repeat=5COERCE_CONTEXT_SWITCH=1 timeout 60s ./db_test2 --gtest_filter='DBTest2.TracePreserveWriteOrderSkipsFailedWrite' --gtest_repeat=5