Replace procedural finalize_and_sanitize with declarative incompatibility system by xingbowang · Pull Request #14516 · facebook/rocksdb

xingbowang · 2026-03-27T14:26:15Z

Summary

Replaces the ~500-line procedural finalize_and_sanitize() in tools/db_crashtest.py with a declarative Feature Requirements System that handles flag priority correctly when users pass --extra_flags to the stress test runner.

Problem

The old finalize_and_sanitize() was a large if/else chain with implicit ordering dependencies. When users passed conflicting flags via --extra_flags, the outcome depended on which branch ran last — not on which flag was explicitly requested.

Solution

A declarative engine with three priority rules:

Explicit + Explicit conflict: Two features both forced via --extra_flags that contradict each other → print a clear error and exit(1).
Explicit beats random: Explicitly forced feature wins over a randomly enabled conflicting feature; the random one is disabled.
Random vs random: 50/50 deterministic tiebreak (stable hash of feature pair), ensuring both paths get test coverage.

Key changes

FEATURE_REQUIREMENTS dict: each feature declares active_when, disable_self, and requires (the param values it needs). Conflicts are auto-detected by the engine — no need to enumerate every pair.
INCOMPATIBILITY_RULES list: one-way consequence rules (e.g. disable_wal=1 → atomic_flush=1). These propagate effects in a fixed-point loop.
finalize_and_sanitize(src_params, explicit_keys=None): new signature accepts the set of param names that came from --extra_flags, enabling priority-based conflict resolution.
gen_cmd() updated to extract explicit_keys from unknown_params and pass them through.
tools/fuzz_convergence.py and tools/test_db_crashtest.py added: regression tests for convergence (no oscillation), idempotency, and known conflict scenarios.
Fix: use_optimistic_txn=1 now enforces max_write_buffer_size_to_maintain >= write_buffer_size (required for OCC conflict detection against old memtable data).

Test coverage

17 unit tests in tools/test_db_crashtest.py covering convergence, idempotency, and known conflicts
100,000-iteration fuzz convergence test (50 workers): 0 oscillations
All existing crash test modes pass

github-actions · 2026-03-27T14:27:40Z

✅ clang-tidy: No findings on changed lines

Completed in 0.0s.

tools/db_crashtest.py

xingbowang · 2026-03-27T14:58:52Z

Addressed — both non-convergence warnings.warn sites replaced with sys.exit(1) + stderr error message. Also added in the same push:

Maintenance guide comments on FEATURE_REQUIREMENTS and INCOMPATIBILITY_RULES explaining how to add new rules
make check now runs test_db_crashtest.py: always on Linux (local + CI), always locally on other platforms, skipped on non-Linux CI

meta-codesync · 2026-03-27T15:32:05Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98495218.

xingbowang · 2026-03-30T14:32:05Z

/claude-review

hx235 · 2026-03-31T22:30:19Z

Test coverage

@xingbowang do you have some data on how many iterations are needed to converge the sanitization on a randomly generated command like in our stress/crash test CI? Just want to ensure we are not often stuck in exceeding max iteration.

hx235 · 2026-03-31T22:38:54Z

tools/db_crashtest.py

-        dest_params["allow_resumption_one_in"] = 0
+
+
+def _apply_declarative_rules(dest_params, max_iterations=20):


not called?

Full review from codex

Dead-path ambiguity (most important)
Evidence: advisory comment says _apply_declarative_rules exists but is not called, while rules are already applied in _apply_feature_requirements Phase 3.

Why it matters: two apparent “rule application” paths confuse maintainers and can cause bugs if someone updates the wrong one.

Suggestion:

remove _apply_declarative_rules if unused,
keep _apply_feature_requirements as single source of truth,
fix any error string referencing the wrong helper.

Removed the unused alternate path and kept _apply_feature_requirements() as the single rule-application entry point. Fixed the convergence error text so it points at the right helper.

hx235 · 2026-03-31T22:40:41Z

tools/db_crashtest.py

+# When a conflict is detected, resolution depends on provenance:
+#   - Both features explicitly forced via --extra_flags: exit(1) with error
+#   - One explicit, one random: explicit feature wins; random feature disabled
+#   - Both random: 50/50 random choice of which feature to disable


Codex review:

Random-vs-random wording mismatch
Evidence: comments describe “50/50 random”, but code uses stable hash of feature pair to choose loser deterministically.

Why it matters: behavior is deterministic per pair, not stochastic per run.

Suggestion:

either update wording to “deterministic per-pair tiebreak,”
or switch implementation to real randomness if that was intended.

Agreed. The implementation is deterministic per feature pair, not stochastic per run. Updated the comments to say "deterministic per-pair tiebreak" so the wording matches the code.

hx235 · 2026-03-31T22:57:59Z

tools/db_crashtest.py

+        # Should never happen in practice, but safety valve
+        print(
+            f"ERROR: finalize_and_sanitize did not converge after"
+            f" {max_iterations} iterations in _apply_special_rules()."


why is "_apply_special_rules()."? I thought the function containing this statement is a different one.

I fixed it so the non-convergence error now references _apply_feature_requirements(), which is the actual fixed-point loop containing the logic.

hx235 · 2026-03-31T23:00:53Z

tools/db_crashtest.py

+    dest_params = {k: v() if callable(v) else v for (k, v) in src_params.items()}
+
+    # Special rules first (external deps, cache computation, compression manager)
+    _apply_special_rules(dest_params)


Is it possible for the later steps to effectively overwrite _apply_special_rules result? Is there any test to verify it does not happen?

The special rules run in Phase 0 on every fixed-point iteration, so later phases cannot permanently override them. I also added/kept regression coverage that checks the final sanitized config still reflects those special rules, e.g. unsupported direct I/O stays disabled and remote DB paths keep blob direct write disabled through the full loop.

hx235 · 2026-03-31T23:03:06Z

tools/db_crashtest.py

    return dest_params


+INCOMPATIBILITY_RULES = [


I had some trouble understanding why INCOMPATIBILITY_RULES can't be part of FEATURE_REQUIREMENTS ... can we sync offline?

I clarified this in the maintenance comments. The short version is:

FEATURE_REQUIREMENTS is for mutual feature conflicts where there is a clear "feature A vs feature B" relationship and a meaningful disable_self(). E.g. blob_direct_write vs best_efforts_recovery They are peer features and want incompatible recovery/WAL settings, so one feature must lose.

CONSEQUENCE_RULES (the earlier draft called theseINCOMPATIBILITY_RULES) is for one-way normalization like disable_wal -> atomic_flush/sync/reopen adjustments or optimistic-txn write-buffer maintenance. Those are not two features fighting over ownership of one parameter, so forcing them into FEATURE_REQUIREMENTS would make the model less clear. E.g. disable_wal=1. Then atomic_flush, sync, reopen, etc. are normalized. That is not two features fighting; it is a one-way consequence.

…lity system Replaces the old procedural finalize_and_sanitize() with a data-driven approach. Incompatibility rules are declared as data, then applied in a fixed-point loop to handle transitive dependencies. Key changes: - INCOMPATIBILITY_RULES: declarative list of (when, then) rules covering all the same incompatibilities as the old procedural code - Fixed-point loop: applies rules repeatedly until convergence, handling transitive dependencies automatically - 10M-trial fuzz test (fuzz_convergence.py) confirms 0 oscillations Also adds: - tools/fuzz_convergence.py: convergence fuzzer for finalize_and_sanitize - tools/test_db_crashtest.py: unit tests for the new system Test: python3 tools/fuzz_convergence.py 10000000 100 Trials: 10,000,000 | Oscillations: 0 | Elapsed: 1243.1s

…e_buffer_size

…t.py Summary: CONTEXT: New engineers adding stress test incompatibility rules need clear guidance on which data structure to use (FEATURE_REQUIREMENTS vs INCOMPATIBILITY_RULES) and how to add entries correctly. The test file also wasn't wired into make check. WHAT: - Add "Maintenance Guide" comment blocks at the top of both FEATURE_REQUIREMENTS and INCOMPATIBILITY_RULES in db_crashtest.py, with when-to-use guidance, field docs, and worked examples. - Wire test_db_crashtest.py into `make check` with conditional logic: always on Linux, local-only on other platforms (skip non-Linux CI). Test Plan: - make -n check (dry run, no syntax errors) - python3 -m unittest discover -s tools -p 'test_db_crashtest.py': Ran 17 tests in 21.225s — OK

xingbowang · 2026-04-06T20:41:51Z

<< @xingbowang do you have some data on how many iterations are needed to converge the sanitization on a randomly generated command like in our stress/crash test CI? Just want to ensure we are not often stuck in exceeding max iteration.

Less than 10. I don't think our config has a very long chain of transitive dependency.

meta-codesync · 2026-04-06T21:56:34Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98495218.

github-actions · 2026-04-06T22:07:21Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e5d1072

Code Review: Replace procedural finalize_and_sanitize with declarative incompatibility system

PR: Replace procedural finalize_and_sanitize with declarative incompatibility system
Author: xingbowang
Files changed: 7 (2063 insertions, 576 deletions)

Executive Summary

This PR replaces the ~591-line procedural finalize_and_sanitize() with a declarative Feature Requirements System. It also changes gen_cmd() to merge --extra_flags before sanitization (instead of appending after), and adds unit tests + a fuzz convergence test. The approach is architecturally sound and solves a real problem (flag priority with --extra_flags).

Recommendation: Approve with minor revisions.

Findings

HIGH

F4: Behavioral Change — Extra Flags Now Sanitized
Previously, --extra_flags bypassed sanitization via gflags last-one-wins. Now they're sanitized. This is the correct fix (prevents C++ assertion failures), but should be explicitly documented as a behavioral change. Consider logging when a user-specified flag is overridden.

F5: New sys.exit(1) for Explicit+Explicit Conflicts
Two explicitly-forced conflicting features now cause exit(1) instead of silently passing conflicting flags. Correct behavior, but a new failure mode that needs documentation.

F6: Type Conversion in Unknown Parameter Parsing (Medium)
--flag=true/--flag=false remain strings but some params expect int 0/1. --flag= (empty) stays empty string. Consider handling boolean string values.

MEDIUM

F7: Exclusion Set Params Merged But Then Dropped
If --duration=100 is passed via extra flags, it's merged and sanitized but then excluded from the command. Low practical risk since these are Python orchestration params.

F10: random.choice in Fixed-Point Loop
Both _apply_special_rules (compression_type) and multiscan_shape_adjustments (prefetch memory) use random.choice() inside the loop. Both are properly guarded — the condition that triggers them becomes false after the first evaluation, preventing re-evaluation. Defended.

LOW / SUGGESTIONS

F12: Complexity Trade-off — The ~1200 lines (vs 591) are justified by: solving --extra_flags priority, auto-conflict detection, and convergence tests as safety nets. The maintenance guide comments are excellent.

F14: Whitespace-Only Changes — db/db_iter.h, blob_file_partition_strategy.h, perf_context.h should ideally be in a separate commit.

F15: New OCC Rule — The optimistic_txn_write_buffer_maintain consequence rule is a new bug fix, not just refactoring. Should be called out in PR description and release notes.

F16: fuzz_convergence.py Shebang — The #!/usr/bin/env python3 shebang is placed AFTER the copyright header, making it ineffective. Move it to line 1.

F13: Makefile Conditional — The CI_TOTAL_SHARDS guard is correct (always run on Linux, skip non-Linux CI).

Verified Correct (Initial Concerns Dismissed)

enable_blob_direct_write: 0 IS present in best_efforts_recovery requires
use_multi_cf_iterator: 0 IS present in txn_non_write_committed requires
Percentage redistribution guards prevent double-counting in the fixed-point loop
UDT+BER+disable_wal atomic_flush interaction is explicitly and correctly handled
The _pick_random_conflict_loser hash is deterministic and adequate for ~10 features

Action Items

Document behavioral change (F4, F5) in PR description
Fix shebang in fuzz_convergence.py (F16)
Consider boolean string handling in type conversion (F6)
Split whitespace changes into separate commit (F14)
Call out OCC rule as a new bug fix (F15)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-cla bot added the CLA Signed label Mar 27, 2026

xingbowang commented Mar 27, 2026

View reviewed changes

tools/db_crashtest.py Outdated Show resolved Hide resolved

xingbowang added a commit to xingbowang/rocksdb that referenced this pull request Mar 28, 2026

Apply PR facebook#14516 declarative incompatibility system + BDW rules

2e7c09f

hx235 reviewed Mar 31, 2026

View reviewed changes

xingbowang added 3 commits April 2, 2026 10:46

Fix optimistic_txn: enforce max_write_buffer_size_to_maintain >= writ…

4f68465

…e_buffer_size

xingbowang added 2 commits April 6, 2026 13:52

Refine declarative crashtest rule handling

57b9509

Merge upstream/main and reconcile crashtest sanitizer

7763467

xingbowang force-pushed the 2026_03_26_stress_compatibility branch from ad02bde to 7763467 Compare April 6, 2026 21:08

missed some files

e5d1072

xingbowang added 6 commits April 6, 2026 15:54

Document crashtest explicit flag sanitization

5ae9cf7

Prioritize explicit crashtest flags

36e5984

Fix UDT crashtest test fixture

8781fd8

Merge upstream/main and sync crashtest incompatibilities

9de92d1

refine

a8ccc5d

Clarify crashtest rule categories

27525e2

		dest_params["allow_resumption_one_in"] = 0


		def _apply_declarative_rules(dest_params, max_iterations=20):

Conversation

xingbowang commented Mar 27, 2026

Summary

Problem

Solution

Key changes

Test coverage

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ clang-tidy: No findings on changed lines

Uh oh!

Uh oh!

xingbowang commented Mar 27, 2026

Uh oh!

meta-codesync bot commented Mar 27, 2026

Uh oh!

xingbowang commented Mar 30, 2026

Uh oh!

hx235 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hx235 Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xingbowang commented Apr 6, 2026

Uh oh!

meta-codesync bot commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

✅ Claude Code Review

Code Review: Replace procedural finalize_and_sanitize with declarative incompatibility system

Executive Summary

Findings

HIGH

MEDIUM

LOW / SUGGESTIONS

Verified Correct (Initial Concerns Dismissed)

Action Items

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 27, 2026 •

edited

Loading

hx235 commented Mar 31, 2026 •

edited

Loading

hx235 Mar 31, 2026 •

edited

Loading