Skip to content

Replace procedural finalize_and_sanitize with declarative incompatibility system#14516

Open
xingbowang wants to merge 12 commits intofacebook:mainfrom
xingbowang:2026_03_26_stress_compatibility
Open

Replace procedural finalize_and_sanitize with declarative incompatibility system#14516
xingbowang wants to merge 12 commits intofacebook:mainfrom
xingbowang:2026_03_26_stress_compatibility

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

Summary

Replaces the ~500-line procedural finalize_and_sanitize() in tools/db_crashtest.py with a declarative Feature Requirements System that handles flag priority correctly when users pass --extra_flags to the stress test runner.

Problem

The old finalize_and_sanitize() was a large if/else chain with implicit ordering dependencies. When users passed conflicting flags via --extra_flags, the outcome depended on which branch ran last — not on which flag was explicitly requested.

Solution

A declarative engine with three priority rules:

  1. Explicit + Explicit conflict: Two features both forced via --extra_flags that contradict each other → print a clear error and exit(1).
  2. Explicit beats random: Explicitly forced feature wins over a randomly enabled conflicting feature; the random one is disabled.
  3. Random vs random: 50/50 deterministic tiebreak (stable hash of feature pair), ensuring both paths get test coverage.

Key changes

  • FEATURE_REQUIREMENTS dict: each feature declares active_when, disable_self, and requires (the param values it needs). Conflicts are auto-detected by the engine — no need to enumerate every pair.
  • INCOMPATIBILITY_RULES list: one-way consequence rules (e.g. disable_wal=1atomic_flush=1). These propagate effects in a fixed-point loop.
  • finalize_and_sanitize(src_params, explicit_keys=None): new signature accepts the set of param names that came from --extra_flags, enabling priority-based conflict resolution.
  • gen_cmd() updated to extract explicit_keys from unknown_params and pass them through.
  • tools/fuzz_convergence.py and tools/test_db_crashtest.py added: regression tests for convergence (no oscillation), idempotency, and known conflict scenarios.
  • Fix: use_optimistic_txn=1 now enforces max_write_buffer_size_to_maintain >= write_buffer_size (required for OCC conflict detection against old memtable data).

Test coverage

  • 17 unit tests in tools/test_db_crashtest.py covering convergence, idempotency, and known conflicts
  • 100,000-iteration fuzz convergence test (50 workers): 0 oscillations
  • All existing crash test modes pass

@meta-cla meta-cla bot added the CLA Signed label Mar 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

✅ clang-tidy: No findings on changed lines

Completed in 0.0s.

@xingbowang
Copy link
Copy Markdown
Contributor Author

Addressed — both non-convergence warnings.warn sites replaced with sys.exit(1) + stderr error message. Also added in the same push:

  • Maintenance guide comments on FEATURE_REQUIREMENTS and INCOMPATIBILITY_RULES explaining how to add new rules
  • make check now runs test_db_crashtest.py: always on Linux (local + CI), always locally on other platforms, skipped on non-Linux CI

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 27, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98495218.

xingbowang added a commit to xingbowang/rocksdb that referenced this pull request Mar 28, 2026
@xingbowang
Copy link
Copy Markdown
Contributor Author

/claude-review

@hx235
Copy link
Copy Markdown
Contributor

hx235 commented Mar 31, 2026

Test coverage

@xingbowang do you have some data on how many iterations are needed to converge the sanitization on a randomly generated command like in our stress/crash test CI? Just want to ensure we are not often stuck in exceeding max iteration.

dest_params["allow_resumption_one_in"] = 0


def _apply_declarative_rules(dest_params, max_iterations=20):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not called?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full review from codex

Dead-path ambiguity (most important)
Evidence: advisory comment says _apply_declarative_rules exists but is not called, while rules are already applied in _apply_feature_requirements Phase 3.

Why it matters: two apparent “rule application” paths confuse maintainers and can cause bugs if someone updates the wrong one.

Suggestion:

remove _apply_declarative_rules if unused,
keep _apply_feature_requirements as single source of truth,
fix any error string referencing the wrong helper.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused alternate path and kept _apply_feature_requirements() as the single rule-application entry point. Fixed the convergence error text so it points at the right helper.

# When a conflict is detected, resolution depends on provenance:
# - Both features explicitly forced via --extra_flags: exit(1) with error
# - One explicit, one random: explicit feature wins; random feature disabled
# - Both random: 50/50 random choice of which feature to disable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review:

  1. Random-vs-random wording mismatch
    Evidence: comments describe “50/50 random”, but code uses stable hash of feature pair to choose loser deterministically.

Why it matters: behavior is deterministic per pair, not stochastic per run.

Suggestion:

either update wording to “deterministic per-pair tiebreak,”
or switch implementation to real randomness if that was intended.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The implementation is deterministic per feature pair, not stochastic per run. Updated the comments to say "deterministic per-pair tiebreak" so the wording matches the code.

# Should never happen in practice, but safety valve
print(
f"ERROR: finalize_and_sanitize did not converge after"
f" {max_iterations} iterations in _apply_special_rules()."
Copy link
Copy Markdown
Contributor

@hx235 hx235 Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is "_apply_special_rules()."? I thought the function containing this statement is a different one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it so the non-convergence error now references _apply_feature_requirements(), which is the actual fixed-point loop containing the logic.

dest_params = {k: v() if callable(v) else v for (k, v) in src_params.items()}

# Special rules first (external deps, cache computation, compression manager)
_apply_special_rules(dest_params)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the later steps to effectively overwrite _apply_special_rules result? Is there any test to verify it does not happen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special rules run in Phase 0 on every fixed-point iteration, so later phases cannot permanently override them. I also added/kept regression coverage that checks the final sanitized config still reflects those special rules, e.g. unsupported direct I/O stays disabled and remote DB paths keep blob direct write disabled through the full loop.

return dest_params


INCOMPATIBILITY_RULES = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some trouble understanding why INCOMPATIBILITY_RULES can't be part of FEATURE_REQUIREMENTS ... can we sync offline?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified this in the maintenance comments. The short version is:

  • FEATURE_REQUIREMENTS is for mutual feature conflicts where there is a clear "feature A vs feature B" relationship and a meaningful disable_self(). E.g. blob_direct_write vs best_efforts_recovery They are peer features and want incompatible recovery/WAL settings, so one feature must lose.
  • CONSEQUENCE_RULES (the earlier draft called theseINCOMPATIBILITY_RULES) is for one-way normalization like disable_wal -> atomic_flush/sync/reopen adjustments or optimistic-txn write-buffer maintenance. Those are not two features fighting over ownership of one parameter, so forcing them into FEATURE_REQUIREMENTS would make the model less clear. E.g. disable_wal=1. Then atomic_flush, sync, reopen, etc. are normalized. That is not two features fighting; it is a one-way consequence.

…lity system

Replaces the old procedural finalize_and_sanitize() with a data-driven
approach. Incompatibility rules are declared as data, then applied in a
fixed-point loop to handle transitive dependencies.

Key changes:
- INCOMPATIBILITY_RULES: declarative list of (when, then) rules covering
  all the same incompatibilities as the old procedural code
- Fixed-point loop: applies rules repeatedly until convergence, handling
  transitive dependencies automatically
- 10M-trial fuzz test (fuzz_convergence.py) confirms 0 oscillations

Also adds:
- tools/fuzz_convergence.py: convergence fuzzer for finalize_and_sanitize
- tools/test_db_crashtest.py: unit tests for the new system

Test: python3 tools/fuzz_convergence.py 10000000 100
  Trials: 10,000,000 | Oscillations: 0 | Elapsed: 1243.1s
…t.py

Summary:
CONTEXT: New engineers adding stress test incompatibility rules need clear
guidance on which data structure to use (FEATURE_REQUIREMENTS vs
INCOMPATIBILITY_RULES) and how to add entries correctly. The test file
also wasn't wired into make check.

WHAT:
- Add "Maintenance Guide" comment blocks at the top of both
  FEATURE_REQUIREMENTS and INCOMPATIBILITY_RULES in db_crashtest.py,
  with when-to-use guidance, field docs, and worked examples.
- Wire test_db_crashtest.py into `make check` with conditional logic:
  always on Linux, local-only on other platforms (skip non-Linux CI).

Test Plan:
- make -n check (dry run, no syntax errors)
- python3 -m unittest discover -s tools -p 'test_db_crashtest.py':
  Ran 17 tests in 21.225s — OK
@xingbowang
Copy link
Copy Markdown
Contributor Author

<< @xingbowang do you have some data on how many iterations are needed to converge the sanitization on a randomly generated command like in our stress/crash test CI? Just want to ensure we are not often stuck in exceeding max iteration.

Less than 10. I don't think our config has a very long chain of transitive dependency.

@xingbowang xingbowang force-pushed the 2026_03_26_stress_compatibility branch from ad02bde to 7763467 Compare April 6, 2026 21:08
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 6, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98495218.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e5d1072


Code Review: Replace procedural finalize_and_sanitize with declarative incompatibility system

PR: Replace procedural finalize_and_sanitize with declarative incompatibility system
Author: xingbowang
Files changed: 7 (2063 insertions, 576 deletions)


Executive Summary

This PR replaces the ~591-line procedural finalize_and_sanitize() with a declarative Feature Requirements System. It also changes gen_cmd() to merge --extra_flags before sanitization (instead of appending after), and adds unit tests + a fuzz convergence test. The approach is architecturally sound and solves a real problem (flag priority with --extra_flags).

Recommendation: Approve with minor revisions.


Findings

HIGH

F4: Behavioral Change — Extra Flags Now Sanitized
Previously, --extra_flags bypassed sanitization via gflags last-one-wins. Now they're sanitized. This is the correct fix (prevents C++ assertion failures), but should be explicitly documented as a behavioral change. Consider logging when a user-specified flag is overridden.

F5: New sys.exit(1) for Explicit+Explicit Conflicts
Two explicitly-forced conflicting features now cause exit(1) instead of silently passing conflicting flags. Correct behavior, but a new failure mode that needs documentation.

F6: Type Conversion in Unknown Parameter Parsing (Medium)
--flag=true/--flag=false remain strings but some params expect int 0/1. --flag= (empty) stays empty string. Consider handling boolean string values.

MEDIUM

F7: Exclusion Set Params Merged But Then Dropped
If --duration=100 is passed via extra flags, it's merged and sanitized but then excluded from the command. Low practical risk since these are Python orchestration params.

F10: random.choice in Fixed-Point Loop
Both _apply_special_rules (compression_type) and multiscan_shape_adjustments (prefetch memory) use random.choice() inside the loop. Both are properly guarded — the condition that triggers them becomes false after the first evaluation, preventing re-evaluation. Defended.

LOW / SUGGESTIONS

F12: Complexity Trade-off — The ~1200 lines (vs 591) are justified by: solving --extra_flags priority, auto-conflict detection, and convergence tests as safety nets. The maintenance guide comments are excellent.

F14: Whitespace-Only Changesdb/db_iter.h, blob_file_partition_strategy.h, perf_context.h should ideally be in a separate commit.

F15: New OCC Rule — The optimistic_txn_write_buffer_maintain consequence rule is a new bug fix, not just refactoring. Should be called out in PR description and release notes.

F16: fuzz_convergence.py Shebang — The #!/usr/bin/env python3 shebang is placed AFTER the copyright header, making it ineffective. Move it to line 1.

F13: Makefile Conditional — The CI_TOTAL_SHARDS guard is correct (always run on Linux, skip non-Linux CI).

Verified Correct (Initial Concerns Dismissed)

  • enable_blob_direct_write: 0 IS present in best_efforts_recovery requires
  • use_multi_cf_iterator: 0 IS present in txn_non_write_committed requires
  • Percentage redistribution guards prevent double-counting in the fixed-point loop
  • UDT+BER+disable_wal atomic_flush interaction is explicitly and correctly handled
  • The _pick_random_conflict_loser hash is deterministic and adequate for ~10 features

Action Items

  1. Document behavioral change (F4, F5) in PR description
  2. Fix shebang in fuzz_convergence.py (F16)
  3. Consider boolean string handling in type conversion (F6)
  4. Split whitespace changes into separate commit (F14)
  5. Call out OCC rule as a new bug fix (F15)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants