Replace procedural finalize_and_sanitize with declarative incompatibility system#14516
Replace procedural finalize_and_sanitize with declarative incompatibility system#14516xingbowang wants to merge 12 commits intofacebook:mainfrom
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 0.0s. |
|
Addressed — both non-convergence
|
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98495218. |
|
/claude-review |
@xingbowang do you have some data on how many iterations are needed to converge the sanitization on a randomly generated command like in our stress/crash test CI? Just want to ensure we are not often stuck in exceeding max iteration. |
tools/db_crashtest.py
Outdated
| dest_params["allow_resumption_one_in"] = 0 | ||
|
|
||
|
|
||
| def _apply_declarative_rules(dest_params, max_iterations=20): |
There was a problem hiding this comment.
Full review from codex
Dead-path ambiguity (most important)
Evidence: advisory comment says _apply_declarative_rules exists but is not called, while rules are already applied in _apply_feature_requirements Phase 3.
Why it matters: two apparent “rule application” paths confuse maintainers and can cause bugs if someone updates the wrong one.
Suggestion:
remove _apply_declarative_rules if unused,
keep _apply_feature_requirements as single source of truth,
fix any error string referencing the wrong helper.
There was a problem hiding this comment.
Removed the unused alternate path and kept _apply_feature_requirements() as the single rule-application entry point. Fixed the convergence error text so it points at the right helper.
tools/db_crashtest.py
Outdated
| # When a conflict is detected, resolution depends on provenance: | ||
| # - Both features explicitly forced via --extra_flags: exit(1) with error | ||
| # - One explicit, one random: explicit feature wins; random feature disabled | ||
| # - Both random: 50/50 random choice of which feature to disable |
There was a problem hiding this comment.
Codex review:
- Random-vs-random wording mismatch
Evidence: comments describe “50/50 random”, but code uses stable hash of feature pair to choose loser deterministically.
Why it matters: behavior is deterministic per pair, not stochastic per run.
Suggestion:
either update wording to “deterministic per-pair tiebreak,”
or switch implementation to real randomness if that was intended.
There was a problem hiding this comment.
Agreed. The implementation is deterministic per feature pair, not stochastic per run. Updated the comments to say "deterministic per-pair tiebreak" so the wording matches the code.
tools/db_crashtest.py
Outdated
| # Should never happen in practice, but safety valve | ||
| print( | ||
| f"ERROR: finalize_and_sanitize did not converge after" | ||
| f" {max_iterations} iterations in _apply_special_rules()." |
There was a problem hiding this comment.
why is "_apply_special_rules()."? I thought the function containing this statement is a different one.
There was a problem hiding this comment.
I fixed it so the non-convergence error now references _apply_feature_requirements(), which is the actual fixed-point loop containing the logic.
tools/db_crashtest.py
Outdated
| dest_params = {k: v() if callable(v) else v for (k, v) in src_params.items()} | ||
|
|
||
| # Special rules first (external deps, cache computation, compression manager) | ||
| _apply_special_rules(dest_params) |
There was a problem hiding this comment.
Is it possible for the later steps to effectively overwrite _apply_special_rules result? Is there any test to verify it does not happen?
There was a problem hiding this comment.
The special rules run in Phase 0 on every fixed-point iteration, so later phases cannot permanently override them. I also added/kept regression coverage that checks the final sanitized config still reflects those special rules, e.g. unsupported direct I/O stays disabled and remote DB paths keep blob direct write disabled through the full loop.
tools/db_crashtest.py
Outdated
| return dest_params | ||
|
|
||
|
|
||
| INCOMPATIBILITY_RULES = [ |
There was a problem hiding this comment.
I had some trouble understanding why INCOMPATIBILITY_RULES can't be part of FEATURE_REQUIREMENTS ... can we sync offline?
There was a problem hiding this comment.
I clarified this in the maintenance comments. The short version is:
FEATURE_REQUIREMENTSis for mutual feature conflicts where there is a clear "feature A vs feature B" relationship and a meaningfuldisable_self(). E.g. blob_direct_write vs best_efforts_recovery They are peer features and want incompatible recovery/WAL settings, so one feature must lose.CONSEQUENCE_RULES(the earlier draft called theseINCOMPATIBILITY_RULES) is for one-way normalization likedisable_wal -> atomic_flush/sync/reopen adjustmentsor optimistic-txn write-buffer maintenance. Those are not two features fighting over ownership of one parameter, so forcing them intoFEATURE_REQUIREMENTSwould make the model less clear. E.g. disable_wal=1. Then atomic_flush, sync, reopen, etc. are normalized. That is not two features fighting; it is a one-way consequence.
…lity system Replaces the old procedural finalize_and_sanitize() with a data-driven approach. Incompatibility rules are declared as data, then applied in a fixed-point loop to handle transitive dependencies. Key changes: - INCOMPATIBILITY_RULES: declarative list of (when, then) rules covering all the same incompatibilities as the old procedural code - Fixed-point loop: applies rules repeatedly until convergence, handling transitive dependencies automatically - 10M-trial fuzz test (fuzz_convergence.py) confirms 0 oscillations Also adds: - tools/fuzz_convergence.py: convergence fuzzer for finalize_and_sanitize - tools/test_db_crashtest.py: unit tests for the new system Test: python3 tools/fuzz_convergence.py 10000000 100 Trials: 10,000,000 | Oscillations: 0 | Elapsed: 1243.1s
…t.py Summary: CONTEXT: New engineers adding stress test incompatibility rules need clear guidance on which data structure to use (FEATURE_REQUIREMENTS vs INCOMPATIBILITY_RULES) and how to add entries correctly. The test file also wasn't wired into make check. WHAT: - Add "Maintenance Guide" comment blocks at the top of both FEATURE_REQUIREMENTS and INCOMPATIBILITY_RULES in db_crashtest.py, with when-to-use guidance, field docs, and worked examples. - Wire test_db_crashtest.py into `make check` with conditional logic: always on Linux, local-only on other platforms (skip non-Linux CI). Test Plan: - make -n check (dry run, no syntax errors) - python3 -m unittest discover -s tools -p 'test_db_crashtest.py': Ran 17 tests in 21.225s — OK
|
<< @xingbowang do you have some data on how many iterations are needed to converge the sanitization on a randomly generated command like in our stress/crash test CI? Just want to ensure we are not often stuck in exceeding max iteration. Less than 10. I don't think our config has a very long chain of transitive dependency. |
ad02bde to
7763467
Compare
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98495218. |
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit e5d1072 Code Review: Replace procedural finalize_and_sanitize with declarative incompatibility systemPR: Replace procedural finalize_and_sanitize with declarative incompatibility system Executive SummaryThis PR replaces the ~591-line procedural Recommendation: Approve with minor revisions. FindingsHIGHF4: Behavioral Change — Extra Flags Now Sanitized F5: New sys.exit(1) for Explicit+Explicit Conflicts F6: Type Conversion in Unknown Parameter Parsing (Medium) MEDIUMF7: Exclusion Set Params Merged But Then Dropped F10: random.choice in Fixed-Point Loop LOW / SUGGESTIONSF12: Complexity Trade-off — The ~1200 lines (vs 591) are justified by: solving F14: Whitespace-Only Changes — F15: New OCC Rule — The F16: fuzz_convergence.py Shebang — The F13: Makefile Conditional — The Verified Correct (Initial Concerns Dismissed)
Action Items
ℹ️ About this responseGenerated by Claude Code. Limitations:
Commands:
|
Summary
Replaces the ~500-line procedural
finalize_and_sanitize()intools/db_crashtest.pywith a declarative Feature Requirements System that handles flag priority correctly when users pass--extra_flagsto the stress test runner.Problem
The old
finalize_and_sanitize()was a large if/else chain with implicit ordering dependencies. When users passed conflicting flags via--extra_flags, the outcome depended on which branch ran last — not on which flag was explicitly requested.Solution
A declarative engine with three priority rules:
--extra_flagsthat contradict each other → print a clear error andexit(1).Key changes
FEATURE_REQUIREMENTSdict: each feature declaresactive_when,disable_self, andrequires(the param values it needs). Conflicts are auto-detected by the engine — no need to enumerate every pair.INCOMPATIBILITY_RULESlist: one-way consequence rules (e.g.disable_wal=1→atomic_flush=1). These propagate effects in a fixed-point loop.finalize_and_sanitize(src_params, explicit_keys=None): new signature accepts the set of param names that came from--extra_flags, enabling priority-based conflict resolution.gen_cmd()updated to extractexplicit_keysfromunknown_paramsand pass them through.tools/fuzz_convergence.pyandtools/test_db_crashtest.pyadded: regression tests for convergence (no oscillation), idempotency, and known conflict scenarios.use_optimistic_txn=1now enforcesmax_write_buffer_size_to_maintain >= write_buffer_size(required for OCC conflict detection against old memtable data).Test coverage
tools/test_db_crashtest.pycovering convergence, idempotency, and known conflicts