feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL)#1005
Open
feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL)#1005
Conversation
Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX
4090) to a new GATE-ARCH-370M-006 in the sovereign contract via a pure
f32 threshold fn + two unit tests. The compute-heavy half (`apr bench`
on a real trained 370M .apr) is deferred to AC-SHIP2-003/004
compute-dispatch; the decision rule itself is proven today.
Changes:
- crates/aprender-train/src/models/llama_370m.rs:
* AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 (const floor)
* Ship020Verdict { Pass, Fail }
* verdict_from_decode_tps(f32) -> Ship020Verdict (fn, non-finite → Fail)
* falsify_ship_020_decode_tps_threshold_logic (5 invariants:
Pass boundary, Fail boundary at one f32 ULP, monotonicity in
both directions, conservative Fail for NaN/±∞, provenance
pinning that the const stays = 100.0)
* falsify_ship_020_gate_arch_370m_006_has_partial_discharge_marker
(contract parses + advertises PARTIAL_ALGORITHM_LEVEL +
evidence_discharged_by populated + full_discharge_blocks_on
documented + ship_blocking:true)
- contracts/model-families/llama-370m-sovereign-v1.yaml:
* v1.5.0 → v1.6.0, stays ACTIVE
* New GATE-ARCH-370M-006 binding AC-SHIP2-010 ↔ FALSIFY-SHIP-020
with discharge_status: PARTIAL_ALGORITHM_LEVEL
- docs/specifications/aprender-train/ship-two-models-spec.md:
* v2.23.0 → v2.26.0 with amendment block
* MODEL-2 ship-gate status updated: 3/12 ACTIVE + 5/12 PARTIAL =
8/12 touched (66.7%)
- crates/aprender-train/src/train/device.rs:
* 2 pre-existing fmt fixes (6 lines of whitespace) — restores
`cargo fmt -p aprender-train --check` green. Pre-existing on
origin/main; kept in this PR under Toyota Way "all defects are
your defects" rule.
Pattern lesson: v2.22.0 declared MODEL-2 non-compute PARTIAL levers
"exhausted" — re-running the counter-example survey has now falsified
that verdict three times (SHIP-019 → SHIP-017 → SHIP-020). When a
SHIP gate names a threshold / tolerance / ratio / cut-off and the
compute-heavy harness is separable from the decision function, the
threshold fn can land today at unit-test time — even when the full
end-to-end harness is blocked on compute.
Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004
compute-dispatch + three independent `apr bench --tokens 128 --json`
medians on RTX 4090 host. Fixture-swap only — no decision-rule rewrite.
Verification:
- cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS
- pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
→ "Contract is valid. 0 error(s), 0 warning(s)."
- cargo clippy -p aprender-train --lib → green
- cargo fmt -p aprender-train --check → green
Task #150.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Apr 22, 2026
noahgift
added a commit
that referenced
this pull request
Apr 23, 2026
…lean branch) Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on main (superseding stale PR #1014 which was stacked on feat/falsify-ship-008/006-partial-discharge branches that had not yet merged to main). Algorithm commit carries the same 7-section mutation survey as the original be6d129, re-based onto post-SHIP-002 main (commit f615148, contract v1.1.0). Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is proven today; compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding 30.0 to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL v2.27.0 marker and adds v2.27.0 amendment entry. Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005 not yet on main). Once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with model-specific floors pinned as module-level consts. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002) → **5/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean. Fmt: `cargo fmt --check -p aprender-core` → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 23, 2026
…lean branch) (#1019) * feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch) Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on main (superseding stale PR #1014 which was stacked on feat/falsify-ship-008/006-partial-discharge branches that had not yet merged to main). Algorithm commit carries the same 7-section mutation survey as the original be6d129, re-based onto post-SHIP-002 main (commit f615148, contract v1.1.0). Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is proven today; compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding 30.0 to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL v2.27.0 marker and adds v2.27.0 amendment entry. Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005 not yet on main). Once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with model-specific floors pinned as module-level consts. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002) → **5/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean. Fmt: `cargo fmt --check -p aprender-core` → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: retrigger after 3 disk-guard race failures --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
verdict_from_decode_tps(measured_tps) -> Ship020Verdict+ constAC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0incrates/aprender-train/src/models/llama_370m.rscontracts/model-families/llama-370m-sovereign-v1.yamlv1.5.0 → v1.6.0 (stays ACTIVE)Status lift
MODEL-2 ship-gate status after this PR:
3/12 ACTIVE (001, 011, 012) + 5/12 PARTIAL_ALGORITHM_LEVEL (002 via SHIP-012, 005 via SHIP-015, 007 via SHIP-017 [PR #1004], 009 via SHIP-019, 010 via SHIP-020 ← this PR) = 8/12 touched (66.7%).
Remaining 4 (003/004/006/008) all need real 370M training compute or a benchmark pipeline on RTX 4090.
Pattern lesson
v2.22.0 of the spec declared MODEL-2 non-compute PARTIAL levers "exhausted". The counter-example survey has now falsified that verdict three times:
Rule (reinforced): when a SHIP gate names a threshold / tolerance / ratio / cut-off and the compute-heavy harness is separable from the decision function, the threshold fn can land today at unit-test time — even when the full end-to-end harness is blocked on compute.
Full discharge blocks on
Real 370M
.aprfrom AC-SHIP2-003/004 compute-dispatch + three independentapr bench --tokens 128 --jsonmedians on the RTX 4090 host. Fixture-swap only — no decision-rule rewrite.Scope note
Also includes 6 lines of pre-existing
cargo fmt -p aprender-train --checkfixes incrates/aprender-train/src/train/device.rs(whitespace only). Keepscargo fmt -p aprender-train --checkgreen under the Toyota Way "all defects are your defects" rule.Test plan
cargo test -p aprender-train --lib models::llama_370m→ 11/11 PASS (9 pre-existing + 2 new)pv validate contracts/model-families/llama-370m-sovereign-v1.yaml→ "Contract is valid. 0 error(s), 0 warning(s)."cargo clippy -p aprender-train --lib -- -D warnings→ greencargo fmt -p aprender-train --check→ green.apr+apr bench --tokens 128 --jsonon RTX 4090 (blocked on AC-SHIP2-003/004 compute-dispatch — task fix(lint): Resolve bashrs false positives #126 in-flight)Task #150.
🤖 Generated with Claude Code