feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL) by noahgift · Pull Request #1005 · paiml/aprender

noahgift · 2026-04-22T13:25:37Z

Summary

Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX 4090) ↔ FALSIFY-SHIP-020 via new GATE-ARCH-370M-006 in the sovereign contract
Pure f32 threshold fn verdict_from_decode_tps(measured_tps) -> Ship020Verdict + const AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 in crates/aprender-train/src/models/llama_370m.rs
Two unit tests covering 5 invariants (Pass boundary, Fail boundary at one f32 ULP, bidirectional monotonicity, conservative Fail for NaN/±∞, provenance pinning)
Sovereign contract contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE)
Spec bump v2.23.0 → v2.26.0 with amendment block

Status lift

MODEL-2 ship-gate status after this PR:

3/12 ACTIVE (001, 011, 012) + 5/12 PARTIAL_ALGORITHM_LEVEL (002 via SHIP-012, 005 via SHIP-015, 007 via SHIP-017 [PR #1004], 009 via SHIP-019, 010 via SHIP-020 ← this PR) = 8/12 touched (66.7%).

Remaining 4 (003/004/006/008) all need real 370M training compute or a benchmark pipeline on RTX 4090.

Pattern lesson

v2.22.0 of the spec declared MODEL-2 non-compute PARTIAL levers "exhausted". The counter-example survey has now falsified that verdict three times:

SHIP-019 (v2.22.0 itself, task chore(deps): Bump entrenar from 0.2.6 to 0.2.9 in the production-dependencies group #117)
SHIP-017 (PR feat(ship-two-001): FALSIFY-SHIP-017 AC-SHIP2-007 PARTIAL_ALGORITHM_LEVEL discharge (task #149) #1004, task feat(pruning): Implement Lottery Ticket Hypothesis pruning #149)
SHIP-020 (this PR, task apr fails to find config.json #150)

Rule (reinforced): when a SHIP gate names a threshold / tolerance / ratio / cut-off and the compute-heavy harness is separable from the decision function, the threshold fn can land today at unit-test time — even when the full end-to-end harness is blocked on compute.

Full discharge blocks on

Real 370M .apr from AC-SHIP2-003/004 compute-dispatch + three independent apr bench --tokens 128 --json medians on the RTX 4090 host. Fixture-swap only — no decision-rule rewrite.

Scope note

Also includes 6 lines of pre-existing cargo fmt -p aprender-train --check fixes in crates/aprender-train/src/train/device.rs (whitespace only). Keeps cargo fmt -p aprender-train --check green under the Toyota Way "all defects are your defects" rule.

Test plan

cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS (9 pre-existing + 2 new)
pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → "Contract is valid. 0 error(s), 0 warning(s)."
cargo clippy -p aprender-train --lib -- -D warnings → green
cargo fmt -p aprender-train --check → green
PMAT pre-commit quality gates → green
CI (ci / gate, workspace-test)
Full discharge: real 370M .apr + apr bench --tokens 128 --json on RTX 4090 (blocked on AC-SHIP2-003/004 compute-dispatch — task fix(lint): Resolve bashrs false positives #126 in-flight)

Task #150.

🤖 Generated with Claude Code

Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX 4090) to a new GATE-ARCH-370M-006 in the sovereign contract via a pure f32 threshold fn + two unit tests. The compute-heavy half (`apr bench` on a real trained 370M .apr) is deferred to AC-SHIP2-003/004 compute-dispatch; the decision rule itself is proven today. Changes: - crates/aprender-train/src/models/llama_370m.rs: * AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 (const floor) * Ship020Verdict { Pass, Fail } * verdict_from_decode_tps(f32) -> Ship020Verdict (fn, non-finite → Fail) * falsify_ship_020_decode_tps_threshold_logic (5 invariants: Pass boundary, Fail boundary at one f32 ULP, monotonicity in both directions, conservative Fail for NaN/±∞, provenance pinning that the const stays = 100.0) * falsify_ship_020_gate_arch_370m_006_has_partial_discharge_marker (contract parses + advertises PARTIAL_ALGORITHM_LEVEL + evidence_discharged_by populated + full_discharge_blocks_on documented + ship_blocking:true) - contracts/model-families/llama-370m-sovereign-v1.yaml: * v1.5.0 → v1.6.0, stays ACTIVE * New GATE-ARCH-370M-006 binding AC-SHIP2-010 ↔ FALSIFY-SHIP-020 with discharge_status: PARTIAL_ALGORITHM_LEVEL - docs/specifications/aprender-train/ship-two-models-spec.md: * v2.23.0 → v2.26.0 with amendment block * MODEL-2 ship-gate status updated: 3/12 ACTIVE + 5/12 PARTIAL = 8/12 touched (66.7%) - crates/aprender-train/src/train/device.rs: * 2 pre-existing fmt fixes (6 lines of whitespace) — restores `cargo fmt -p aprender-train --check` green. Pre-existing on origin/main; kept in this PR under Toyota Way "all defects are your defects" rule. Pattern lesson: v2.22.0 declared MODEL-2 non-compute PARTIAL levers "exhausted" — re-running the counter-example survey has now falsified that verdict three times (SHIP-019 → SHIP-017 → SHIP-020). When a SHIP gate names a threshold / tolerance / ratio / cut-off and the compute-heavy harness is separable from the decision function, the threshold fn can land today at unit-test time — even when the full end-to-end harness is blocked on compute. Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004 compute-dispatch + three independent `apr bench --tokens 128 --json` medians on RTX 4090 host. Fixture-swap only — no decision-rule rewrite. Verification: - cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS - pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → "Contract is valid. 0 error(s), 0 warning(s)." - cargo clippy -p aprender-train --lib → green - cargo fmt -p aprender-train --check → green Task #150. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…lean branch) Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on main (superseding stale PR #1014 which was stacked on feat/falsify-ship-008/006-partial-discharge branches that had not yet merged to main). Algorithm commit carries the same 7-section mutation survey as the original be6d129, re-based onto post-SHIP-002 main (commit f615148, contract v1.1.0). Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is proven today; compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding 30.0 to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL v2.27.0 marker and adds v2.27.0 amendment entry. Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005 not yet on main). Once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with model-specific floors pinned as module-level consts. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002) → **5/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean. Fmt: `cargo fmt --check -p aprender-core` → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…lean branch) (#1019) * feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch) Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on main (superseding stale PR #1014 which was stacked on feat/falsify-ship-008/006-partial-discharge branches that had not yet merged to main). Algorithm commit carries the same 7-section mutation survey as the original be6d129, re-based onto post-SHIP-002 main (commit f615148, contract v1.1.0). Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is proven today; compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding 30.0 to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL v2.27.0 marker and adds v2.27.0 amendment entry. Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005 not yet on main). Once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with model-specific floors pinned as module-level consts. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002) → **5/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean. Fmt: `cargo fmt --check -p aprender-core` → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: retrigger after 3 disk-guard race failures --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced Apr 22, 2026

falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn #1006

Open

feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL discharge #1014

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL)#1005

feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL)#1005
noahgift wants to merge 1 commit intomainfrom
feat/falsify-ship-020-partial-discharge

noahgift commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

Status lift

Pattern lesson

Full discharge blocks on

Scope note

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant