Skip to content

feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL)#1005

Open
noahgift wants to merge 1 commit intomainfrom
feat/falsify-ship-020-partial-discharge
Open

feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL)#1005
noahgift wants to merge 1 commit intomainfrom
feat/falsify-ship-020-partial-discharge

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

  • Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX 4090) ↔ FALSIFY-SHIP-020 via new GATE-ARCH-370M-006 in the sovereign contract
  • Pure f32 threshold fn verdict_from_decode_tps(measured_tps) -> Ship020Verdict + const AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 in crates/aprender-train/src/models/llama_370m.rs
  • Two unit tests covering 5 invariants (Pass boundary, Fail boundary at one f32 ULP, bidirectional monotonicity, conservative Fail for NaN/±∞, provenance pinning)
  • Sovereign contract contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE)
  • Spec bump v2.23.0 → v2.26.0 with amendment block

Status lift

MODEL-2 ship-gate status after this PR:

3/12 ACTIVE (001, 011, 012) + 5/12 PARTIAL_ALGORITHM_LEVEL (002 via SHIP-012, 005 via SHIP-015, 007 via SHIP-017 [PR #1004], 009 via SHIP-019, 010 via SHIP-020 ← this PR) = 8/12 touched (66.7%).

Remaining 4 (003/004/006/008) all need real 370M training compute or a benchmark pipeline on RTX 4090.

Pattern lesson

v2.22.0 of the spec declared MODEL-2 non-compute PARTIAL levers "exhausted". The counter-example survey has now falsified that verdict three times:

  1. SHIP-019 (v2.22.0 itself, task chore(deps): Bump entrenar from 0.2.6 to 0.2.9 in the production-dependencies group #117)
  2. SHIP-017 (PR feat(ship-two-001): FALSIFY-SHIP-017 AC-SHIP2-007 PARTIAL_ALGORITHM_LEVEL discharge (task #149) #1004, task feat(pruning): Implement Lottery Ticket Hypothesis pruning #149)
  3. SHIP-020 (this PR, task apr fails to find config.json #150)

Rule (reinforced): when a SHIP gate names a threshold / tolerance / ratio / cut-off and the compute-heavy harness is separable from the decision function, the threshold fn can land today at unit-test time — even when the full end-to-end harness is blocked on compute.

Full discharge blocks on

Real 370M .apr from AC-SHIP2-003/004 compute-dispatch + three independent apr bench --tokens 128 --json medians on the RTX 4090 host. Fixture-swap only — no decision-rule rewrite.

Scope note

Also includes 6 lines of pre-existing cargo fmt -p aprender-train --check fixes in crates/aprender-train/src/train/device.rs (whitespace only). Keeps cargo fmt -p aprender-train --check green under the Toyota Way "all defects are your defects" rule.

Test plan

  • cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS (9 pre-existing + 2 new)
  • pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → "Contract is valid. 0 error(s), 0 warning(s)."
  • cargo clippy -p aprender-train --lib -- -D warnings → green
  • cargo fmt -p aprender-train --check → green
  • PMAT pre-commit quality gates → green
  • CI (ci / gate, workspace-test)
  • Full discharge: real 370M .apr + apr bench --tokens 128 --json on RTX 4090 (blocked on AC-SHIP2-003/004 compute-dispatch — task fix(lint): Resolve bashrs false positives #126 in-flight)

Task #150.

🤖 Generated with Claude Code

Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX
4090) to a new GATE-ARCH-370M-006 in the sovereign contract via a pure
f32 threshold fn + two unit tests. The compute-heavy half (`apr bench`
on a real trained 370M .apr) is deferred to AC-SHIP2-003/004
compute-dispatch; the decision rule itself is proven today.

Changes:
- crates/aprender-train/src/models/llama_370m.rs:
  * AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 (const floor)
  * Ship020Verdict { Pass, Fail }
  * verdict_from_decode_tps(f32) -> Ship020Verdict (fn, non-finite → Fail)
  * falsify_ship_020_decode_tps_threshold_logic (5 invariants:
    Pass boundary, Fail boundary at one f32 ULP, monotonicity in
    both directions, conservative Fail for NaN/±∞, provenance
    pinning that the const stays = 100.0)
  * falsify_ship_020_gate_arch_370m_006_has_partial_discharge_marker
    (contract parses + advertises PARTIAL_ALGORITHM_LEVEL +
    evidence_discharged_by populated + full_discharge_blocks_on
    documented + ship_blocking:true)

- contracts/model-families/llama-370m-sovereign-v1.yaml:
  * v1.5.0 → v1.6.0, stays ACTIVE
  * New GATE-ARCH-370M-006 binding AC-SHIP2-010 ↔ FALSIFY-SHIP-020
    with discharge_status: PARTIAL_ALGORITHM_LEVEL

- docs/specifications/aprender-train/ship-two-models-spec.md:
  * v2.23.0 → v2.26.0 with amendment block
  * MODEL-2 ship-gate status updated: 3/12 ACTIVE + 5/12 PARTIAL =
    8/12 touched (66.7%)

- crates/aprender-train/src/train/device.rs:
  * 2 pre-existing fmt fixes (6 lines of whitespace) — restores
    `cargo fmt -p aprender-train --check` green. Pre-existing on
    origin/main; kept in this PR under Toyota Way "all defects are
    your defects" rule.

Pattern lesson: v2.22.0 declared MODEL-2 non-compute PARTIAL levers
"exhausted" — re-running the counter-example survey has now falsified
that verdict three times (SHIP-019 → SHIP-017 → SHIP-020). When a
SHIP gate names a threshold / tolerance / ratio / cut-off and the
compute-heavy harness is separable from the decision function, the
threshold fn can land today at unit-test time — even when the full
end-to-end harness is blocked on compute.

Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004
compute-dispatch + three independent `apr bench --tokens 128 --json`
medians on RTX 4090 host. Fixture-swap only — no decision-rule rewrite.

Verification:
- cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS
- pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
  → "Contract is valid. 0 error(s), 0 warning(s)."
- cargo clippy -p aprender-train --lib → green
- cargo fmt -p aprender-train --check → green

Task #150.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 23, 2026
…lean branch)

Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on
main (superseding stale PR #1014 which was stacked on
feat/falsify-ship-008/006-partial-discharge branches that had not yet
merged to main). Algorithm commit carries the same 7-section mutation
survey as the original be6d129, re-based onto post-SHIP-002 main
(commit f615148, contract v1.1.0).

Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090
(7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold
verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is
proven today; compute-heavy half (live `apr bench` on RTX 4090) is
deferred to hardware evidence collection.

Files:
- `crates/aprender-core/src/bench/ship_007.rs` (NEW) —
  `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`,
  `Ship007Verdict { Pass, Fail }`,
  `verdict_from_decode_tps(f32) -> Ship007Verdict`,
  `falsify_ship_007_decode_tps_threshold_logic` 7-section survey:
    1. boundary (30.0 exactly → Pass; contract is ≥, not >)
    2. one-ULP-below → Fail (sharpest off-by-one counter-example)
    3. clear Pass band (45 / 100 tok/s)
    4. clear Fail band (0 / 10 / 29.999999)
    5. monotonicity above floor + below floor
    6. non-finite → Fail conservatively (NaN, +∞, -∞)
    7. provenance pin binding 30.0 to spec §4.2.

- `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`.

- `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds
  `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`,
  `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by`
  pointing at ship_007.rs + the harness test, and
  `full_discharge_blocks_on` live `apr bench --iterations 5
  --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090
  with --features cuda; median of 5 iterations must be ≥ 30.0.

- `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0
  → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL
  v2.27.0 marker and adds v2.27.0 amendment entry.

Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005
not yet on main). Once both ship, the two `verdict_from_decode_tps_*`
fns should be deduplicated into a single parameterized helper
`verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with
model-specific floors pinned as module-level consts. MODEL-1 floor is
30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor
is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth).

MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006
+ SHIP-002) → **5/10** touched (+ SHIP-007).

Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed.
Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors.
Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean.
Fmt: `cargo fmt --check -p aprender-core` → clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 23, 2026
…lean branch) (#1019)

* feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch)

Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on
main (superseding stale PR #1014 which was stacked on
feat/falsify-ship-008/006-partial-discharge branches that had not yet
merged to main). Algorithm commit carries the same 7-section mutation
survey as the original be6d129, re-based onto post-SHIP-002 main
(commit f615148, contract v1.1.0).

Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090
(7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold
verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is
proven today; compute-heavy half (live `apr bench` on RTX 4090) is
deferred to hardware evidence collection.

Files:
- `crates/aprender-core/src/bench/ship_007.rs` (NEW) —
  `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`,
  `Ship007Verdict { Pass, Fail }`,
  `verdict_from_decode_tps(f32) -> Ship007Verdict`,
  `falsify_ship_007_decode_tps_threshold_logic` 7-section survey:
    1. boundary (30.0 exactly → Pass; contract is ≥, not >)
    2. one-ULP-below → Fail (sharpest off-by-one counter-example)
    3. clear Pass band (45 / 100 tok/s)
    4. clear Fail band (0 / 10 / 29.999999)
    5. monotonicity above floor + below floor
    6. non-finite → Fail conservatively (NaN, +∞, -∞)
    7. provenance pin binding 30.0 to spec §4.2.

- `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`.

- `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds
  `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`,
  `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by`
  pointing at ship_007.rs + the harness test, and
  `full_discharge_blocks_on` live `apr bench --iterations 5
  --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090
  with --features cuda; median of 5 iterations must be ≥ 30.0.

- `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0
  → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL
  v2.27.0 marker and adds v2.27.0 amendment entry.

Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005
not yet on main). Once both ship, the two `verdict_from_decode_tps_*`
fns should be deduplicated into a single parameterized helper
`verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with
model-specific floors pinned as module-level consts. MODEL-1 floor is
30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor
is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth).

MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006
+ SHIP-002) → **5/10** touched (+ SHIP-007).

Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed.
Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors.
Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean.
Fmt: `cargo fmt --check -p aprender-core` → clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: retrigger after 3 disk-guard race failures

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant