[BUG] StageCountAutoCarveout assumes max family SMEM, breaks SM121 (99 KiB vs SM120 228 KiB)

### Problem

`StageCountAutoCarveout` computes pipeline stages using the architecture family's maximum `SharedMemoryCapacity`, not the actual device's runtime shared memory limit. Within the SM12x family:

- **SM120** (B200, RTX PRO 6000): **228 KiB** shared memory per block
- **SM121** (DGX Spark GB10): **99 KiB** shared memory per block

When CUTLASS compiles grouped GEMM kernels for `compute_120f`, `StageCountAutoCarveout` selects stage counts that fit 228 KiB. At runtime on SM121, `gemm.initialize()` fails because `cudaFuncSetAttribute(MaxDynamicSharedMemorySize)` receives a value exceeding 99 KiB.

### Specific failure

MoE grouped GEMM with FP4 (`__nv_fp4_e2m1`), tile `CtaShape128x256x64B`:

```
MoE grouped GEMM requires 102400 bytes shared memory but device supports 101376
(cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:39)
```

102400 bytes (100 KiB) vs 101376 bytes (99 KiB) -- exactly 1 KiB over.

### Impact

In TensorRT-LLM on DGX Spark (SM121):
- 13 of 16 autotuner MoE GEMM tactics fail
- Surviving tactics are small/inefficient: 4.8 tok/s (vs 24 tok/s with llama.cpp software dequant)
- Users must bypass CUTLASS entirely (e.g., Triton MoE backend) to get reasonable performance (32-40 tok/s)

### Root cause

`StageCountAutoCarveout` is a compile-time policy. For a given architecture target (`sm_120f`), it picks the maximum stages that fit the architecture's shared memory spec. But SM12x is not homogeneous -- SM121 has less than half the SMEM of SM120. There's no runtime path to reduce stages based on `cudaDevAttrMaxSharedMemoryPerBlockOptin`.

### Current workaround (in TensorRT-LLM)

We filter candidate tile configs at the TRT-LLM level before they reach CUTLASS, keeping only tiles that fit within the device's actual SMEM. This works but pushes device-awareness to every consumer of CUTLASS.

### Suggested fix

A runtime-aware `StageCount` policy that queries `cudaDevAttrMaxSharedMemoryPerBlockOptin` and clamps stages to fit, or a mechanism to pass max SMEM as a runtime parameter to the grouped GEMM kernel.

### Environment

- Device: DGX Spark GB10 (SM121, `cudaDevAttrMaxSharedMemoryPerBlockOptin` = 101376)
- CUDA: 13.1
- CUTLASS: 4.4.x (via TensorRT-LLM 1.3.0rc10)
- Arch target: `compute_120f` / `sm_120f`

### Related issues

- #3096 -- SM120 NVFP4 MoE garbage output (different root cause)
- #2800 -- BlockScaledMmaOp restricts FP4 to sm_100a only
- #2614 -- Request sm_121 support


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] StageCountAutoCarveout assumes max family SMEM, breaks SM121 (99 KiB vs SM120 228 KiB) #3144

Problem

Specific failure

Impact

Root cause

Current workaround (in TensorRT-LLM)

Suggested fix

Environment

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] StageCountAutoCarveout assumes max family SMEM, breaks SM121 (99 KiB vs SM120 228 KiB) #3144

Description

Problem

Specific failure

Impact

Root cause

Current workaround (in TensorRT-LLM)

Suggested fix

Environment

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions