Problem
StageCountAutoCarveout computes pipeline stages using the architecture family's maximum SharedMemoryCapacity, not the actual device's runtime shared memory limit. Within the SM12x family:
- SM120 (B200, RTX PRO 6000): 228 KiB shared memory per block
- SM121 (DGX Spark GB10): 99 KiB shared memory per block
When CUTLASS compiles grouped GEMM kernels for compute_120f, StageCountAutoCarveout selects stage counts that fit 228 KiB. At runtime on SM121, gemm.initialize() fails because cudaFuncSetAttribute(MaxDynamicSharedMemorySize) receives a value exceeding 99 KiB.
Specific failure
MoE grouped GEMM with FP4 (__nv_fp4_e2m1), tile CtaShape128x256x64B:
MoE grouped GEMM requires 102400 bytes shared memory but device supports 101376
(cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:39)
102400 bytes (100 KiB) vs 101376 bytes (99 KiB) -- exactly 1 KiB over.
Impact
In TensorRT-LLM on DGX Spark (SM121):
- 13 of 16 autotuner MoE GEMM tactics fail
- Surviving tactics are small/inefficient: 4.8 tok/s (vs 24 tok/s with llama.cpp software dequant)
- Users must bypass CUTLASS entirely (e.g., Triton MoE backend) to get reasonable performance (32-40 tok/s)
Root cause
StageCountAutoCarveout is a compile-time policy. For a given architecture target (sm_120f), it picks the maximum stages that fit the architecture's shared memory spec. But SM12x is not homogeneous -- SM121 has less than half the SMEM of SM120. There's no runtime path to reduce stages based on cudaDevAttrMaxSharedMemoryPerBlockOptin.
Current workaround (in TensorRT-LLM)
We filter candidate tile configs at the TRT-LLM level before they reach CUTLASS, keeping only tiles that fit within the device's actual SMEM. This works but pushes device-awareness to every consumer of CUTLASS.
Suggested fix
A runtime-aware StageCount policy that queries cudaDevAttrMaxSharedMemoryPerBlockOptin and clamps stages to fit, or a mechanism to pass max SMEM as a runtime parameter to the grouped GEMM kernel.
Environment
- Device: DGX Spark GB10 (SM121,
cudaDevAttrMaxSharedMemoryPerBlockOptin = 101376)
- CUDA: 13.1
- CUTLASS: 4.4.x (via TensorRT-LLM 1.3.0rc10)
- Arch target:
compute_120f / sm_120f
Related issues
Problem
StageCountAutoCarveoutcomputes pipeline stages using the architecture family's maximumSharedMemoryCapacity, not the actual device's runtime shared memory limit. Within the SM12x family:When CUTLASS compiles grouped GEMM kernels for
compute_120f,StageCountAutoCarveoutselects stage counts that fit 228 KiB. At runtime on SM121,gemm.initialize()fails becausecudaFuncSetAttribute(MaxDynamicSharedMemorySize)receives a value exceeding 99 KiB.Specific failure
MoE grouped GEMM with FP4 (
__nv_fp4_e2m1), tileCtaShape128x256x64B:102400 bytes (100 KiB) vs 101376 bytes (99 KiB) -- exactly 1 KiB over.
Impact
In TensorRT-LLM on DGX Spark (SM121):
Root cause
StageCountAutoCarveoutis a compile-time policy. For a given architecture target (sm_120f), it picks the maximum stages that fit the architecture's shared memory spec. But SM12x is not homogeneous -- SM121 has less than half the SMEM of SM120. There's no runtime path to reduce stages based oncudaDevAttrMaxSharedMemoryPerBlockOptin.Current workaround (in TensorRT-LLM)
We filter candidate tile configs at the TRT-LLM level before they reach CUTLASS, keeping only tiles that fit within the device's actual SMEM. This works but pushes device-awareness to every consumer of CUTLASS.
Suggested fix
A runtime-aware
StageCountpolicy that queriescudaDevAttrMaxSharedMemoryPerBlockOptinand clamps stages to fit, or a mechanism to pass max SMEM as a runtime parameter to the grouped GEMM kernel.Environment
cudaDevAttrMaxSharedMemoryPerBlockOptin= 101376)compute_120f/sm_120fRelated issues