[FEA] Add Snake activation functor for Epilogue Visitor Tree (EVT)

### Which component requires the feature?

CUTLASS C++

### Feature Request

<h2>Is your feature request related to a problem? Please describe.</h2>
<p>CUTLASS provides several built-in activation functors (GELU, ReLU, SiLU, etc.) that can be used inside the Epilogue Visitor Tree framework. However, there is currently no support for the <strong>Snake activation function</strong> (<code>x + sin²(αx) / α</code>, Ziyin et al. 2020), a parametric activation that has seen adoption in audio synthesis models (neural vocoders) and other domains where periodic inductive bias is beneficial.</p>
<p>Users who need Snake activation in fused GEMM epilogues currently have to fall back to unfused workflows, running the activation as a separate kernel after GEMM, which incurs additional global memory round-trips.</p>
<h2>Describe the solution you'd like</h2>
<p>Add a <code>SnakeOp</code> functor to CUTLASS's activation function library, following the same pattern as existing functors like <code>GELU</code>. The implementation would:</p>
<ul>
<li>Support scalar and <code>Array&lt;T, N&gt;</code> specializations</li>
<li>Take a learnable per-channel <code>α</code> parameter via <code>Sm90RowBroadcast</code></li>
<li>Be composable within the SM90 EVT framework (e.g. <code>Sm90Compute&lt;SnakeOp, ...&gt;</code>)</li>
</ul>
<p>I have a working implementation targeting SM90 (Hopper) that I'd be happy to contribute as a PR. The EVT tree structure is:</p>
<pre><code class="language-cpp">// EVT tree: SnakeOp(AccFetch, RowBroadcast(alpha))
using SnakeEpilogue = Sm90EVT&lt;
    Sm90Compute&lt;SnakeOp, ElementOut, ElementEpi,
                cutlass::FloatRoundStyle::round_to_nearest&gt;,
    Sm90AccFetch,
    Sm90RowBroadcast&lt;0, TileShape, float&gt;
&gt;;
</code></pre>
<h2>Describe alternatives you've considered</h2>
<ul>
<li><strong>Unfused approach:</strong> Running Snake activation as a separate PyTorch/custom CUDA kernel after the GEMM. This works but requires an extra global memory read/write of the full output tensor.</li>
<li><strong>Approximating with existing functors:</strong> Snake cannot be accurately decomposed into existing CUTLASS activation primitives since it requires <code>sin²</code> and a per-channel learnable parameter.</li>
</ul>
<h2>Additional context</h2>
<h3>Benchmarks (H100 SXM5 80GB, SM90)</h3>
<p>Implicit GEMM shapes from a production neural vocoder (Conv1d k=7 mapped to GEMM as M=T, N=C_out, K=C_in×7):</p>

GEMM Shape (M × N × K) | cuDNN Conv + Separate Snake | CUTLASS Fused EVT | Speedup
-- | -- | -- | --
1200 × 768 × 5376 | 0.108 – 0.138 ms | 0.078 ms | 1.39 – 1.76×
6000 × 384 × 2688 | 0.125 – 0.664 ms | 0.078 ms | 1.61 – 8.51×
24000 × 192 × 1344 | 0.174 – 0.193 ms | 0.082 ms | 2.12 – 2.35×
48000 × 96 × 672 | 0.168 – 0.170 ms | 0.063 – 0.064 ms | 2.65 – 2.70×


<p>Median speedup across all 12 shapes is approximately <strong>2.1×</strong>, with the benefit primarily coming from eliminating the extra memory round-trip for the activation kernel.</p>
<h3>Implementation</h3>
<p>The implementation follows CUTLASS's existing functor conventions (scalar + <code>Array&lt;T, N&gt;</code> specializations, <code>__sinf</code> fast math intrinsic). I'm ready to open a PR if there's interest.</p>
<h3>Reference</h3>
<p>Ziyin, L., Hartwig, T., &amp; Ueda, M. "Neural Networks Fail to Learn Periodic Functions and How to Fix It." NeurIPS 2020. <a href="https://arxiv.org/abs/2006.08195">arXiv:2006.08195</a> , introduces <code>x + (1/a) sin²(ax)</code>.</p></body></html># [Feature Request] Add Snake activation functor for Epilogue Visitor Tree (EVT)

## Is your feature request related to a problem? Please describe.

CUTLASS provides several built-in activation functors (GELU, ReLU, SiLU, etc.) that can be used inside the Epilogue Visitor Tree framework. However, there is currently no support for the **Snake activation function** (`x + sin²(αx) / α`, Ziyin et al. 2020), a parametric activation that has seen adoption in audio synthesis models (neural vocoders) and other domains where periodic inductive bias is beneficial.

Users who need Snake activation in fused GEMM epilogues currently have to fall back to unfused workflows, running the activation as a separate kernel after GEMM, which incurs additional global memory round-trips.

## Describe the solution you'd like

Add a `SnakeOp` functor to CUTLASS's activation function library, following the same pattern as existing functors like `GELU`. The implementation would:

- Support scalar and `Array<T, N>` specializations
- Take a learnable per-channel `α` parameter via `Sm90RowBroadcast`
- Be composable within the Sm90 EVT framework (e.g. `Sm90Compute<SnakeOp, ...>`)

I have a working implementation targeting SM90 (Hopper) that I'd be happy to contribute as a PR. The EVT tree structure is:

```cpp
// EVT tree: SnakeOp(AccFetch, RowBroadcast(alpha))
using SnakeEpilogue = Sm90EVT<
    Sm90Compute<SnakeOp, ElementOut, ElementEpi,
                cutlass::FloatRoundStyle::round_to_nearest>,
    Sm90AccFetch,
    Sm90RowBroadcast<0, TileShape, float>
>;
```

## Describe alternatives you've considered

- **Unfused approach:** Running Snake activation as a separate PyTorch/custom CUDA kernel after the GEMM. This works but requires an extra global memory read/write of the full output tensor.
- **Approximating with existing functors:** Snake cannot be accurately decomposed into existing CUTLASS activation primitives since it requires `sin²` and a per-channel learnable parameter.

## Additional context

### Benchmarks (H100 SXM5 80GB, SM90)

Implicit GEMM shapes from a production neural vocoder (Conv1d k=7 mapped to GEMM as M=T, N=C_out, K=C_in×7):

| GEMM Shape (M × N × K) | cuDNN Conv + Separate Snake | CUTLASS Fused EVT | Speedup |
|---|---|---|---|
| 1200 × 768 × 5376 | 0.108 – 0.138 ms | 0.078 ms | 1.39 – 1.76× |
| 6000 × 384 × 2688 | 0.125 – 0.664 ms | 0.078 ms | 1.61 – 8.51× |
| 24000 × 192 × 1344 | 0.174 – 0.193 ms | 0.082 ms | 2.12 – 2.35× |
| 48000 × 96 × 672 | 0.168 – 0.170 ms | 0.063 – 0.064 ms | 2.65 – 2.70× |

Median speedup across all 12 shapes is approximately **2.1×**, with the benefit primarily coming from eliminating the extra memory round-trip for the activation kernel.

### Implementation

The implementation follows CUTLASS's existing functor conventions (scalar + `Array<T, N>` specializations, `__sinf` fast math intrinsic). I'm ready to open a PR if there's interest.

### Reference

Ziyin, L., Hartwig, T., & Ueda, M. "Neural Networks Fail to Learn Periodic Functions and How to Fix It." NeurIPS 2020. [[arXiv:2006.08195](https://arxiv.org/abs/2006.08195)](https://arxiv.org/abs/2006.08195), introduces `x + (1/a) sin²(ax)`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add Snake activation functor for Epilogue Visitor Tree (EVT) #3141

Which component requires the feature?

Feature Request

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Benchmarks (H100 SXM5 80GB, SM90)

Implementation

Reference

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Benchmarks (H100 SXM5 80GB, SM90)

Implementation

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GEMM Shape (M × N × K)	cuDNN Conv + Separate Snake	CUTLASS Fused EVT	Speedup
1200 × 768 × 5376	0.108 – 0.138 ms	0.078 ms	1.39 – 1.76×
6000 × 384 × 2688	0.125 – 0.664 ms	0.078 ms	1.61 – 8.51×
24000 × 192 × 1344	0.174 – 0.193 ms	0.082 ms	2.12 – 2.35×
48000 × 96 × 672	0.168 – 0.170 ms	0.063 – 0.064 ms	2.65 – 2.70×

[FEA] Add Snake activation functor for Epilogue Visitor Tree (EVT) #3141

Description

Which component requires the feature?

Feature Request

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Benchmarks (H100 SXM5 80GB, SM90)

Implementation

Reference

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Benchmarks (H100 SXM5 80GB, SM90)

Implementation

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions