Add fused GatedDeltaNet decode Triton kernel by Gasoonjia · Pull Request #18865 · pytorch/executorch

Gasoonjia · 2026-04-14T06:39:13Z

Fuse Q/K/V split, L2 normalization, head repeat, gating computation, and delta-rule recurrent state update into a single Triton kernel for decode (T=1). Replaces ~6 small AOTI-generated kernels with one, reducing GatedDeltaNet kernel time by ~62%.

Fuse Q/K/V split, L2 normalization, head repeat, gating computation, and delta-rule recurrent state update into a single Triton kernel for decode (T=1). Replaces ~6 small AOTI-generated kernels with one, reducing GatedDeltaNet kernel time by ~62% and improving end-to-end decode throughput by ~2% (106 -> 108.5 tok/s on A100).

pytorch-bot · 2026-04-14T06:39:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18865

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 6 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit f380b22 with merge base c48ea12 ():

NEW FAILURES - The following jobs have failed:

MLX / test-mlx-qwen35-moe / test-mlx-qwen35-moe (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 39453c97b5abbd103d01c9db496fe2519cffcb57917f9eec621b9de765affcd6 /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t b6ce841f7fb59c59ec4789a5a78f9c224aa30f9d5df9f9d06a86b20c9ab51dbd /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 6f8b362d63e860ae52af57bd17d743d1ec6713064766215b66aaf95403ad8b79 /exec failed with exit code 1
Test Metal Backend / test-metal-qwen35-moe-tiny / macos-job (gh)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/examples/models/qwen3_5_moe/main.cpp:324:3: error: use of undeclared identifier 'cudaMemGetInfo'

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-editable / windows / windows-job (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

digantdesai · 2026-04-21T04:05:17Z

+
+
+# bf16 kernel vs fp32 reference tolerance.
+MAX_ABS_TOL = 0.05


why so high?

digantdesai · 2026-04-21T15:51:56Z

+    o_base = O_ptr + bid * stride_ob + h * stride_oh
+
+    # ====== Main computation ======
+    if BLOCK_K >= K:


Does this get traced through?

Yes. It will be traced through in triton kernel autotune
K is 128 in Qwen 3.5 MoE case; while BLOCK_K can be autotuned in [64, 128].

digantdesai · 2026-04-21T15:55:19Z

+# Qwen3.5 MoE dimensions (used across tests)
+NUM_K_HEADS = 16
+NUM_V_HEADS = 32
+HEAD_K_DIM = 128


try with a smaller K to exercise the other branch

digantdesai

Do a sweep for prompt_len: {128, 512, 2048} and decode_len: {128} to see if this works OK with small and large states. Update the PR summary.

…ltanet-decode

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2026

Gasoonjia temporarily deployed to upload-benchmark-results April 14, 2026 07:39 — with GitHub Actions Inactive

Gasoonjia added the ciflow/cuda label Apr 14, 2026

Gasoonjia added 2 commits April 15, 2026 01:14

Merge branch 'cuda-graph' into fused-deltanet-decode

1c73738

Merge branch 'cuda-graph' into fused-deltanet-decode

deb1c34

Gasoonjia marked this pull request as ready for review April 15, 2026 23:40

Gasoonjia requested a review from lucylq as a code owner April 15, 2026 23:40

Gasoonjia added 3 commits April 15, 2026 22:17

Merge branch 'cuda-graph' into fused-deltanet-decode

484ad49

optimized by KA

07be9ee

lint

a342209

digantdesai reviewed Apr 21, 2026

View reviewed changes

digantdesai approved these changes Apr 21, 2026

View reviewed changes

Merge branch 'cuda-graph' into fused-deltanet-decode

2ca1b22

Gasoonjia changed the base branch from cuda-graph to gasoonjia/flashdecoding-pp-async-softmax April 23, 2026 05:41

Gasoonjia added 2 commits April 22, 2026 22:42

Merge branch 'gasoonjia/flashdecoding-pp-async-softmax' into fused-de…

5535d78

…ltanet-decode

Merge branch 'gasoonjia/flashdecoding-pp-async-softmax' into fused-de…

f380b22

…ltanet-decode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fused GatedDeltaNet decode Triton kernel#18865

Add fused GatedDeltaNet decode Triton kernel#18865
Gasoonjia wants to merge 9 commits intogasoonjia/flashdecoding-pp-async-softmaxfrom
fused-deltanet-decode

Gasoonjia commented Apr 14, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

digantdesai Apr 21, 2026

Uh oh!

digantdesai Apr 21, 2026

Uh oh!

Gasoonjia Apr 24, 2026

Uh oh!

digantdesai Apr 21, 2026

Uh oh!

digantdesai left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# bf16 kernel vs fp32 reference tolerance.
		MAX_ABS_TOL = 0.05

Conversation

Gasoonjia commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18865

❗ 1 Active SEVs

❌ 6 New Failures, 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

digantdesai Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gasoonjia commented Apr 14, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 14, 2026 •

edited

Loading

digantdesai left a comment •

edited

Loading