Skip to content

Add fused GatedDeltaNet decode Triton kernel#18865

Open
Gasoonjia wants to merge 9 commits intogasoonjia/flashdecoding-pp-async-softmaxfrom
fused-deltanet-decode
Open

Add fused GatedDeltaNet decode Triton kernel#18865
Gasoonjia wants to merge 9 commits intogasoonjia/flashdecoding-pp-async-softmaxfrom
fused-deltanet-decode

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

@Gasoonjia Gasoonjia commented Apr 14, 2026

Fuse Q/K/V split, L2 normalization, head repeat, gating computation, and delta-rule recurrent state update into a single Triton kernel for decode (T=1). Replaces ~6 small AOTI-generated kernels with one, reducing GatedDeltaNet kernel time by ~62%.

Fuse Q/K/V split, L2 normalization, head repeat, gating computation,
and delta-rule recurrent state update into a single Triton kernel for
decode (T=1). Replaces ~6 small AOTI-generated kernels with one,
reducing GatedDeltaNet kernel time by ~62% and improving end-to-end
decode throughput by ~2% (106 -> 108.5 tok/s on A100).
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18865

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 6 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit f380b22 with merge base c48ea12 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2026
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 14, 2026 07:39 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia marked this pull request as ready for review April 15, 2026 23:40
@Gasoonjia Gasoonjia requested a review from lucylq as a code owner April 15, 2026 23:40


# bf16 kernel vs fp32 reference tolerance.
MAX_ABS_TOL = 0.05
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so high?

o_base = O_ptr + bid * stride_ob + h * stride_oh

# ====== Main computation ======
if BLOCK_K >= K:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get traced through?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It will be traced through in triton kernel autotune
K is 128 in Qwen 3.5 MoE case; while BLOCK_K can be autotuned in [64, 128].

# Qwen3.5 MoE dimensions (used across tests)
NUM_K_HEADS = 16
NUM_V_HEADS = 32
HEAD_K_DIM = 128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try with a smaller K to exercise the other branch

Copy link
Copy Markdown
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do a sweep for prompt_len: {128, 512, 2048} and decode_len: {128} to see if this works OK with small and large states. Update the PR summary.

@Gasoonjia Gasoonjia changed the base branch from cuda-graph to gasoonjia/flashdecoding-pp-async-softmax April 23, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants