[ET-VK] Add apply_rotary_emb_interleaved fused operator by SS-JIA · Pull Request #19115 · pytorch/executorch

SS-JIA · 2026-04-24T16:46:41Z

Stack from ghstack (oldest at bottom):

Introduces et_vk.apply_rotary_emb_interleaved, a fused Vulkan custom operator for the "complex-number" RoPE variant used by SAM2/EdgeTAM's memory attention. This replaces a 12+-op layout-shuffle chain (view/unbind/stack/view -> lowers to slice_copy + squeeze_copy + unsqueeze_copy + cat + view_copy) with a single GPU dispatch.

Math: On pair-interleaved inputs where element 2k is real and 2k+1 is imag, for each k in [0, C/2):

out[2k] = x[2k] * cos[k] - x[2k+1] * sin[k]
out[2k+1] = x[2k] * sin[k] + x[2k+1] * cos[k]

Why a new op instead of reusing et_vk.apply_rotary_emb: The existing LLM-oriented operator takes (xq, xk) pairs with separate freqs_cos / freqs_sin tensors and 4D (B, S, H, D) shapes optimized for LLM prefill two-texel-per-thread reuse. SAM2's memory attention passes a single 3D (B, N, C) tensor through RoPE (no heads dim) with a fused [N, C/2, 2] freqs tensor. Reusing the existing op would force runtime splits of the fused freqs and double-dispatch Q/K separately, defeating the fuse. A sibling shader is tighter for both workloads.

Op contract: apply_rotary_emb_interleaved(x, freqs_cis) -> Tensor where x is [B, N, C] and freqs_cis is any rank with N*C elements and the cos/sin values interleaved on the innermost dim. In EdgeTAM's memory attention the native shape is [1, N, C/2, 2]; passing it through without a reshape keeps the exported graph clean of bracketing view_copy dispatches.

Shader: Single-dispatch kernel, one texel out per thread. Each thread reads one x texel (2 real/imag pairs) and the corresponding freqs_cis entries (2 cos/sin pairs) flat-indexed from buffer storage, writes one output texel. x and output support buffer + texture3d; freqs_cis is always buffer-storage (small tensor, flat indexing is simplest). Supports fp16 and fp32 via the FP_T dtype iterator in the YAML.

Op registration: Meta kernel returns torch.empty_like(x) to keep the op opaque during torch.export. CPU kernel holds the reference math so non-Vulkan backends keep working. op_registry.py pins freqs_cis storage to CONTIGUOUS_BUFFER while leaving x at CONTIGUOUS_ANY.

Differential Revision: D102360202

Introduces `et_vk.apply_rotary_emb_interleaved`, a fused Vulkan custom operator for the "complex-number" RoPE variant used by SAM2/EdgeTAM's memory attention. This replaces a 12+-op layout-shuffle chain (`view/unbind/stack/view` -> lowers to `slice_copy + squeeze_copy + unsqueeze_copy + cat + view_copy`) with a single GPU dispatch. **Math**: On pair-interleaved inputs where element `2k` is real and `2k+1` is imag, for each `k in [0, C/2)`: out[2k] = x[2k] * cos[k] - x[2k+1] * sin[k] out[2k+1] = x[2k] * sin[k] + x[2k+1] * cos[k] **Why a new op instead of reusing `et_vk.apply_rotary_emb`**: The existing LLM-oriented operator takes `(xq, xk)` pairs with separate `freqs_cos` / `freqs_sin` tensors and 4D `(B, S, H, D)` shapes optimized for LLM prefill two-texel-per-thread reuse. SAM2's memory attention passes a single 3D `(B, N, C)` tensor through RoPE (no heads dim) with a fused `[N, C/2, 2]` freqs tensor. Reusing the existing op would force runtime splits of the fused freqs and double-dispatch Q/K separately, defeating the fuse. A sibling shader is tighter for both workloads. **Op contract**: `apply_rotary_emb_interleaved(x, freqs_cis) -> Tensor` where `x` is `[B, N, C]` and `freqs_cis` is any rank with `N*C` elements and the `cos`/`sin` values interleaved on the innermost dim. In EdgeTAM's memory attention the native shape is `[1, N, C/2, 2]`; passing it through without a reshape keeps the exported graph clean of bracketing view_copy dispatches. **Shader**: Single-dispatch kernel, one texel out per thread. Each thread reads one `x` texel (2 real/imag pairs) and the corresponding `freqs_cis` entries (2 cos/sin pairs) flat-indexed from buffer storage, writes one output texel. `x` and output support buffer + texture3d; `freqs_cis` is always buffer-storage (small tensor, flat indexing is simplest). Supports fp16 and fp32 via the `FP_T` dtype iterator in the YAML. **Op registration**: `Meta` kernel returns `torch.empty_like(x)` to keep the op opaque during `torch.export`. `CPU` kernel holds the reference math so non-Vulkan backends keep working. `op_registry.py` pins `freqs_cis` storage to `CONTIGUOUS_BUFFER` while leaving `x` at `CONTIGUOUS_ANY`. Differential Revision: [D102360202](https://our.internmc.facebook.com/intern/diff/D102360202/) [ghstack-poisoned]

pytorch-bot · 2026-04-24T16:46:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19115

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 1 New Failure, 3 Cancelled Jobs, 2 Unrelated Failures

As of commit 9ad2459 with merge base eef7921 ():

NEW FAILURE - The following job has failed:

pull / android / run-emulator (gh)
The process '/usr/local/lib/android/sdk/platform-tools/adb' failed with exit code 224

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-lora-linux / linux-job (gh)
##[error]The operation was canceled.
pull / unittest / linux / linux-job (gh)
##[error]The operation was canceled.
pull / unittest-editable / linux / linux-job (gh)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-24T16:47:51Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Introduces `et_vk.apply_rotary_emb_interleaved`, a fused Vulkan custom operator for the "complex-number" RoPE variant used by SAM2/EdgeTAM's memory attention. This replaces a 12+-op layout-shuffle chain (`view/unbind/stack/view` -> lowers to `slice_copy + squeeze_copy + unsqueeze_copy + cat + view_copy`) with a single GPU dispatch. **Math**: On pair-interleaved inputs where element `2k` is real and `2k+1` is imag, for each `k in [0, C/2)`: out[2k] = x[2k] * cos[k] - x[2k+1] * sin[k] out[2k+1] = x[2k] * sin[k] + x[2k+1] * cos[k] **Why a new op instead of reusing `et_vk.apply_rotary_emb`**: The existing LLM-oriented operator takes `(xq, xk)` pairs with separate `freqs_cos` / `freqs_sin` tensors and 4D `(B, S, H, D)` shapes optimized for LLM prefill two-texel-per-thread reuse. SAM2's memory attention passes a single 3D `(B, N, C)` tensor through RoPE (no heads dim) with a fused `[N, C/2, 2]` freqs tensor. Reusing the existing op would force runtime splits of the fused freqs and double-dispatch Q/K separately, defeating the fuse. A sibling shader is tighter for both workloads. **Op contract**: `apply_rotary_emb_interleaved(x, freqs_cis) -> Tensor` where `x` is `[B, N, C]` and `freqs_cis` is any rank with `N*C` elements and the `cos`/`sin` values interleaved on the innermost dim. In EdgeTAM's memory attention the native shape is `[1, N, C/2, 2]`; passing it through without a reshape keeps the exported graph clean of bracketing view_copy dispatches. **Shader**: Single-dispatch kernel, one texel out per thread. Each thread reads one `x` texel (2 real/imag pairs) and the corresponding `freqs_cis` entries (2 cos/sin pairs) flat-indexed from buffer storage, writes one output texel. `x` and output support buffer + texture3d; `freqs_cis` is always buffer-storage (small tensor, flat indexing is simplest). Supports fp16 and fp32 via the `FP_T` dtype iterator in the YAML. **Op registration**: `Meta` kernel returns `torch.empty_like(x)` to keep the op opaque during `torch.export`. `CPU` kernel holds the reference math so non-Vulkan backends keep working. `op_registry.py` pins `freqs_cis` storage to `CONTIGUOUS_BUFFER` while leaving `x` at `CONTIGUOUS_ANY`. Differential Revision: [D102360202](https://our.internmc.facebook.com/intern/diff/D102360202/) [ghstack-poisoned]

Pull Request resolved: #19115 Introduces `et_vk.apply_rotary_emb_interleaved`, a fused Vulkan custom operator for the "complex-number" RoPE variant used by SAM2/EdgeTAM's memory attention. This replaces a 12+-op layout-shuffle chain (`view/unbind/stack/view` -> lowers to `slice_copy + squeeze_copy + unsqueeze_copy + cat + view_copy`) with a single GPU dispatch. **Math**: On pair-interleaved inputs where element `2k` is real and `2k+1` is imag, for each `k in [0, C/2)`: out[2k] = x[2k] * cos[k] - x[2k+1] * sin[k] out[2k+1] = x[2k] * sin[k] + x[2k+1] * cos[k] **Why a new op instead of reusing `et_vk.apply_rotary_emb`**: The existing LLM-oriented operator takes `(xq, xk)` pairs with separate `freqs_cos` / `freqs_sin` tensors and 4D `(B, S, H, D)` shapes optimized for LLM prefill two-texel-per-thread reuse. SAM2's memory attention passes a single 3D `(B, N, C)` tensor through RoPE (no heads dim) with a fused `[N, C/2, 2]` freqs tensor. Reusing the existing op would force runtime splits of the fused freqs and double-dispatch Q/K separately, defeating the fuse. A sibling shader is tighter for both workloads. **Op contract**: `apply_rotary_emb_interleaved(x, freqs_cis) -> Tensor` where `x` is `[B, N, C]` and `freqs_cis` is any rank with `N*C` elements and the `cos`/`sin` values interleaved on the innermost dim. In EdgeTAM's memory attention the native shape is `[1, N, C/2, 2]`; passing it through without a reshape keeps the exported graph clean of bracketing view_copy dispatches. **Shader**: Single-dispatch kernel, one texel out per thread. Each thread reads one `x` texel (2 real/imag pairs) and the corresponding `freqs_cis` entries (2 cos/sin pairs) flat-indexed from buffer storage, writes one output texel. `x` and output support buffer + texture3d; `freqs_cis` is always buffer-storage (small tensor, flat indexing is simplest). Supports fp16 and fp32 via the `FP_T` dtype iterator in the YAML. **Op registration**: `Meta` kernel returns `torch.empty_like(x)` to keep the op opaque during `torch.export`. `CPU` kernel holds the reference math so non-Vulkan backends keep working. `op_registry.py` pins `freqs_cis` storage to `CONTIGUOUS_BUFFER` while leaving `x` at `CONTIGUOUS_ANY`. ghstack-source-id: 372548626 @exported-using-ghexport Differential Revision: [D102360202](https://our.internmc.facebook.com/intern/diff/D102360202/)

Introduces `et_vk.apply_rotary_emb_interleaved`, a fused Vulkan custom operator for the "complex-number" RoPE variant used by SAM2/EdgeTAM's memory attention. This replaces a 12+-op layout-shuffle chain (`view/unbind/stack/view` -> lowers to `slice_copy + squeeze_copy + unsqueeze_copy + cat + view_copy`) with a single GPU dispatch. **Math**: On pair-interleaved inputs where element `2k` is real and `2k+1` is imag, for each `k in [0, C/2)`: out[2k] = x[2k] * cos[k] - x[2k+1] * sin[k] out[2k+1] = x[2k] * sin[k] + x[2k+1] * cos[k] **Why a new op instead of reusing `et_vk.apply_rotary_emb`**: The existing LLM-oriented operator takes `(xq, xk)` pairs with separate `freqs_cos` / `freqs_sin` tensors and 4D `(B, S, H, D)` shapes optimized for LLM prefill two-texel-per-thread reuse. SAM2's memory attention passes a single 3D `(B, N, C)` tensor through RoPE (no heads dim) with a fused `[N, C/2, 2]` freqs tensor. Reusing the existing op would force runtime splits of the fused freqs and double-dispatch Q/K separately, defeating the fuse. A sibling shader is tighter for both workloads. **Op contract**: `apply_rotary_emb_interleaved(x, freqs_cis) -> Tensor` where `x` is `[B, N, C]` and `freqs_cis` is any rank with `N*C` elements and the `cos`/`sin` values interleaved on the innermost dim. In EdgeTAM's memory attention the native shape is `[1, N, C/2, 2]`; passing it through without a reshape keeps the exported graph clean of bracketing view_copy dispatches. **Shader**: Single-dispatch kernel, one texel out per thread. Each thread reads one `x` texel (2 real/imag pairs) and the corresponding `freqs_cis` entries (2 cos/sin pairs) flat-indexed from buffer storage, writes one output texel. `x` and output support buffer + texture3d; `freqs_cis` is always buffer-storage (small tensor, flat indexing is simplest). Supports fp16 and fp32 via the `FP_T` dtype iterator in the YAML. **Op registration**: `Meta` kernel returns `torch.empty_like(x)` to keep the op opaque during `torch.export`. `CPU` kernel holds the reference math so non-Vulkan backends keep working. `op_registry.py` pins `freqs_cis` storage to `CONTIGUOUS_BUFFER` while leaving `x` at `CONTIGUOUS_ANY`. Differential Revision: [D102360202](https://our.internmc.facebook.com/intern/diff/D102360202/) [ghstack-poisoned]

Pull Request resolved: #19115 Introduces `et_vk.apply_rotary_emb_interleaved`, a fused Vulkan custom operator for the "complex-number" RoPE variant used by SAM2/EdgeTAM's memory attention. This replaces a 12+-op layout-shuffle chain (`view/unbind/stack/view` -> lowers to `slice_copy + squeeze_copy + unsqueeze_copy + cat + view_copy`) with a single GPU dispatch. **Math**: On pair-interleaved inputs where element `2k` is real and `2k+1` is imag, for each `k in [0, C/2)`: out[2k] = x[2k] * cos[k] - x[2k+1] * sin[k] out[2k+1] = x[2k] * sin[k] + x[2k+1] * cos[k] **Why a new op instead of reusing `et_vk.apply_rotary_emb`**: The existing LLM-oriented operator takes `(xq, xk)` pairs with separate `freqs_cos` / `freqs_sin` tensors and 4D `(B, S, H, D)` shapes optimized for LLM prefill two-texel-per-thread reuse. SAM2's memory attention passes a single 3D `(B, N, C)` tensor through RoPE (no heads dim) with a fused `[N, C/2, 2]` freqs tensor. Reusing the existing op would force runtime splits of the fused freqs and double-dispatch Q/K separately, defeating the fuse. A sibling shader is tighter for both workloads. **Op contract**: `apply_rotary_emb_interleaved(x, freqs_cis) -> Tensor` where `x` is `[B, N, C]` and `freqs_cis` is any rank with `N*C` elements and the `cos`/`sin` values interleaved on the innermost dim. In EdgeTAM's memory attention the native shape is `[1, N, C/2, 2]`; passing it through without a reshape keeps the exported graph clean of bracketing view_copy dispatches. **Shader**: Single-dispatch kernel, one texel out per thread. Each thread reads one `x` texel (2 real/imag pairs) and the corresponding `freqs_cis` entries (2 cos/sin pairs) flat-indexed from buffer storage, writes one output texel. `x` and output support buffer + texture3d; `freqs_cis` is always buffer-storage (small tensor, flat indexing is simplest). Supports fp16 and fp32 via the `FP_T` dtype iterator in the YAML. **Op registration**: `Meta` kernel returns `torch.empty_like(x)` to keep the op opaque during `torch.export`. `CPU` kernel holds the reference math so non-Vulkan backends keep working. `op_registry.py` pins `freqs_cis` storage to `CONTIGUOUS_BUFFER` while leaving `x` at `CONTIGUOUS_ANY`. ghstack-source-id: 372777610 @exported-using-ghexport Differential Revision: [D102360202](https://our.internmc.facebook.com/intern/diff/D102360202/)

This was referenced Apr 24, 2026

[ET-VK] Prefer downstream layout in TagMemoryMetaPass to reduce transitions #19113

Open

[ET-VK] Update fused SDPA operator to support ViT attention #19114

Open

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 24, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Add apply_rotary_emb_interleaved fused operator#19115

[ET-VK] Add apply_rotary_emb_interleaved fused operator#19115
SS-JIA wants to merge 3 commits intogh/SS-JIA/524/basefrom
gh/SS-JIA/524/head

SS-JIA commented Apr 24, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SS-JIA commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19115

❗ 1 Active SEVs

❌ 1 New Failure, 3 Cancelled Jobs, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Apr 24, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SS-JIA commented Apr 24, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

This PR needs a `release notes:` label