[BUG]: cub::DeviceTransform is much slower for `cuda::zip_iterator` input that without

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this bug and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Type of Bug

Performance

### Component

CUB

### Describe the bug

Whenn working on #8257 I initially tried a zip_iterator approach for cub::DeviceTransform

```c++
    auto __zip_first = ::cuda::zip_iterator{__first1, __first2};
    auto __zip_last  = ::cuda::zip_iterator{__last1, __ret};

    (void) __transform_dispatch(
      __policy, __zip_first, ::cuda::std::move(__zip_last), __zip_first, __swap_ranges_transform_fn{});
    return __ret;
```

However, this was considerably slower for many larger input types than what I ended up using:

```c++
    auto __zip_first = ::cuda::zip_iterator{__first1, __first2};

    (void) __transform_dispatch(
      __policy,
      ::cuda::std::move(__first1),
      ::cuda::std::move(__last1),
      ::cuda::std::move(__first2),
      ::cuda::std::move(__zip_first),
      __swap_ranges_transform_fn{});
```


### How to Reproduce

Run the `libcudacxx.bench.swap_ranges.basic.base` benchmark swapping out the implementation for the zip_iterator approach.

The current implementation is equal to thrust, however, the zipped implementation gets us:
```
['thrust_swap_ranges.json', 'pstl_swap_ranges.json']
# base

## [1] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^16    |   7.494 us |       6.36% |   7.592 us |       7.11% |   0.099 us |   1.32% |   SAME   |
|   I8    |    2^20    |  14.396 us |       9.41% |  14.253 us |       4.62% |  -0.143 us |  -0.99% |   SAME   |
|   I8    |    2^24    | 114.209 us |       0.95% | 107.109 us |       0.58% |  -7.100 us |  -6.22% |   FAST   |
|   I8    |    2^28    |   1.671 ms |       1.94% |   1.585 ms |       1.92% | -85.346 us |  -5.11% |   FAST   |
|   I16   |    2^16    |   7.787 us |       6.49% |   8.222 us |       5.74% |   0.436 us |   5.59% |   SAME   |
|   I16   |    2^20    |  19.372 us |       7.61% |  21.961 us |       2.92% |   2.589 us |  13.36% |   SLOW   |
|   I16   |    2^24    | 203.446 us |       0.32% | 204.764 us |       0.40% |   1.318 us |   0.65% |   SLOW   |
|   I16   |    2^28    |   3.159 ms |       1.42% |   3.144 ms |       1.21% | -15.003 us |  -0.47% |   SAME   |
|   I32   |    2^16    |   8.289 us |      13.94% |   8.824 us |      15.04% |   0.535 us |   6.45% |   SAME   |
|   I32   |    2^20    |  31.751 us |       2.10% |  35.951 us |       2.31% |   4.199 us |  13.22% |   SLOW   |
|   I32   |    2^24    | 400.489 us |       0.25% | 405.749 us |       0.31% |   5.260 us |   1.31% |   SLOW   |
|   I32   |    2^28    |   6.307 ms |       0.91% |   6.315 ms |       1.10% |   8.007 us |   0.13% |   SAME   |
|   I64   |    2^16    |   9.821 us |       5.96% |  11.138 us |      10.67% |   1.317 us |  13.40% |   SLOW   |
|   I64   |    2^20    |  57.347 us |       2.33% |  68.176 us |       2.36% |  10.830 us |  18.88% |   SLOW   |
|   I64   |    2^24    | 794.324 us |       0.20% | 905.335 us |       0.39% | 111.011 us |  13.98% |   SLOW   |
|   I64   |    2^28    |  12.636 ms |       0.51% |  14.738 ms |       1.25% |   2.102 ms |  16.64% |   SLOW   |
|  I128   |    2^16    |  12.316 us |       6.74% |  14.150 us |       4.56% |   1.834 us |  14.89% |   SLOW   |
|  I128   |    2^20    | 111.077 us |       0.67% | 140.295 us |       3.23% |  29.218 us |  26.30% |   SLOW   |
|  I128   |    2^24    |   1.583 ms |       0.17% |   2.025 ms |       0.12% | 442.190 us |  27.93% |   SLOW   |
|  I128   |    2^28    |  25.302 ms |       0.30% |  32.203 ms |       0.27% |   6.901 ms |  27.27% |   SLOW   |
|   F32   |    2^16    |   8.248 us |       5.52% |   8.599 us |       6.49% |   0.351 us |   4.25% |   SAME   |
|   F32   |    2^20    |  32.468 us |       1.83% |  36.484 us |       2.48% |   4.015 us |  12.37% |   SLOW   |
|   F32   |    2^24    | 411.222 us |       3.13% | 416.424 us |       3.29% |   5.202 us |   1.27% |   SAME   |
|   F32   |    2^28    |   6.307 ms |       0.94% |   6.312 ms |       1.12% |   4.881 us |   0.08% |   SAME   |
|   F64   |    2^16    |   9.731 us |       7.42% |  11.290 us |       4.50% |   1.559 us |  16.02% |   SLOW   |
|   F64   |    2^20    |  56.717 us |       1.87% |  69.159 us |       2.54% |  12.442 us |  21.94% |   SLOW   |
|   F64   |    2^24    | 794.085 us |       0.19% | 895.293 us |       0.41% | 101.208 us |  12.75% |   SLOW   |
|   F64   |    2^28    |  12.637 ms |       0.56% |  14.353 ms |       1.30% |   1.716 ms |  13.58% |   SLOW   |
```

### Expected behavior

Both implementations should be equivalent and lead to the same performance

### Reproduction link

_No response_

### Operating System

_No response_

### nvidia-smi output

_No response_

### NVCC version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: cub::DeviceTransform is much slower for `cuda::zip_iterator` input that without #8258

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: cub::DeviceTransform is much slower for cuda::zip_iterator input that without #8258

Description

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG]: cub::DeviceTransform is much slower for `cuda::zip_iterator` input that without #8258