Skip to content

[BUG]: cub::DeviceTransform is much slower for cuda::zip_iterator input that without #8258

@miscco

Description

@miscco

Is this a duplicate?

Type of Bug

Performance

Component

CUB

Describe the bug

Whenn working on #8257 I initially tried a zip_iterator approach for cub::DeviceTransform

    auto __zip_first = ::cuda::zip_iterator{__first1, __first2};
    auto __zip_last  = ::cuda::zip_iterator{__last1, __ret};

    (void) __transform_dispatch(
      __policy, __zip_first, ::cuda::std::move(__zip_last), __zip_first, __swap_ranges_transform_fn{});
    return __ret;

However, this was considerably slower for many larger input types than what I ended up using:

    auto __zip_first = ::cuda::zip_iterator{__first1, __first2};

    (void) __transform_dispatch(
      __policy,
      ::cuda::std::move(__first1),
      ::cuda::std::move(__last1),
      ::cuda::std::move(__first2),
      ::cuda::std::move(__zip_first),
      __swap_ranges_transform_fn{});

How to Reproduce

Run the libcudacxx.bench.swap_ranges.basic.base benchmark swapping out the implementation for the zip_iterator approach.

The current implementation is equal to thrust, however, the zipped implementation gets us:

['thrust_swap_ranges.json', 'pstl_swap_ranges.json']
# base

## [1] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^16    |   7.494 us |       6.36% |   7.592 us |       7.11% |   0.099 us |   1.32% |   SAME   |
|   I8    |    2^20    |  14.396 us |       9.41% |  14.253 us |       4.62% |  -0.143 us |  -0.99% |   SAME   |
|   I8    |    2^24    | 114.209 us |       0.95% | 107.109 us |       0.58% |  -7.100 us |  -6.22% |   FAST   |
|   I8    |    2^28    |   1.671 ms |       1.94% |   1.585 ms |       1.92% | -85.346 us |  -5.11% |   FAST   |
|   I16   |    2^16    |   7.787 us |       6.49% |   8.222 us |       5.74% |   0.436 us |   5.59% |   SAME   |
|   I16   |    2^20    |  19.372 us |       7.61% |  21.961 us |       2.92% |   2.589 us |  13.36% |   SLOW   |
|   I16   |    2^24    | 203.446 us |       0.32% | 204.764 us |       0.40% |   1.318 us |   0.65% |   SLOW   |
|   I16   |    2^28    |   3.159 ms |       1.42% |   3.144 ms |       1.21% | -15.003 us |  -0.47% |   SAME   |
|   I32   |    2^16    |   8.289 us |      13.94% |   8.824 us |      15.04% |   0.535 us |   6.45% |   SAME   |
|   I32   |    2^20    |  31.751 us |       2.10% |  35.951 us |       2.31% |   4.199 us |  13.22% |   SLOW   |
|   I32   |    2^24    | 400.489 us |       0.25% | 405.749 us |       0.31% |   5.260 us |   1.31% |   SLOW   |
|   I32   |    2^28    |   6.307 ms |       0.91% |   6.315 ms |       1.10% |   8.007 us |   0.13% |   SAME   |
|   I64   |    2^16    |   9.821 us |       5.96% |  11.138 us |      10.67% |   1.317 us |  13.40% |   SLOW   |
|   I64   |    2^20    |  57.347 us |       2.33% |  68.176 us |       2.36% |  10.830 us |  18.88% |   SLOW   |
|   I64   |    2^24    | 794.324 us |       0.20% | 905.335 us |       0.39% | 111.011 us |  13.98% |   SLOW   |
|   I64   |    2^28    |  12.636 ms |       0.51% |  14.738 ms |       1.25% |   2.102 ms |  16.64% |   SLOW   |
|  I128   |    2^16    |  12.316 us |       6.74% |  14.150 us |       4.56% |   1.834 us |  14.89% |   SLOW   |
|  I128   |    2^20    | 111.077 us |       0.67% | 140.295 us |       3.23% |  29.218 us |  26.30% |   SLOW   |
|  I128   |    2^24    |   1.583 ms |       0.17% |   2.025 ms |       0.12% | 442.190 us |  27.93% |   SLOW   |
|  I128   |    2^28    |  25.302 ms |       0.30% |  32.203 ms |       0.27% |   6.901 ms |  27.27% |   SLOW   |
|   F32   |    2^16    |   8.248 us |       5.52% |   8.599 us |       6.49% |   0.351 us |   4.25% |   SAME   |
|   F32   |    2^20    |  32.468 us |       1.83% |  36.484 us |       2.48% |   4.015 us |  12.37% |   SLOW   |
|   F32   |    2^24    | 411.222 us |       3.13% | 416.424 us |       3.29% |   5.202 us |   1.27% |   SAME   |
|   F32   |    2^28    |   6.307 ms |       0.94% |   6.312 ms |       1.12% |   4.881 us |   0.08% |   SAME   |
|   F64   |    2^16    |   9.731 us |       7.42% |  11.290 us |       4.50% |   1.559 us |  16.02% |   SLOW   |
|   F64   |    2^20    |  56.717 us |       1.87% |  69.159 us |       2.54% |  12.442 us |  21.94% |   SLOW   |
|   F64   |    2^24    | 794.085 us |       0.19% | 895.293 us |       0.41% | 101.208 us |  12.75% |   SLOW   |
|   F64   |    2^28    |  12.637 ms |       0.56% |  14.353 ms |       1.30% |   1.716 ms |  13.58% |   SLOW   |

Expected behavior

Both implementations should be equivalent and lead to the same performance

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions