Is this a duplicate?
Type of Bug
Performance
Component
CUB
Describe the bug
Whenn working on #8257 I initially tried a zip_iterator approach for cub::DeviceTransform
auto __zip_first = ::cuda::zip_iterator{__first1, __first2};
auto __zip_last = ::cuda::zip_iterator{__last1, __ret};
(void) __transform_dispatch(
__policy, __zip_first, ::cuda::std::move(__zip_last), __zip_first, __swap_ranges_transform_fn{});
return __ret;
However, this was considerably slower for many larger input types than what I ended up using:
auto __zip_first = ::cuda::zip_iterator{__first1, __first2};
(void) __transform_dispatch(
__policy,
::cuda::std::move(__first1),
::cuda::std::move(__last1),
::cuda::std::move(__first2),
::cuda::std::move(__zip_first),
__swap_ranges_transform_fn{});
How to Reproduce
Run the libcudacxx.bench.swap_ranges.basic.base benchmark swapping out the implementation for the zip_iterator approach.
The current implementation is equal to thrust, however, the zipped implementation gets us:
['thrust_swap_ranges.json', 'pstl_swap_ranges.json']
# base
## [1] NVIDIA RTX A6000
| T{ct} | Elements | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
| I8 | 2^16 | 7.494 us | 6.36% | 7.592 us | 7.11% | 0.099 us | 1.32% | SAME |
| I8 | 2^20 | 14.396 us | 9.41% | 14.253 us | 4.62% | -0.143 us | -0.99% | SAME |
| I8 | 2^24 | 114.209 us | 0.95% | 107.109 us | 0.58% | -7.100 us | -6.22% | FAST |
| I8 | 2^28 | 1.671 ms | 1.94% | 1.585 ms | 1.92% | -85.346 us | -5.11% | FAST |
| I16 | 2^16 | 7.787 us | 6.49% | 8.222 us | 5.74% | 0.436 us | 5.59% | SAME |
| I16 | 2^20 | 19.372 us | 7.61% | 21.961 us | 2.92% | 2.589 us | 13.36% | SLOW |
| I16 | 2^24 | 203.446 us | 0.32% | 204.764 us | 0.40% | 1.318 us | 0.65% | SLOW |
| I16 | 2^28 | 3.159 ms | 1.42% | 3.144 ms | 1.21% | -15.003 us | -0.47% | SAME |
| I32 | 2^16 | 8.289 us | 13.94% | 8.824 us | 15.04% | 0.535 us | 6.45% | SAME |
| I32 | 2^20 | 31.751 us | 2.10% | 35.951 us | 2.31% | 4.199 us | 13.22% | SLOW |
| I32 | 2^24 | 400.489 us | 0.25% | 405.749 us | 0.31% | 5.260 us | 1.31% | SLOW |
| I32 | 2^28 | 6.307 ms | 0.91% | 6.315 ms | 1.10% | 8.007 us | 0.13% | SAME |
| I64 | 2^16 | 9.821 us | 5.96% | 11.138 us | 10.67% | 1.317 us | 13.40% | SLOW |
| I64 | 2^20 | 57.347 us | 2.33% | 68.176 us | 2.36% | 10.830 us | 18.88% | SLOW |
| I64 | 2^24 | 794.324 us | 0.20% | 905.335 us | 0.39% | 111.011 us | 13.98% | SLOW |
| I64 | 2^28 | 12.636 ms | 0.51% | 14.738 ms | 1.25% | 2.102 ms | 16.64% | SLOW |
| I128 | 2^16 | 12.316 us | 6.74% | 14.150 us | 4.56% | 1.834 us | 14.89% | SLOW |
| I128 | 2^20 | 111.077 us | 0.67% | 140.295 us | 3.23% | 29.218 us | 26.30% | SLOW |
| I128 | 2^24 | 1.583 ms | 0.17% | 2.025 ms | 0.12% | 442.190 us | 27.93% | SLOW |
| I128 | 2^28 | 25.302 ms | 0.30% | 32.203 ms | 0.27% | 6.901 ms | 27.27% | SLOW |
| F32 | 2^16 | 8.248 us | 5.52% | 8.599 us | 6.49% | 0.351 us | 4.25% | SAME |
| F32 | 2^20 | 32.468 us | 1.83% | 36.484 us | 2.48% | 4.015 us | 12.37% | SLOW |
| F32 | 2^24 | 411.222 us | 3.13% | 416.424 us | 3.29% | 5.202 us | 1.27% | SAME |
| F32 | 2^28 | 6.307 ms | 0.94% | 6.312 ms | 1.12% | 4.881 us | 0.08% | SAME |
| F64 | 2^16 | 9.731 us | 7.42% | 11.290 us | 4.50% | 1.559 us | 16.02% | SLOW |
| F64 | 2^20 | 56.717 us | 1.87% | 69.159 us | 2.54% | 12.442 us | 21.94% | SLOW |
| F64 | 2^24 | 794.085 us | 0.19% | 895.293 us | 0.41% | 101.208 us | 12.75% | SLOW |
| F64 | 2^28 | 12.637 ms | 0.56% | 14.353 ms | 1.30% | 1.716 ms | 13.58% | SLOW |
Expected behavior
Both implementations should be equivalent and lead to the same performance
Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response
Is this a duplicate?
Type of Bug
Performance
Component
CUB
Describe the bug
Whenn working on #8257 I initially tried a zip_iterator approach for cub::DeviceTransform
However, this was considerably slower for many larger input types than what I ended up using:
How to Reproduce
Run the
libcudacxx.bench.swap_ranges.basic.basebenchmark swapping out the implementation for the zip_iterator approach.The current implementation is equal to thrust, however, the zipped implementation gets us:
Expected behavior
Both implementations should be equivalent and lead to the same performance
Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response