The following PR added support for micro benchmarking shuffle operations:
The benchmarks shipped in that PR have several cases, and all of them shuffle the same amount of data, but the output tasks and partitioning is different. I'd expect all cases to take the same time, as they shuffle the same amount of data, but there is some variations:
Benchmarking shuffle/stream/1producer_1consumer_8partitions_1000000rows_1024batch_Nonecompression:
time: [90.704 ms 91.729 ms 94.261 ms]
shuffle/stream/1producer_1consumer_8partitions_1000000rows_1024batch_Some(LZ4_FRAME)compression:
time: [115.04 ms 125.10 ms 139.33 ms]
Benchmarking shuffle/stream/8producer_1consumer_8partitions_1000000rows_1024batch_Nonecompression:
time: [60.487 ms 60.780 ms 61.057 ms]
Benchmarking shuffle/stream/1producer_8consumer_8partitions_1000000rows_1024batch_Nonecompression:
time: [197.23 ms 197.69 ms 198.10 ms]
Benchmarking shuffle/stream/8producer_8consumer_8partitions_1000000rows_1024batch_Nonecompression:
time: [76.776 ms 77.825 ms 80.544 ms]
I think there should be some low hanging fruits regarding performance, and some other more complicated improvements:
- I'm suspicious about
arrow-flight cloning data unnecessarily
- RepartitionExec might be doing something funny depending on the amount of output tasks
The following PR added support for micro benchmarking shuffle operations:
The benchmarks shipped in that PR have several cases, and all of them shuffle the same amount of data, but the output tasks and partitioning is different. I'd expect all cases to take the same time, as they shuffle the same amount of data, but there is some variations:
I think there should be some low hanging fruits regarding performance, and some other more complicated improvements:
arrow-flightcloning data unnecessarily