s4p44: all-reduce implementation

In the slides, you mentioned that all-reduces are decomposed into reduce-scatter and all-gather. So basically, it costs double of those ops. However, in XLA on TPU, the reduce-scatter is often implemented with all-reduce and dynamic-slice which suggests the opposite way where all-reduce is much faster than reduce-scatter. Can you explain the differences?