There is currently double-counting in stage duration estimates for Python UDFs (BatchEvalPython), and there will also be double-counting for Pandas/Arrow UDF execs, pending a fix to #2066.
When a Python/Pandas UDF is present, both the parent Project and the child ArrowEvalPython are unsupported execs in the same stage. The equal-weight duration estimate in stagesSummary() counts both toward unsupportedTaskDur, but the Project's WholeStageCodegen wall-clock duration already includes the time spent blocked waiting for the Python worker. Here are some links on that: WSCG duration accumulator update which occurs after the iterator finally returns false, and WSCG execution loop which will wait on Python worker batches to come back.
A fix could add Python/Pandas UDF execs to execsToBeRemoved so they are excluded from the duration estimate since the parent Project/WSCG already captures their execution time.
There is currently double-counting in stage duration estimates for Python UDFs (BatchEvalPython), and there will also be double-counting for Pandas/Arrow UDF execs, pending a fix to #2066.
When a Python/Pandas UDF is present, both the parent Project and the child ArrowEvalPython are unsupported execs in the same stage. The equal-weight duration estimate in stagesSummary() counts both toward
unsupportedTaskDur, but the Project's WholeStageCodegen wall-clock duration already includes the time spent blocked waiting for the Python worker. Here are some links on that: WSCG duration accumulator update which occurs after the iterator finally returns false, and WSCG execution loop which will wait on Python worker batches to come back.A fix could add Python/Pandas UDF execs to
execsToBeRemovedso they are excluded from the duration estimate since the parent Project/WSCG already captures their execution time.