Skip to content

[BUG] Python UDF exec deduplication for duration estimates #2068

@rishic3

Description

@rishic3

There is currently double-counting in stage duration estimates for Python UDFs (BatchEvalPython), and there will also be double-counting for Pandas/Arrow UDF execs, pending a fix to #2066.

When a Python/Pandas UDF is present, both the parent Project and the child ArrowEvalPython are unsupported execs in the same stage. The equal-weight duration estimate in stagesSummary() counts both toward unsupportedTaskDur, but the Project's WholeStageCodegen wall-clock duration already includes the time spent blocked waiting for the Python worker. Here are some links on that: WSCG duration accumulator update which occurs after the iterator finally returns false, and WSCG execution loop which will wait on Python worker batches to come back.

A fix could add Python/Pandas UDF execs to execsToBeRemoved so they are excluded from the duration estimate since the parent Project/WSCG already captures their execution time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions