Skip to content

[FEA] Enhance UDF-specific outputs #2066

@rishic3

Description

@rishic3

Motivation

It would be nice if spark-rapids-tools could produce structured outputs about all detected UDFs, such as the name, type, SQL ID, stage task duration, etc., so that Aether can directly surface this information in its output without having to do any custom parsing.

Currently, while there are some flags (e.g., "Contains UDF", "Is UDF") on execs, different UDF types are flagged differently, and comprehensive detection requires digging into the unsupported_ops.csv / execs.csv etc (see below). We would like to have spark-rapids-tools, as the source of truth, produce a dedicated output for UDF-related info so that no special handling is needed.

Current UDF coverage (limiting to scalar UDFs for now):

UDF Type Detected by spark-rapids-tools? Notes
Java/Hive UDFs Yes Expression marked "Is UDF" with UDF name
Python UDFs Yes BatchEvalPython exec marked "Contains UDF"
Pandas UDFs Indirectly Parent Project marked "Contains UDF"; actual UDF exec e.g. ArrowEvalPython visible in execs.csv but marked supported
Scala UDFs (DF API) Yes Plan retains "UDF" token
Scala UDFs (Spark SQL) No SPARK-34388 replaces "UDF" with registered name; see spark-rapids-tools#1271

Proposal

Produce a per-app udf_report.json in qual_core_output/qual_metrics/<app-id>/, with:

  • a list of UDF entries for each UDF in the app, each of which contains the following:
    • Name of the UDF
    • Type of UDF
    • SQL ID / stage ID
  • metrics for the UDFs (could be aggregated per app), e.g., unsupported task duration

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions