-
Notifications
You must be signed in to change notification settings - Fork 49
[FEA] Enhance UDF-specific outputs #2066
Description
Motivation
It would be nice if spark-rapids-tools could produce structured outputs about all detected UDFs, such as the name, type, SQL ID, stage task duration, etc., so that Aether can directly surface this information in its output without having to do any custom parsing.
Currently, while there are some flags (e.g., "Contains UDF", "Is UDF") on execs, different UDF types are flagged differently, and comprehensive detection requires digging into the unsupported_ops.csv / execs.csv etc (see below). We would like to have spark-rapids-tools, as the source of truth, produce a dedicated output for UDF-related info so that no special handling is needed.
Current UDF coverage (limiting to scalar UDFs for now):
| UDF Type | Detected by spark-rapids-tools? | Notes |
|---|---|---|
| Java/Hive UDFs | Yes | Expression marked "Is UDF" with UDF name |
| Python UDFs | Yes | BatchEvalPython exec marked "Contains UDF" |
| Pandas UDFs | Indirectly | Parent Project marked "Contains UDF"; actual UDF exec e.g. ArrowEvalPython visible in execs.csv but marked supported |
| Scala UDFs (DF API) | Yes | Plan retains "UDF" token |
| Scala UDFs (Spark SQL) | No | SPARK-34388 replaces "UDF" with registered name; see spark-rapids-tools#1271 |
Proposal
Produce a per-app udf_report.json in qual_core_output/qual_metrics/<app-id>/, with:
- a list of UDF entries for each UDF in the app, each of which contains the following:
- Name of the UDF
- Type of UDF
- SQL ID / stage ID
- metrics for the UDFs (could be aggregated per app), e.g., unsupported task duration