Add a UDF-specific JSON report with metadata by rishic3 · Pull Request #2070 · NVIDIA/spark-rapids-tools

rishic3 · 2026-04-03T17:21:09Z

Depends on, and includes the diff from, #2069. Closes #2066.

This adds a new per-app JSON report, udf_report.json, to serve as a centralized place for UDF-related info and metadata that we would want to surface to a user. This serves to raise awareness and motivate the user to convert UDFs to a GPU-accelerable equivalent.

The outputs include: UDF exec/name, SQL ID, Stage ID, so that a user can quickly identify where the UDF is in their app, as well as coarse metrics, to give a rough idea of the impact.

Example output:

{
  "has_udfs": true,
  "udfs": [{
    "name": "IntegerMultiplyBy2UDF",
    "exec": "Project",
    "sql_id": 2,
    "stage_id": 1
  }],
  "metrics": {
    "unsupported_task_duration_ms": 44,
    "app_task_duration_ms": 176,
    "unsupported_task_duration_pct": 25.0
  }
}

rishic3 · 2026-04-10T17:04:53Z

@greptileai

greptile-apps · 2026-04-10T17:10:43Z

Greptile Summary

This PR introduces a new per-app udf_report.json output for the qualification tool, surfacing detected UDFs with their name, exec, SQL ID, and stage ID alongside coarse timing metrics. The implementation follows existing per-app table patterns (AppQualTable / factory wiring) and is well-tested across Scala, Pandas, Python, and Java/Hive UDF variants.

Confidence Score: 5/5

Safe to merge; the only finding is a one-word documentation typo in the YAML schema.

All previously-raised concerns (sentinel -1, multi-stage undercounting, fragile assertion) have been resolved or accepted. The single remaining finding is a P2 doc-only mismatch ("type" vs "exec") in the YAML description that does not affect runtime behaviour.

core/src/main/resources/configs/reports/qualOutputTable.yaml — minor description field typo.

Important Files Changed

Filename	Overview
core/src/main/resources/configs/reports/qualOutputTable.yaml	Adds udfReportJSON schema entry; minor documentation mismatch: "type" should be "exec" in the udfs column description.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala	New object implementing UDF detection and metrics aggregation; logic is clear, stage_id is now Option[Int] (previously-noted sentinel issue addressed), rounding is correct.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/QualPerAppReportGenerator.scala	Adds AppQualUdfReportTable class and wires it into the factory; straightforward integration following existing table patterns.
core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationNoSparkSuite.scala	New UDF report tests covering Scala, Pandas, Python, Java/Hive UDFs, and the no-UDF case; helper readUdfReport is clean and the assertions look correct.
core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala	Minor copyright year update; no substantive logic changes visible in the changed region.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[QualPerAppReportGenerator] -->|"label: udfReportJSON"| B[AppQualUdfReportTable]
    B --> C[UdfReportGenerator.generateReport]
    C --> D[collectUdfs]
    C --> E[computeMetrics]
    D --> D1["Filter execInfo where udf=true\nExpand cluster-node children"]
    D1 --> D2["unsupportedExprs.nonEmpty?\n→ one UdfEntry per expr\nelse exec != Project?\n→ UdfEntry for exec\nelse → skip"]
    E --> E1["udfStageIds from udfs.flatMap(_.stage_id)"]
    E1 --> E2["sum stageInfo unsupportedTaskDur\nfor matching stage IDs"]
    E2 --> E3[UdfMetrics: dur / pct]
    B --> F["Serialization.writePretty → udf_report.json"]

_{Reviews (3): Last reviewed commit: "type stage id as option" | Re-trigger Greptile}

core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala

core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationNoSparkSuite.scala

greptile-apps · 2026-04-10T20:48:17Z

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Signed-off-by: Rishi Chandra <rishic@nvidia.com>

parthosa · 2026-04-13T16:56:18Z

core/src/main/resources/configs/reports/qualOutputTable.yaml

+        dataType: Boolean
+        description: >-
+          Whether any UDFs were detected in the application
+      - name: udfs


Do we need both boolean and list data types? If the udfs list is empty, is it safe to assume that has_udfs is false?

True has_udfs is redundant. Removed.

parthosa · 2026-04-13T16:59:21Z

core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala

+    val appTaskDuration = stageInfo.map(_.stageTaskTime).sum
+    if (appTaskDuration == 0) return None
+
+    val udfStageIds = udfs.flatMap(_.stage_id).toSet


If udfStageIds is empty, we might return the following Some(UdfMetrics(0, appTaskDuration, 0.0)). Should we consider returning None in this case?

Good catch. Updated.

parthosa · 2026-04-13T17:01:02Z

core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala

+          } else if (e.exec != "Project") {
+            // Actual UDF executor (e.g., ArrowEvalPython, BatchEvalPython).
+            // Skip Project nodes with no unsupported expressions since they
+            // are just containers for child UDF execs running in Python.


Do we expect this path for Scala/Java UDF?

Nope, Scala/Java should always be an expression inside a project exec.

parthosa · 2026-04-13T17:12:23Z

core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala

+        }
+
+        execs.filter(_.udf).flatMap { e =>
+          val stageId = e.stages.headOption


Does this assume that all UDF execs map to a single stage?

Related to #2070 (comment); for now we're only handling scalar UDFs so yes. I added a comment to that effect.

+1 on the question
From AI:
Qualification has some fallback stage inference logic for execs whose stage list is empty. That means a UDF exec can end up with stage_id: null, an arbitrary stage from an unordered_set

sayedbilalbari

Thanks @rishic3 , had a few questions

sayedbilalbari · 2026-04-13T17:56:52Z

core/src/main/resources/configs/reports/qualOutputTable.yaml

        dataType: Long
        description: >-
          Calculated as (submissionTime - completionTime) of the given stage
+  - label: udfReportJSON


@rishic3 We will need to add tools-api support for this new report. Mostly by adding the definition in the qualCoreReport.yaml.
The qualOutputTable.yaml is consumed by the scala but not by the python side. qualOutputTable.yaml is an old file that is eventually hoped to be deprecated.

sayedbilalbari · 2026-04-13T18:20:47Z

core/src/main/resources/configs/reports/qualOutputTable.yaml

+    description: >-
+      UDF detection report containing detected UDFs and metadata
+    fileName: udf_report.json
+    fileFormat: JSON


qq: Any specific reason to keep this as a JSON file ?
PS - with the API support reading the output is equally simple for a csv/json ( reads as dataframe in case of csv but remains as a json in case of json files )

sayedbilalbari · 2026-04-13T18:50:06Z

core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala

+        }
+
+        execs.filter(_.udf).flatMap { e =>
+          val stageId = e.stages.headOption


+1 on the question
From AI:
Qualification has some fallback stage inference logic for execs whose stage list is empty. That means a UDF exec can end up with stage_id: null, an arbitrary stage from an unordered_set

sayedbilalbari · 2026-04-13T18:51:35Z

core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala

+    sql_id: Long,
+    stage_id: Option[Int])
+
+  case class UdfMetrics(


qq: Any other UDF metrics that are perhaps not being extracted in tools correctly and we can have those in future iterations that will be helpful ?

sayedbilalbari · 2026-04-13T18:54:37Z

core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala

+          val sqlId = e.sqlID
+
+          if (e.unsupportedExprs.nonEmpty) {
+            // Container exec (e.g., Project) with named UDF expressions.


qq: Can a project not have a non-udf unsupported expression ?

sayedbilalbari · 2026-04-13T21:30:03Z

.../main/scala/com/nvidia/spark/rapids/tool/views/qualification/QualPerAppReportGenerator.scala

 }

+// UDF detection report (JSON). Reports all detected UDFs with metadata.
+class AppQualUdfReportTable(


qq: Currently this file is rendered in the qual_metrics, where unsupported ops file already lives. udf_report is kind of a dedicated extraction from that unsupported_ops if I am correct.
The code structure is such that all per-app file writing logic is inside - QualPerAppReportGenerator ( same for UDF ). UdfReportGenerator is more of a builder for this particular report. Should we rename it to something like UdfReportBuilder or UdfReportViewBuilder ?

github-actions bot added the core_tools Scope the core module (scala) label Apr 3, 2026

parthosa assigned rishic3 Apr 3, 2026

parthosa requested review from amahussein, parthosa and sayedbilalbari April 3, 2026 17:53

rishic3 marked this pull request as ready for review April 10, 2026 17:02

greptile-apps bot reviewed Apr 10, 2026

View reviewed changes

rishic3 added 5 commits April 10, 2026 13:51

add udf reporting

94b36ad

Signed-off-by: Rishi Chandra <rishic@nvidia.com>

license header

971f52f

one more license header

2cc1ba4

scalastyle; scala.io.source -> UTF8Source

eeff42a

type stage id as option

ca9282c

rishic3 force-pushed the udf-reporting branch from 40555aa to ca9282c Compare April 10, 2026 20:51

parthosa reviewed Apr 13, 2026

View reviewed changes

sayedbilalbari requested changes Apr 13, 2026

View reviewed changes

Conversation

rishic3 commented Apr 3, 2026

Uh oh!

rishic3 commented Apr 10, 2026

Uh oh!

greptile-apps bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Apr 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayedbilalbari left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Apr 10, 2026 •

edited

Loading