Skip to content

Add a UDF-specific JSON report with metadata#2070

Open
rishic3 wants to merge 5 commits intoNVIDIA:devfrom
rishic3:udf-reporting
Open

Add a UDF-specific JSON report with metadata#2070
rishic3 wants to merge 5 commits intoNVIDIA:devfrom
rishic3:udf-reporting

Conversation

@rishic3
Copy link
Copy Markdown
Contributor

@rishic3 rishic3 commented Apr 3, 2026

Depends on, and includes the diff from, #2069. Closes #2066.

This adds a new per-app JSON report, udf_report.json, to serve as a centralized place for UDF-related info and metadata that we would want to surface to a user. This serves to raise awareness and motivate the user to convert UDFs to a GPU-accelerable equivalent.

The outputs include: UDF exec/name, SQL ID, Stage ID, so that a user can quickly identify where the UDF is in their app, as well as coarse metrics, to give a rough idea of the impact.

Example output:

{
  "has_udfs": true,
  "udfs": [{
    "name": "IntegerMultiplyBy2UDF",
    "exec": "Project",
    "sql_id": 2,
    "stage_id": 1
  }],
  "metrics": {
    "unsupported_task_duration_ms": 44,
    "app_task_duration_ms": 176,
    "unsupported_task_duration_pct": 25.0
  }
}

@github-actions github-actions bot added the core_tools Scope the core module (scala) label Apr 3, 2026
@rishic3 rishic3 marked this pull request as ready for review April 10, 2026 17:02
@rishic3
Copy link
Copy Markdown
Contributor Author

rishic3 commented Apr 10, 2026

@greptileai

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 10, 2026

Greptile Summary

This PR introduces a new per-app udf_report.json output for the qualification tool, surfacing detected UDFs with their name, exec, SQL ID, and stage ID alongside coarse timing metrics. The implementation follows existing per-app table patterns (AppQualTable / factory wiring) and is well-tested across Scala, Pandas, Python, and Java/Hive UDF variants.

Confidence Score: 5/5

Safe to merge; the only finding is a one-word documentation typo in the YAML schema.

All previously-raised concerns (sentinel -1, multi-stage undercounting, fragile assertion) have been resolved or accepted. The single remaining finding is a P2 doc-only mismatch ("type" vs "exec") in the YAML description that does not affect runtime behaviour.

core/src/main/resources/configs/reports/qualOutputTable.yaml — minor description field typo.

Important Files Changed

Filename Overview
core/src/main/resources/configs/reports/qualOutputTable.yaml Adds udfReportJSON schema entry; minor documentation mismatch: "type" should be "exec" in the udfs column description.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala New object implementing UDF detection and metrics aggregation; logic is clear, stage_id is now Option[Int] (previously-noted sentinel issue addressed), rounding is correct.
core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/QualPerAppReportGenerator.scala Adds AppQualUdfReportTable class and wires it into the factory; straightforward integration following existing table patterns.
core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationNoSparkSuite.scala New UDF report tests covering Scala, Pandas, Python, Java/Hive UDFs, and the no-UDF case; helper readUdfReport is clean and the assertions look correct.
core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala Minor copyright year update; no substantive logic changes visible in the changed region.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[QualPerAppReportGenerator] -->|"label: udfReportJSON"| B[AppQualUdfReportTable]
    B --> C[UdfReportGenerator.generateReport]
    C --> D[collectUdfs]
    C --> E[computeMetrics]
    D --> D1["Filter execInfo where udf=true\nExpand cluster-node children"]
    D1 --> D2["unsupportedExprs.nonEmpty?\n→ one UdfEntry per expr\nelse exec != Project?\n→ UdfEntry for exec\nelse → skip"]
    E --> E1["udfStageIds from udfs.flatMap(_.stage_id)"]
    E1 --> E2["sum stageInfo unsupportedTaskDur\nfor matching stage IDs"]
    E2 --> E3[UdfMetrics: dur / pct]
    B --> F["Serialization.writePretty → udf_report.json"]
Loading

Reviews (3): Last reviewed commit: "type stage id as option" | Re-trigger Greptile

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 10, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

dataType: Boolean
description: >-
Whether any UDFs were detected in the application
- name: udfs
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both boolean and list data types? If the udfs list is empty, is it safe to assume that has_udfs is false?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True has_udfs is redundant. Removed.

val appTaskDuration = stageInfo.map(_.stageTaskTime).sum
if (appTaskDuration == 0) return None

val udfStageIds = udfs.flatMap(_.stage_id).toSet
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If udfStageIds is empty, we might return the following Some(UdfMetrics(0, appTaskDuration, 0.0)). Should we consider returning None in this case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Updated.

} else if (e.exec != "Project") {
// Actual UDF executor (e.g., ArrowEvalPython, BatchEvalPython).
// Skip Project nodes with no unsupported expressions since they
// are just containers for child UDF execs running in Python.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this path for Scala/Java UDF?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, Scala/Java should always be an expression inside a project exec.

}

execs.filter(_.udf).flatMap { e =>
val stageId = e.stages.headOption
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume that all UDF execs map to a single stage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #2070 (comment); for now we're only handling scalar UDFs so yes. I added a comment to that effect.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the question
From AI:
Qualification has some fallback stage inference logic for execs whose stage list is empty. That means a UDF exec can end up with stage_id: null, an arbitrary stage from an unordered_set

Copy link
Copy Markdown
Collaborator

@sayedbilalbari sayedbilalbari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rishic3 , had a few questions

dataType: Long
description: >-
Calculated as (submissionTime - completionTime) of the given stage
- label: udfReportJSON
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishic3 We will need to add tools-api support for this new report. Mostly by adding the definition in the qualCoreReport.yaml.
The qualOutputTable.yaml is consumed by the scala but not by the python side. qualOutputTable.yaml is an old file that is eventually hoped to be deprecated.

description: >-
UDF detection report containing detected UDFs and metadata
fileName: udf_report.json
fileFormat: JSON
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Any specific reason to keep this as a JSON file ?
PS - with the API support reading the output is equally simple for a csv/json ( reads as dataframe in case of csv but remains as a json in case of json files )

}

execs.filter(_.udf).flatMap { e =>
val stageId = e.stages.headOption
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the question
From AI:
Qualification has some fallback stage inference logic for execs whose stage list is empty. That means a UDF exec can end up with stage_id: null, an arbitrary stage from an unordered_set

sql_id: Long,
stage_id: Option[Int])

case class UdfMetrics(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Any other UDF metrics that are perhaps not being extracted in tools correctly and we can have those in future iterations that will be helpful ?

val sqlId = e.sqlID

if (e.unsupportedExprs.nonEmpty) {
// Container exec (e.g., Project) with named UDF expressions.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Can a project not have a non-udf unsupported expression ?

}

// UDF detection report (JSON). Reports all detected UDFs with metadata.
class AppQualUdfReportTable(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Currently this file is rendered in the qual_metrics, where unsupported ops file already lives. udf_report is kind of a dedicated extraction from that unsupported_ops if I am correct.
The code structure is such that all per-app file writing logic is inside - QualPerAppReportGenerator ( same for UDF ). UdfReportGenerator is more of a builder for this particular report. Should we rename it to something like UdfReportBuilder or UdfReportViewBuilder ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core_tools Scope the core module (scala)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Enhance UDF-specific outputs

4 participants