Add a UDF-specific JSON report with metadata#2070
Add a UDF-specific JSON report with metadata#2070rishic3 wants to merge 5 commits intoNVIDIA:devfrom
Conversation
Greptile SummaryThis PR introduces a new per-app Confidence Score: 5/5Safe to merge; the only finding is a one-word documentation typo in the YAML schema. All previously-raised concerns (sentinel -1, multi-stage undercounting, fragile assertion) have been resolved or accepted. The single remaining finding is a P2 doc-only mismatch ("type" vs "exec") in the YAML description that does not affect runtime behaviour. core/src/main/resources/configs/reports/qualOutputTable.yaml — minor description field typo. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[QualPerAppReportGenerator] -->|"label: udfReportJSON"| B[AppQualUdfReportTable]
B --> C[UdfReportGenerator.generateReport]
C --> D[collectUdfs]
C --> E[computeMetrics]
D --> D1["Filter execInfo where udf=true\nExpand cluster-node children"]
D1 --> D2["unsupportedExprs.nonEmpty?\n→ one UdfEntry per expr\nelse exec != Project?\n→ UdfEntry for exec\nelse → skip"]
E --> E1["udfStageIds from udfs.flatMap(_.stage_id)"]
E1 --> E2["sum stageInfo unsupportedTaskDur\nfor matching stage IDs"]
E2 --> E3[UdfMetrics: dur / pct]
B --> F["Serialization.writePretty → udf_report.json"]
Reviews (3): Last reviewed commit: "type stage id as option" | Re-trigger Greptile |
core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/views/qualification/UdfReportGenerator.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationNoSparkSuite.scala
Show resolved
Hide resolved
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
Signed-off-by: Rishi Chandra <rishic@nvidia.com>
| dataType: Boolean | ||
| description: >- | ||
| Whether any UDFs were detected in the application | ||
| - name: udfs |
There was a problem hiding this comment.
Do we need both boolean and list data types? If the udfs list is empty, is it safe to assume that has_udfs is false?
There was a problem hiding this comment.
True has_udfs is redundant. Removed.
| val appTaskDuration = stageInfo.map(_.stageTaskTime).sum | ||
| if (appTaskDuration == 0) return None | ||
|
|
||
| val udfStageIds = udfs.flatMap(_.stage_id).toSet |
There was a problem hiding this comment.
If udfStageIds is empty, we might return the following Some(UdfMetrics(0, appTaskDuration, 0.0)). Should we consider returning None in this case?
There was a problem hiding this comment.
Good catch. Updated.
| } else if (e.exec != "Project") { | ||
| // Actual UDF executor (e.g., ArrowEvalPython, BatchEvalPython). | ||
| // Skip Project nodes with no unsupported expressions since they | ||
| // are just containers for child UDF execs running in Python. |
There was a problem hiding this comment.
Do we expect this path for Scala/Java UDF?
There was a problem hiding this comment.
Nope, Scala/Java should always be an expression inside a project exec.
| } | ||
|
|
||
| execs.filter(_.udf).flatMap { e => | ||
| val stageId = e.stages.headOption |
There was a problem hiding this comment.
Does this assume that all UDF execs map to a single stage?
There was a problem hiding this comment.
Related to #2070 (comment); for now we're only handling scalar UDFs so yes. I added a comment to that effect.
There was a problem hiding this comment.
+1 on the question
From AI:
Qualification has some fallback stage inference logic for execs whose stage list is empty. That means a UDF exec can end up with stage_id: null, an arbitrary stage from an unordered_set
sayedbilalbari
left a comment
There was a problem hiding this comment.
Thanks @rishic3 , had a few questions
| dataType: Long | ||
| description: >- | ||
| Calculated as (submissionTime - completionTime) of the given stage | ||
| - label: udfReportJSON |
There was a problem hiding this comment.
@rishic3 We will need to add tools-api support for this new report. Mostly by adding the definition in the qualCoreReport.yaml.
The qualOutputTable.yaml is consumed by the scala but not by the python side. qualOutputTable.yaml is an old file that is eventually hoped to be deprecated.
| description: >- | ||
| UDF detection report containing detected UDFs and metadata | ||
| fileName: udf_report.json | ||
| fileFormat: JSON |
There was a problem hiding this comment.
qq: Any specific reason to keep this as a JSON file ?
PS - with the API support reading the output is equally simple for a csv/json ( reads as dataframe in case of csv but remains as a json in case of json files )
| } | ||
|
|
||
| execs.filter(_.udf).flatMap { e => | ||
| val stageId = e.stages.headOption |
There was a problem hiding this comment.
+1 on the question
From AI:
Qualification has some fallback stage inference logic for execs whose stage list is empty. That means a UDF exec can end up with stage_id: null, an arbitrary stage from an unordered_set
| sql_id: Long, | ||
| stage_id: Option[Int]) | ||
|
|
||
| case class UdfMetrics( |
There was a problem hiding this comment.
qq: Any other UDF metrics that are perhaps not being extracted in tools correctly and we can have those in future iterations that will be helpful ?
| val sqlId = e.sqlID | ||
|
|
||
| if (e.unsupportedExprs.nonEmpty) { | ||
| // Container exec (e.g., Project) with named UDF expressions. |
There was a problem hiding this comment.
qq: Can a project not have a non-udf unsupported expression ?
| } | ||
|
|
||
| // UDF detection report (JSON). Reports all detected UDFs with metadata. | ||
| class AppQualUdfReportTable( |
There was a problem hiding this comment.
qq: Currently this file is rendered in the qual_metrics, where unsupported ops file already lives. udf_report is kind of a dedicated extraction from that unsupported_ops if I am correct.
The code structure is such that all per-app file writing logic is inside - QualPerAppReportGenerator ( same for UDF ). UdfReportGenerator is more of a builder for this particular report. Should we rename it to something like UdfReportBuilder or UdfReportViewBuilder ?
Depends on, and includes the diff from, #2069. Closes #2066.
This adds a new per-app JSON report,
udf_report.json, to serve as a centralized place for UDF-related info and metadata that we would want to surface to a user. This serves to raise awareness and motivate the user to convert UDFs to a GPU-accelerable equivalent.The outputs include: UDF exec/name, SQL ID, Stage ID, so that a user can quickly identify where the UDF is in their app, as well as coarse metrics, to give a rough idea of the impact.
Example output:
{ "has_udfs": true, "udfs": [{ "name": "IntegerMultiplyBy2UDF", "exec": "Project", "sql_id": 2, "stage_id": 1 }], "metrics": { "unsupported_task_duration_ms": 44, "app_task_duration_ms": 176, "unsupported_task_duration_pct": 25.0 } }