Skip to content

[FEA] Extract and report Spark Connect session and operation metadata #2065

@sayedbilalbari

Description

@sayedbilalbari

Parent issue: #2058

Summary

Implement full Spark Connect metadata extraction and reporting on top of the parser awareness added in earlier phases.

This is the Phase 3 work item from #2058.

Problem

Once Connect events are accepted, the tools still need to interpret them to provide user/session attribution, operation lifecycle timing, and operation-level failure metadata. This is the feature-complete Spark Connect support layer.

Scope

  • Parse Connect session lifecycle events for sessionId and userId
  • Parse Connect operation lifecycle events for operationId, jobTag, timing markers, row counts, and failure metadata
  • Correlate operation events with SQL executions/jobs via jobTag
  • Support per-session config tracking by combining session correlation with modifiedConfigs
  • Surface relevant metadata in qualification/profiling outputs where appropriate
  • Consider multi-user reporting or attribution breakdowns where the data model supports it

Acceptance Criteria

  • Session lifecycle metadata is captured and queryable
  • Operation lifecycle timing can be reconstructed from Connect events
  • Operation failures/cancellations are attributable in tool output or internal models
  • Connect operations can be correlated to SQL executions/jobs through jobTag
  • Tests cover session attribution, timing reconstruction, and failure handling

Notes

Relevant analysis in repo:

  • core/docs/spark-connect-events-analysis.md
  • core/docs/spark-connect-operation-started-examples.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions