You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tools should be able to parse Spark Connect event logs. Connect introduces 9 new SparkListener event types that provide session-level and operation-level metadata not available in standard Spark. Additionally, the existing modifiedConfigs field in SparkListenerSQLExecutionStart — which the tools currently ignore — becomes critical for Connect workloads because spark.conf.set() is the only way Connect users can set configs (they don't control server startup).
This issue tracks the overall Spark Connect event-log support effort and is the parent issue for the phase-specific implementation work below.
Part 1: Parse modifiedConfigs from SparkListenerSQLExecutionStart (Bug Fix)
Not Connect-specific — this is an existing gap in OSS Spark parsing that Connect makes more visible.
SparkListenerSQLExecutionStart has a modifiedConfigs field that captures session-level config overrides (set via spark.conf.set()). The tools already parse this event but ignore modifiedConfigs entirely.
Why it matters more for Connect
Aspect
spark-submit
Spark Connect
Baseline (SharedState)
Set at spark-submit time, includes --conf
Set at server startup, shared across all sessions
Config mechanism
--conf flags + spark.conf.set()
spark.conf.set() only
modifiedConfigs importance
Low (often empty, since configs passed via --conf land in SparkListenerEnvironmentUpdate)
High — primary config visibility mechanism; only way to see per-session overrides
What modifiedConfigs contains
Session-level configs that differ from the global baseline (SharedState.conf, frozen at startup). spark.driver.* and spark.executor.* prefixes are excluded. Example:
Part 2: Support Connect-Specific Events (New Feature)
Spark Connect introduces 9 new SparkListenerEvent types posted to the standard LiveListenerBus (so they appear in event logs). These enable multi-user attribution, operation-level timing, session lifecycle tracking, and error reporting at the Connect operation level.
Event Catalog
Service/Session Lifecycle
Event
Key Fields
Purpose
SparkListenerConnectServiceStarted
hostAddress, bindingPort
Identifies the log as a Connect server log (1 per app)
SparkListenerConnectSessionStarted
sessionId, userId
Marks client connection; enables multi-user attribution
SparkListenerConnectSessionClosed
sessionId, userId
Session end; paired with Started gives session duration
ConnectServiceStarted (1 per server)
└── ConnectSessionStarted (1+ per server, one per client)
├── ConnectOperationStarted ──────────────────┐
│ (jobTag, sessionId, userId, │
│ statementText) │ linked via jobTag
├── ConnectOperationAnalyzed │
├── ConnectOperationReadyForExecution │
├── ConnectOperationFinished (or Failed) │
├── ConnectOperationClosed │
│ │
│ SQLExecutionStart ─────────────────────────┤
│ (executionId, physicalPlan, │ jobTags contains
│ modifiedConfigs, jobTags) │ the same jobTag
│ │
│ JobStart ──────────────────────────────────┘
│ (Properties["spark.job.tags"] linked via spark.job.tags
│ contains the same jobTag)
│
└── ConnectSessionClosed
Operation Lifecycle Timing
Connect events provide phase-level timing breakdown not available from standard Spark events:
OperationStarted ─┐
├─ analysis time (~500-700ms)
OperationAnalyzed ─┤
├─ planning time (~400ms)
ReadyForExecution ─┤
├─ execution time (~2-6s)
OperationFinished ─┤
├─ result transfer time (~250ms)
OperationClosed ─┘
Summary
Tools should be able to parse Spark Connect event logs. Connect introduces 9 new
SparkListenerevent types that provide session-level and operation-level metadata not available in standard Spark. Additionally, the existingmodifiedConfigsfield inSparkListenerSQLExecutionStart— which the tools currently ignore — becomes critical for Connect workloads becausespark.conf.set()is the only way Connect users can set configs (they don't control server startup).This issue tracks the overall Spark Connect event-log support effort and is the parent issue for the phase-specific implementation work below.
Sub-issues
modifiedConfigsfromSparkListenerSQLExecutionStartSuggested execution order:
Part 1: Parse
modifiedConfigsfromSparkListenerSQLExecutionStart(Bug Fix)Not Connect-specific — this is an existing gap in OSS Spark parsing that Connect makes more visible.
SparkListenerSQLExecutionStarthas amodifiedConfigsfield that captures session-level config overrides (set viaspark.conf.set()). The tools already parse this event but ignoremodifiedConfigsentirely.Why it matters more for Connect
--conf--confflags +spark.conf.set()spark.conf.set()onlymodifiedConfigsimportance--confland inSparkListenerEnvironmentUpdate)What
modifiedConfigscontainsSession-level configs that differ from the global baseline (
SharedState.conf, frozen at startup).spark.driver.*andspark.executor.*prefixes are excluded. Example:{ "Event": "org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", "executionId": 0, "modifiedConfigs": { "spark.sql.autoBroadcastJoinThreshold": "-1", "spark.app.name": "saralihalli-test" } }See
core/docs/spark-connect-modifiedConfigs-analysis.mdfor the full code-path trace.Part 2: Support Connect-Specific Events (New Feature)
Spark Connect introduces 9 new
SparkListenerEventtypes posted to the standardLiveListenerBus(so they appear in event logs). These enable multi-user attribution, operation-level timing, session lifecycle tracking, and error reporting at the Connect operation level.Event Catalog
Service/Session Lifecycle
SparkListenerConnectServiceStartedSparkListenerConnectSessionStartedSparkListenerConnectSessionClosedOperation Lifecycle
ConnectOperationStartedConnectOperationAnalyzedConnectOperationReadyForExecutionConnectOperationFinishedConnectOperationClosedConnectOperationFailedConnectOperationCanceledCorrelation Model
The
jobTagis the universal correlation key linking Connect operations to existing Spark events:It appears in:
SparkListenerConnectOperation*.jobTagSparkListenerSQLExecutionStart.jobTags(array)SparkListenerJobStart.Properties["spark.job.tags"](comma-separated string)Operation Lifecycle Timing
Connect events provide phase-level timing breakdown not available from standard Spark events:
Implementation Phases
Phase 1:
modifiedConfigsparsing (Part 1)Tracked by: #2063
modifiedConfigsfromSparkListenerSQLExecutionStartPhase 2: Connect event awareness (Part 2 — minimum)
Tracked by: #2064
SparkListenerConnectServiceStartedeventlog-parser.yaml)jobTagfor operation↔SQL execution correlationPhase 3: Connect metadata extraction (Part 2 — full)
Tracked by: #2065
OperationFailedfor error attributionmodifiedConfigs+ sessionId correlation