[SPARK-53759][PYTHON][3.5] Fix missing close in simple-worker path#55225
Closed
anblanco wants to merge 1 commit intoapache:branch-3.5from
Closed
[SPARK-53759][PYTHON][3.5] Fix missing close in simple-worker path#55225anblanco wants to merge 1 commit intoapache:branch-3.5from
anblanco wants to merge 1 commit intoapache:branch-3.5from
Conversation
On Python 3.12+, changed GC finalization ordering can close the underlying socket before BufferedRWPair flushes its write buffer, causing EOFException on the JVM side. This affects the simple-worker (non-daemon) path used on Windows and when spark.python.use.daemon=false. Adds explicit sock_file.close() in a finally block to all 3 worker files' __main__ blocks, matching how PR apache#54458 solved this on master via a context manager. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gaogaotiantian
approved these changes
Apr 7, 2026
allisonwang-db
approved these changes
Apr 8, 2026
dongjoon-hyun
requested changes
Apr 8, 2026
Member
There was a problem hiding this comment.
I'm blocking this PR according to the community backporting policy. The branch-4.0/4.1 PRs are not merged yet to master, @allisonwang-db .
dongjoon-hyun
approved these changes
Apr 8, 2026
Member
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @anblanco and all.
dongjoon-hyun
pushed a commit
that referenced
this pull request
Apr 8, 2026
### What changes were proposed in this pull request? Backport of the SPARK-53759 fix to `branch-3.5` (parent PR: #55201, targeting `branch-4.1`). Add an explicit `sock_file.close()` in a `finally` block wrapping the `main()` call in all 3 worker files' `__main__` blocks on this branch (`worker.py`, `foreach_batch_worker.py`, `listener_worker.py` — the only worker files that exist on `branch-3.5`). The simple-worker codepath (used on Windows and when `spark.python.use.daemon=false`) calls `main(sock_file, sock_file)` as the last statement. When `main()` returns, the process exits without flushing the `BufferedRWPair` write buffer. On Python 3.12+, changed GC finalization ordering ([cpython#97922](python/cpython#97922)) causes the underlying socket to close before `BufferedRWPair.__del__` can flush, resulting in data loss and `EOFException` on the JVM side. On master, SPARK-55665 refactored `worker.py` to use a `get_sock_file_to_executor()` context manager with `close()` in the `finally` block, which covers this structurally. Since `branch-3.5` lacks that refactoring, this backport adds explicit `sock_file.close()` calls instead. `BufferedRWPair.close()` flushes internally, so this is equivalent. Regression tests are in a separate PR targeting master: #55223. ### Why are the changes needed? The simple-worker path crashes on Python 3.12+ with `EOFException`: - **Windows**: Always uses simple-worker (`os.fork()` unavailable) — this crash is reproducible on every worker-dependent operation with Python 3.12+ - **Linux/macOS**: Reproducible when `spark.python.use.daemon=false` Every worker-dependent operation (`rdd.map()`, `createDataFrame()`, UDFs) fails with: ``` org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) Caused by: java.io.EOFException ``` This crash reproduces on all currently supported PySpark releases (3.5.8, 4.0.2, 4.1.1) with Python 3.12 and 3.13. **Root cause**: The daemon path (`daemon.py`) wraps `worker_main()` in `try/finally` with `outfile.flush()`. The simple-worker path (`worker.py` `__main__`) does not — it relies on `BufferedRWPair.__del__` during interpreter shutdown, which [is not guaranteed](https://docs.python.org/3/reference/datamodel.html#object.__del__) and no longer works as expected on Python 3.12+ due to changed GC finalization ordering. ### Does this PR introduce _any_ user-facing change? Yes. Resolves a crash that prevents PySpark from functioning on Windows with Python 3.12+, and on Linux/macOS with `spark.python.use.daemon=false` on Python 3.12+. ### How was this patch tested? - Red/green verified in local WSL build (Python 3.12.3): unfixed build fails with EOFException, fixed build passes 9/9 SimpleWorkerTests - Regression tests are in a separate PR targeting master: #55223 - Verification matrix from standalone reproducer ([anblanco/spark53759-reproducer](https://github.com/anblanco/spark53759-reproducer)): | Platform | Python | Unpatched | Patched | |----------|--------|-----------|---------| | Windows 11 | 3.11.9 | PASS | PASS (harmless) | | Windows 11 | 3.12.10 | **FAIL** | PASS | | Windows 11 | 3.13.3 | **FAIL** | PASS | | Linux (Ubuntu 24.04) | 3.12.3 | **FAIL** | PASS | ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.6 (Anthropic) Closes #55225 from anblanco/fix/SPARK-53759-backport-3.5. Authored-by: Antonio Blanco <antonioblancocpe@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Member
|
Merged to branch-3.5. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Backport of the SPARK-53759 fix to
branch-3.5(parent PR: #55201, targetingbranch-4.1).Add an explicit
sock_file.close()in afinallyblock wrapping themain()call in all 3 worker files'__main__blocks on this branch (worker.py,foreach_batch_worker.py,listener_worker.py— the only worker files that exist onbranch-3.5).The simple-worker codepath (used on Windows and when
spark.python.use.daemon=false) callsmain(sock_file, sock_file)as the last statement. Whenmain()returns, the process exits without flushing theBufferedRWPairwrite buffer. On Python 3.12+, changed GC finalization ordering (cpython#97922) causes the underlying socket to close beforeBufferedRWPair.__del__can flush, resulting in data loss andEOFExceptionon the JVM side.On master, SPARK-55665 refactored
worker.pyto use aget_sock_file_to_executor()context manager withclose()in thefinallyblock, which covers this structurally. Sincebranch-3.5lacks that refactoring, this backport adds explicitsock_file.close()calls instead.BufferedRWPair.close()flushes internally, so this is equivalent.Regression tests are in a separate PR targeting master: #55223.
Why are the changes needed?
The simple-worker path crashes on Python 3.12+ with
EOFException:os.fork()unavailable) — this crash is reproducible on every worker-dependent operation with Python 3.12+spark.python.use.daemon=falseEvery worker-dependent operation (
rdd.map(),createDataFrame(), UDFs) fails with:This crash reproduces on all currently supported PySpark releases (3.5.8, 4.0.2, 4.1.1) with Python 3.12 and 3.13.
Root cause: The daemon path (
daemon.py) wrapsworker_main()intry/finallywithoutfile.flush(). The simple-worker path (worker.py__main__) does not — it relies onBufferedRWPair.__del__during interpreter shutdown, which is not guaranteed and no longer works as expected on Python 3.12+ due to changed GC finalization ordering.Does this PR introduce any user-facing change?
Yes. Resolves a crash that prevents PySpark from functioning on Windows with Python 3.12+, and on Linux/macOS with
spark.python.use.daemon=falseon Python 3.12+.How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.6 (Anthropic)