[SPARK-56284] Adding UDF worker specification protobuf definition#55165
[SPARK-56284] Adding UDF worker specification protobuf definition#55165sven-weber-db wants to merge 1 commit intoapache:masterfrom
Conversation
046fd81 to
3b6fb12
Compare
3b6fb12 to
24ecbb2
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
This PR fills in the previously placeholder UDFWorkerSpecification protobuf message with the full worker specification schema per SPIP SPARK-55278.
Design approach: Two proto files define a layered worker specification:
common.proto— shared types for reuse by both the worker spec and the forthcoming UDF protocol:UDFWorkerDataFormat(data serialization format),UDFShape/SparkUDFShapes(UDF execution shapes).worker_spec.proto— the full specification:UDFWorkerSpecificationcomposesWorkerEnvironment(lifecycle callables),WorkerCapabilities(data formats, UDF shapes, concurrency/reuse flags), and aDirectWorker(process callable + connection + timeout properties). Transport is abstracted viaWorkerConnection(oneofof Unix domain socket or TCP).
Key design decisions:
ProcessCallableseparatescommand(executable prefix) fromarguments, with the engine injecting--idand--connectionat invocation time.oneof workerinUDFWorkerSpecificationandoneof transportinWorkerConnectionprovide extension points for future worker provisioning strategies and transport types.WorkerCapabilities.supports_concurrent_udfsis defined but explicitly deferred for future use.
General comments:
udf/worker/README.md(line 23) still says "UDFWorkerSpecification -- currently a placeholder" — should be updated now that the specification is filled in.- Spark Connect protos use
(Required)/(Optional)annotations on field comments to clarify the application-level contract. For fields likesupported_data_formats(where the comment says "Every worker MUST at least support ARROW"), such annotations would make the requirement immediately visible to proto consumers.
| // engine-configurable maximum time (e.g. 30 seconds). | ||
| optional int32 graceful_termination_timeout_ms = 2; | ||
|
|
||
| // The connection this [[UDFWorker]] supports. Note that a single |
There was a problem hiding this comment.
[[UDFWorker]] is not defined anywhere — no proto message, no Scala/Java class. The same dangling reference appears at lines 149 and 159. The closest entity is DirectWorker (line 101). Should these reference DirectWorker, or is UDFWorker a planned type not yet introduced?
There was a problem hiding this comment.
Good catch! [[UDFWorker]] was the name we previously used for DirectWorker. Before raising this PR, it was renamed, and I seem to have forgotten to update all references in the text to the old name. This should be fixed now. Thank you!
| // ["\"echo 'Test'\""] | ||
| // | ||
| // Every executable will ALWAYS receive a | ||
| // --id argument. This argument CANNOT be part of the below list of arguments. |
There was a problem hiding this comment.
The --id argument is explicitly reserved here ("CANNOT be part of the below list of arguments"), but --connection (injected by the engine per lines 130–134) has no such restriction documented. A user including --connection in their arguments would conflict with the engine-injected value. Consider adding the same reservation for --connection.
There was a problem hiding this comment.
Yes, very good point. I have updated the description to a list of restricted values including both --id and --connection. Thank you!
| } | ||
| } | ||
|
|
||
| enum SparkUDFShapes { |
There was a problem hiding this comment.
SparkUDFShapes uses plural naming, while UDFWorkerDataFormat in the same file uses singular. Proto convention recommends singular enum names — consider SparkUDFShape.
There was a problem hiding this comment.
Good catch, thank you! Fixed.
| // produces iterator to a batch of rows as output. | ||
| MAP_PARTITIONS = 2; |
There was a problem hiding this comment.
Grammar — "a iterator" and missing article:
| // produces iterator to a batch of rows as output. | |
| MAP_PARTITIONS = 2; | |
| // UDF receives an iterator to a batch of rows as input and | |
| // produces an iterator to a batch of rows as output. |
|
|
||
| // Which types of UDFs this worker supports. | ||
| // This should list all supported Shapes. | ||
| // Of which shape a specific UDF is will be communicated |
There was a problem hiding this comment.
Awkward phrasing:
| // Of which shape a specific UDF is will be communicated | |
| // The shape of a specific UDF will be communicated |
| // Maximum amount of time to wait until the worker can accept connections. | ||
| // | ||
| // The engine will use this timeout, if it does not exceed a | ||
| // engine-configurable maximum time (e.g. 30 seconds). |
There was a problem hiding this comment.
Same issue at line 119.
| // engine-configurable maximum time (e.g. 30 seconds). | |
| // The engine will use this timeout, if it does not exceed an |
| // After this time, the worker process should have terminated itself. | ||
| // Otherwise, the process will be forcefully killed using SIGKILL. | ||
| // | ||
| // The engine will use this timeout, if it does not exceed a |
There was a problem hiding this comment.
| // The engine will use this timeout, if it does not exceed a | |
| // The engine will use this timeout, if it does not exceed an |
| } | ||
| } | ||
|
|
||
| // Communication between the engine and worker |
There was a problem hiding this comment.
Leading space before // — inconsistent with all other message-level comments:
| // Communication between the engine and worker | |
| // Communication between the engine and worker |
| // is done using a UNIX domain socket. | ||
| // | ||
| // On [[UDFWorker]] creation, a path to a socket | ||
| // to listen on is passed as a argument. |
There was a problem hiding this comment.
| // to listen on is passed as a argument. | |
| // to listen on is passed as an argument. |
| // ["python3", "-m"] | ||
| // ["worker.bin"] | ||
| // ["java", "worker.java"] | ||
| // ["bin/bash", "-c"] |
There was a problem hiding this comment.
Missing leading /:
| // ["bin/bash", "-c"] | |
| // ["/bin/bash", "-c"] |
24ecbb2 to
bd1bbf2
Compare
sven-weber-db
left a comment
There was a problem hiding this comment.
Adjusted according to review comments
| } | ||
| } | ||
|
|
||
| enum SparkUDFShapes { |
There was a problem hiding this comment.
Good catch, thank you! Fixed.
| // produces iterator to a batch of rows as output. | ||
| MAP_PARTITIONS = 2; |
| // engine-configurable maximum time (e.g. 30 seconds). | ||
| optional int32 graceful_termination_timeout_ms = 2; | ||
|
|
||
| // The connection this [[UDFWorker]] supports. Note that a single |
There was a problem hiding this comment.
Good catch! [[UDFWorker]] was the name we previously used for DirectWorker. Before raising this PR, it was renamed, and I seem to have forgotten to update all references in the text to the old name. This should be fixed now. Thank you!
| // ["\"echo 'Test'\""] | ||
| // | ||
| // Every executable will ALWAYS receive a | ||
| // --id argument. This argument CANNOT be part of the below list of arguments. |
There was a problem hiding this comment.
Yes, very good point. I have updated the description to a list of restricted values including both --id and --connection. Thank you!
bd1bbf2 to
70f38f9
Compare
What changes were proposed in this pull request?
This PR introduces the protobuf definitions for the UDF worker specification described in SPIP SPARK-55278 and this design document.
Overall, two new
.protofiles are introduced:common.proto- Shared types and messages between the worker specification & the new UDF protocol (to be introduced)worker_spec.proto- UDF worker specificationWhy are the changes needed?
This is the first step toward a language-agnostic UDF protocol for Spark that enables UDF workers written in any language to communicate with the Spark engine through a well-defined specification and API boundary. The abstractions introduced here establish the core contract that concrete implementations (e.g., process-based or gRPC-based workers) will build on.
The worker specification introduced in this PR captures all the information Spark needs to:
Does this PR introduce any user-facing change?
No. All new APIs are marked @experimental, and there are no behavioral changes to existing code.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Yes, in an assistive manner and for reviews.