Lance Spark is an Apache Spark connector for the Lance columnar data format, enabling Spark to read and write Lance datasets. It supports multiple Spark versions (3.4, 3.5, 4.0, 4.1) and Scala versions (2.12, 2.13).
Multi-module Maven project with shared base code and version-specific modules:
lance-spark-base_2.12/- Shared source code (Scala 2.12 sources, cross-compiled for 2.13)lance-spark-{3.4,3.5}_2.{12,13}/- Spark version-specific moduleslance-spark-{4.0,4.1}_2.13/- Spark 4.x modules (Scala 2.13 only)lance-spark-bundle-*/- Shaded JARs for distributionintegration-tests/- PySpark integration tests (pytest, runs against local/Azurite/MinIO/LanceDB Cloud)docs/- MkDocs documentation (hosted at lance.org/integrations/spark/)docker/- Docker environment for running integration tests and notebooksbenchmark/- Performance benchmarks
Source lives in lance-spark-base_2.12/src/main/{scala,java}/. Tests live alongside in src/test/{scala,java}/. The build-helper-maven-plugin copies both source and test directories across version modules so they don't need to be duplicated — add code once in the base module and it builds/tests against all supported Spark/Scala versions.
Key dependency: lance-core (Java bindings to the Lance Rust library via JNI).
Use ./mvnw instead of mvn to ensure the correct Maven version.
- Format:
./mvnw spotless:apply - Format (check only):
./mvnw spotless:check - Compile:
./mvnw compile -pl lance-spark-3.5_2.13 -am - Test:
./mvnw test -pl lance-spark-3.5_2.13 - Full build:
make build SPARK_VERSION=3.5 SCALA_VERSION=2.13
Use -pl <module> -am to target a specific Spark/Scala version and build its dependencies.
- Use Scala for new code unless Java is required for Spark API compatibility.
- Follow existing code style — enforced by Spotless (
checkstyle.xml) and scalafmt (.scalafmt.conf). - Use top-level imports — do not use fully qualified class names.
- Keep the connector logic thin — delegate to
lance-corewhere possible. - Never open a Lance dataset directly. Use
Utils.openDatasetBuilder()to construct anOpenDatasetBuilderfromLanceSparkReadOptionsorLanceSparkWriteOptions. The builder handles storage option merging and runtime session management.
- Configuration changes and new operations must update the docs. New Spark options go in
docs/src/config.md. New DDL/DML operations go indocs/src/operations/. New supported versions go indocs/src/install.md. - Keep docs concise and example-driven — show the Spark SQL or DataFrame API usage.
- New functionality must include integration tests. Tests should cover the happy path and key error cases — keep them concise rather than exhaustive over every edge case.
- Add tests to existing test files/suites when possible rather than creating new ones.
- Tests run against real Lance datasets (no mocking the storage layer).
- For Spark version-specific behavior, add tests in the appropriate version module.
- Never force-push commits. Append new commits instead to keep the log clean for incremental review.
- Be concise. Focus on correctness, API consistency with Spark conventions, and test coverage.
- Verify that documentation is updated for any user-facing changes.