This repository contains a DuckDB extension for querying Lance format datasets (including scan, vector search, and full-text search). The DuckDB integration is implemented in C++ (under src/) and links a Rust static library (lance_duckdb_ffi) that uses the Lance Rust crate and exports data via the Arrow C Data Interface.
All documentation in this repository (including README.md and files under docs/) must be written in English.
For any planning, implementation, review, or refactor work in this repository:
- The only hard constraints are upstream DuckDB requirements and upstream Lance requirements.
- If any repository-local guidance conflicts with DuckDB or Lance behavior/contracts, follow DuckDB/Lance.
- Any decision made inside
lance-duckdb(APIs/ABIs, FFI boundaries, internal formats, module boundaries, naming, testing conventions, or implementation strategies) is mutable and may be changed when a better design is found. - Treat repository-local rules in this document as default guidance, not immutable policy.
This project is delivered as a fully self-contained DuckDB extension artifact (statically linked in our distribution). Low-level APIs/ABIs/formats in this repository (C++/Rust FFI boundaries, internal encodings, file/IPC formats, etc.) are strictly internal implementation details:
- They are not intended to be directly consumed by end users.
- There are no external downstream users that depend on them.
- Prioritize first principles and the most direct correct design; do not optimize for migrations or compatibility unless explicitly requested.
- Compatibility commitments are driven by DuckDB/Lance behavior, not by historical repository-local decisions.
- User-facing contracts should be defined at the SQL surface.
Expose user-facing capabilities primarily through SQL, not internal helper functions.
- Default: expose features via DuckDB SQL mechanisms (replacement scan,
ATTACH ... (TYPE LANCE), standard DDL/DML, andCOPY ... (FORMAT lance)). - Internal functions may exist for implementation/composition, but should not be the primary user-facing entry points.
- Narrow exceptions are allowed when a dedicated function is the clearest SQL contract (for example,
lance_ftsand similar search entry points).
When in doubt, prefer SQL surface area that matches DuckDB idioms over extension-specific internal entry points.
Prefer DuckDB/Lance native mechanisms over extension-specific replacements.
- Before adding a new mechanism, first check whether DuckDB or Lance already provides the required hook, lifecycle, cache, profiling surface, or introspection path.
- If an upstream mechanism can satisfy the requirement with reasonable adaptation, use it instead of inventing a parallel extension-specific mechanism.
- Prefer integration with DuckDB-native observability surfaces such as profiling,
EXPLAIN, and catalog/state hooks over custom debug-only SQL functions. - Introduce new extension-specific mechanisms only when upstream surfaces cannot express the requirement cleanly, and document why reuse was insufficient.
# Initial setup (only needed once)
git submodule update --init --recursive
# Build commands (provided by DuckDB extension tooling from `extension-ci-tools`)
make
GEN=ninja make release
GEN=ninja make debug
GEN=ninja make clean
GEN=ninja make clean_all
# Rust-only checks (without a full DuckDB/CMake build)
cargo check --manifest-path Cargo.toml
cargo clippy --manifest-path Cargo.toml --all-targetsThis repository is configured with uv, so you can run formatting via:
uv run make formatPR requirement: Before submitting a PR, run uv run make format and commit any resulting formatting changes (as a separate commit if it helps review).
Commit messages follow Conventional Commits:
type[(scope)]: short description
- Common types:
feat,fix,refactor,docs,ci,test,chore. - Scope is optional. When present, use the affected module (e.g.
search,resolver,scan). - Use
!suffix for breaking changes (e.g.refactor!: ...). - PR numbers are appended automatically by GitHub on merge (e.g.
(#153)).
The release build can be slow. For fast iteration, prefer test_debug when available.
# Run all tests (builds and runs sqllogictest)
GEN=ninja make test
# Run with specific build
GEN=ninja make test_debug # Test with debug build
GEN=ninja make test_release # Test with release build
# Run DuckDB with extension for manual testing
./build/release/duckdb -c "SELECT * FROM 'test/data/test_data.lance' LIMIT 1;"
# Or load the loadable extension from a standalone DuckDB binary
duckdb -unsigned -c "LOAD 'build/release/extension/lance/lance.duckdb_extension'; SELECT * FROM 'test/data/test_data.lance' LIMIT 1;"C++ and Rust communicate through the Arrow C Data Interface:
DuckDB SQL → C++ Extension (src/) → FFI calls → Rust (rust/) → Lance crate
← Arrow C Data Interface (schema + batches) ←
See docs/rust_guidelines.md for the FFI function patterns, error propagation, and memory ownership conventions.
The extension uses two custom binary wire formats to push work from C++ into Rust:
- Filter IR (magic
LFT1): Serializes DuckDB bound filter expressions into a compact binary format that Rust deserializes into DataFusionExpr. This avoids passing expression trees through C function signatures. - Exec IR (magic
LEX1): Encodes aggregate/projection plans so that Rust can execute them entirely inside Lance/DataFusion, returning only pre-aggregated results to C++.
When modifying filter or aggregate pushdown, changes typically span both sides: C++ serialization (lance_filter_ir.cpp / lance_exec_ir.cpp) and Rust deserialization (rust/filter_ir.rs / rust/exec_ir.rs). The magic bytes and version numbers must stay in sync.
ATTACH ... (TYPE LANCE) supports two namespace backends (directory for local, REST for remote LanceDB). Both lazily discover tables and auto-create catalog entries on first access.
Tests use DuckDB's sqllogictest format in test/sql/. S3/MinIO tests (s3_*.test) are gated by LANCE_TEST_S3=1 and require a running MinIO instance.
- Extension loading: Local builds are unsigned. Always use
duckdb -unsignedwhen loading the extension manually. - Debug vs release:
D_ASSERTchecks are stripped in release builds. A test that passes in one configuration but fails in the other usually indicates an assertion violation or uninitialized-variable issue. - Rust build diagnostics: When Cargo fails inside the CMake build, error output is hard to read. Run
cargo check --manifest-path Cargo.tomlin isolation for clearer diagnostics.
- C++: See docs/cpp_guidelines.md for coding style, error handling, and naming conventions.
- Rust: See docs/rust_guidelines.md for unsafe/FFI conventions and error handling.