Skip to content

feat: support distributed IVF vector index builds#479

Open
jiaoew1991 wants to merge 1 commit intolance-format:mainfrom
jiaoew1991:feat/vector-index-support
Open

feat: support distributed IVF vector index builds#479
jiaoew1991 wants to merge 1 commit intolance-format:mainfrom
jiaoew1991:feat/vector-index-support

Conversation

@jiaoew1991
Copy link
Copy Markdown
Contributor

Summary

Adds ivf_flat, ivf_pq, ivf_sq as CREATE INDEX methods, using lance-core's multi-segment commit API (commitExistingIndexSegments) to publish per-fragment segments atomically under one logical index name.

User-facing

ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding)
  WITH (num_partitions=256, num_sub_vectors=16, num_bits=8, metric_type='l2');

Grammar is untouched; IndexUtils.buildIndexType learns three new cases. Replace-on-recreate is preserved via dropIndex(name) before commit (the segment commit API is additive by default).

Design

  • Driver-side training: lance-core's distributed build (rust/lance/src/index/vector.rs:514) rejects per-fragment-trained centroids — it requires pre-computed IVF centroids (and a PQ codebook for IVF_PQ). The driver calls VectorTrainer.trainIvfCentroids / trainPqCodebook once; every per-fragment task uses the same artifacts. This also keeps all segments in the same query-time compatibility group.
  • Per-fragment tasks: each calls createIndex(IndexOptions.builder(...).withFragmentIds([fid]).build()), returns an uncommitted Index.
  • Executor → driver handoff: org.lance.index.Index is not Serializable, so a small Scala case class LanceIndexHandle carries its fields as primitives and the driver rebuilds Index via the builder.
  • Commit: driver pre-drops any same-name index, then calls commitExistingIndexSegments(name, column, segments).
  • Scalar path unchanged: FragmentBasedIndexJob / RangeBasedBTreeIndexJob still use mergeIndexMetadata + manual CreateIndex transaction. In v6.0.0-beta.2 scalar per-fragment createIndex produces partial files that require the merge finalize step — the segment commit API is effectively vector-only for this release.

Known follow-ups (not in this PR)

  • IVF_HNSW_FLAT / IVF_HNSW_SQ / IVF_HNSW_PQ (supported by lance-core, code add here is small)
  • IVF_RQ (same)
  • Expose more WITH-args: sample_rate, max_iters, hnsw_m, hnsw_ef_construction, etc.
  • OPTIMIZE INDEX for incremental builds over new fragments
  • Size-aware fragment grouping for unbalanced datasets

References

Test plan

  • All 16 AddIndexTest cases green on lance-spark-3.5_2.12 (11 existing scalar + 5 new vector)
  • All 16 AddIndexTest cases green on lance-spark-4.0_2.13
  • Manual smoke test on a non-local dataset

…/ IVF_SQ)

Adds a VectorIndexJob path in AddIndexExec that uses lance-core's
multi-segment commit API: the driver pre-trains IVF centroids (and PQ
codebook for IVF_PQ) once via VectorTrainer, then each Spark task calls
createIndex(withFragmentIds([fid])) with those shared artifacts. The
driver collects the uncommitted per-fragment segments and publishes them
atomically under one logical index name via commitExistingIndexSegments.

Pre-trained centroids are required: lance-core's distributed build path
rejects per-fragment-trained centroids (rust/lance/src/index/vector.rs
"missing precomputed IVF centroids"). Sharing centroids across segments
also keeps them in the same query-time compatibility group, avoiding
per-segment routing overhead.

Grammar is unchanged — users write:

  ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding)
    WITH (num_partitions=256, num_sub_vectors=16, metric_type='l2');

Scalar BTREE / INVERTED paths are untouched; those still produce partial
per-fragment files that require mergeIndexMetadata and cannot use the
segment commit API in v6.0.0-beta.2.

Replace-on-recreate is preserved via dropIndex(name) before commit, since
commitExistingIndexSegments is additive by default.
@github-actions github-actions Bot added the enhancement New feature or request label Apr 25, 2026
@jiaoew1991 jiaoew1991 requested a review from jackye1995 April 25, 2026 16:30
@eddyxu eddyxu requested review from BubbleCal and hamersaw April 29, 2026 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant