feat: support distributed IVF vector index builds#479
Open
jiaoew1991 wants to merge 1 commit intolance-format:mainfrom
Open
feat: support distributed IVF vector index builds#479jiaoew1991 wants to merge 1 commit intolance-format:mainfrom
jiaoew1991 wants to merge 1 commit intolance-format:mainfrom
Conversation
…/ IVF_SQ)
Adds a VectorIndexJob path in AddIndexExec that uses lance-core's
multi-segment commit API: the driver pre-trains IVF centroids (and PQ
codebook for IVF_PQ) once via VectorTrainer, then each Spark task calls
createIndex(withFragmentIds([fid])) with those shared artifacts. The
driver collects the uncommitted per-fragment segments and publishes them
atomically under one logical index name via commitExistingIndexSegments.
Pre-trained centroids are required: lance-core's distributed build path
rejects per-fragment-trained centroids (rust/lance/src/index/vector.rs
"missing precomputed IVF centroids"). Sharing centroids across segments
also keeps them in the same query-time compatibility group, avoiding
per-segment routing overhead.
Grammar is unchanged — users write:
ALTER TABLE t CREATE INDEX v_idx USING ivf_pq (embedding)
WITH (num_partitions=256, num_sub_vectors=16, metric_type='l2');
Scalar BTREE / INVERTED paths are untouched; those still produce partial
per-fragment files that require mergeIndexMetadata and cannot use the
segment commit API in v6.0.0-beta.2.
Replace-on-recreate is preserved via dropIndex(name) before commit, since
commitExistingIndexSegments is additive by default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
ivf_flat,ivf_pq,ivf_sqasCREATE INDEXmethods, using lance-core's multi-segment commit API (commitExistingIndexSegments) to publish per-fragment segments atomically under one logical index name.User-facing
Grammar is untouched;
IndexUtils.buildIndexTypelearns three new cases. Replace-on-recreate is preserved viadropIndex(name)before commit (the segment commit API is additive by default).Design
rust/lance/src/index/vector.rs:514) rejects per-fragment-trained centroids — it requires pre-computed IVF centroids (and a PQ codebook for IVF_PQ). The driver callsVectorTrainer.trainIvfCentroids/trainPqCodebookonce; every per-fragment task uses the same artifacts. This also keeps all segments in the same query-time compatibility group.createIndex(IndexOptions.builder(...).withFragmentIds([fid]).build()), returns an uncommittedIndex.org.lance.index.Indexis notSerializable, so a small Scala case classLanceIndexHandlecarries its fields as primitives and the driver rebuildsIndexvia the builder.commitExistingIndexSegments(name, column, segments).FragmentBasedIndexJob/RangeBasedBTreeIndexJobstill usemergeIndexMetadata+ manualCreateIndextransaction. In v6.0.0-beta.2 scalar per-fragmentcreateIndexproduces partial files that require the merge finalize step — the segment commit API is effectively vector-only for this release.Known follow-ups (not in this PR)
sample_rate,max_iters,hnsw_m,hnsw_ef_construction, etc.OPTIMIZE INDEXfor incremental builds over new fragmentsReferences
Test plan
AddIndexTestcases green onlance-spark-3.5_2.12(11 existing scalar + 5 new vector)AddIndexTestcases green onlance-spark-4.0_2.13