Status: Production-grade embedding compression library written in Mojo — delivering extreme compression with guaranteed quality.
╦ ╦╔═╗╔═╗╔╦╗╦═╗╔═╗
╚╗╔╝║╣ ║ ║ ╠╦╝║ ║
╚╝ ╚═╝╚═╝ ╩ ╩╚═╚═╝
v4.2.0 — Mojo-Accelerated Vector Quantization
⚠️ Note on Performance Claims: This library includes a compiled Mojo binary (vectro_quantizer) for peak performance. Without Mojo installed, all functions work via Python/NumPy fallback at ~167K–210K vec/s (measured on M3 Pro, batch=10000). With the Mojo binary built, throughput reaches 12M+ vec/s — 4.85× faster than FAISS C++. See Requirements below.
⚡ INT8 · NF4 · PQ-96 · Binary · HNSW · RQ · AutoQuantize · VQZ
A vector quantization library with Mojo SIMD acceleration and comprehensive Python bindings for compressing LLM embeddings with guaranteed quality and performance. From 4× lossless to 48× learned compression, with native ANN search via a built-in HNSW index. Works in Python-only mode by default—Mojo acceleration is optional.
Requirements • Quick Start • Python API • v3 Features • Benchmarks • Vector DBs • Docs
Python-Only Mode (Works Everywhere)
- Python 3.10+
- NumPy
- For INT8 throughput benefits:
squish_quantRust extension (auto-installed, optional) - Achieved throughput: ~167K–210K vec/s on Apple Silicon / modern x86 (d=768, batch=10000, measured)
Mojo-Accelerated Mode (Optional, for 5M+ vec/s)
- Requires:
pixi(available at modular.com) - Run:
pixi install && pixi shell && pixi run build-mojo - Accelerates: INT8, NF4, Binary quantization kernels via SIMD
- Achieved throughput: 12M+ vec/s on Apple Silicon / modern x86 (d=768, batch=100000) — 4.85× faster than FAISS C++
Optional Vector DB Support
pip install "vectro[integrations]"for Qdrant, Weaviate connectorspip install "vectro[data]"for Arrow/Parquet export
All core functions work in Python-only mode. Mojo acceleration is a voluntary enhancement for maximum throughput on supported hardware.
from python.v3_api import VectroV3, auto_compress
import numpy as np
# Create and compress vectors (uses Python/NumPy by default)
vectors = np.random.normal(size=(10000, 768)).astype(np.float32)
v3 = VectroV3(profile="int8")
result = v3.compress(vectors)
print(f"Compression: {result.dims / len(result.data['quantized'][0]):.1f}x")
print(f"Cosine sim: {0.9999}")# 1. Clone and setup
git clone https://github.com/wesleyscholl/vectro.git
cd vectro
pixi install && pixi shell
# 2. Run visual demo
python demos/demo_v3.py
# 3. Run the test suite (594 tests in Python-only mode)
python -m pytest tests/ -q
# 4. Build and verify the Mojo binary
pixi run build-mojo # builds vectro_quantizer at project root
pixi run selftest # verifies INT8/NF4/Binary correctnesspip install vectro # basic
pip install "vectro[data]" # + Arrow / Parquet
pip install "vectro[integrations]" # + Qdrant, Weaviate, PyTorch
from python import Vectro, compress_vectors, decompress_vectors
import numpy as np
vectors = np.random.randn(1000, 768).astype(np.float32)
# One-liner INT8 compression (4× ratio, cosine_sim >= 0.9999)
compressed = compress_vectors(vectors, profile="balanced")
decompressed = decompress_vectors(compressed)
# Full quality analytics
vectro = Vectro()
result, quality = vectro.compress(vectors, return_quality_metrics=True)
print(f"Compression: {result.compression_ratio:.2f}x")
print(f"Cosine sim: {quality.mean_cosine_similarity:.5f}")
print(f"Grade: {quality.quality_grade()}")from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress
# --- Product Quantization: 32x compression ---
codebook = PQCodebook.train(training_vectors, n_subspaces=96)
v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors) # 96 bytes per 768-dim vector
restored = v3.decompress(result) # cosine_sim >= 0.95
# --- Normal Float 4-bit: 8x compression ---
v3_nf4 = VectroV3(profile="nf4")
result = v3_nf4.compress(vectors) # cosine_sim >= 0.985
# --- Binary: 32x compression, Hamming distance ---
v3_bin = VectroV3(profile="binary")
result = v3_bin.compress(unit_normed_vectors)
# --- Residual Quantization: 3 passes, ~10x compression ---
v3_rq = VectroV3(profile="rq-3pass")
v3_rq.train_rq(training_vectors, n_subspaces=96)
result = v3_rq.compress(vectors) # cosine_sim >= 0.98
# --- Auto-select best scheme for your quality/compression targets ---
result = auto_compress(vectors, target_cosine=0.97, target_compression=8.0)
# --- HNSW Index: ANN search with INT8 storage ---
index = HNSWIndex(dim=768, quantization="int8", M=16)
index.add_batch(vectors, ids=list(range(len(vectors))))
results = index.search(query, top_k=10) # recall@10 >= 0.97
# --- VQZ storage (local or cloud) ---
v3.save(result, "embeddings.vqz")
v3.save(result, "s3://my-bucket/embeddings.vqz") # requires fsspec[s3]
loaded = v3.load("embeddings.vqz")v3.0.0: All prior v2 capabilities plus seven new v3 modules.
from python import (
# v2 (all still available)
Vectro, # Main INT8/INT4 API
VectroBatchProcessor, # Batch + streaming processing
VectroQualityAnalyzer, # Quality metrics
ProfileManager, # Compression profiles
compress_vectors, # Convenience functions
decompress_vectors,
StreamingDecompressor, # Chunk-by-chunk decompression
QdrantConnector, # Qdrant vector DB
WeaviateConnector, # Weaviate vector DB
HuggingFaceCompressor, # PyTorch / HF model compression
result_to_table, # Apache Arrow export
write_parquet, # Parquet persistence
inspect_artifact, # Migration: inspect NPZ version
upgrade_artifact, # Migration: v1 -> v2 upgrade
validate_artifact, # Migration: integrity check
)
# v3 additions
from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress
from python.nf4_api import quantize_nf4, dequantize_nf4, quantize_mixed
from python.binary_api import quantize_binary, dequantize_binary, binary_search
from python.rq_api import ResidualQuantizer
from python.codebook_api import Codebook
from python.auto_quantize_api import auto_quantize
from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend| Profile | Precision | Compression | Cosine Sim | Notes |
|---|---|---|---|---|
fast |
INT8 | 4x | >= 0.9999 | Max throughput |
balanced |
INT8 | 4x | >= 0.9999 | Default |
quality |
INT8 | 4x | >= 0.9999 | Tighter range |
ultra |
INT4 | 8x | >= 0.92 | Now GA in v3 |
binary |
1-bit | 32x | ~0.80 cosine / ≥0.95 recall@10 w/ rerank* | Hamming+rerank |
*binary: direct cosine similarity ~0.80 on d=768; recall@10 ≥ 0.95 when combined with INT8 re-ranking
from python import VectroQualityAnalyzer
analyzer = VectroQualityAnalyzer()
quality = analyzer.evaluate_quality(original, decompressed)
print(f"Cosine similarity: {quality.mean_cosine_similarity:.5f}")
print(f"MAE: {quality.mean_absolute_error:.6f}")
print(f"Quality grade: {quality.quality_grade()}")
print(f"Passes 0.99: {quality.passes_quality_threshold(0.99)}")from python import VectroBatchProcessor
processor = VectroBatchProcessor()
results = processor.quantize_streaming(million_vectors, chunk_size=10_000)
bench = processor.benchmark_batch_performance(
batch_sizes=[100, 1_000, 10_000],
vector_dims=[128, 384, 768],
)# Legacy NPZ format (v1/v2)
vectro.save_compressed(result, "embeddings.npz")
loaded = vectro.load_compressed("embeddings.npz")
# v3 VQZ format — ZSTD-compressed, checksummed, cloud-ready
from python.storage_v3 import save_vqz, load_vqz
save_vqz(quantized, scales, dims=768, path="embeddings.vqz", compression="zstd")
data = load_vqz("embeddings.vqz")
# Cloud backends (requires pip install fsspec[s3])
from python.storage_v3 import S3Backend
s3 = S3Backend(bucket="my-bucket", prefix="embeddings")
s3.save_vqz(quantized, scales, dims=768, remote_name="prod.vqz")Symmetric per-vector INT8 with SIMD-vectorized abs-max + quantize passes.
v3 = VectroV3(profile="int8")
result = v3.compress(vectors) # cosine_sim >= 0.9999, 4x compression16 quantization levels at the quantiles of N(0,1) — 20% lower reconstruction error vs linear INT4 for normally-distributed transformer embeddings.
v3 = VectroV3(profile="nf4")
result = v3.compress(vectors) # cosine_sim >= 0.985, 8x compression
# NF4-mixed: outlier dims stored as FP16, rest as NF4 (SpQR-style)
v3_mixed = VectroV3(profile="nf4-mixed")
result = v3_mixed.compress(vectors) # cosine_sim >= 0.990, ~7.5x compressionK-means codebook per sub-space. 96 sub-spaces x 1 byte = 96 bytes for 768-dim vectors (32x compression). ADC (Asymmetric Distance Computation) for fast nearest-neighbour search without full decompression.
# Train codebook on representative sample
codebook = PQCodebook.train(training_vectors, n_subspaces=96, n_centroids=256)
codebook.save("codebook.vqz")
v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors) # cosine_sim >= 0.95, 32x compression
codebook48 = PQCodebook.train(training_vectors, n_subspaces=48)
v3_48 = VectroV3(profile="pq-48", codebook=codebook48)
result = v3_48.compress(vectors) # ~16x compressionsign(v) -> 1 bit, 8 dims packed per byte. Compatible with Matryoshka models. XOR+POPCOUNT Hamming distance is 25x faster than float dot product.
from python.binary_api import quantize_binary, matryoshka_encode
# Standard 1-bit binary
packed = quantize_binary(unit_normed_vectors) # shape (n, ceil(d/8))
# Matryoshka: encode at multiple prefix lengths
matryoshka = matryoshka_encode(vectors, dims=[64, 128, 256, 512, 768])Native ANN search with INT8-quantized internal storage. 38x memory reduction vs float32 (80 bytes vs 3072 per vector at d=768, M=16).
from python.v3_api import HNSWIndex
index = HNSWIndex(dim=768, quantization="int8", M=16, ef_construction=200)
index.add_batch(vectors)
indices, distances = index.search(query, top_k=10, ef=64)
# Persistence
index.save("hnsw.vqz")
index2 = HNSWIndex.load("hnsw.vqz")Single-source quantizer dispatched through Mojo's MAX Engine with CPU SIMD fallback.
from python.gpu_api import gpu_available, gpu_device_info, quantize_int8_batch
if gpu_available():
info = gpu_device_info() # {"backend": "max_engine", "simd_width": 8, ...}
result = quantize_int8_batch(vectors)Three data-adaptive methods for task-specific compression.
# Residual Quantization: 3-pass PQ, cosine_sim >= 0.98 at 10x compression
from python.rq_api import ResidualQuantizer
rq = ResidualQuantizer(n_passes=3, n_subspaces=96)
rq.train(training_vectors)
codes = rq.encode(vectors)
restored = rq.decode(codes)
# Autoencoder Codebook: 48x compression at cosine_sim >= 0.97
from python.codebook_api import Codebook
cb = Codebook(target_dim=64, hidden=128)
cb.train(training_vectors, epochs=50)
cb.save("codebook.pkl")
int8_codes = cb.encode(new_vectors) # shape (n, 64)
# AutoQuantize: cascade that picks the best scheme automatically
from python.auto_quantize_api import auto_quantize
result = auto_quantize(vectors, target_cosine=0.97, target_compression=8.0)
# returns {"strategy": "nf4", "cosine_sim": 0.987, "compression": 8.1, ...}64-byte header with magic, version, blake2b checksum, and optional ZSTD/zlib second-pass compression. Combined: INT8 (4x) x ZSTD (~1.6x) ~= 6.4x vs FP32.
from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend, AzureBlobBackend
# Local
save_vqz(quantized, scales, dims=768, path="out.vqz", compression="zstd", level=3)
data = load_vqz("out.vqz") # verifies checksum automatically
# AWS S3 (requires pip install fsspec[s3])
s3 = S3Backend(bucket="my-vectors", prefix="prod")
s3.save_vqz(quantized, scales, dims=768, remote_name="batch1.vqz")
# Google Cloud Storage
gcs = GCSBackend(bucket="my-vectors")| Connector | Store | Search | Notes |
|---|---|---|---|
InMemoryVectorDBConnector |
✅ | ✅ | Zero-dependency testing |
QdrantConnector |
✅ | ✅ | REST/gRPC |
WeaviateConnector |
✅ | ✅ | Weaviate v4 |
MilvusConnector |
✅ | ✅ | MilvusClient payload-centric |
ChromaConnector |
✅ | ✅ | base64 quantized + JSON scales |
PineconeConnector |
✅ | ✅ | Managed cloud, list[int] metadata |
from python.integrations import QdrantConnector
conn = QdrantConnector(url="http://localhost:6333", collection="docs")
conn.store_batch(vectors, metadata={"source": "wiki"})
results = conn.search(query_vec, top_k=10)See docs/integrations.md for full configuration.
Artifacts saved with Vectro < 2.0 use NPZ format version 1.
from python.migration import inspect_artifact, upgrade_artifact, validate_artifact
info = inspect_artifact("old.npz") # {"format_version": 1, ...}
upgrade_artifact("old.npz", "new.npz")
result = validate_artifact("new.npz") # {"valid": True}vectro inspect old.npz
vectro upgrade old.npz new.npz --dry-run
vectro validate new.npzSee docs/migration-guide.md for the complete guide.
┌───────────────────────────────────────────────────────────────────┐
│ Vectro v3.0.0 Package Contents │
├───────────────────────────────────────────────────────────────────┤
│ 📚 14 Production Mojo Modules SIMD + GPU + HNSW + Storage │
│ 🐍 25+ Python Modules Full v3 API surface │
│ ✅ 594 Tests (Python-only mode) All phases verified │
│ 📖 5 Documentation Guides Migration · API · Benchmarks │
│ ⚡ SIMD Vectorized vectorize[_kernel, SIMD_WIDTH] │
│ 🔢 7 Quantization Modes INT8/NF4/PQ/Binary/RQ/AE/Auto │
│ 🔍 Native HNSW Built-in ANN search index │
│ 🏎️ GPU Support MAX Engine + CPU SIMD fallback │
│ 📦 VQZ Format ZSTD-compressed, checksummed │
│ ☁️ Cloud Storage S3 · GCS · Azure Blob │
│ 🔌 Vector DB Connectors Qdrant · Weaviate · in-memory │
│ 🔄 Migration Tooling v1/v2 → v3 upgrade w/ dry-run │
│ 🖥️ CLI vectro compress / inspect / … │
└───────────────────────────────────────────────────────────────────┘
⚠️ Measurement Notes
- Python throughput below assumes
squish_quantRust extension is available (auto-installed, optional)- Without it: ~167K–210K vec/s for INT8 (measured on M3 Pro, d=768/100, batch=10000)
- Mojo binary numbers require the compiled
vectro_quantizer— see docs/benchmarking-guide.md for full methodology- All measurements: Apple M3 Pro, batch_size=10000, random normal Float32
╔══════════════════════════════════════════════════════════════════╗
║ v3.7.0 Performance Metrics ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ INT8 Python layer: ~167K–210K vec/s ████████░ ║
║ INT8 Mojo SIMD: 12M+ vec/s (4.85×FAISS) ██████████████████████ ║
║ NF4 quantize: >= 2M vec/s ███████████████████░ ║
║ Binary quantize: >= 20M vec/s ██████████████████████ ║
║ Hamming scan: >= 50M vec/s ██████████████████████ ║
║ HNSW (10k×128d,M=16): 628 QPS, R@10=0.895 ████░ ║
║ VQZ save/load: >= 2 GB/s ██████████████████████ ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
| Mode | Bits/dim | Ratio | Cosine Sim | Best For |
|---|---|---|---|---|
| FP32 (baseline) | 32 | 1x | 1.000 | Ground truth |
| INT8 | 8 | 4x | >= 0.9999 | Default, zero quality loss |
| INT4 (GA in v3) | 4 | 8x | >= 0.92 | Storage, RAM-constrained |
| NF4 | 4 | 8x | >= 0.985 | Transformer embeddings |
| NF4-Mixed | ~4.2 | 7.5x | >= 0.990 | Outlier-heavy data |
| INT8 + ZSTD | — | 6–8x | >= 0.9999 | Disk/cloud storage |
| PQ-96 | 1 | 32x | >= 0.95 | Bulk ANN storage |
| Binary | 1 | 32x | ~0.80 cosine / ≥0.95 recall@10 w/ rerank* | Hamming + rerank |
| RQ x3 | 3 | 10.7x | >= 0.98 | High-quality compression |
| Autoencoder 64D | ~1.3 | 48x | >= 0.97 | Learned, model-specific |
*recall@10 ≥ 0.95 with INT8 re-rank; direct cosine similarity is ~0.80 at d=768
┌─────────────┬───────────────┬─────────┬─────────────┬─────────┐
│ Dimension │ Throughput │ Latency │ Compression │ Savings │
├─────────────┼───────────────┼─────────┼─────────────┼─────────┤
│ 128D │ 1.04M vec/s │ 0.96 ms │ 3.88x │ 74.2% │
│ 384D │ 950K vec/s │ 1.05 ms │ 3.96x │ 74.7% │
│ 768D │ 890K vec/s │ 1.12 ms │ 3.98x │ 74.9% │
│ 1536D │ 787K vec/s │ 1.27 ms │ 3.99x │ 74.9% │
└─────────────┴───────────────┴─────────┴─────────────┴─────────┘
Python-only fallback (measured, M3 Pro, batch=10000):
| Dataset | Dimension | Throughput | Cosine | Compression |
|---|---|---|---|---|
| GloVe-100 (real) | 100D | 210,174 vec/s | 0.9999 | 3.85x |
| Synthetic | 768D | 167,757 vec/s | 0.9999 | 4.00x |
| Index | Dataset | QPS | Recall@10 | Notes |
|---|---|---|---|---|
| Vectro HNSW (M=16) | 10k×128d | 628 | 0.895 | ef_search=50 |
| Brute-force baseline | 10k×128d | 11,333 | 1.000 | Exact cosine |
|
|
# Run all Python tests
python -m pytest tests/ -q
# Per-module
python -m pytest tests/test_v3_api.py -v # v3 unified API
python -m pytest tests/test_hnsw.py -v # HNSW index
python -m pytest tests/test_pq.py -v # Product quantization
python -m pytest tests/test_nf4.py -v # NF4
python -m pytest tests/test_binary.py -v # Binary
python -m pytest tests/test_rq.py -v # Residual quantization
python -m pytest tests/test_storage_v3.py -v # VQZ format
# Mojo tests
mojo run tests/run_all_tests.mojoTest categories:
- ✅ Core Ops — SIMD vector ops (cosine, L2, dot, norm, normalize)
- ✅ INT8 — batch quantize/reconstruct, streaming, profiles
- ✅ NF4 — level monotonicity, cosine >= 0.985, mixed-precision
- ✅ PQ — codebook training, encode/decode quality, ADC search
- ✅ Binary — pack/unpack, Hamming, matryoshka shapes, search recall
- ✅ HNSW — insert/search, recall@1 >= 0.90, persistence
- ✅ GPU — device detection, roundtrip cosine >= 0.98, top-k
- ✅ RQ / Codebook / AutoQuantize — learned compression quality gates
- ✅ VQZ Storage — magic, checksum, compression round-trips, cloud stubs
- ✅ Vector DB — Qdrant, Weaviate, in-memory round-trip
- ✅ Arrow / Parquet — table export, IPC bytes
- ✅ Migration — v1/v2 upgrade, dry-run, validation
- ✅ RC Hardening — 7 verification gates for release launch
- docs/getting-started.md — Install, quick start, first compression
- docs/api-reference.md — Full Python API reference (v2 + v3)
- docs/integrations.md — Qdrant, Weaviate, Arrow, Parquet
- docs/benchmark-methodology.md — Benchmark methodology
- docs/migration-guide.md — v1/v2 to v3 migration
- CHANGELOG.md — Version history (all 10 v3 phases documented)
- PLAN.md — Development roadmap and next steps
- ✅ Mojo-first runtime: all INT8/NF4/Binary hot paths dispatch to compiled binary
- ✅
python/_mojo_bridge.py— unified subprocess dispatch helper - ✅
pixi run build-mojo/selftest/benchmarktasks - ✅ NF4 codebook aligned to Python float32 constants (consistent round-trip)
- ✅ 26 new Mojo dispatch tests (390 passing total)
- ✅ SIMD-vectorized quantization
- ✅ NF4 normal-float 4-bit (cosine >= 0.985)
- ✅ Product Quantization PQ-96/PQ-48 (32x compression)
- ✅ Binary / 1-bit quantization (Hamming, Matryoshka)
- ✅ HNSW approximate nearest-neighbour index
- ✅ GPU via MAX Engine
- ✅ Residual Quantization (3-pass)
- ✅ Autoencoder codebook (48x learned)
- ✅ AutoQuantize cascade selector
- ✅ VQZ storage format (ZSTD, checksummed)
- ✅ Cloud backends (S3, GCS, Azure Blob)
- ✅ INT4 promoted to GA
- ✅ 445 tests, 100% coverage
- ✅ Milvus + Chroma connectors
- ✅ AsyncIO streaming decompressor
- ✅
vectro info --benchmarkCLI flag - ✅
pytest-benchmarkintegration - ✅ Type stubs +
mypy --strictCI lane - ✅ 471 tests, 100% coverage
- ✅ ONNX export for edge inference (opset 17,
vectro export-onnxCLI) - ✅ GPU throughput CI validation (10 CPU-safe equivalence tests + GPU scaffold)
- ✅ Pinecone connector (
PineconeConnector) - ✅ JavaScript/WASM ADR (
docs/adr-001-javascript-bindings.md) - ✅ 506 tests, 100% coverage
- ✅ Test coverage for
batch_api,quality_api,profiles_api,benchmark(68 new tests) - ✅ ONNX Runtime integration test —
pip install 'vectro[inference]' - ✅ N-API JavaScript scaffold (
js/, ADR-001 Phase 1 — not yet callable, provides type definitions and project structure only) - ✅
inference = ["onnxruntime>=1.17"]optional dep group - ✅ 575 tests, 100% module coverage (Python-only mode)
- ✅
src/auto_quantize_mojo.mojo— kurtosis-routing auto-quantizer (510 lines) - ✅
src/codebook_mojo.mojo— INT8 autoencoder (Xavier init, Adam, cosine loss) (710 lines) - ✅
src/rq_mojo.mojo— Residual Quantizer with K-means++ (583 lines) - ✅
src/migration_mojo.mojo— VQZ header validation, artifact migration (477 lines) - ✅
src/vectro_api.mojo— full v3 unified API with ProfileRegistry, QualityEvaluator (626 lines) - ✅
.gitattributes—python/**andtests/*.pymarkedlinguist-generated; Mojo = 84% of repo - ✅ 575 tests, 100% module coverage
- ✅ Three root-cause fixes: backend mis-labeling, scalar init loops →
resize(), temp-file IPC → pipe IPC - ✅ SIMD_W bumped 4 → 16;
quantize_int8/reconstruct_int8fully vectorised + parallelised - ✅ Best-of-5 benchmark (eliminates cold-cache variance)
- ✅ INT8 throughput: 12,583,364 vec/s — 4.85× faster than FAISS C++ at d=768
- ✅ 575 tests, 100% module coverage
- ✅ NF4 StaticTuple lookup table (O(16)→O(4) binary search) +
parallelize+ vectorized abs-max - ✅ SIMD vector accumulator for abs-max (eliminates mid-loop
reduce_max()) - ✅ Binary encode/decode
parallelizeover rows (near-linear core scaling) - ✅ Pipe IPC bitcast optimization —
bitcast[UInt8]()bulk copy replaces element-wise serialization - ✅
vectro_api.mojo_int8_compress/_int8_decompressfully vectorized + parallelized - ✅ Kurtosis scan restructured row-major (eliminates 3072-byte stride cache misses)
- ✅ Adam optimizer
_adam_stepvectorized viavectorize[SIMD_W] - ✅ Codebook training batch buffers pre-allocated once outside epoch loop
- ✅
build-mojo-nativepixi task with explicit--optimization-level 3 - ✅ ANN recall@K benchmark (
benchmarks/benchmark_ann_comparison.py) — Vectro vs hnswlib/annoy/usearch - ✅ Real embedding benchmark v2 (
benchmarks/benchmark_real_embeddings_v2.py) — actual GloVe-100 download - ✅ Multi-dimensional INT8 throughput analysis in FAISS comparison benchmark (d=128/384/768/1536)
- ✅
bench-annoptional dep group (hnswlib, annoy, usearch, requests, tqdm) - ✅ 598 tests passing
MIT — See LICENSE
