Skip to content

feat: add custom Lance metrics to trace read-path scan performance#460

Open
summaryzb wants to merge 1 commit intolance-format:mainfrom
summaryzb:feat/read-path-custom-metrics
Open

feat: add custom Lance metrics to trace read-path scan performance#460
summaryzb wants to merge 1 commit intolance-format:mainfrom
summaryzb:feat/read-path-custom-metrics

Conversation

@summaryzb
Copy link
Copy Markdown
Contributor

Summary

Adds custom metrics to the Lance Spark read path using Spark's DataSource V2 CustomMetric API, enabling per-task timing and counter instrumentation that surfaces on the Spark UI Scan node. Six metrics are tracked: fragments scanned, batches read, dataset open time, scanner creation time, batch load time, and a derived total scan time.

Motivation

Implement #459

before this pr

image

after this pr

image

Approach

The implementation uses Spark's CustomMetric / CustomTaskMetric API, which is the standard DataSource V2 mechanism for surfacing connector-specific metrics in the Spark UI.

Metric definitions (LanceCustomMetrics): Six CustomSumMetric inner classes define the metrics. Each has a public no-arg constructor (required by Spark's reflection-based instantiation). A static allMetrics() method returns all definitions for LanceScan.supportedCustomMetrics(). The CustomSumMetric base class handles aggregation across tasks automatically.

Executor-side tracking (LanceReadMetricsTracker): A thread-confined accumulator that lives inside each PartitionReader. It collects per-phase nanosecond timings and counters via simple add*() methods. The currentMetricsValues() method returns a snapshot array of CustomTaskMetric instances -- Spark calls this after each next() invocation. The derived scanTimeNs metric is computed as datasetOpenTimeNs + scannerCreateTimeNs + batchLoadTimeNs.

Instrumentation points: Timing is captured at three boundaries in the scan lifecycle:

  1. LanceFragmentScanner.create() wraps Dataset.open() and fragment.newScan() with System.nanoTime() measurements, storing the durations as instance fields.
  2. LanceFragmentColumnarBatchScanner.loadNextBatch() measures each ArrowReader.loadNextBatch() call.
  3. LanceColumnarPartitionReader reads these timings from the scanner and feeds them into its LanceReadMetricsTracker.

The same pattern is applied to LanceCountStarPartitionReader (pushed-down COUNT(*)) and LanceRowPartitionReader (delegates to the columnar reader). All three reader types override currentMetricsValues() to report metrics to Spark.

Test Coverage

  • Metric count: allMetrics() returns exactly 6 metrics.
  • Name uniqueness: all metric names are distinct.
  • Name correctness: metric names match the string constants defined in LanceCustomMetrics.
  • Description presence and uniqueness: every metric has a non-empty, unique description.
  • Sum aggregation: CustomSumMetric.aggregateTaskMetrics() correctly sums values (including empty array).
  • Reflection instantiation: all six metric inner classes can be constructed via no-arg reflection (as Spark does at runtime).
  • Tracker initial state: all counters start at zero.
  • Tracker accumulation: repeated add*() calls accumulate correctly; derived scanTimeNs equals sum of three sub-timings.
  • Tracker bulk add: addFragmentsScanned() supports multi-fragment increments.
  • Tracker-definition consistency: task metric names match definition names (required for Spark aggregation).
  • Integration -- columnar read: a SELECT x, y query produces non-zero values for all six metrics, and scanTimeNs == datasetOpenTimeNs + scannerCreateTimeNs + batchLoadTimeNs after Spark aggregation.
  • Integration -- COUNT(*) with filter: the LanceCountStarPartitionReader path also produces non-zero values for all metrics.
  • Integration -- basic read/count completions: verifies that metrics instrumentation does not break normal query execution.

Change-Id: I63dd17d7e8469c27a73251d7eca3ac373d279d7f
@github-actions github-actions Bot added the enhancement New feature or request label Apr 20, 2026
@hamersaw
Copy link
Copy Markdown
Collaborator

Some review notes on the metrics approach — splitting into perf concerns, naming/correctness fixes, and gaps in coverage.

Performance concerns

The overall approach is sound — thread-confined long counters, no locks, and nanoTime() overhead per batch (~50ns total) is negligible against the JNI + Arrow IPC cost it's measuring. Two things to address before merge:

  1. Per-call allocation in currentMetricsValues(). LanceReadMetricsTracker.currentMetricsValues() allocates a fresh CustomTaskMetric[6] plus 6 anonymous inner-class instances every call. Spark invokes this once per PartitionReader.next(). That's per-batch for columnar (fine), but LanceRowPartitionReader.currentMetricsValues() delegates straight through, so for the row-oriented read path it's 7 allocations per row. Cache the array and metric instances once at construction and mutate the long fields in place — Spark snapshots the value on read, reuse is safe.

  2. Redundant field shuttle for batch load time. LanceFragmentColumnarBatchScanner.lastBatchLoadTimeNs exists only so the partition reader can pull it back out via a getter. Pass the tracker into the scanner and have it call tracker.addBatchLoadTimeNs(...) directly. Removes a field, a getter, and a per-batch coupling point.

One unrelated correctness win bundled in worth noting: LanceCountStarPartitionReader.get() previously ran computeCount() on every call — now cached behind if (currentBatch == null). Spark only calls get() once per next()==true so it likely never bit anyone, but worth keeping.

Naming / correctness fixes

  1. Format time metrics for the UI. CustomSumMetric returns the raw long, so the UI will show "47382910283" instead of "47.4 s". Subclass once with a ns-formatting aggregateTaskMetrics and use it for every time metric:

    public abstract class CustomNsTimeMetric extends CustomSumMetric {
      @Override public String aggregateTaskMetrics(long[] taskMetrics) {
        long total = 0; for (long v : taskMetrics) total += v;
        return formatDuration(total); // \"1.2 s\", \"350 ms\", \"47 us\"
      }
    }

    SQLMetrics.stringValue is private[sql], so reimplement the formatter — it's ~10 lines.

  2. Drop or rename scanTimeNs. It's open + create + load summed across fragments and batches in a partition — not a wall-clock scan duration. The name invites misinterpretation. Either delete it (the components are already exposed) or rename to lanceJniTotalTimeNs. Removing it also retires the "must be updated if new phases are added" comment in currentMetricsValues() — that comment is admitting a maintenance trap that doesn't need to exist.

  3. Match Spark naming conventions. Built-in metrics use numX for counts:

    • fragmentsScannednumFragmentsScanned
    • batchesReadnumBatchesLoaded (verb consistency with loadNextBatch / batchLoadTimeNs)
  4. Verify the JNI surface for native stats. Before adding the row/byte counters in the next section, check whether lance-jni's LanceScanner exposes a getStats()-style call. If yes, pull from there — it's authoritative. If no, file an upstream ask.

Gaps in coverage

The current set tells you "how long did Lance take" but not "where did the time go" or "did pushdown work." Proposed final metric set:

Counts (CustomSumMetric):

Metric Description
numFragmentsScanned Lance fragments actually opened by this task.
numFragmentsPruned Fragments eliminated by stats / zonemap before scan. Critical for verifying pushdown — silent regressions otherwise.
numBatchesLoaded Arrow batches returned from the JNI scanner.
numRowsScanned Rows read from storage before filter evaluation. Pair with Spark's built-in numOutputRows to compute filter selectivity.
numIndexLookups Times a Lance scalar/vector index was consulted. Tells you whether a query stayed on the index path.
bytesRead Bytes pulled from object store / disk by this task. The single most important diagnostic metric — answers IO-bound vs CPU-bound. Format with a size formatter (KB/MB/GB).

Times (CustomNsTimeMetric from fix #3):

Metric Description
datasetOpenTimeNs Total time in Utils.openDatasetBuilder(...).build() summed across fragments in the partition.
scannerCreateTimeNs Total time in fragment.newScan(...) summed across fragments.
indexLookupTimeNs Time spent in index probes. Sourced from native stats if available.
ioWaitTimeNs Time blocked on storage IO inside loadNextBatch. Requires native split — file an ask if lance-jni doesn't surface it.
decodeTimeNs Time spent decompressing / decoding inside loadNextBatch (the non-IO portion). Same native-source dependency as above.
batchLoadTimeNs Total loadNextBatch wall time. Keep even if IO/decode are split — useful as a sanity check (ioWait + decode ≈ batchLoad).

Drop: scanTimeNs (derived, misleading — see fix #4).

If lance-jni doesn't yet expose bytesRead, numFragmentsPruned, numRowsScanned, numIndexLookups, ioWaitTimeNs, or decodeTimeNs, ship the Java-measurable ones now (fragment/batch counts, the three timing phases, plus numRowsScanned from VectorSchemaRoot.getRowCount()) and gate the rest on an upstream change. Don't fake the missing ones with Java-side approximations — that's worse than not having them.

The two I'd treat as blocking are bytesRead and numFragmentsPruned. The rest is nice-to-have.

@hamersaw
Copy link
Copy Markdown
Collaborator

This is awesome! Appreciate the effort here, it will be super useful. I left a comment ^^^ kind of just dumping my claude session during review (sorry for the AI comment). My takeaways:
(1) I like the idea of formatting the nanosecond times into seconds for readability
(2) I think we need to add a little more information to the metrics and solidify a naming scheme (ex. num* for counters, *TimeNS for durations, etc. The actual metric values were from me iterating on this a bit.
(3) Lets update the performance docs with a breakdown of these values + description. I think this could help debug a lot of performance issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants