Skip to content

fix(target-allocator): sort labels in processTargetGroups to prevent hash collisions#4967

Draft
gracewehner wants to merge 4 commits intoopen-telemetry:mainfrom
gracewehner:fix/ta-sorted-labels
Draft

fix(target-allocator): sort labels in processTargetGroups to prevent hash collisions#4967
gracewehner wants to merge 4 commits intoopen-telemetry:mainfrom
gracewehner:fix/ta-sorted-labels

Conversation

@gracewehner
Copy link
Copy Markdown
Contributor

Description

When processTargetGroups builds labels for each target, it uses a struct-copy of the group ScratchBuilder and appends target labels. The resulting label set has two independently sorted sublists (group labels, then target labels) but is not globally sorted.

labels.Labels.Get() (in the stringlabels build, the default) uses a linear scan with sorted-order early termination: if it encounters a label name whose first byte is greater than the search key, it stops immediately and returns empty. When a group label like vendor (first byte v=118) precedes __address__ (first byte _=95), Get("__address__") sees 118 > 95 and breaks early — returning "".

This causes relabel_configs that read source_labels: [__address__] to produce identical output for all targets in the group, creating hash collisions in the allocator's map[ItemHash]*Item. The last target silently overwrites all others.

Fix

Replace the struct-copy pattern with a merge-sort that interleaves the sorted group labels and sorted target labels into globally sorted output. Uses ScratchBuilder.Overwrite() to reuse internal buffers, achieving ~0% overhead vs the buggy baseline in benchmarks.

Testing

  • Added regression test TestSortedLabelsBlackboxRelabeling that reproduces the exact customer scenario (blackbox exporter with vendor group label + 2 targets)
  • Added workaround validation tests confirming which customer workarounds are effective
  • All existing tests pass (make precommit clean — fmt, vet, lint, 1955 tests, ensure-update-is-noop)

Customer Impact

This bug silently drops scrape targets when:

  1. static_configs has group-level labels AND multiple targets
  2. relabel_configs reads __address__ (which most configs do)
  3. The group label name sorts alphabetically after __address__ (any label starting with a-z)

Workaround: Remove group labels from static_configs and add them via metric_relabel_configs instead.

Link to tracking issue

Fixes open-telemetry/opentelemetry-operator#XXXX

gracewehner and others added 4 commits April 10, 2026 17:26
… fix silent target loss

processTargetGroups used a ScratchBuilder struct-copy optimization that
produced labels with two independently sorted sublists (group labels +
target labels) but not globally sorted. labels.Labels.Get() uses binary
search assuming globally sorted data. When group labels sort after target
labels (e.g. 'vendor' > '__address__'), Get('__address__') silently returns
empty string, causing relabel rules reading source_labels: [__address__] to
produce empty results.

Both targets then get identical post-relabel labels, creating hash collisions
in the allocator's target map which silently drops one target.

The fix materializes group labels once, then creates a fresh ScratchBuilder
per target and calls Sort() after all labels are added, ensuring globally
sorted Labels that work correctly with binary search.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename groupBuilder to targetBuilder via pointer alias in the per-target
loop to clarify that the builder is being reused for target label merging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hash collisions

ScratchBuilder.Labels() serializes labels in insertion order without
sorting. When group labels (e.g. vendor=nginx) sort after target labels
(e.g. __address__), Labels.Get() hits a sorted-order early-termination
check and returns empty for __address__. This causes relabel rules using
source_labels: [__address__] to produce identical output for all targets
in the group, creating hash collisions that silently drop targets in the
allocator's buildTargetMap.

Fix: replace the struct-copy pattern with a merge-sort that interleaves
sorted group labels and sorted target labels, keeping output sorted.
Uses ScratchBuilder.Overwrite for zero-allocation after the first
iteration (~0.3% benchmark overhead vs buggy baseline).

Add regression test and workaround validation tests confirming:
- The fix produces sorted labels and unique hashes
- The buggy code collapses targets via hash collision
- Workaround: removing group labels avoids the collision
- One-target-per-group does NOT avoid the collision

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant