fix(cluster): complete Issue #905 step 2 correctness for Rand and Rou… by mochengqian · Pull Request #915 · apache/dubbo-go-pixiu

mochengqian · 2026-04-18T08:17:47Z

What this PR does:

This PR delivers Step 1 (tests + benchmarks) and Step 2 (correctness) for Issue #905, and is intended to be the first mergeable slice with a clean boundary.

It:

fixes Rand to choose only from the healthy endpoint slice and return nil when no healthy endpoint exists
fixes RoundRobin to return nil when no healthy endpoint exists
makes the round-robin cursor safe for concurrent picks
adds deterministic regression coverage for the current Rand / RoundRobin correctness issues
extends manager-path regression coverage for missing-cluster / unhealthy-endpoint nil behavior
adds and refines the benchmark suite, including:
- PickEndpoint serial baseline
- cleaner PickEndpoint parallel baseline
- lookup-only benchmark
- simple LB hot-path-only benchmark
- healthy endpoint filtering cost
- mixed CompareAndSetStore workload
- consistent-hash resolve baseline

This PR does not start the Step 3+ runtime-consistency / snapshot refactor yet.

Which issue(s) this PR fixes:

Fixes #905

Special notes for your reviewer:

Issue #905 step progress:

Add tests and benchmarks
Fix current correctness issues
Tighten runtime consistency
Switch cluster lookup to O(1)
Introduce healthy endpoint snapshots
Optimize simple LB hot path
Optimize consistent-hash LB last
This PR intentionally covers Step 1 (tests + benchmarks) and Step 2 (correctness) only.
Runtime consistency for UpdateCluster / CompareAndSetStore, O(1) cluster lookup, healthy endpoint snapshots, simple LB hot-path optimization, and consistent-hash optimization will follow in later PRs.
Detailed benchmark baseline numbers are posted in a separate PR comment so later steps can compare against the same baseline.
Local validation completed:
- go test ./pkg/model ./pkg/server ./pkg/cluster/loadbalancer/rand ./pkg/cluster/loadbalancer/roundrobin -count=1
- go test ./pkg/server ./pkg/cluster/loadbalancer/rand ./pkg/cluster/loadbalancer/roundrobin -race -gcflags=-l -count=1
- go vet ./pkg/server ./pkg/cluster/loadbalancer/... ./pkg/model
- go test ./pkg/server -run '^$' -bench '^BenchmarkCluster' -benchmem -count=5

Does this PR introduce a user-facing change?:

NONE

…nd RoundRobin

mochengqian · 2026-04-18T08:26:18Z

Issue #905 benchmark baseline (Step 1 + Step 2)

Environment:

goos=darwin
goarch=arm64
cpu=Apple M5

Command:

go test ./pkg/server -run '^$' -bench '^BenchmarkCluster' -benchmem -count=5

Each Mean value below is the arithmetic mean across 5 runs. Range shows the min-max span across the same 5 runs.

Raw output was captured locally for reference.

PickEndpoint / Lookup

Benchmark	Mean ns/op	Range ns/op
PickEndpoint serial / Rand / `clusters=1`	16.12	15.94-16.29
PickEndpoint serial / Rand / `clusters=32`	33.09	32.98-33.32
PickEndpoint serial / Rand / `clusters=256`	113.58	112.80-115.30
PickEndpoint serial / Rand / `clusters=1024`	685.30	669.10-692.30
PickEndpoint serial / RoundRobin / `clusters=1`	13.51	13.29-13.96
PickEndpoint serial / RoundRobin / `clusters=32`	29.59	29.48-29.82
PickEndpoint serial / RoundRobin / `clusters=256`	111.00	110.20-112.30
PickEndpoint serial / RoundRobin / `clusters=1024`	767.10	675.90-1054.00
Lookup serial / `clusters=1`	3.33	3.32-3.35
Lookup serial / `clusters=32`	17.82	17.77-17.85
Lookup serial / `clusters=256`	111.68	111.50-112.20
Lookup serial / `clusters=1024`	744.18	738.00-749.60
PickEndpoint parallel / Rand	100.42	98.42-101.30
PickEndpoint parallel / RoundRobin	107.66	103.10-110.10

LB Hot Path / Healthy Filter

Benchmark	Mean ns/op	Range ns/op	B/op	allocs/op
LB hot path / Rand / `endpoints=4`	13.72	13.51-13.93	0	0
LB hot path / Rand / `endpoints=64`	134.96	131.40-136.70	512	1
LB hot path / Rand / `endpoints=512`	860.96	852.00-881.30	4864	1
LB hot path / RoundRobin / `endpoints=4`	9.29	8.96-9.57	0	0
LB hot path / RoundRobin / `endpoints=64`	125.40	124.60-128.10	512	1
LB hot path / RoundRobin / `endpoints=512`	858.72	851.60-868.40	4864	1
Healthy filter / `endpoints=8, healthy=100%`	20.02	19.92-20.11	64	1
Healthy filter / `endpoints=8, healthy=50%`	19.11	18.98-19.45	64	1
Healthy filter / `endpoints=8, healthy=0%`	16.78	16.70-16.91	64	1
Healthy filter / `endpoints=64, healthy=100%`	120.60	120.30-121.70	512	1
Healthy filter / `endpoints=64, healthy=50%`	102.90	102.00-104.50	512	1
Healthy filter / `endpoints=64, healthy=0%`	87.86	87.38-88.61	512	1
Healthy filter / `endpoints=512, healthy=100%`	900.28	894.20-908.50	4864	1
Healthy filter / `endpoints=512, healthy=50%`	744.82	730.60-758.70	4864	1
Healthy filter / `endpoints=512, healthy=0%`	498.62	495.40-501.60	4864	1

CAS / Consistent Hash

Benchmark	Mean ns/op	Range ns/op	B/op	allocs/op
CompareAndSetStore mixed	586839.40	573165.00-615044.00	1850084	6417
Consistent hash / RingHash	2644.00	2617.00-2690.00	2212	128
Consistent hash / Maglev	1944.20	1921.00-1964.00	1828	99

Notes

The split benchmarks are intended to make later step-by-step attribution clearer:
- Step 4 should primarily move the lookup-only numbers.
- Step 5/6 should primarily move the healthy-filter and simple LB hot-path numbers.
- Step 7 should primarily move the consistent-hash numbers.
PickEndpoint serial / RoundRobin / clusters=1024 showed one isolated high value in the count=5 suite run. A dedicated rerun with:
```
go test ./pkg/server -run '^$' -bench '^BenchmarkClusterPickEndpointSerial/RoundRobin/clusters=1024$' -benchmem -count=10
```
was stable at:
- mean: 687.15 ns/op
- median: 685.75 ns/op
- range: 673.60-712.40 ns/op
I am treating the earlier 1054.00 ns/op sample as an outlier rather than a representative signal.

Copilot

Pull request overview

Implements Issue #905 Step 1 (tests/benchmarks) and Step 2 (correctness) for cluster endpoint picking, focusing on fixing Rand / RoundRobin behavior with unhealthy endpoints and improving concurrency safety for round-robin picks.

Changes:

Fix Rand and RoundRobin to pick only from healthy endpoints and return nil when none are healthy.
Make round-robin cursor updates concurrency-safe via atomic operations.
Add regression tests and a benchmark suite covering pick-path behavior and related hot paths.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/server/cluster_manager_test.go	Adds/extends manager-path regression tests (missing cluster, unhealthy endpoints) and a concurrent pick race-style test.
pkg/server/cluster_manager_bench_test.go	Introduces a benchmark suite for `PickEndpoint`, lookup cost, LB hot path, health filtering cost, CAS workload, and consistent-hash resolve.
pkg/model/cluster_test.go	Adds unit tests for `GetEndpoint` health filtering, consistent-hash initialization, and `Endpoint.GetHost`.
pkg/model/cluster.go	Changes `PrePickEndpointIndex` type to `uint32` to support atomic increments.
pkg/cluster/loadbalancer/roundrobin/round_robin_test.go	Adds deterministic tests for all-unhealthy behavior and ordering/cursor handling.
pkg/cluster/loadbalancer/roundrobin/round_robin.go	Updates round-robin handler to return `nil` when no healthy endpoints and uses atomic cursor increments.
pkg/cluster/loadbalancer/rand/load_balancer_rand_test.go	Adds deterministic tests for `Rand` correctness and all-unhealthy behavior without panics.
pkg/cluster/loadbalancer/rand/load_balancer_rand.go	Fixes `Rand` to use healthy slice length, return `nil` on empty, and adds a test hook for deterministic randomness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov-commenter · 2026-04-19T03:00:35Z

Codecov Report

❌ Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 22.36%. Comparing base (e6be678) to head (7350a07).

Files with missing lines	Patch %	Lines
pkg/server/cluster_manager.go	69.23%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #915      +/-   ##
===========================================
+ Coverage    22.05%   22.36%   +0.30%     
===========================================
  Files          270      270              
  Lines        20069    20083      +14     
===========================================
+ Hits          4426     4491      +65     
+ Misses       15193    15140      -53     
- Partials       450      452       +2

Flag	Coverage Δ
unittests	`22.36% <80.95%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Alanxtl

cool work

Alanxtl

lgtm

Similarityoung · 2026-04-26T10:49:33Z

Code review

Found 1 issue:

Data race: non-atomic read/write of PrePickEndpointIndex in carryOverRuntimeStateFrom

round_robin.go accesses PrePickEndpointIndex exclusively via atomic.AddUint32 (line 41), but the new carryOverRuntimeStateFrom function copies the field with a plain assignment. Go's memory model treats mixing atomic and non-atomic accesses to the same variable as a data race — go test -race will flag this.

Fix: use atomic.LoadUint32 / atomic.StoreUint32:

atomic.StoreUint32(&clusterConfig.PrePickEndpointIndex, atomic.LoadUint32(&oldConfig.PrePickEndpointIndex))

dubbo-go-pixiu/pkg/server/cluster_manager.go

Lines 407 to 412 in 7350a07

    
           		} 
        
           		if oldConfig := oldConfigsByName[clusterConfig.Name]; oldConfig != nil { 
        
           			clusterConfig.PrePickEndpointIndex = oldConfig.PrePickEndpointIndex 
        
           		} 
        
           	} 
        
           }

(The atomic write path for reference:

dubbo-go-pixiu/pkg/cluster/loadbalancer/roundrobin/round_robin.go

Lines 39 to 43 in 7350a07

    
           	} 
        
           	// AddUint32 returns the incremented value, so subtract 1 for a zero-based index. 
        
           	index := atomic.AddUint32(&c.PrePickEndpointIndex, 1) - 1 
        
           	return endpoints[int(index%uint32(len(endpoints)))] 
        
           }

)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

sonarqubecloud · 2026-04-26T14:48:26Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

mochengqian · 2026-04-26T14:58:09Z

代码审查

发现 1 个问题：
数据竞争：非原子读/写PrePickEndpointIndex操作carryOverRuntimeStateFrom
round_robin.go``PrePickEndpointIndex（第 41 行）仅通过 __init__ 访问atomic.AddUint32，但新carryOverRuntimeStateFrom函数使用普通赋值复制了该字段。Go 的内存模型将对同一变量混合使用原子访问和非原子访问视为数据竞争——go test -race会对此发出警告。
解决方法：使用atomic.LoadUint32/ atomic.StoreUint32:
atomic.StoreUint32(&clusterConfig.PrePickEndpointIndex, atomic.LoadUint32(&oldConfig.PrePickEndpointIndex))
dubbo-go-pixiu/pkg/server/cluster_manager.go

Lines 407 to 412 in 7350a07

}

if oldConfig := oldConfigsByName[clusterConfig.Name]; oldConfig != nil {

clusterConfig.PrePickEndpointIndex = oldConfig.PrePickEndpointIndex

}

}

}

（原子写入路径供参考：）

dubbo-go-pixiu/pkg/cluster/loadbalancer/roundrobin/round_robin.go

Lines 39 to 43 in 7350a07

}

// AddUint32 returns the incremented value, so subtract 1 for a zero-based index.

index := atomic.AddUint32(&c.PrePickEndpointIndex, 1) - 1

return endpoints[int(index%uint32(len(endpoints)))]

}

）
🤖 由Claude Code生成

如果这篇代码审查对您有帮助，请点赞👍。否则，请点踩👎。

good catch!我确实没从并发内存模型层面考虑,现已修复问题.

fix(cluster): complete Issue apache#905 step 2 correctness for Rand a…

ec8f94a

…nd RoundRobin

AlexStocks requested a review from Copilot April 19, 2026 02:07

Copilot started reviewing on behalf of AlexStocks April 19, 2026 02:07 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread pkg/model/cluster.go Outdated

Comment thread pkg/server/cluster_manager_bench_test.go

mochengqian added 2 commits April 19, 2026 10:31

fix(cluster): address review feedback for issue 905

4b4e048

fix(test): address sonar warnings for blank imports

3b89c87

Alanxtl reviewed Apr 24, 2026

View reviewed changes

Comment thread pkg/model/cluster.go

fix(cluster): preserve round-robin cursor across store refresh

7350a07

Alanxtl approved these changes Apr 25, 2026

View reviewed changes

fix(cluster): use atomic cursor carry-over

3493243

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cluster): complete Issue #905 step 2 correctness for Rand and Rou…#915

fix(cluster): complete Issue #905 step 2 correctness for Rand and Rou…#915
mochengqian wants to merge 5 commits intoapache:developfrom
mochengqian:feat/issue-905

mochengqian commented Apr 18, 2026 •

edited

Loading

Uh oh!

mochengqian commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 19, 2026 •

edited

Loading

Uh oh!

Alanxtl left a comment

Uh oh!

Uh oh!

Alanxtl left a comment

Uh oh!

Similarityoung commented Apr 26, 2026

Uh oh!

sonarqubecloud Bot commented Apr 26, 2026

Uh oh!

mochengqian commented Apr 26, 2026

代码审查

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mochengqian commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mochengqian commented Apr 18, 2026

PickEndpoint / Lookup

LB Hot Path / Healthy Filter

CAS / Consistent Hash

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Alanxtl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Alanxtl left a comment

Choose a reason for hiding this comment

Uh oh!

Similarityoung commented Apr 26, 2026

Code review

Uh oh!

sonarqubecloud Bot commented Apr 26, 2026

Quality Gate passed

Uh oh!

mochengqian commented Apr 26, 2026

代码审查

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mochengqian commented Apr 18, 2026 •

edited

Loading

codecov-commenter commented Apr 19, 2026 •

edited

Loading