br: reduce crr advancer etcd retry backoff#69047
Conversation
Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
|
@Leavrth I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
Hi @Leavrth. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe PR adds proactive etcd watch progress management and refactors etcd client configuration. The operator consolidates gRPC backoff and keepalive settings into helper functions and a builder pattern, updating build dependencies and adding configuration tests. In the metadata client, checkpoint watches now use leader-required contexts, periodic progress requests, and idle timeout detection to prevent stalled watches, with configurable timeouts and integration tests validating timeout behavior. ChangesEtcd gRPC Configuration Centralization
Proactive Etcd Watch Progress Management
Sequence Diagram(s)sequenceDiagram
participant Client as WaitGlobalCheckpointAdvance
participant WatchSetup as waitCheckpointEvent
participant ProgressTicker as Progress Ticker
participant IdleTicker as Idle Ticker
participant EtcdWatch as etcd Watcher
Client->>WatchSetup: call with context
WatchSetup->>WatchSetup: create leader-required context
WatchSetup->>EtcdWatch: Watch(ctx, WithProgressNotify)
WatchSetup->>ProgressTicker: initialize ticker
WatchSetup->>IdleTicker: initialize idle timer
loop Watch Processing
par Progress Request Path
ProgressTicker->>EtcdWatch: RequestProgress()
and Watch Response Path
EtcdWatch->>WatchSetup: send response
WatchSetup->>IdleTicker: reset idle timer
and Idle Timeout Path
IdleTicker->>WatchSetup: timeout fires
WatchSetup->>Client: return watchIdleTimeoutError
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Tools execution failed with the following error: Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error) Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
br/pkg/task/operator/crr_checkpoint.go (1)
43-43: ⚡ Quick winDocument why the backoff cap is fixed at 3s.
Line 43 encodes a non-obvious retry/perf trade-off; please add a short comment with the stale-leader mitigation rationale so future tuning is safer.
As per coding guidelines, comments SHOULD explain non-obvious intent and important performance trade-offs.Suggested diff
-const etcdGRPCBackOffMaxDelay = 3 * time.Second +// etcdGRPCBackOffMaxDelay keeps reconnect retries short so the client can +// move off a stale PD leader quickly after leader changes. +const etcdGRPCBackOffMaxDelay = 3 * time.Second🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@br/pkg/task/operator/crr_checkpoint.go` at line 43, Add a short explanatory comment above the constant etcdGRPCBackOffMaxDelay (currently set to 3 * time.Second) that documents the non-obvious retry/performance trade-off: state that the 3s cap is chosen to limit client-side gRPC backoff so the BR operator quickly fails over from a stale etcd leader instead of waiting long exponential backoffs, improving recovery latency at the cost of more frequent retries; mention that raising this value increases time-to-recover from leader changes while lowering it may raise request churn, so future tuning should consider stale-leader sensitivity and cluster load.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@br/pkg/task/operator/crr_checkpoint.go`:
- Line 43: Add a short explanatory comment above the constant
etcdGRPCBackOffMaxDelay (currently set to 3 * time.Second) that documents the
non-obvious retry/performance trade-off: state that the 3s cap is chosen to
limit client-side gRPC backoff so the BR operator quickly fails over from a
stale etcd leader instead of waiting long exponential backoffs, improving
recovery latency at the cost of more frequent retries; mention that raising this
value increases time-to-recover from leader changes while lowering it may raise
request churn, so future tuning should consider stale-leader sensitivity and
cluster load.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 13fba38f-90e2-47a0-9166-b8c77cddd710
📒 Files selected for processing (3)
br/pkg/task/operator/BUILD.bazelbr/pkg/task/operator/crr_checkpoint.gobr/pkg/task/operator/crr_checkpoint_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #69047 +/- ##
================================================
- Coverage 76.3213% 75.9252% -0.3961%
================================================
Files 2041 2052 +11
Lines 562689 582736 +20047
================================================
+ Hits 429452 442444 +12992
- Misses 132324 137972 +5648
- Partials 913 2320 +1407
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RidRisR, YuJuncen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
|
/ok-to-test |
What problem does this PR solve?
Issue Number: ref #69048
Problem Summary:
When the pd leader io delay and then pd leader is changed, the etcd client is still retry on the old pd leader.
What changed and how does it work?
reduce etcd retry backoff
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Summary by CodeRabbit
Bug Fixes
Tests