Skip to content

br: reduce crr advancer etcd retry backoff#69047

Merged
ti-chi-bot[bot] merged 2 commits into
pingcap:masterfrom
Leavrth:reduce_etcd_backoff
Jun 17, 2026
Merged

br: reduce crr advancer etcd retry backoff#69047
ti-chi-bot[bot] merged 2 commits into
pingcap:masterfrom
Leavrth:reduce_etcd_backoff

Conversation

@Leavrth

@Leavrth Leavrth commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: ref #69048

Problem Summary:
When the pd leader io delay and then pd leader is changed, the etcd client is still retry on the old pd leader.

What changed and how does it work?

reduce etcd retry backoff

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of etcd checkpoint watching by proactively requesting watch progress and adding an idle-timeout guard when no progress is observed.
    • Enhanced etcd gRPC dialing with centralized configuration, including backoff and keepalive behavior tweaks.
  • Tests

    • Added unit tests covering the constructed etcd client configuration.
    • Added an integration test that validates timeout behavior during global checkpoint watch-progress handling.

Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
@pantheon-ai

pantheon-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

@Leavrth I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-tests-checked labels Jun 9, 2026
@tiprow

tiprow Bot commented Jun 9, 2026

Copy link
Copy Markdown

Hi @Leavrth. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 290c24fe-4d7f-4410-bc70-a6a98c7863d7

📥 Commits

Reviewing files that changed from the base of the PR and between 97fa9a6 and 62e75b9.

📒 Files selected for processing (3)
  • br/pkg/streamhelper/advancer_cliext.go
  • br/pkg/streamhelper/export_test.go
  • br/pkg/streamhelper/integration_test.go

📝 Walkthrough

Walkthrough

The PR adds proactive etcd watch progress management and refactors etcd client configuration. The operator consolidates gRPC backoff and keepalive settings into helper functions and a builder pattern, updating build dependencies and adding configuration tests. In the metadata client, checkpoint watches now use leader-required contexts, periodic progress requests, and idle timeout detection to prevent stalled watches, with configurable timeouts and integration tests validating timeout behavior.

Changes

Etcd gRPC Configuration Centralization

Layer / File(s) Summary
Build dependencies and test configuration
br/pkg/task/operator/BUILD.bazel
Operator library now depends on @org_golang_google_grpc//backoff, test target shard count increases from 3 to 4, and operator test dependencies include //br/pkg/task.
Etcd client configuration implementation
br/pkg/task/operator/crr_checkpoint.go
gRPC backoff import and etcdGRPCBackOffMaxDelay constant are added. Helper functions etcdGRPCBackoffConfig and etcdKeepaliveParams compute connection parameters. New newEtcdClientConfig function constructs clientv3.Config with gRPC dial options for backoff and keepalive, and dialEtcdWithCfg is refactored to call this builder. Keepalive now uses PermitWithoutStream: true and TLS errors are traced.
Configuration behavior tests
br/pkg/task/operator/crr_checkpoint_test.go
Imports for time and task packages are added. New TestNewEtcdClientConfig verifies gRPC backoff max delay, keepalive time/timeout/permit values, and resulting config attributes including endpoints, auto-sync interval, dial timeout, and dial option count.

Proactive Etcd Watch Progress Management

Layer / File(s) Summary
Watch progress helpers and configuration
br/pkg/streamhelper/advancer_cliext.go
Adds time import and package-level metadataWatchProgressInterval and metadataWatchIdleTimeout constants. Introduces helper functions to reset idle timer, request watch progress (with optional failpoint skip), and generate standardized idle timeout errors for stalled watches.
Watch progress integration in waitCheckpointEvent
br/pkg/streamhelper/advancer_cliext.go
waitCheckpointEvent is reworked to create leader-required watch context with progress notifications enabled, initialize progress and idle timers, reset idle timer upon receiving watch responses, and add select cases to periodically request watch progress and return timeout errors when no progress is observed within the configured idle interval.
Watch progress testing
br/pkg/streamhelper/export_test.go, br/pkg/streamhelper/integration_test.go
SetMetadataWatchProgressForTest helper allows temporary timeout override for testing. Integration test imports time and adds TestCheckpointWatchProgressTimeout subtest that verifies WaitGlobalCheckpointAdvance returns timeout errors containing expected message when watch-progress request is skipped via failpoint.

Sequence Diagram(s)

sequenceDiagram
    participant Client as WaitGlobalCheckpointAdvance
    participant WatchSetup as waitCheckpointEvent
    participant ProgressTicker as Progress Ticker
    participant IdleTicker as Idle Ticker
    participant EtcdWatch as etcd Watcher
    
    Client->>WatchSetup: call with context
    WatchSetup->>WatchSetup: create leader-required context
    WatchSetup->>EtcdWatch: Watch(ctx, WithProgressNotify)
    WatchSetup->>ProgressTicker: initialize ticker
    WatchSetup->>IdleTicker: initialize idle timer
    
    loop Watch Processing
        par Progress Request Path
            ProgressTicker->>EtcdWatch: RequestProgress()
        and Watch Response Path
            EtcdWatch->>WatchSetup: send response
            WatchSetup->>IdleTicker: reset idle timer
        and Idle Timeout Path
            IdleTicker->>WatchSetup: timeout fires
            WatchSetup->>Client: return watchIdleTimeoutError
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

size/L, ok-to-test

Suggested reviewers

  • RidRisR
  • YuJuncen

Poem

A rabbit refactors with care and with grace,
Backoff and keepalive find their right place,
Then adds a keen watch that won't fall asleep,
With progress requests both steady and deep,
Now timeouts are caught and the data runs true! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective: reducing etcd retry backoff for the CRR advancer, which is achieved through the watch timeout mechanism and gRPC backoff configuration changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
br/pkg/task/operator/crr_checkpoint.go (1)

43-43: ⚡ Quick win

Document why the backoff cap is fixed at 3s.

Line 43 encodes a non-obvious retry/perf trade-off; please add a short comment with the stale-leader mitigation rationale so future tuning is safer.

Suggested diff
-const etcdGRPCBackOffMaxDelay = 3 * time.Second
+// etcdGRPCBackOffMaxDelay keeps reconnect retries short so the client can
+// move off a stale PD leader quickly after leader changes.
+const etcdGRPCBackOffMaxDelay = 3 * time.Second
As per coding guidelines, comments SHOULD explain non-obvious intent and important performance trade-offs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/task/operator/crr_checkpoint.go` at line 43, Add a short explanatory
comment above the constant etcdGRPCBackOffMaxDelay (currently set to 3 *
time.Second) that documents the non-obvious retry/performance trade-off: state
that the 3s cap is chosen to limit client-side gRPC backoff so the BR operator
quickly fails over from a stale etcd leader instead of waiting long exponential
backoffs, improving recovery latency at the cost of more frequent retries;
mention that raising this value increases time-to-recover from leader changes
while lowering it may raise request churn, so future tuning should consider
stale-leader sensitivity and cluster load.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@br/pkg/task/operator/crr_checkpoint.go`:
- Line 43: Add a short explanatory comment above the constant
etcdGRPCBackOffMaxDelay (currently set to 3 * time.Second) that documents the
non-obvious retry/performance trade-off: state that the 3s cap is chosen to
limit client-side gRPC backoff so the BR operator quickly fails over from a
stale etcd leader instead of waiting long exponential backoffs, improving
recovery latency at the cost of more frequent retries; mention that raising this
value increases time-to-recover from leader changes while lowering it may raise
request churn, so future tuning should consider stale-leader sensitivity and
cluster load.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 13fba38f-90e2-47a0-9166-b8c77cddd710

📥 Commits

Reviewing files that changed from the base of the PR and between a0e180e and 97fa9a6.

📒 Files selected for processing (3)
  • br/pkg/task/operator/BUILD.bazel
  • br/pkg/task/operator/crr_checkpoint.go
  • br/pkg/task/operator/crr_checkpoint_test.go

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.9252%. Comparing base (d568a85) to head (62e75b9).
⚠️ Report is 49 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #69047        +/-   ##
================================================
- Coverage   76.3213%   75.9252%   -0.3961%     
================================================
  Files          2041       2052        +11     
  Lines        562689     582736     +20047     
================================================
+ Hits         429452     442444     +12992     
- Misses       132324     137972      +5648     
- Partials        913       2320      +1407     
Flag Coverage Δ
integration 44.8228% <0.0000%> (+5.0940%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4610% <ø> (ø)
parser ∅ <ø> (∅)
br 64.7928% <0.0000%> (+2.0023%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 16, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RidRisR, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 16, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-06-16 03:04:11.606089855 +0000 UTC m=+1447552.676407235: ☑️ agreed by YuJuncen.
  • 2026-06-16 08:02:49.139067924 +0000 UTC m=+1465470.209385304: ☑️ agreed by RidRisR.

Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2026
@hawkingrei

Copy link
Copy Markdown
Contributor

/ok-to-test

@ti-chi-bot ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Jun 17, 2026
@ti-chi-bot ti-chi-bot Bot merged commit 234eee9 into pingcap:master Jun 17, 2026
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants