br: reduce crr advancer etcd retry backoff by Leavrth · Pull Request #69047 · pingcap/tidb

Leavrth · 2026-06-09T05:43:02Z

What problem does this PR solve?

Issue Number: ref #69048

Problem Summary:
When the pd leader io delay and then pd leader is changed, the etcd client is still retry on the old pd leader.

What changed and how does it work?

reduce etcd retry backoff

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

Bug Fixes
- Improved reliability of etcd checkpoint watching by proactively requesting watch progress and adding an idle-timeout guard when no progress is observed.
- Enhanced etcd gRPC dialing with centralized configuration, including backoff and keepalive behavior tweaks.
Tests
- Added unit tests covering the constructed etcd client configuration.
- Added an integration test that validates timeout behavior during global checkpoint watch-progress handling.

Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>

pantheon-ai · 2026-06-09T05:43:07Z

@Leavrth I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

tiprow · 2026-06-09T05:43:23Z

Hi @Leavrth. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-06-09T05:43:24Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 290c24fe-4d7f-4410-bc70-a6a98c7863d7

📥 Commits

Reviewing files that changed from the base of the PR and between 97fa9a6 and 62e75b9.

📒 Files selected for processing (3)

br/pkg/streamhelper/advancer_cliext.go
br/pkg/streamhelper/export_test.go
br/pkg/streamhelper/integration_test.go

📝 Walkthrough

Walkthrough

The PR adds proactive etcd watch progress management and refactors etcd client configuration. The operator consolidates gRPC backoff and keepalive settings into helper functions and a builder pattern, updating build dependencies and adding configuration tests. In the metadata client, checkpoint watches now use leader-required contexts, periodic progress requests, and idle timeout detection to prevent stalled watches, with configurable timeouts and integration tests validating timeout behavior.

Changes

Etcd gRPC Configuration Centralization

Layer / File(s)	Summary
Build dependencies and test configuration `br/pkg/task/operator/BUILD.bazel`	Operator library now depends on `@org_golang_google_grpc//backoff`, test target shard count increases from 3 to 4, and operator test dependencies include `//br/pkg/task`.
Etcd client configuration implementation `br/pkg/task/operator/crr_checkpoint.go`	gRPC backoff import and `etcdGRPCBackOffMaxDelay` constant are added. Helper functions `etcdGRPCBackoffConfig` and `etcdKeepaliveParams` compute connection parameters. New `newEtcdClientConfig` function constructs `clientv3.Config` with gRPC dial options for backoff and keepalive, and `dialEtcdWithCfg` is refactored to call this builder. Keepalive now uses `PermitWithoutStream: true` and TLS errors are traced.
Configuration behavior tests `br/pkg/task/operator/crr_checkpoint_test.go`	Imports for `time` and `task` packages are added. New `TestNewEtcdClientConfig` verifies gRPC backoff max delay, keepalive time/timeout/permit values, and resulting config attributes including endpoints, auto-sync interval, dial timeout, and dial option count.

Proactive Etcd Watch Progress Management

Layer / File(s)	Summary
Watch progress helpers and configuration `br/pkg/streamhelper/advancer_cliext.go`	Adds `time` import and package-level `metadataWatchProgressInterval` and `metadataWatchIdleTimeout` constants. Introduces helper functions to reset idle timer, request watch progress (with optional failpoint skip), and generate standardized idle timeout errors for stalled watches.
Watch progress integration in waitCheckpointEvent `br/pkg/streamhelper/advancer_cliext.go`	`waitCheckpointEvent` is reworked to create leader-required watch context with progress notifications enabled, initialize progress and idle timers, reset idle timer upon receiving watch responses, and add select cases to periodically request watch progress and return timeout errors when no progress is observed within the configured idle interval.
Watch progress testing `br/pkg/streamhelper/export_test.go`, `br/pkg/streamhelper/integration_test.go`	`SetMetadataWatchProgressForTest` helper allows temporary timeout override for testing. Integration test imports `time` and adds `TestCheckpointWatchProgressTimeout` subtest that verifies `WaitGlobalCheckpointAdvance` returns timeout errors containing expected message when watch-progress request is skipped via failpoint.

Sequence Diagram(s)

sequenceDiagram
    participant Client as WaitGlobalCheckpointAdvance
    participant WatchSetup as waitCheckpointEvent
    participant ProgressTicker as Progress Ticker
    participant IdleTicker as Idle Ticker
    participant EtcdWatch as etcd Watcher
    
    Client->>WatchSetup: call with context
    WatchSetup->>WatchSetup: create leader-required context
    WatchSetup->>EtcdWatch: Watch(ctx, WithProgressNotify)
    WatchSetup->>ProgressTicker: initialize ticker
    WatchSetup->>IdleTicker: initialize idle timer
    
    loop Watch Processing
        par Progress Request Path
            ProgressTicker->>EtcdWatch: RequestProgress()
        and Watch Response Path
            EtcdWatch->>WatchSetup: send response
            WatchSetup->>IdleTicker: reset idle timer
        and Idle Timeout Path
            IdleTicker->>WatchSetup: timeout fires
            WatchSetup->>Client: return watchIdleTimeoutError
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested labels

size/L, ok-to-test

Suggested reviewers

RidRisR
YuJuncen

Poem

A rabbit refactors with care and with grace,
Backoff and keepalive find their right place,
Then adds a keen watch that won't fall asleep,
With progress requests both steady and deep,
Now timeouts are caught and the data runs true! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective: reducing etcd retry backoff for the CRR advancer, which is achieved through the watch timeout mechanism and gRPC backoff configuration changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

br/pkg/task/operator/crr_checkpoint.go (1)

43-43: ⚡ Quick win

Document why the backoff cap is fixed at 3s.

Line 43 encodes a non-obvious retry/perf trade-off; please add a short comment with the stale-leader mitigation rationale so future tuning is safer.

Suggested diff

-const etcdGRPCBackOffMaxDelay = 3 * time.Second
+// etcdGRPCBackOffMaxDelay keeps reconnect retries short so the client can
+// move off a stale PD leader quickly after leader changes.
+const etcdGRPCBackOffMaxDelay = 3 * time.Second

As per coding guidelines, comments SHOULD explain non-obvious intent and important performance trade-offs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/task/operator/crr_checkpoint.go` at line 43, Add a short explanatory
comment above the constant etcdGRPCBackOffMaxDelay (currently set to 3 *
time.Second) that documents the non-obvious retry/performance trade-off: state
that the 3s cap is chosen to limit client-side gRPC backoff so the BR operator
quickly fails over from a stale etcd leader instead of waiting long exponential
backoffs, improving recovery latency at the cost of more frequent retries;
mention that raising this value increases time-to-recover from leader changes
while lowering it may raise request churn, so future tuning should consider
stale-leader sensitivity and cluster load.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@br/pkg/task/operator/crr_checkpoint.go`:
- Line 43: Add a short explanatory comment above the constant
etcdGRPCBackOffMaxDelay (currently set to 3 * time.Second) that documents the
non-obvious retry/performance trade-off: state that the 3s cap is chosen to
limit client-side gRPC backoff so the BR operator quickly fails over from a
stale etcd leader instead of waiting long exponential backoffs, improving
recovery latency at the cost of more frequent retries; mention that raising this
value increases time-to-recover from leader changes while lowering it may raise
request churn, so future tuning should consider stale-leader sensitivity and
cluster load.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 13fba38f-90e2-47a0-9166-b8c77cddd710

📥 Commits

Reviewing files that changed from the base of the PR and between a0e180e and 97fa9a6.

📒 Files selected for processing (3)

br/pkg/task/operator/BUILD.bazel
br/pkg/task/operator/crr_checkpoint.go
br/pkg/task/operator/crr_checkpoint_test.go

codecov · 2026-06-09T05:51:06Z

Codecov Report

❌ Patch coverage is 0% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.9252%. Comparing base (d568a85) to head (62e75b9).
⚠️ Report is 49 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #69047        +/-   ##
================================================
- Coverage   76.3213%   75.9252%   -0.3961%     
================================================
  Files          2041       2052        +11     
  Lines        562689     582736     +20047     
================================================
+ Hits         429452     442444     +12992     
- Misses       132324     137972      +5648     
- Partials        913       2320      +1407

Flag	Coverage Δ
integration	`44.8228% <0.0000%> (+5.0940%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`60.4610% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`64.7928% <0.0000%> (+2.0023%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ti-chi-bot · 2026-06-16T08:02:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RidRisR, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~br/OWNERS~~ [YuJuncen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-06-16T08:02:50Z

[LGTM Timeline notifier]

Timeline:

2026-06-16 03:04:11.606089855 +0000 UTC m=+1447552.676407235: ☑️ agreed by YuJuncen.
2026-06-16 08:02:49.139067924 +0000 UTC m=+1465470.209385304: ☑️ agreed by RidRisR.

Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>

hawkingrei · 2026-06-17T08:11:26Z

/ok-to-test

reduce etcd backoff

97fa9a6

Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>

ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. labels Jun 9, 2026

ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-tests-checked labels Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

seiya-annie mentioned this pull request Jun 9, 2026

[crrf]inject io hang to the pd leader, the upstream checkpoint continued to move, but the downstream checkpoint stopped for about 18 minutes #69048

Open

ti-chi-bot Bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels Jun 9, 2026

YuJuncen approved these changes Jun 16, 2026

View reviewed changes

ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 16, 2026

RidRisR approved these changes Jun 16, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 16, 2026

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Jun 16, 2026

add watch timeout

62e75b9

Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>

ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2026

ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Jun 17, 2026

ti-chi-bot Bot merged commit 234eee9 into pingcap:master Jun 17, 2026
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

br: reduce crr advancer etcd retry backoff#69047

br: reduce crr advancer etcd retry backoff#69047
ti-chi-bot[bot] merged 2 commits into
pingcap:masterfrom
Leavrth:reduce_etcd_backoff

Leavrth commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

pantheon-ai Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

tiprow Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

ti-chi-bot Bot commented Jun 16, 2026

Uh oh!

ti-chi-bot Bot commented Jun 16, 2026

Uh oh!

hawkingrei commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Leavrth commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

pantheon-ai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiprow Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ti-chi-bot Bot commented Jun 16, 2026

Uh oh!

ti-chi-bot Bot commented Jun 16, 2026

[LGTM Timeline notifier]

Uh oh!

hawkingrei commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Leavrth commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

pantheon-ai Bot commented Jun 9, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

codecov Bot commented Jun 9, 2026 •

edited

Loading