Skip to content

fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery#2920

Open
bhope wants to merge 1 commit intokubernetes:mainfrom
bhope:fix-mem-leak
Open

fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery#2920
bhope wants to merge 1 commit intokubernetes:mainfrom
bhope:fix-mem-leak

Conversation

@bhope
Copy link
Copy Markdown
Member

@bhope bhope commented Apr 10, 2026

Elevated and unbounded memory growth introduced in v2.18.0 when custom resource state config is in use.

Root Causes

  1. AppendToMap overwrites stop channels and appends duplicate kinds on every call (internal/discovery/types.go). Since PollForCacheUpdates calls it for every known GVK each cycle, old stop channels were silently replaced, orphaning any reflector goroutine blocking on them.
  2. CR reflectors ignore context cancellation (internal/store/builder.go). Unlike standard reflectors started with reflector.Run(b.ctx.Done()), custom resource reflectors were started with only their GVK-specific stop channel - no context cancellation path at all.

Fix

  • AppendToMap: skip the append if the kind already exists; skip make(chan struct{}) if a channel already exists for the GVK.
  • startReflector: wrap the GVK stop channel with a bridge goroutine that also selects on b.ctx.Done(), so CR reflectors stop on both CRD deletion and context cancellation.

Also, added tests to cover idempotency and cleanup in the discovery package - verifying no duplicate kinds or channel replacement on repeated AppendToMap calls, and that RemoveFromMap closes channels so reflectors stop cleanly.

Test Results:

TestMemoryLeakSimulation - 5 GVKs × 500 poll cycles

Buggy (pre-fix) Fixed (post-fix)
Kind entries in map 2500 5
Stop channels live 5 5
Heap growth (KB) +88 -8

TestGoroutineLeakSimulation - 5 GVKs × 20 store rebuilds

Buggy (pre-fix) Fixed (post-fix)
Goroutines leaked 100 0

Fixes #2867

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bhope
Once this PR has been reviewed and has the lgtm label, please assign catherinef-dev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 10, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Instrumentation Apr 10, 2026
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 10, 2026
@mrueg mrueg requested a review from Copilot April 10, 2026 20:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes elevated/unbounded memory growth and goroutine leaks when custom resource state config is enabled by making CRD discovery idempotent and ensuring custom-resource reflectors stop on both CRD removal and context cancellation.

Changes:

  • Make CRDiscoverer.AppendToMap idempotent (no duplicate kinds; don’t replace existing stop channels).
  • Ensure custom-resource reflectors stop when either the GVK stop channel fires or the builder context is cancelled.
  • Add/extend tests covering idempotency, channel cleanup, and leak simulations.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/store/builder.go Updates CR reflector stop behavior to also honor builder context cancellation.
internal/store/builder_test.go Adds unit tests around the combined stop channel behavior for CR reflectors.
internal/discovery/types.go Prevents duplicate kind entries and stop-channel replacement in repeated discovery updates.
internal/discovery/types_test.go Adds deterministic unit tests for Append/Remove idempotency and channel closure.
internal/discovery/memleak_test.go Adds simulation-style tests intended to demonstrate pre/post fix memory & goroutine behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/store/builder.go
Comment thread internal/store/builder_test.go
Comment thread internal/store/builder_test.go Outdated
Comment thread internal/discovery/memleak_test.go Outdated
Comment thread internal/discovery/memleak_test.go Outdated
@bhope
Copy link
Copy Markdown
Member Author

bhope commented Apr 13, 2026

Hi @mrueg addressed the copilot suggestions and CI is now green. Ready for a review when you get a chance.

…discovery

fix gofmt error

Co-authored-by: Oleg Zaytsev <1511481+colega@users.noreply.github.com>
@jullianow
Copy link
Copy Markdown

Any idea when this will be released?

@bhope
Copy link
Copy Markdown
Member Author

bhope commented Apr 14, 2026

@jullianow This will be included in the upcoming release, we are working towards it. Please stay tuned. Thanks.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +297 to +301
deadline := time.After(2 * time.Second)
for i, ch := range stopChs {
select {
case <-ch:
case <-deadline:
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadline := time.After(2 * time.Second) is created once and reused across the loop. That means later iterations may get less than 2s (or even time out immediately) depending on scheduling, making this test more brittle and the error message misleading. Consider using a per-iteration timeout (create time.After inside the loop) or use a single overall deadline but compare against time.Now()/time.Until() and adjust the message accordingly.

Suggested change
deadline := time.After(2 * time.Second)
for i, ch := range stopChs {
select {
case <-ch:
case <-deadline:
for i, ch := range stopChs {
select {
case <-ch:
case <-time.After(2 * time.Second):

Copilot uses AI. Check for mistakes.
@mrueg
Copy link
Copy Markdown
Member

mrueg commented Apr 15, 2026

Hi @mrueg addressed the copilot suggestions and CI is now green. Ready for a review when you get a chance.

Thanks for looking into those comments.
Unfortunately I won't have access to a way to test it until mid May due to private travel.

@rexagod can you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

Elevated Memory Utilization (v2.18.0)

5 participants