fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery by bhope · Pull Request #2920 · kubernetes/kube-state-metrics

bhope · 2026-04-10T20:23:16Z

Elevated and unbounded memory growth introduced in v2.18.0 when custom resource state config is in use.

Root Causes

AppendToMap overwrites stop channels and appends duplicate kinds on every call (internal/discovery/types.go). Since PollForCacheUpdates calls it for every known GVK each cycle, old stop channels were silently replaced, orphaning any reflector goroutine blocking on them.
CR reflectors ignore context cancellation (internal/store/builder.go). Unlike standard reflectors started with reflector.Run(b.ctx.Done()), custom resource reflectors were started with only their GVK-specific stop channel - no context cancellation path at all.

Fix

AppendToMap: skip the append if the kind already exists; skip make(chan struct{}) if a channel already exists for the GVK.
startReflector: wrap the GVK stop channel with a bridge goroutine that also selects on b.ctx.Done(), so CR reflectors stop on both CRD deletion and context cancellation.

Also, added tests to cover idempotency and cleanup in the discovery package - verifying no duplicate kinds or channel replacement on repeated AppendToMap calls, and that RemoveFromMap closes channels so reflectors stop cleanly.

Test Results:

TestMemoryLeakSimulation - 5 GVKs × 500 poll cycles

	Buggy (pre-fix)	Fixed (post-fix)
Kind entries in map	2500	5
Stop channels live	5	5
Heap growth (KB)	+88	-8

TestGoroutineLeakSimulation - 5 GVKs × 20 store rebuilds

	Buggy (pre-fix)	Fixed (post-fix)
Goroutines leaked	100	0

Fixes #2867

k8s-ci-robot · 2026-04-10T20:23:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bhope
Once this PR has been reviewed and has the lgtm label, please assign catherinef-dev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-04-10T20:23:25Z

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot

Pull request overview

Fixes elevated/unbounded memory growth and goroutine leaks when custom resource state config is enabled by making CRD discovery idempotent and ensuring custom-resource reflectors stop on both CRD removal and context cancellation.

Changes:

Make CRDiscoverer.AppendToMap idempotent (no duplicate kinds; don’t replace existing stop channels).
Ensure custom-resource reflectors stop when either the GVK stop channel fires or the builder context is cancelled.
Add/extend tests covering idempotency, channel cleanup, and leak simulations.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
internal/store/builder.go	Updates CR reflector stop behavior to also honor builder context cancellation.
internal/store/builder_test.go	Adds unit tests around the combined stop channel behavior for CR reflectors.
internal/discovery/types.go	Prevents duplicate kind entries and stop-channel replacement in repeated discovery updates.
internal/discovery/types_test.go	Adds deterministic unit tests for Append/Remove idempotency and channel closure.
internal/discovery/memleak_test.go	Adds simulation-style tests intended to demonstrate pre/post fix memory & goroutine behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bhope · 2026-04-13T20:50:23Z

Hi @mrueg addressed the copilot suggestions and CI is now green. Ready for a review when you get a chance.

…discovery fix gofmt error Co-authored-by: Oleg Zaytsev <1511481+colega@users.noreply.github.com>

jullianow · 2026-04-14T23:48:18Z

Any idea when this will be released?

bhope · 2026-04-14T23:52:46Z

@jullianow This will be included in the upcoming release, we are working towards it. Please stay tuned. Thanks.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T16:29:14Z

+	deadline := time.After(2 * time.Second)
+	for i, ch := range stopChs {
+		select {
+		case <-ch:
+		case <-deadline:


deadline := time.After(2 * time.Second) is created once and reused across the loop. That means later iterations may get less than 2s (or even time out immediately) depending on scheduling, making this test more brittle and the error message misleading. Consider using a per-iteration timeout (create time.After inside the loop) or use a single overall deadline but compare against time.Now()/time.Until() and adjust the message accordingly.

Suggested change

deadline := time.After(2 * time.Second)

for i, ch := range stopChs {

select {

case <-ch:

case <-deadline:

for i, ch := range stopChs {

select {

case <-ch:

case <-time.After(2 * time.Second):

mrueg · 2026-04-15T16:36:02Z

Hi @mrueg addressed the copilot suggestions and CI is now green. Ready for a review when you get a chance.

Thanks for looking into those comments.
Unfortunately I won't have access to a way to test it until mid May due to private travel.

@rexagod can you take a look?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026

k8s-ci-robot requested review from dgrisonnet and mrueg April 10, 2026 20:23

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 10, 2026

github-project-automation bot added this to SIG Instrumentation Apr 10, 2026

github-project-automation bot moved this to Needs Triage in SIG Instrumentation Apr 10, 2026

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 10, 2026

bhope force-pushed the fix-mem-leak branch from ecb6cdb to f24b776 Compare April 10, 2026 20:55

mrueg requested a review from Copilot April 10, 2026 20:58

Copilot started reviewing on behalf of mrueg April 10, 2026 20:59 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread internal/store/builder.go

Comment thread internal/store/builder_test.go

Comment thread internal/store/builder_test.go Outdated

Comment thread internal/discovery/memleak_test.go Outdated

Comment thread internal/discovery/memleak_test.go Outdated

bhope force-pushed the fix-mem-leak branch from 8767772 to 463d3a7 Compare April 10, 2026 21:35

fix: stop goroutine and memory leak in CR reflectors on repeated CRD …

e2a1dcf

…discovery fix gofmt error Co-authored-by: Oleg Zaytsev <1511481+colega@users.noreply.github.com>

bhope force-pushed the fix-mem-leak branch from 6c35482 to e2a1dcf Compare April 13, 2026 21:03

mrueg requested a review from Copilot April 15, 2026 16:22

Copilot started reviewing on behalf of mrueg April 15, 2026 16:22 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery#2920

fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery#2920
bhope wants to merge 1 commit intokubernetes:mainfrom
bhope:fix-mem-leak

bhope commented Apr 10, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhope commented Apr 13, 2026

Uh oh!

jullianow commented Apr 14, 2026

Uh oh!

bhope commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

mrueg commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

bhope commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhope commented Apr 13, 2026

Uh oh!

jullianow commented Apr 14, 2026

Uh oh!

bhope commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

mrueg commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bhope commented Apr 10, 2026 •

edited

Loading

mrueg commented Apr 15, 2026 •

edited

Loading