Skip to content

Add hook for formatting Kubernetes Event messages#910

Open
agoose77 wants to merge 63 commits intojupyterhub:mainfrom
agoose77:feat-add-hook
Open

Add hook for formatting Kubernetes Event messages#910
agoose77 wants to merge 63 commits intojupyterhub:mainfrom
agoose77:feat-add-hook

Conversation

@agoose77
Copy link
Copy Markdown

@agoose77 agoose77 commented Mar 3, 2026

TL;DR

Note

No LLMs were used in the authoring of this PR.

2i2c is currently working on a user-story to improve the kubespawner progress messages, as part of an initiative to improve the spawn-progress page.

This PR does several things:

  1. Adds a custom decorate_progress_message that overrides the pretty-printing of event messages.
  2. Adds a kubespawner.events module for richer built-in formatting of log messages.
  3. Adds a kubespawner.events.RuleEventFormatter and other types for defining event formatting rules.

See the before and after:

Before
old

After
new

Goal

The goal is modest: to improve the human readability of spawn messages, and to allow further customisation.

Example Decoration Hook

Basic hook

def decorate_progress_message(spawner, event, text):
    return { 
        "message": f"custom-message-{text}",
        "html_message": f"<span>{text}</span>"
    }
c.KubeSpawner.decorate_progress_message = decorate_progress_message

Use the rules to define custom renderers

c.RuleEventFormatter.rules = {
    "01-container-image-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "fieldPath": r"spec\.(?P<container>initContainers|containers)\{([^}]+)\}",
            "reason": r"(?P<action>Pulling|Pulled)",
        },
        "template": "{action} {image} image for the {container} container",
    },
    "02-container-lifecycle-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "fieldPath": r"spec\.(?P<container>initContainers|containers)\{([^}]+)\}",
            "reason": r"(?P<action>Started|Killing|Created|Stopped)",
        },
        "template": "{action} the {container} container",
    },
    "03-pod-resource-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "reason": r"OutOf(?P<resource>memory|cpu|ephemeral-storage|pods)",
        },
        "template": "The node selected to run your server ran out of {resource}",
    },
    "04-scheduler-node-found-events": {
        "match": {
            "reportingComponent": r".*-user-scheduler",
            "reason": r"Scheduled",
            "message": r".*?assigned \S+ to (?P<node>\S+)",
        },
        "template": "A node ({node}) has been found to run your server",
    },
    "05-scheduler-no-nodes-events": {
        "match": {
            "reportingComponent": r".*-user-scheduler",
            "reason": r"FailedScheduling",
        },
        "template": "No existing nodes are currently able to run your server",
    },
    "06-cluster-autoscaler-events": {
        "match": {
            "reportingComponent": r"cluster-autoscaler",
            "reason": r"TriggeredScaleUp",
        },
        "template": "Launching new nodes by scaling up the cluster",
    },
    "07-node-affinity-events": {
        "match": {
            "reportingComponent": r"kubelet",
            "message": r"Predicate NodeAffinity failed.*",
            "reason": r"NodeAffinity",
        },
        "template": "It was not possible to find or launch any nodes to run your server. This is likely due to a configuration problem with the infrastructure or the JupyterHub",
    },
    "08-gke-scheduler-node-found-events": {
        "match": {
            "reportingComponent": r"gke\.io/optimize-utilization-scheduler",
            "reason": r"Scheduled",
            "message": r".*?assigned \S+ to (?P<node>\S+)",
        },
        "template": "A node ({node}) has been found to run your server",
    },
    "09-gke-scheduler-no-nodes-events": {
        "match": {
            "reportingComponent": r"gke\.io/optimize-utilization-scheduler",
            "reason": r"FailedScheduling",
        },
        "template": "No existing nodes are currently able to run your server",
    },
    "10-taint-eviction-events": {
        "match": {
            "reportingComponent": r"taint-eviction-controller",
            "reason": r"gke\.io/optimize-utilization-scheduler",
            "message": r"Cancelling deletion of Pod.*",
        },
        "template": "Cancelling deletion of your server. This normally happens when a scale-up has just taken place.",
    },
}

Design Details

UI

  • Timestamps are formatted to regular isoformat-like %Y-%m-%DTHH:MM:SSZ to keep fixed width
  • Timestamps and message types are pretty formatted as button-pills
  • Messages are simplified where possible

Constraints

  • I targeted Python 3.7 given pyproject.toml, meaning no match, :=, or removeprefix.

Questions

  • Is Kubespawner making too many assumptions if we bake-in the expectation of Bootstrap?
  • Could we consider adding a start-time timestamp so that times can simply be given as "minutes since spawn" rather than UTC times?

@agoose77 agoose77 marked this pull request as ready for review March 3, 2026 14:00
@manics
Copy link
Copy Markdown
Member

manics commented Mar 3, 2026

This does effectively vendor some cluster-provider specifics (like the GCP scheduler). I think that's OK? But if we are vehemently against that, we can just pull those parts out.

Whatever we decide we need to be consistent in future. If we add GCP specific code we need to accept code for other clouds, including from third parties who use platforms that we can't test ourselves.

@agoose77 agoose77 force-pushed the feat-add-hook branch 2 times, most recently from dd73955 to cb2f0a3 Compare March 3, 2026 16:54
Copy link
Copy Markdown
Member

@jnywong jnywong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enjoy this feature ❤️ I like that the format_event_hook was easy to configure for basic formatting. It took me a while to understand what was going on with the default formatting, but I think there will alway be room for improvements there.

In general, my main comment is to complete the test suite to regenerate how the sample-events were created through the message.py, since I think that represents the bulk of the work in this PR. The default formatting may undergo further development, but like you say I think we want to add in some basic regression testing and update that if needed in future.


In answer to your questions:

Is Kubespawner making too many assumptions if we bake-in the expectation of Bootstrap?

We know Bootstrap ships with JupyterHub, so I think this is a safe assumption for now. I don't think we need to worry about this for BinderHub?

Could we consider adding a start-time timestamp so that times can simply be given as "minutes since spawn" rather than UTC times?

I think this is a nice-to-have. Most users hopefully shouldn't have to dwell on the spawn progress screen, but if they do, then they will likely screenshot their spawn failure to send to an admin. Keeping the timestamps consistent with server side logs with raw k8s events should be useful for sysadmins for troubleshooting.

Should I rework the built-in formatter to generate HTML at every stage — would it be useful to have e.g. image names/tags, and node names in button-tags?

At this stage, I would prefer not to have the default formatter be too flashy.

Is there motivation to move the default message format into its own configurable, rather than requiring users to create their own hook?

Yes, out of all of these questions I think this would be one to focus on. Most people will probably want to configure an extension to the default formatter.

@jnywong
Copy link
Copy Markdown
Member

jnywong commented Mar 5, 2026

This does effectively vendor some cluster-provider specifics (like the GCP scheduler). I think that's OK? But if we are vehemently against that, we can just pull those parts out.

Whatever we decide we need to be consistent in future. If we add GCP specific code we need to accept code for other clouds, including from third parties who use platforms that we can't test ourselves.

I am okay with that – I don't think we need to assume full responsibility for testing code on platforms we don't have access to, but we should ensure contributors who would like this functionality to include full test suites for that. I think the scope of changes in this PR are pretty cosmetic, so there doesn't seem to be huge scope for someone to introduce anything too crazy on a third party platform.

There is a small question about how to structure this as the corpus of messages to reformat scales, but I think we can cross that bridge when we get to it?

@agoose77
Copy link
Copy Markdown
Author

@yuvipanda rather than moving just the business logic of event formatting to the new module created in this PR, I opted to move it into a new module under a new Configurable. This helps better isolate the responsibility, makes testing easier, and also improves extensibility.

Comment on lines +655 to +659
assert "progress" in progress
assert isinstance(progress["progress"], int)
assert "message" in progress
assert isinstance(progress["message"], str)
messages.append(progress["message"])
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My LSP changes these. It passes pre-commit, so I assume OK to leave?

or "",
"message": event.get("message") or "",
"reason": event.get("reason") or "",
"type": event["type"],
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the type field, as it's harmless and useful!

Copy link
Copy Markdown
Member

@jnywong jnywong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, a significant improvement over the last time I reviewed this. It makes sense to make all of the matching rulesets configurable rather than essentially hard coding them into the library itself as before. Good job!

I have minor style suggestions and comments, but I think this is gtg once addressed.

@jnywong
Copy link
Copy Markdown
Member

jnywong commented Apr 13, 2026

I tested the scenario where I triggered a Error: ImagePullBackOff by entering an invalid image name into the kubespawner form and I think the default event rules introduced by this PR makes this more inscrutable than before

Before After
Screenshot 2026-04-13 at 14 53 41 Screenshot 2026-04-13 at 14 55 31

Could you take a close look at this? I am not sure why this rule would match the Error: ErrImagePull event.

@agoose77
Copy link
Copy Markdown
Author

@jnywong nice catch!

This PR adds a default formatter that was not being returned properly. I've lifted it earlier into the transforms to avoid needing to manually handle the exception case.

I've added rules for these errors:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants