Skip to content

v0.5.0 — STAC retry + exit on root failure

Choose a tag to compare

@boettiger-lab-llm-agent boettiger-lab-llm-agent released this 18 Apr 19:15
· 19 commits to main since this release
3756774

Retry pass for transient child failures

STAC catalog children that time out on the first attempt now get one automatic retry in a fresh pool with a longer per-child timeout.

  • New env var: `STAC_CHILD_RETRY_TIMEOUT` (default 8s).
  • A retry that succeeds clears the original error from `STAC_LOAD_ERRORS`; a retry that also fails leaves the error in place.
  • Rescues the tail-latency case observed on v0.4.0's first real-world deploy (2 of 81 collections timed out at the 5s ceiling — both were recoverable with a slightly longer timeout).

Exit on root-catalog failure

When the root STAC catalog JSON is unreachable at startup, the server now `sys.exit(1)` rather than starting uvicorn with an empty catalog. Kubernetes restarts the pod; next attempt tries fresh S3 conditions.

Partial catalogs (some children failed) still serve — that's unchanged from v0.4.0. Only total failure triggers exit.

Default concurrency bumped

`STAC_FETCH_CONCURRENCY` default raised from 8 to 16 so the main pool plus the retry pass both fit within the readiness-probe budget. The full-load cold-start worst case (all 63 fetches succeed on retry) drops from ~40s to ~28s.

Upgrade notes

All changes are backwards-compatible for existing callers. New env var `STAC_CHILD_RETRY_TIMEOUT` is optional. Behavior change worth knowing about: pods that previously stayed up with an empty catalog when root failed will now exit on restart; that's intentional (k8s retries faster than a broken-but-serving pod).