Skip to content

Commit 690e25a

Browse files
KyleAMathewsevanobclaude
authored
fix(client): self-healing for permanently stuck expired shape handles (#4087)
## Summary Expired shape handle entries in localStorage can get permanently stuck, preventing data from ever loading for affected shapes. This adds a self-healing retry mechanism that clears the poisoned entry and retries once, allowing automatic recovery even when a proxy strips cache-buster query parameters. Based on #4085 by @evan-liveflow — refined with additional hardening from code review. ## Root Cause When a shape gets a 409 (handle rotation), the client stores the old handle in `localStorage['electric_expired_shapes']`. On future requests, if a response contains that handle, the client treats it as a stale cached response and retries up to 3 times with cache-buster params. The problem: if a proxy (e.g., phoenix_sync) strips query parameters, the cache busters are ineffective. All 3 retries fail, `FetchError(502)` is thrown to `onError`, and if `onError` doesn't retry, the stream dies. The expired entry persists in localStorage, so the next session hits the same wall — permanently. Since the server never reuses handles (now documented as **SPEC.md S0**), the expired entry becomes a false positive once the caching layer clears — but the client has no way to discover this. ## Approach After stale cache retries exhaust (3 attempts), the client now: 1. **Always clears the expired entry** from localStorage — if cache busters didn't work, keeping the entry only poisons future sessions 2. **Attempts one self-healing retry** — resets the stream and retries without the `expired_handle` param. Since handles are never reused, the fresh response will have a new handle and won't trigger stale detection 3. **Guards against infinite loops** via `#expiredShapeRecoveryKey` (once per shape key, reset on up-to-date) ```typescript if (transition.exceededMaxRetries) { if (shapeKey) { expiredShapesCache.delete(shapeKey) // always clear if (this.#expiredShapeRecoveryKey !== shapeKey) { this.#expiredShapeRecoveryKey = shapeKey // remember we tried this.#reset() // fresh start throw new StaleCacheError(...) // caught internally → retry } } throw new FetchError(502, ...) // truly give up } ``` ### Key Invariants - **S0**: Server handles are unique and never reused (phash2 + microsecond timestamp, SQLite UNIQUE INDEX, ETS insert_new) - Self-healing fires at most once per shape per retry cycle (`#expiredShapeRecoveryKey` guard) - Guard resets on up-to-date, so long-lived streams can self-heal again if CDN misbehaves later - Expired entry is cleared on every exhaustion, regardless of whether self-healing fires ### Non-goals - TTL on expired cache entries — the self-healing mechanism handles the failure mode without added complexity - Changing `onError` contract — the fix works regardless of what the user's `onError` callback does ## Verification ```bash cd packages/typescript-client pnpm vitest run --config vitest.unit.config.ts # 312 tests pass pnpm exec tsc --noEmit # Clean ``` ## Files changed | File | Change | |------|--------| | `src/client.ts` | Self-healing logic in `#onInitialResponse`, recovery key cleared on up-to-date, updated catch block comment | | `test/expired-shapes-cache.test.ts` | Updated 2 existing tests for self-healing flow, added test for CDN-always-stale scenario | | `SPEC.md` | Added S0 (handle uniqueness guarantee), updated L3 loop-back entry and guard table | | `.changeset/fix-expired-shapes-self-healing.md` | Changeset for patch release | --- Based on #4085 --------- Co-authored-by: Evan O'Brien <evan@liveflow.io> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b449f70 commit 690e25a

13 files changed

Lines changed: 510 additions & 95 deletions

File tree

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
'@electric-sql/client': patch
3+
---
4+
5+
Fix permanently stuck expired shape handles in localStorage by adding self-healing retry. When stale cache retries are exhausted (3 attempts with cache busters), the client now clears the expired entry from localStorage and retries once without the `expired_handle` parameter. Since the server never reuses handles (documented as SPEC.md S0), the fresh response will have a new handle and bypass stale detection. This prevents shapes from being permanently unloadable when a proxy strips cache-buster query parameters.

examples/burn/assets/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@
5050
"eslint-plugin-prettier": "^5.4.0",
5151
"eslint-plugin-react-hooks": "^4.6.0",
5252
"eslint-plugin-react-refresh": "^0.4.6",
53-
"prettier": "^3.2.4",
53+
"prettier": "^3.6.2",
5454
"typescript": "^5.2.2",
5555
"vite": "^6.2.3"
5656
}

examples/redis/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040
"eslint-config-prettier": "^9.1.0",
4141
"eslint-plugin-prettier": "^5.1.3",
4242
"glob": "^10.3.10",
43-
"prettier": "^3.3.2",
43+
"prettier": "^3.6.2",
4444
"shx": "^0.3.4",
4545
"tsup": "^8.0.1",
4646
"tsx": "^4.19.1",

package.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,9 @@
3333
"lint-staged": {
3434
"*.{js,jsx,ts,tsx}": [
3535
"eslint --fix",
36-
"prettier --write"
36+
"node_modules/.bin/prettier --write"
3737
],
38-
"*.{json,css,md,yml,yaml}": "prettier --write"
38+
"*.{json,css,md,yml,yaml}": "node_modules/.bin/prettier --write"
3939
},
4040
"pnpm": {
4141
"patchedDependencies": {

packages/experimental/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"eslint-plugin-prettier": "^5.1.3",
2020
"glob": "^10.3.10",
2121
"pg": "^8.12.0",
22-
"prettier": "^3.3.2",
22+
"prettier": "^3.6.2",
2323
"shx": "^0.3.4",
2424
"tsup": "^8.0.1",
2525
"typescript": "^5.5.2",

packages/react-hooks/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
"glob": "^10.3.10",
2727
"jsdom": "^25.0.0",
2828
"pg": "^8.12.0",
29-
"prettier": "^3.3.2",
29+
"prettier": "^3.6.2",
3030
"react": "^18.3.1",
3131
"react-dom": "^18.3.1",
3232
"shx": "^0.3.4",

packages/start/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
"eslint": "^8.57.0",
3636
"eslint-config-prettier": "^9.1.0",
3737
"eslint-plugin-prettier": "^5.1.3",
38-
"prettier": "^3.3.2",
38+
"prettier": "^3.6.2",
3939
"shx": "^0.3.4",
4040
"tsup": "^8.0.1",
4141
"typescript": "^5.5.2",

packages/typescript-client/SPEC.md

Lines changed: 37 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,26 @@ Any ──markMustRefetch─► Initial (offset = -1)
6666
- `response` on Paused delegates to `previousState`, preserving the Paused wrapper for `accepted` and `stale-retry` transitions; `ignored` returns `this`
6767
- `response`/`messages`/`sseClose` on Error return `this` (ignored)
6868

69+
## Server Assumptions
70+
71+
Properties of the sync service that the client state machine depends on.
72+
73+
### S0: Shape handles are unique and never reused
74+
75+
The server generates handles as `{phash2_hash}-{microsecond_timestamp}`. Uniqueness
76+
is enforced by monotonic timestamps, a SQLite `UNIQUE INDEX` on the handle column,
77+
and ETS `insert_new` checks. Even after server restarts, old handles persist in
78+
SQLite and new ones receive fresh timestamps, so collisions cannot occur.
79+
80+
**Implication for expired shapes cache**: Once a handle is marked expired (after a
81+
409 response), the server will never issue that handle again. If a response contains
82+
an expired handle, it must be coming from a caching layer (browser HTTP cache,
83+
CDN, or proxy) — not from the server itself.
84+
85+
**Source**: `packages/sync-service/lib/electric/shapes/shape.ex` (`generate_id/1`),
86+
`packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex`
87+
(`shapes_handle_idx`).
88+
6989
## Invariants
7090

7191
Properties that must hold after every state transition. Checked automatically by
@@ -346,25 +366,26 @@ This is enforced by the path-specific guards listed below. Live requests
346366

347367
Six sites in `client.ts` recurse or loop to issue a new fetch:
348368

349-
| # | Site | Line | Trigger | URL changes because | Guard |
350-
| --- | --------------------------------------- | ---- | ---------------------------------------------------------- | ----------------------------------------------------------------------------------- | ------------------------------------------------------- |
351-
| L1 | `#requestShape``#requestShape` | 940 | Normal completion after `#fetchShape()` | Offset advances from response headers | `#checkFastLoop` (non-live) |
352-
| L2 | `#requestShape` catch → `#requestShape` | 874 | Abort with `FORCE_DISCONNECT_AND_REFRESH` or `SYSTEM_WAKE` | `isRefreshing` flag changes `canLongPoll`, affecting `live` param | Abort signals are discrete events |
353-
| L3 | `#requestShape` catch → `#requestShape` | 886 | `StaleCacheError` thrown by `#onInitialResponse` | `StaleRetryState` adds `cache_buster` param | `maxStaleCacheRetries` counter in state machine |
354-
| L4 | `#requestShape` catch → `#requestShape` | 924 | HTTP 409 (shape rotation) | `#reset()` sets offset=-1 + new handle; or request-scoped cache buster if no handle | New handle from 409 response or unique retry URL |
355-
| L5 | `#start` catch → `#start` | 782 | Exception + `onError` returns retry opts | Params/headers merged from `retryOpts` | `#maxConsecutiveErrorRetries` (50) |
356-
| L6 | `fetchSnapshot` catch → `fetchSnapshot` | 1975 | HTTP 409 on snapshot fetch | New handle via `withHandle()`; or local retry cache buster if same/no handle | `#maxSnapshotRetries` (5) + cache buster on same handle |
369+
| # | Site | Line | Trigger | URL changes because | Guard |
370+
| --- | --------------------------------------- | ---- | ---------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
371+
| L1 | `#requestShape``#requestShape` | 940 | Normal completion after `#fetchShape()` | Offset advances from response headers | `#checkFastLoop` (non-live) |
372+
| L2 | `#requestShape` catch → `#requestShape` | 874 | Abort with `FORCE_DISCONNECT_AND_REFRESH` or `SYSTEM_WAKE` | `isRefreshing` flag changes `canLongPoll`, affecting `live` param | Abort signals are discrete events |
373+
| L3 | `#requestShape` catch → `#requestShape` | 886 | `StaleCacheError` thrown by `#onInitialResponse` | `StaleRetryState` adds `cache_buster` param; after max retries, self-healing clears expired entry + resets stream | `maxStaleCacheRetries` counter + `#expiredShapeRecoveryKey` (once per shape) |
374+
| L4 | `#requestShape` catch → `#requestShape` | 924 | HTTP 409 (shape rotation) | `#reset()` sets offset=-1 + new handle; or request-scoped cache buster if no handle | New handle from 409 response or unique retry URL |
375+
| L5 | `#start` catch → `#start` | 782 | Exception + `onError` returns retry opts | Params/headers merged from `retryOpts` | `#maxConsecutiveErrorRetries` (50) |
376+
| L6 | `fetchSnapshot` catch → `fetchSnapshot` | 1975 | HTTP 409 on snapshot fetch | New handle via `withHandle()`; or local retry cache buster if same/no handle | `#maxSnapshotRetries` (5) + cache buster on same handle |
357377

358378
### Guard mechanisms
359379

360-
| Guard | Scope | How it works |
361-
| ----------------------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
362-
| `#checkFastLoop` | Non-live `#requestShape` only | Detects N requests at same offset within a time window. First: clears caches + resets. Persistent: exponential backoff → throws FetchError(502). |
363-
| `maxStaleCacheRetries` | Stale response path (L3) | State machine counts stale retries. Throws FetchError(502) after 3 consecutive stale responses. |
364-
| `#maxSnapshotRetries` | Snapshot 409 path (L6) | Counts consecutive snapshot 409s. Adds cache buster when handle unchanged. Throws FetchError(502) after 5. |
365-
| `#maxConsecutiveErrorRetries` | `#start` onError retry (L5) | Counts consecutive error retries. Sends error to subscribers and tears down after 50. Reset on successful message batch. |
366-
| Pause lock | `#requestShape` entry | Returns immediately if paused. Prevents fetches during snapshots. |
367-
| Up-to-date exit | `#requestShape` entry | Returns if `!subscribe` and `isUpToDate`. Breaks loop for one-shot syncs. |
380+
| Guard | Scope | How it works |
381+
| ----------------------------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
382+
| `#checkFastLoop` | Non-live `#requestShape` only | Detects N requests at same offset within a time window. First: clears caches + resets. Persistent: exponential backoff → throws FetchError(502). |
383+
| `maxStaleCacheRetries` | Stale response path (L3) | State machine counts stale retries. After 3 consecutive stale responses, clears expired entry and attempts one self-healing retry. Throws FetchError(502) if self-healing also fails. |
384+
| `#expiredShapeRecoveryKey` | Self-healing (L3 extension) | Records shape key after first self-healing attempt. Second exhaustion on same key skips self-healing → FetchError(502). Cleared on up-to-date. |
385+
| `#maxSnapshotRetries` | Snapshot 409 path (L6) | Counts consecutive snapshot 409s. Adds cache buster when handle unchanged. Throws FetchError(502) after 5. |
386+
| `#maxConsecutiveErrorRetries` | `#start` onError retry (L5) | Counts consecutive error retries. Sends error to subscribers and tears down after 50. Reset on successful message batch. |
387+
| Pause lock | `#requestShape` entry | Returns immediately if paused. Prevents fetches during snapshots. |
388+
| Up-to-date exit | `#requestShape` entry | Returns if `!subscribe` and `isUpToDate`. Breaks loop for one-shot syncs. |
368389

369390
### Coverage gaps
370391

packages/typescript-client/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
"glob": "^10.3.10",
2626
"jsdom": "^26.1.0",
2727
"pg": "^8.12.0",
28-
"prettier": "^3.3.2",
28+
"prettier": "^3.6.2",
2929
"shx": "^0.3.4",
3030
"tsup": "^8.0.1",
3131
"typescript": "^5.5.2",

packages/typescript-client/src/client.ts

Lines changed: 69 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -623,6 +623,8 @@ export class ShapeStream<T extends Row<unknown> = Row>
623623
#fastLoopMaxCount = 5
624624
#pendingRequestShapeCacheBuster?: string
625625
#maxSnapshotRetries = 5
626+
#expiredShapeRecoveryKey: string | null = null
627+
#pendingSelfHealCheck: { shapeKey: string; staleHandle: string } | null = null
626628
#consecutiveErrorRetries = 0
627629
#maxConsecutiveErrorRetries = 50
628630

@@ -914,10 +916,11 @@ export class ShapeStream<T extends Row<unknown> = Row>
914916
}
915917

916918
if (e instanceof StaleCacheError) {
917-
// Received a stale cached response from CDN with an expired handle.
918-
// The #staleCacheBuster has been set in #onInitialResponse, so retry
919-
// the request which will include a random cache buster to bypass the
920-
// misconfigured CDN cache.
919+
// Two paths throw StaleCacheError:
920+
// 1. Normal stale-retry: response handle matched expired handle,
921+
// #staleCacheBuster set to bypass CDN cache on next request.
922+
// 2. Self-healing: stale retries exhausted, expired entry cleared,
923+
// stream reset — retry without expired_handle param.
921924
return this.#requestShape()
922925
}
923926

@@ -1248,6 +1251,25 @@ export class ShapeStream<T extends Row<unknown> = Row>
12481251
? expiredShapesCache.getExpiredHandle(shapeKey)
12491252
: null
12501253

1254+
// If this response is the first one after a self-healing retry, check
1255+
// whether the proxy/CDN returned the exact handle we just marked expired.
1256+
// If so, the client is about to accept stale data silently — loudly warn
1257+
// so operators can detect and fix the proxy misconfiguration.
1258+
if (this.#pendingSelfHealCheck) {
1259+
const { shapeKey: healedKey, staleHandle } = this.#pendingSelfHealCheck
1260+
this.#pendingSelfHealCheck = null
1261+
if (shapeKey === healedKey && shapeHandle === staleHandle) {
1262+
console.warn(
1263+
`[Electric] Self-healing retry received the same handle "${staleHandle}" that was just marked expired. ` +
1264+
`This means your proxy/CDN is serving a stale cached response and ignoring cache-buster query params. ` +
1265+
`The client will proceed with this stale data to avoid a permanent failure, but it may be out of date until the cache refreshes. ` +
1266+
`Fix: configure your proxy/CDN to include all query parameters (especially 'handle' and 'offset') in its cache key. ` +
1267+
`For more information visit the troubleshooting guide: ${TROUBLESHOOTING_URL}`,
1268+
new Error(`stack trace`)
1269+
)
1270+
}
1271+
}
1272+
12511273
const transition = this.#syncState.handleResponseMetadata({
12521274
status,
12531275
responseHandle: shapeHandle,
@@ -1262,6 +1284,12 @@ export class ShapeStream<T extends Row<unknown> = Row>
12621284

12631285
this.#syncState = transition.state
12641286

1287+
// Clear recovery guard on 204 (no-content), since the empty body means
1288+
// #onMessages won't run to clear it via the up-to-date path.
1289+
if (status === 204) {
1290+
this.#expiredShapeRecoveryKey = null
1291+
}
1292+
12651293
if (transition.action === `accepted` && status === 204) {
12661294
this.#consecutiveErrorRetries = 0
12671295
}
@@ -1270,6 +1298,38 @@ export class ShapeStream<T extends Row<unknown> = Row>
12701298
// Cancel the response body to release the connection before retrying.
12711299
await response.body?.cancel()
12721300
if (transition.exceededMaxRetries) {
1301+
if (shapeKey) {
1302+
// Clear the expired entry — keeping it only poisons future sessions.
1303+
expiredShapesCache.delete(shapeKey)
1304+
1305+
// Try one self-healing retry per shape: reset the stream and
1306+
// retry without the expired_handle param. Since handles are never
1307+
// reused (see SPEC.md S0), the fresh response will have a new
1308+
// handle and won't trigger stale detection.
1309+
if (this.#expiredShapeRecoveryKey !== shapeKey) {
1310+
console.warn(
1311+
`[Electric] Stale cache retries exhausted (${this.#maxStaleCacheRetries} attempts). ` +
1312+
`Clearing expired handle entry and attempting self-healing retry without the expired_handle parameter. ` +
1313+
`For more information visit the troubleshooting guide: ${TROUBLESHOOTING_URL}`,
1314+
new Error(`stack trace`)
1315+
)
1316+
this.#expiredShapeRecoveryKey = shapeKey
1317+
// Arm a post-self-heal check: if the next response comes back
1318+
// with the same handle we just marked expired, the proxy/CDN is
1319+
// still serving stale data and we'll warn loudly instead of
1320+
// accepting it silently.
1321+
if (shapeHandle) {
1322+
this.#pendingSelfHealCheck = {
1323+
shapeKey,
1324+
staleHandle: shapeHandle,
1325+
}
1326+
}
1327+
this.#reset()
1328+
throw new StaleCacheError(
1329+
`Expired handle entry evicted for self-healing retry`
1330+
)
1331+
}
1332+
}
12731333
throw new FetchError(
12741334
502,
12751335
undefined,
@@ -1351,6 +1411,7 @@ export class ShapeStream<T extends Row<unknown> = Row>
13511411
shapeKey,
13521412
this.#syncState.liveCacheBuster
13531413
)
1414+
this.#expiredShapeRecoveryKey = null
13541415
}
13551416
}
13561417

@@ -1770,9 +1831,10 @@ export class ShapeStream<T extends Row<unknown> = Row>
17701831
#reset(handle?: string) {
17711832
this.#syncState = this.#syncState.markMustRefetch(handle)
17721833
this.#connected = false
1773-
// releaseAllMatching intentionally doesn't fire onReleased — it's called
1774-
// from within the running stream loop (#requestShape's 409 handler), so
1775-
// the stream is already active and doesn't need a resume signal.
1834+
// releaseAllMatching intentionally doesn't fire onReleased — every caller
1835+
// (#requestShape's 409 handler, #checkFastLoop, and stale-retry
1836+
// self-healing in #onInitialResponse) runs inside the active stream loop,
1837+
// so the stream is already active and doesn't need a resume signal.
17761838
this.#pauseLock.releaseAllMatching(`snapshot`)
17771839
}
17781840

0 commit comments

Comments
 (0)