All notable changes to S3 Proxy will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- OTLP metrics: removed superfluous metrics: Dropped
cache.cache_hit_rate_percent,cache.ram_cache_hit_rate_percent,cache.total_requests,cache.ram_cache_max_size,cache.metadata_cache_max_entries,request_metrics.requests_per_second, andrequest_metrics.max_concurrent_requests. Hit rates are derivable from hits/misses in CloudWatch metric math; max sizes and max concurrent are config constants that don't belong in time-series data;requests_per_secondwas a cumulative average (total/uptime) that trends toward zero over time, not a real rate. - OTLP metrics: added health/error signals: Added
cache.corruption_metadata_total,cache.corruption_missing_range_total,cache.disk_full_events_total,cache.lock_timeout_total,cache.write_failures_total,cache.etag_mismatches_total,cache.range_invalidations_total, andcache.incomplete_uploads_evicted. These were tracked internally but never exported, making cache health invisible in CloudWatch. - OTLP metrics: added
cache.s3_requests_saved: Exports the headline value-add metric (disk hits + metadata hits) that was shown in the dashboard but missing from OTLP. - Dashboard: horizontally aligned flow rows: HEAD and GET flow rows (RAM card, arrows, Disk card, S3 card) now align across columns. Restructured HTML to emit elements in row order inside a CSS grid rather than two independent flex columns, so each tier sits on the same horizontal baseline regardless of content height differences between columns.
- Dashboard: info-box isolation: Clicking ⓘ on a stat no longer expands identically-named stats in the other column (e.g. "Hit Rate" in HEAD was also opening "Hit Rate" in GET). Help state is now tracked by unique element ID instead of label text.
- Dashboard: per-prefix hit/miss stats: The Bucket and Prefix Overrides table now shows accurate per-prefix HEAD and GET hit rates. Previously, prefix rows duplicated the bucket-level totals. Stats are now tracked separately per prefix (keyed by
bucket/prefix) and populated from a newprefix_cache_statsmap inMetricsManager. - Dashboard: fixed double-counting in whole-proxy HEAD/GET totals:
update_statisticswas called from internal cache lookup functions (get_cached_response,get_range_data, HEAD miss path, disk miss path, and a store path), causing the top-level counters to fire multiple times per HTTP request. Moved allupdate_statisticscalls exclusively tohttp_proxy.rsrequest handlers, consistent with whererecord_bucket_cache_accessalready lived. Whole-proxy totals now match bucket-level totals. - Dashboard: fixed HEAD/GET miscounting in bucket and prefix stats: Several
record_bucket_cache_accesscall sites inhttp_proxy.rshardcodedis_head = false, causing HEAD cache hits to be counted as GET hits in the per-bucket and per-prefix stats. Fixed the RAM cache hit path, buffered range path, and coalescing waiter paths to passmethod == Method::HEAD. Whole-proxy totals and bucket/prefix totals now agree on the HEAD/GET split.
- HEAD TTL prefix override not applied:
resolve_settingsincache.rspassed the full cache key path (e.g./bucket/many/prefix/key) toBucketSettingsManager::resolve, but prefix matching incascadecompared against just the object key. The bucket name prefix was never stripped, soprefix_overridesentries in_settings.jsonnever matched. Fixed by stripping/{bucket}/from the path before callingresolve, consistent with how tests and documentation define prefix patterns (e.g.many/10x100M/). - Prefix override validation rejected valid prefixes:
BucketSettings::validate()required allprefix_overridesentries to start with/, but after the path-stripping fix the object key has no leading slash. Removed the leading-slash requirement — only empty prefixes are now rejected. - Dashboard: Disk Metadata hit rate inflated in HEAD column:
headDiskHitRateusedheadDiskHits + headMissesas the denominator, butmetadata_cache.missescounts all RAM misses (requests that reached disk), not just disk misses. The correct denominator isheadRamMisses(all RAM misses = disk hits + S3 fetches). Also fixedheadTotal(wasramHits + diskHits + ramMisses, double-counting disk hits),headRamHitRate(same double-count), andS3 Fetchcount (was showing RAM misses instead oframMisses - diskHits). - Dashboard: HEAD column inflated by GET metadata lookups:
metadata_cache.hits,misses, anddisk_hitswere shared between HEAD requests (get_head_cache_entry_unified) and GET metadata prefetches (get_metadata_cached). Added separatehead_hits,head_misses,head_disk_hitscounters incremented only from the HEAD path. The HEAD dashboard column now uses these HEAD-specific counters; the generic counters remain for internal tracking.
- IP distribution never activated: The background DNS refresh task was never started in
main.rs, soip_distribution_enabled: truehad no effect — theConnectionPoolManageralways had an empty distributor and every request fell back to hostname-based forwarding. Added the background task usingpool_check_interval(default 10s). - DNS refresh could not bootstrap itself:
refresh_dnsonly iterated endpoints already inresolved_ips, butresolved_ipswas only populated byrefresh_endpoint_dns. On startup it was always empty, making the refresh loop a no-op even if called. Addedregister_endpointwhich performs an immediate DNS resolve and seedsresolved_ips. Called fromtry_forward_requeston first miss for any new hostname. - Health tracker failures not cleared on DNS refresh: When
refresh_endpoint_dnsrestored IPs after a DNS cycle, stale failure counts for those IPs persisted inIpHealthTracker. A previously-excluded IP could be immediately re-excluded on its first request after restoration.S3Client::refresh_dnsnow callshealth_tracker.clear()after each successful pool refresh. - Health check falsely reporting
Degradedat startup: The connection pool health check marked the systemDegradedwheneverip_distributorswas empty, which is the normal state before any request arrives. Now only reportsDegradedwhen an endpoint is registered but has zero IPs (DNS resolution failed for a known endpoint). Empty distributor at startup isHealthy.
- Dashboard: flow-chart layout for cache statistics: Replaced the flat 4-card grid with a two-column flow-chart showing HEAD and GET request paths separately. Each column shows RAM → Disk → S3 Fetch with hit/miss counts, hit rates, and flow arrows. Per-column totals and overall hit rate at the bottom. Overall Statistics section shows total requests, cached objects, Total Cache Size, Write Cache, S3 savings, and uptime.
- Dashboard: metadata cache disk hits counter: New
disk_hitsmetric tracks metadata lookups that missed RAM but were served from unexpired.metafiles on disk. Previously these were invisible — counted as a RAM miss with no corresponding hit anywhere. - Dashboard: per-bucket table HEAD TTL column: Added HEAD TTL column to the Per-Bucket Cache Settings table. Previously only GET TTL was shown despite HEAD TTL being available in the API response and detail view.
- Dashboard: click-to-expand help text: Replaced hover-over
titletooltips with ⓘ icons that toggle inline help text on click. Works on mobile and is more discoverable than hover tooltips. - Dashboard: stale refreshes help text: Updated to clarify that stale refreshes count as RAM misses but may still be disk hits, rather than the previous incorrect "does not affect hit rate" wording.
- Dashboard: bucket overrides section redesign: Renamed "Per-Bucket Cache Settings" to "Bucket and Prefix Overrides". Flattened bucket-level and prefix-level overrides into one table showing Bucket, Prefix, HEAD hit rate ("x% of y"), GET hit rate ("x% of y"), with a "Settings" button that expands to show TTLs and cache flags inline. Removed redundant "Cache Statistics" and "Application Logs" h2 headings.
- Per-bucket cache hit/miss recording for HEAD requests: HEAD cache hits and misses now call
record_bucket_cache_accessandupdate_statistics, fixing the per-bucket counters that were always zero for HEAD-heavy workloads.
- Rate-limited S3 forwarding error logs: All "Failed to forward request to S3" error paths now route through a single rate-limited helper that emits at most one log line per 60 seconds with an occurrence count and the most recent request's URI, method, and error. Previously, 10+ call sites used direct
error!()calls that spammed logs during S3 connectivity issues. The helper usestry_lockon aMutexto store the latest example without blocking the hot path.
- Cached objects counter not reconciled when size drift is zero: The daily validation scan only called
update_size_from_validation(which correctscached_objects) when the scanned size differed from the tracked size. With accumulator-based size tracking producing zero drift, the object count was never corrected — it accumulated double-counts from multi-instance consolidation and OOM restarts (1,076k tracked vs 691k actual). Now always reconcilescached_objectsfrom the validation scan's.metafile count regardless of size drift.
- Parallel validation scan: Replaced sequential
WalkDirwith parallel L1 shard directory traversal using rayon. The previous approach used a single-threaded directory walk feeding into parallel file processing viapar_bridge(), bottlenecked by sequential NFSreaddircalls. Now enumerates L1 directories upfront and walks each in parallel, overlapping NFS round-trips across rayon threads. Measured improvement from ~35 min to ~10 min for 691k objects on EFS. - Stale HEAD-only metadata cleanup during daily validation: During the daily consistency validation scan, HEAD-only
.metafiles (no cached object data) that have been expired for more than 1 day are now automatically removed. These entries have no body data to serve and waste disk space and NFS I/O on every scan. - Consolidation lock: try-lock instead of acquire-with-retry: Per-key metadata lock acquisition in
consolidate_object_with_filesnow uses a single non-blocking attempt (try_acquire_lock) instead of exponential backoff with up to 5 retries (acquire_lock). If the lock is held by another instance, the key is skipped and retried next cycle. Eliminates ~150ms worst-case backoff per contended key, improving throughput under multi-instance contention. - Journal cleanup: file-level skip optimization:
cleanup_consolidated_entriesnow tracks which journal files were seen during discovery. Files not in the discovery set (e.g., created after discovery started) are skipped entirely without any I/O. Files where no entries were consolidated are also skipped. Only files with consolidated entries are re-read and rewritten. Reduces cleanup I/O from O(total_journal_size) to O(files_with_consolidated_entries). - Discovery tracks per-file entry counts:
discover_pending_cache_keys_indexed_cappednow counts total parseable entries per journal file during the discovery pass (no extra I/O — piggybacks on the existing read). This metadata is passed to cleanup for file-level optimization decisions.
- Dashboard: S3 Requests Saved counter: New "S3 Requests Saved" metric in the "Overall Statistics" dashboard section shows the total number of GET and HEAD requests served from cache instead of forwarding to S3. Displayed below the existing "S3 Transfer Saved" (bytes) counter.
- OOM kills from discovery reading all journal files past key cap: The
discover_pending_cache_keys_indexed_cappedchange to track per-file entry counts removed the inner-loopbreakwhen the key cap was reached. Instead of stopping mid-file at 5000 keys, discovery continued parsing and deserializing every JSON line in every journal file to get accurate entry counts. With 840 MB of stale journal files from previous crashes, this allocated hundreds of MB ofJournalEntryobjects per 5-second cycle, causing RSS to grow to 30 GB before OOM kill. Restored the inner-loopbreak— entry counts for partially-read files will be incomplete, but cleanup falls through to entry-by-entry matching for those files. Also addedcleanup_dead_instance_journals()duringinitialize()to remove journal files from dead PIDs (checked viakill(pid, 0)), preventing stale journal accumulation after crashes. - Cached objects counter not incrementing for HEAD-then-consolidate pattern: The counter only incremented when consolidation created a new
.metafile. When the HEAD handler created a HEAD-only.meta(no ranges) first, consolidation saw the file already existed and skipped the increment. Now checks if the metadata had zero ranges before consolidation and counts adding the first range as a new cached object. - Redundant NFS stat in consolidation: Removed
metadata_path.exists()call that precededload_or_create_metadata— the range count is now checked from the already-loaded metadata, eliminating one NFS round-trip per consolidated key.
- Consolidation "zero progress" under high small-object load: The 30s cycle deadline previously wrapped
buffer_unordered().collect(), discarding all completed work when the deadline fired. With 100k+ pending keys, every cycle discovered all keys, processed none before the 30s timeout, and discarded the batch — resulting in zero progress indefinitely. Replaced with incrementalstream.next()polling against a deadline so completed keys are counted, cleaned up, and logged even when the deadline fires. - Unbounded journal discovery causing NFS I/O waste: Added
max_keys_per_cycle(default 5000) cap to discovery. Stops reading journal files once the cap is reached (including mid-file), reducing discovery from O(100k) NFS reads to O(5k). Without this, discovery of 66k+ keys consumed 20s+ of the 30s deadline, leaving only seconds for actual key processing (~192 keys/cycle). With the cap, cycles process up to 5000 discovered keys within the deadline. - Discovery eating into processing deadline: Moved the 30s deadline to before the discovery phase so the total cycle (discovery + processing + cleanup) is bounded. Previously discovery ran outside the deadline and could consume most of the wall-clock time.
- Noisy
trust_dns_protowarnings: Suppressedtrust_dns_protoWARN logs (e.g., "failed to associate send_message response to the sender") by setting the crate to ERROR level. These are benign DNS multiplexing artifacts under high concurrency. - Cached objects counter: NFS contention and batching: Moved
increment_cached_objectsfrom per-key (one NFS lock + read + write per new object) to a single batched call after the cycle completes. Reduces NFS lock operations from N to 1 per cycle. The counter still only increments for genuinely new objects (first.metafile creation); re-consolidation of existing objects does not double-count. - HEAD handler overwriting consolidated ranges via NFS attribute cache:
store_head_cache_entry_unifiedusedmetadata_path.exists()to decide whether to update or create a.metafile. On NFS,exists()can returnfalsedue to attribute caching even when the file was recently written by consolidation on another instance. This caused the HEAD handler to create a HEAD-only.meta(empty ranges) that overwrote the consolidated version, losing cached range data. Replaced with a directread_from_diskattempt that bypasses the NFS attribute cache. - MetadataCache caching HEAD-only entries without ranges: The HEAD handler stored metadata with empty ranges in the MetadataCache (RAM). Any subsequent range lookup hitting that RAM entry within the 5s refresh window would see
ranges=0and miss, even if the disk.metahad been updated by consolidation with range data. Fixed by not caching empty-ranges metadata in RAM — HEAD-only entries are written to disk but not stored in the MetadataCache. Metadata is only cached in RAM once consolidation adds ranges, ensuring range lookups always benefit from the cache.
- Dashboard: Moved "Total Cached Objects" from "Disk Cache: Object Ranges" to "Overall Statistics" section.
- Log levels: Downgraded consolidation cycle deadline, S3 request retry, and per-key consolidation failure messages from WARN to INFO — these are expected operational behavior under load, not error conditions. Rate-limited the "Request limit exceeded, returning 429" warning to once per minute to reduce log noise during burst traffic.
- Range clamping for oversized range requests: When a client requests a range larger than the object (e.g.,
Range: bytes=0-52428799for a 10-byte object), S3 returns only the available bytes. The proxy now clamps the cache range end to match the actual data received instead of failing with a validation error. - S3 forward error logging: Rate-limited "Failed to forward request to S3" errors to once per minute with occurrence count. Added URI and method to the error message for debugging.
- Validation metadata: Fixed
metadata_files_scannedalways reporting 0 invalidation.json.
- Removed
max_keys_per_runcap: The consolidator now processes all discovered pending cache keys each cycle, relying on the 30-secondconsolidation_cycle_timeoutas the sole backpressure mechanism. The previous 50-key-per-cycle limit was removed. Note: under very high small-object load (100k+ pending keys), this caused the timeout to fire before any keys completed — addressed in 1.9.4 with discovery capping and incremental streaming. KEY_CONCURRENCY_LIMITraised from 8 to 64: Increases parallelism for NFS-latency-bound per-key consolidation, overlapping I/O round-trips for higher throughput.- HashSet cleanup optimization:
cleanup_consolidated_entriesnow uses aHashSetfor O(1) per-entry matching instead of O(m) linear scan, reducing cleanup from O(n·m) to O(n).
- Consolidation cycle O(N²) journal scan eliminated:
discover_pending_cache_keysnow builds aHashMap<cache_key, Vec<PathBuf>>index in a single pass over all journal files. Each key's consolidation then reads only the files that contain entries for that key, instead of re-scanning all journal files for every key. - Unused
mutwarning ininitialize(): Removed spuriousmutonstatebinding in theOkarm ofload_size_state().
- Startup scan skip when validation is fresh: On warm restart (validation ran within 23h), Phase 2 now loads cache size from
size_state.jsoninstead of walking all.metafiles on EFS. Eliminates the slow initialization scan on every proxy restart. - Cold startup reconciles size_state.json: On cold restart (validation stale >23h), the full metadata scan result is written to
size_state.jsonviaupdate_size_from_validation. Eviction decisions use accurate size immediately rather than waiting for the next daily validation.
- Dashboard: Total Cached Objects metric: New "Total Cached Objects" counter in the "Disk Cache: Object Ranges" dashboard section shows the number of distinct S3 objects (unique cache keys) currently stored on disk. Tracked in
SizeState.cached_objects, incremented when consolidation writes a new metadata file, decremented when eviction deletes a metadata file, and recalculated from.metafile count on startup (upgrade path) and during daily validation scans.
- RwLock for ConnectionPoolManager: Replaced
MutexwithRwLockacrossS3Client,CustomHttpsConnector, and all downstream consumers. The hot path (get_distributed_ip,get_hostname_for_ip) now acquires a read lock, eliminating per-request serialization. Write locks are only taken for DNS refresh and IP exclusion. - Idle timeout 30s → 55s: Default
idle_timeoutincreased to 55s to align with S3's ~60s server-side timeout, reducing premature connection eviction from hyper's pool. - TCP keepalive via socket2: New connections apply
SO_KEEPALIVE(idle=15s, interval=5s, retries=3) before TLS handshake. Dead connections are detected at the TCP layer before hyper tries to reuse them. - TCP receive buffer tuning:
SO_RCVBUFset to 256KB by default on new connections for improved large-object throughput. Configurable viatcp_recv_buffer_size. - IP health tracking with automatic exclusion: New
IpHealthTrackerrecords consecutive failures per IP. After 3 failures (configurable viaip_failure_threshold), the IP is removed from the round-robin distributor. DNS refresh (every 60s) restores excluded IPs automatically. - Eager IpDistributor initialization:
endpoint_overridesdistributors are now initialized at construction time instead of lazily on first request, enablingget_distributed_ip(&self)without&mut self.
- Shadow connection pool: Removed
ConnectionPool,Connection,HealthMetrics,PerformanceMetrics,ConnectionPriority,ConnectionSelectionCriteria,LoadBalancingStrategy,DnsResolutionCache,IpAddressInfo, and all associated methods (get_connection,get_or_create_connection,release_connection,select_best_ip,calculate_ip_score,get_expired_connections,cleanup_idle_connections,close_all_connections,monitor_connection_health,get_multiple_connections,get_health_metrics). These tracked phantom state disconnected from hyper's actual connection pool.
- New config fields:
keepalive_idle_secs,keepalive_interval_secs,keepalive_retries,tcp_recv_buffer_size,ip_failure_threshold - Dependency:
socket2 = "0.5"(TCP socket option configuration)
- Revalidation 403/401 no longer invalidates cache: When S3 returns 403 Forbidden or 401 Unauthorized during TTL-expiry revalidation, the proxy returns the error to the client without removing cached data. A credentials failure is not a data change — cached data remains valid for other authorized callers.
- Non-streaming PUT missing S3 response headers in cache: The non-streaming PUT cache path stored an empty
response_headersmap in metadata. S3 response headers (x-amz-server-side-encryption,x-amz-version-id, checksums, etc.) are now captured and stored, matching the signed PUT handler behavior. Checksum headers from the request are merged as fallback.
- PUT write-cache ignores bucket-level
put_ttloverride: The non-streaming PUT cache path (store_write_cache_entry) used the globalput_ttlinstead of resolving per-bucket settings. Bucket-levelput_ttloverrides in_settings.jsonnow apply correctly.
- ETag-based cache revalidation: TTL-expired objects now send
If-None-Match(ETag) alongsideIf-Modified-Sinceduring revalidation. Closes stale-data window when two writes to the same key occur within one second (identicalLast-Modifiedtimestamps). - HEAD-triggered range invalidation: When a HEAD response returns a different ETag or content-length than cached, all cached ranges for that key are cleared immediately. Prevents serving stale range data after object overwrites.
- PUT response ETag capture: The non-streaming PUT handler now captures ETag from S3 response headers instead of request headers, ensuring correct ETag is stored in cache metadata.
- PERF logging actually moved to DEBUG: Fixed v1.7.9 sed command that failed to match multiline
info!(\n "PERFpattern. All 9 PERF log lines now correctly usedebug!macro.
- Idle consolidation detection: Consolidation cycle skips entirely when no pending work (zero accumulator deltas and no journal files). Reduces metadata IOPS to near-zero during idle periods.
- Backpressure-aware TeeStream: Replaced
try_send(non-blocking, drops chunks when channel full) withsend().awaitvia stored future inpoll_next. When the cache write channel is full, the stream applies backpressure to the S3 response, slowing the client download to match disk write speed. Guarantees zero dropped chunks — every range is fully cached on first download. Client speed is unaffected when disk can keep up.
- Large file cache regression: Reverted signed range and unsigned range cache write paths from IncrementalRangeWriter (chunk-by-chunk with RwLock contention) back to buffered accumulation (Vec + single store_range). Fixes 5GB files caching 0-32% of ranges on first download. Root cause: hundreds of concurrent tokio::spawn tasks contending for the DiskCacheManager write lock during commit, causing task starvation. Full GET path retains IncrementalRangeWriter (single task, no contention).
- PERF logging moved to DEBUG level: Request-level PERF timing lines now log at DEBUG instead of INFO. Set
log_level: "debug"to enable. Reduces log noise in production while keeping the diagnostic capability available.
- IP distribution enabled by default:
ip_distribution_enablednow defaults totrue. Per-IP connection pools are active out of the box. - Cache backpressure logging: Replaced per-chunk "Cache channel full" warnings with a single summary at stream end showing dropped chunks/bytes count. Size mismatch commit failures downgraded to DEBUG (the backpressure warning already covers it). Error message now explains the root cause.
- Per-IP connection pool distribution: New
ip_distribution_enabledconfig option rewrites request URI authorities to individual S3 IP addresses, causing hyper to create separate connection pools per IP. Distributes load across all DNS-resolved IPs using round-robin selection. Preserves TLS SNI and Host header for SigV4 compatibility. Falls back to hostname-based routing when no IPs are available. - IP distribution observability: Per-IP connection counts in health check endpoint, info-level logging for IP lifecycle events (DNS refresh, health exclusion, exclusion expiry), debug-level logging of selected IP per request.
- IP distribution configuration:
max_idle_per_ip(default 10, range 1-100) controls idle connections per IP pool. Works with both DNS-resolved IPs and staticendpoint_overrides.
- Always stream S3 responses: Removed 1 MiB streaming threshold — all S3 responses now stream regardless of size. Default
allow_streamingchanged fromfalsetotrue. Eliminates buffering delay for all response sizes.
- Signed range streaming: Signed range requests (AWS CLI) now stream S3 responses directly to client instead of buffering the entire range in memory. This was the root cause of ~200 MB/s cache miss throughput — each 8 MiB range was fully downloaded before any bytes reached the client.
- Request-level PERF timing: INFO-level
PERFlog lines on every GET data path showing timing breakdown (ram_lookup_ms, metadata_ms, disk_open_ms, stream_setup_ms, s3_fetch_ms, data_load_ms). Grep withjournalctl -u s3-proxy | grep PERFto diagnose throughput bottlenecks.
- Dashboard active requests: Read active connections counter directly from the atomic instead of cached metrics, so the dashboard shows real-time values immediately.
- Concurrent requests metric: Active requests counter (
active_requests / max_concurrent_requests) exposed on dashboard header,/api/system-info,/metricsJSON, and OTLP/CloudWatch.
- Streaming decompression for cache hits:
stream_range_datanow usesFrameDecoderwith chunked reads in aspawn_blockingtask, yielding decompressed data through an mpsc channel instead of materializing the full range in memory. - Stream-to-disk caching for cache misses: Background cache writes use new
IncrementalRangeWriterto compress and write chunks as they arrive from S3, eliminating full-range accumulation in memory. - Async file I/O:
load_range_datausestokio::fs::read()instead of blockingstd::fs::read(), preventing NFS latency from stalling tokio worker threads. - Connection pool default increase:
max_idle_per_hostdefault raised from 10 to 100, validation range widened from 1–50 to 1–500. - Streaming chunk size: Default chunk size increased from 512 KiB to 1 MiB for better throughput.
- Per-request memory documentation:
config.example.yamlanddocs/CONFIGURATION.mddocument per-request memory usage (~5 MiB) with sizing formula and example calculations formax_concurrent_requests.
- Configurable log retention: Separate
access_log_retention_daysandapp_log_retention_dayssettings (default 30, range 1–365) allow independent control over access log and application log disk usage. - Background log cleanup task: Spawns a periodic cleanup task at startup (
log_cleanup_interval, default 24h, range 1h–7d). Runs immediately on startup, then at each interval. Deletes expired files, removes empty date-partition directories, logs results, and continues on I/O errors. Application log cleanup is hostname-scoped for safe multi-instance shared storage. - Access log file rotation:
access_log_file_rotation_interval(default 5m, range 1m–60m) consolidates access log flushes within a time window into the same file, reducing small file proliferation under low traffic.
- Dead code cleanup: Removed ~4,000 lines of unreachable code across 20 modules — 147 public functions, 2 enums, and cascading private functions/types/imports that were only called by the removed code. No behavioral changes; all removed code was verified unreachable from both
main.rsand the test suite. start_max_lifetime_task: Removed unwired background task for connection max lifetime enforcement. Updated CONNECTION_POOLING.md to reflect thatmax_lifetimeconfig is accepted but not actively enforced (Hyper's idle timeout handles connection rotation).
- Access log
source_regionfield: Addedsource_regionas the 25th field in S3 server access log records, matching the current AWS S3 log format spec. Always emits-since the proxy cannot determine request origin region (PrivateLink, Direct Connect, and non-AWS IPs are also-in real S3 logs).
- Updated
bytes1.11.0 → 1.11.1 (CVE-2026-25541: integer overflow inBytesMut::reserve) - Updated
time0.3.44 → 0.3.47 (CVE-2026-25727: stack exhaustion in RFC 2822 parsing)
endpoint_overridesconfig option: Static hostname-to-IP mappings that bypass DNS resolution for S3 endpoints. Useful for S3 PrivateLink deployments where the proxy cannot use DNS to resolve S3 endpoints to PrivateLink ENI IPs (e.g., on-prem without Route 53 Resolver inbound endpoints). Works for both HTTP (connection pool) and HTTPS (TCP passthrough) traffic. Load-balances across multiple IPs per hostname.
- Updated PrivateLink documentation in GETTING_STARTED.md and CONFIGURATION.md to document
endpoint_overridesas alternative to Route 53 Resolver - Added
endpoint_overridesexample toconfig/config.example.yaml
- Multipart upload part isolation: Parts from concurrent multipart uploads to the same S3 key with different upload_ids no longer overwrite each other. Parts are now stored in upload-specific directories (
mpus_in_progress/{upload_id}/part{N}.bin) instead of the sharedranges/directory. On CompleteMultipartUpload, parts are moved to their finalranges/location with byte offset names. Cleanup (abort/expiration) is simplified to a singleremove_dir_all().
- Removed
range_file_pathfield fromCachedPartInfostruct (path is now deterministic from upload_id + part_number) - Simplified
cleanup_multipart_upload()andcleanup_incomplete_multipart_cache()to single directory removal - Simplified incomplete upload eviction in
cache_size_tracker,cache.rs, andwrite_cache_manager
- Path-style AP/MRAP alias SigV4 signature preservation: Path-style access point alias requests (e.g.,
--endpoint-url http://s3-accesspoint.eu-west-1.amazonaws.comwith alias in path) are now forwarded to S3 without host or path rewriting. Previously, the proxy reconstructed a virtual-hosted upstream host and stripped the alias from the path, which broke the AWS SigV4 signature (signed for the original host/path). S3 handles path-style AP routing natively; the alias in the first path segment provides correct cache key namespacing without rewriting.
- Journal consolidation TtlRefresh/AccessUpdate validation: Object-level journal operations (TtlRefresh, AccessUpdate) are now validated by checking metadata file existence instead of range file existence. Previously, these operations used dummy range coordinates (0-0) which never matched actual range files, causing consolidation to skip them entirely.
- Test suite JournalConsolidator initialization: Removed erroneous
CacheManager.initialize()calls from ~25 test files that don't set up JournalConsolidator. Fixedeviction_buffer_testto usenew_with_shared_storagewith correctmax_cache_size_limit. Fixed flock-based lock release assertions inglobal_eviction_lock_test.
- Cache key namespace collision: Access point cache key folders now include AWS reserved suffixes (
-s3aliasfor regional APs,.mrapfor MRAPs) to prevent collision with S3 bucket names. Previously, bare AP/MRAP identifiers could match bucket names, causing cross-namespace cache collisions.
- Path-style AP alias support: Requests with Host
s3-accesspoint.{region}.amazonaws.comand an AP alias (ending in-s3alias) in the first path segment are now detected. The proxy reconstructs the correct upstream host, strips the alias from the forwarded path, and uses the alias as the cache key folder. - Path-style MRAP alias support: Requests with Host
accesspoint.s3-global.amazonaws.comand an MRAP alias (ending in.mrap) in the first path segment are now detected. The proxy reconstructs the upstream host (stripping.mrapfrom the hostname), strips the alias from the forwarded path, and uses the alias as the cache key folder. - AP/MRAP documentation updates: Updated
docs/CACHING.mdwith reserved suffix approach, path-style alias detection, and known ARN-vs-alias cache key divergence limitation. Updateddocs/GETTING_STARTED.mdwith AP alias and MRAP alias usage examples.
- Graceful shutdown now fully wired: Cache manager and connection pool were never registered with the shutdown coordinator, making cache lock release (Step 3) and connection pool closure (Step 4) dead code during shutdown. Both are now wired via
set_cache_manager()andset_connection_pool(). - HTTP/HTTPS/TCP proxy accept loops are shutdown-aware: All proxy
start()methods now accept aShutdownSignaland usetokio::select!to break the accept loop on shutdown. Previously, these infinite loops were killed by task cancellation with no cleanup. - HTTP proxy drains in-flight connections on shutdown: After stopping the accept loop, the HTTP proxy waits up to 5 seconds for active connections (tracked via
active_connectionscounter) to complete before returning. - Health and metrics servers are shutdown-aware: Both servers now accept a
ShutdownSignaland break their accept loops on shutdown, matching the existing dashboard server pattern. - Background tasks stop cleanly on shutdown: Cache hit update buffer flush and journal consolidation background tasks now listen for the shutdown signal and break their loops. The cache hit buffer performs a final flush before stopping.
- Process waits for shutdown coordinator to complete:
main()now awaits the shutdown coordinator task instead of usingtokio::select!that could exit before teardown finished. - Shutdown coordinator type alignment:
ShutdownCoordinatornow usesArc<CacheManager>andArc<Mutex<ConnectionPoolManager>>to match the actual types used throughout the system, instead of the previously mismatchedArc<RwLock<...>>wrappers.
- Access point and MRAP cache key collisions: Cache keys for S3 Access Point and Multi-Region Access Point (MRAP) requests are now prefixed with the access point identifier extracted from the Host header. Regional AP requests (
{name}-{account_id}.s3-accesspoint.{region}.amazonaws.com) use{name}-{account_id}/as the prefix. MRAP requests ({mrap_alias}.accesspoint.s3-global.amazonaws.com) use{mrap_alias}/as the prefix. Previously, all access point requests with the same object path produced identical cache keys regardless of which access point they came from, causing cross-access-point data collisions. Regular path-style and virtual-hosted-style requests are unaffected.
- Access point documentation: Updated
docs/CACHING.mdwith access point cache key prefixing details. Updateddocs/GETTING_STARTED.mdwith DNS routing,--endpoint-urlusage, and hosts file / Route 53 configuration for access points and MRAPs.
- S3 error responses no longer cached: The streaming GET path attempted to cache S3 error responses (403, 500, etc.) as if they were object data, causing "data size mismatch" errors in
store_range. The error body (~8KB XML) was collected and passed tostore_rangewhich rejected it due to size mismatch with the expected range. Now checksstatus.is_success()before setting up the TeeStream cache channel, matching the existing behavior in the buffered response path.
- Object-level cache expiration: Expiration is now tracked at the object level (
NewCacheMetadata.expires_at) instead of per-range (RangeSpec.expires_at). All cached ranges of the same object share a single freshness state. Simplifies expiration checks and TTL refresh after 304 responses. - Removed per-range expires_at: The
expires_atfield,is_expired(), andrefresh_ttl()methods are removed fromRangeSpec. Eviction fields (last_accessed,access_count,frequency_score) remain per-range. - Expiration check API:
check_range_expiration(cache_key, start, end)replaced bycheck_object_expiration(cache_key).refresh_range_ttl(cache_key, start, end, ttl)replaced byrefresh_object_ttl(cache_key, ttl).
- Metadata read failures treated as expired (security fix): If the proxy cannot read or deserialize metadata during an expiration check, it now treats the cached data as expired and forwards the request to S3. Previously, metadata read errors were silently treated as "not expired," which could serve stale or unauthorized data — particularly dangerous with
get_ttl=0buckets. - Correct TTL in all metadata creation paths: Metadata created by the hybrid metadata writer and journal consolidator now uses the resolved per-bucket TTL instead of a ~100-year sentinel. Journal entries carry
object_ttl_secsso consolidation creates metadata with the correctexpires_at. Orphan recovery usesDuration::ZERO(force revalidation) since the original TTL is unknown.
- Zero-TTL bypass removal: Removed three bypass blocks in
http_proxy.rsthat skipped cache lookup whenget_ttl=0orhead_ttl=0. Zero-TTL requests now go through the normal cache flow with immediate expiration and conditional revalidation viaIf-Modified-Since, enabling 304 bandwidth savings. - Full-object GET expiration checking: Added expiration check to the full-object GET path (previously missing). Cached full-object data is now validated with S3 before serving when expired, matching the existing range request behavior.
- TTL refresh after 304 uses resolved per-bucket TTL: The TTL refresh after a 304 Not Modified response now uses the resolved per-bucket
get_ttlinstead of the globalconfig.cache.get_ttl. Prevents zero-TTL bucket data from being refreshed with the global TTL (e.g., 10 years).
- Documentation: Updated
docs/CACHING.md"Zero TTL Revalidation" section to describe correct behavior. Added "Settings Apply at Cache-Write Time" subsection. Added cache-write-time note todocs/CONFIGURATION.md.
- Bucket-level cache settings: Per-bucket and per-prefix cache configuration via
_settings.jsonfiles atcache_dir/metadata/{bucket}/_settings.json. Configure TTLs, read/write caching, compression, and RAM cache eligibility per bucket with hot reload (no proxy restart). Settings cascade: Prefix → Bucket → Global. - Zero TTL revalidation:
get_ttl: "0s"caches data on disk but revalidates with S3 on every request. Saves bandwidth on 304 Not Modified responses. - Read cache control:
read_cache_enabled: falsemakes the proxy act as a pure pass-through for GET requests (no disk or RAM caching). Supports allowlist pattern with globalread_cache_enabled: falseand per-bucket overrides. - Per-bucket metrics:
bucket_cache_hit_countandbucket_cache_miss_countcounters for buckets with_settings.jsonfiles. - Dashboard bucket stats table: Sortable table with per-bucket hit/miss stats, resolved settings, and expandable prefix overrides.
/api/bucket-statsAPI endpoint. - JSON schema:
docs/bucket-settings-schema.jsonfor IDE validation of_settings.jsonfiles. - Example settings files: Six example configurations in
docs/examples/.
- Dashboard renamed: "S3 Proxy Dashboard" → "S3 Hybrid Cache".
- Compression control: Per-bucket
compression_enabledsetting.CompressionAlgorithm::Nonevariant for uncompressed storage.
- TTL overrides: Removed
ttl_overridesYAML config andTtlOverridestruct. Replaced by bucket-level cache settings.
- Percentage metrics renamed:
cache_hit_rate→cache_hit_rate_percent,ram_cache_hit_rate→ram_cache_hit_rate_percent,success_rate→success_rate_percentin/metricsJSON and OTLP. Makes units explicit in metric names. - RAM cache always compresses: RAM cache now uses LZ4 compression regardless of the global
compression.enabledflag, saving memory for compressible data even when disk compression is disabled. - OTLP_METRICS.md updated: Replaced stale placeholder metric names with actual metric names matching the
/metricsJSON API.
- Dead code:
extract_path_from_cache_keyin RAM cache (unused after compression change).
- Request metrics always zero:
record_request()was never called from the HTTP proxy, sorequest_metrics.total_requests,successful_requests,failed_requests,average_response_time_ms, andrequests_per_secondwere always 0 in/metricsJSON, OTLP, and CloudWatch. Addedrecord_request()call at the end ofhandle_request()with actual elapsed time and success/failure status.
- Per-tier cache metrics in /metrics and OTLP: The
/metricsJSON endpoint and OTLP export now include RAM cache stats (ram_cache_hits,ram_cache_misses,ram_cache_evictions,ram_cache_max_size), metadata cache stats (metadata_cache_hits,metadata_cache_misses,metadata_cache_entries,metadata_cache_max_entries,metadata_cache_evictions,metadata_cache_stale_refreshes), andbytes_served_from_cache. These match the dashboard's per-tier breakdown — all three surfaces (dashboard,/metrics, OTLP) now use the same data source.
- OTLP metric names match /metrics JSON API: OTLP gauge names now use the JSON field path from the
/metricsendpoint (e.g.cache.cache_hits,coalescing.waits_total,request_metrics.total_requests). Removed thecache_typedimension oncache.size— each size field is a separate metric (cache.total_cache_size,cache.read_cache_size,cache.write_cache_size,cache.ram_cache_size). All metrics share only the resource attributes (host.name,service.name,service.version).
- RAM cache serving corrupted data for non-compressible files:
compress_data_content_aware_with_fallbackreturnedwas_compressed = falsefor content-types that skip compression (zip, jpg, etc.), even though the data was wrapped in LZ4 frame format with uncompressed blocks. On RAM cache retrieval, thecompressed: falseflag caused the LZ4 frame bytes to be served directly to clients without decompression, triggeringAWS_ERROR_S3_RESPONSE_CHECKSUM_MISMATCH. The flag now correctly returnstruewhenever data is in frame format, sinceFrameDecoderis always needed to unwrap it.
- Broken conditional request validation in S3 client:
parse_http_date()ins3_client.rsalways returnedSystemTime::now()instead of parsing the date string, makingIf-Modified-SinceandIf-Unmodified-Sincecomparisons meaningless. Now uses thehttpdatecrate (already used incache.rs).
- Dead per-IP request metrics in S3 client: Removed
connection_ipfromS3Response,record_request_success(),record_request_failure(), andextract_ip_from_error(). These fed per-IP health metrics in the pool manager, but the feedback loop was broken — Hyper's opaque connection pool prevented accurate IP attribution, so success metrics were always attributed to0.0.0.0and failure metrics to127.0.0.1. The pool manager's DNS resolution and IP selection for new connections (viaCustomHttpsConnector) continue to work correctly.
- Real OTLP metrics export: Replaced placeholder OTLP exporter with a working implementation using the OpenTelemetry SDK. Exports cache, request, connection pool, compression, coalescing, and process metrics to any OTLP-compatible collector (CloudWatch Agent, Prometheus, OpenTelemetry Collector) via HTTP protobuf. Enable with
metrics.otlp.enabled: trueand pointendpointto your collector.
- OpenTelemetry dependencies on Linux/macOS: OpenTelemetry crates were accidentally scoped under
[target.'cfg(windows)'.dependencies], preventing compilation on non-Windows platforms. Moved to main[dependencies]section. Addedrt-tokiofeature toopentelemetry_sdkfor async periodic export.
- Dashboard disk cache misses always showing 0: Disk cache miss count was calculated as
get_misses - ram_misses, but RAM misses include both "miss RAM, hit disk" and "miss RAM, miss disk" cases, makingram_misses >= get_missesand the subtraction always 0. Disk misses now correctly useget_missesdirectly since every overall cache miss is also a disk miss.
- BREAKING: LZ4 frame format migration: All cached data now uses LZ4 frame format with content checksum (xxHash-32) for integrity verification on every cache read. Existing cache must be flushed before upgrading (
rm -rf cache_dir/*). Old block-format.binfiles are not compatible with the new frame decoder. - Simplified versionId handling: Requests with
?versionId=bypass cache entirely (no cache read, no cache write). Removes the previous version-matching logic that compared cachedx-amz-version-idheaders. Bypass metric reason unified toversioned_request. - Non-compressible data uses frame format: Content-aware compression now wraps non-compressible data (JPEG, PNG, etc.) in LZ4 frame format with uncompressed blocks instead of storing raw bytes. All
.binfiles use frame format regardless of compressibility. - Compression when globally disabled: When
compression.enabled: false, data is still wrapped in LZ4 frame format with uncompressed blocks for integrity checksums.
- Signed DELETE cache invalidation:
aws s3 rm(signed DELETE with SigV4) now invalidates proxy cache on success. Previously, only unsigned DELETE requests triggered cache invalidation.
--compression-enabledCLI flag: UseCOMPRESSION_ENABLEDenv var orcompression.enabledconfig option instead.CompressionAlgorithm::Nonevariant: All cached data uses LZ4 frame format. TheNonevariant is removed; metadata recordsLz4for all entries.get_cached_version_id()method: Dead code after versionId bypass simplification.
- Proxy identification header: Adds a
Refererheader (s3-hybrid-cache/{version} ({hostname})) to requests forwarded to S3. Appears in S3 Server Access Logs for usage tracking and per-instance debugging. Skips injection when the header already exists or is included in SigV4SignedHeaders. Configurable viaserver.add_referer_header(default:true).
- Cache hit/miss statistics accuracy: Coalescing waiter paths now correctly record cache hits when serving from cache and cache misses only when falling back to S3. Previously, all requests entering the coordination path were counted as misses regardless of outcome.
- Validation scan: streaming parallel processing: Daily validation scan no longer collects all
.metafile paths into aVecbefore processing. UsesWalkDiras a streaming iterator with rayon'spar_bridge()to process files in parallel as they're discovered. Memory usage is O(rayon_threads) instead of O(total_files). Atomic counters accumulate results lock-free. Progress logged every 100K files. Scales to PB-sized caches with hundreds of millions of metadata files.
- Coalescing waiters re-fetching from S3: After a fetcher completed, waiters called
forward_get_head_to_s3_and_cachewhich always goes to S3 — defeating the purpose of coalescing. Waiters now try the cache first viaserve_from_cache_or_s3, only falling back to S3 on cache miss. Part-number waiters now trylookup_partbefore falling back. This eliminates redundant S3 fetches and the associated size over-counting from duplicatestore_rangecalls. - Size tracking: persistent dedup across flush windows: The
add_rangededupHashSetwas cleared every 5 seconds on flush, allowing the same range to be counted again in the next window. The dedup set now persists until the daily validation scan resets it.
- RAM cache auto-disabled when get_ttl=0: When
get_ttlis set to0s, the RAM data cache is automatically disabled during config loading. RAM cache has no TTL check on the hit path and would serve stale data, bypassing the per-request S3 validation thatget_ttl=0requires. The MetadataCache (for.metaobject metadata) remains active regardless.
- Cross-instance size over-counting: Before adding to the size accumulator, check if the range file already exists on disk. If another instance already cached the same range on shared storage, skip the size increment. Reduces stampede over-counting from 23× to near-accurate. The
exists()check is essentially free on NFS withlookupcache=pos(positive lookups are cached).
- Stale range data after PUT overwrite: When an object is overwritten via PUT, the proxy now invalidates all cached range data (RAM and disk) for that cache key. Previously, old range files from prior GET requests survived the overwrite and could be served to clients, causing checksum mismatches. The fix adds prefix-based RAM cache invalidation (
invalidate_by_prefix) to remove all{cache_key}:range:*entries, and ensures the metadata cache is refreshed after storing new PUT data. - Stampede size tracking (same instance): Range request waiters now recompute cache overlap after the fetcher completes instead of reusing the stale overlap from before the wait. Prevents waiters from re-fetching from S3 and double-counting size via
accumulator.add()for data already cached by the fetcher. - Stampede size tracking (cross instance):
SizeAccumulatornow deduplicates range writes within each flush window (~5 seconds) using a(cache_key_hash, start, end)set. When multiple instances write the same range to shared storage, only the first write per flush window increments the size delta. The dedup set is cleared on flush. Existingadd()andsubtract()paths are unchanged.
- Dashboard property tests: Updated tests to match current API — removed reference to deleted
cache_effectivenessfield, addedDashboardConfigparameter toApiHandler::new, addedcache_stats_refresh_msandlogs_refresh_msfields toSystemInfoResponseinitializers.
Closed off edge cases around part uploads and downloads, accelerated cache hits for Get Part requests, handled potential signing of range header, and optimized parallel requests for the same cache miss.
- Download coordination (coalescing): When multiple concurrent requests arrive for the same uncached resource, only one request fetches from S3 while others wait. Covers full-object GETs, range requests (signed and unsigned), and part-number requests. Waiters serve from cache after the fetcher completes, reducing redundant S3 fetches. Configurable via
download_coordination.enabled(default: true) anddownload_coordination.wait_timeout_secs(default: 30s). - Coalescing metrics: New metrics track download coordination effectiveness:
waits_total,cache_hits_after_wait_total,timeouts_total,s3_fetches_saved_total,average_wait_duration_ms,fetcher_completions_success,fetcher_completions_error. Exposed via/metricsendpoint. - Part ranges storage: Multipart object parts now store exact byte ranges (
part_ranges: HashMap<u32, (u64, u64)>) instead of assuming uniform part sizes. Enables accurate cache lookups for objects with variable-sized parts. - CompleteMultipartUpload filtering: During multipart completion, only parts referenced in the request are retained. Unreferenced cached parts are deleted. ETag validation ensures cached parts match the request.
- Content-Range parsing: GET responses with
partNumberparameter now parse theContent-Rangeheader to store accurate byte ranges for external objects (not uploaded through proxy).
- ETag mismatch handling: When storing a range with a different ETag than existing cached data, the proxy now invalidates existing ranges and caches the new data instead of returning an error. This handles object overwrites gracefully.
- Range modification documentation: Updated config comment to clarify dual-mode design: range consolidation applies only to unsigned requests; signed requests preserve exact Range headers for signature validity.
- Request delay behavior: Removed the 5-second sleep and 503 retry mechanism for concurrent part requests. Replaced by InFlightTracker-based download coordination which is more efficient and doesn't block requests.
- Dashboard disk cache hit rate: Disk cache stats now subtract RAM cache hits/misses from the overall totals, showing disk-tier-only performance. Previously the disk section displayed combined RAM+disk numbers.
- Dashboard overall stats: Removed redundant "Cache Hit Rate" from overall statistics section. RAM and disk hit rates are shown separately in their respective sections.
Stabilized multi-instance size tracking, fixed over-eviction race conditions, added streaming disk cache and parallel NFS operations for performance, improved dashboard accuracy, and reduced log noise. Shared-storage cache coordination is fully operational.
- Dead code cleanup: Removed 2 unused modules (
streaming_tee,performance_logger), 4 unused functions, 1 deprecated method with zero callers, 9 unused struct fields, and their associated test file. Fixed incorrect#[allow(dead_code)]onDiskCacheManager.write_cache_enabled(field is actually used). Zero behavior change.
- Lock file cleanup logging: Downgraded "Failed to remove lock file on drop" from
warn!todebug!when the error isNotFound. On shared NFS storage, another instance may have already cleaned up the lock file — this is expected, not an error.
- Range validation with zero content_length: When cached metadata has
content_length: 0(not yet populated), the proxy passedSome(0)to range parsing which rejected every range as "Start position exceeds content length". Now treatscontent_length == 0as unknown and skips range validation, forwarding to S3 instead.
- Range parse error logging: Downgraded "Invalid range specification" from
warn!todebug!since the proxy correctly forwards these to S3. Addedcontent_lengthcontext to the log. Addedcache_keyto the forwarding-to-S3 debug message.
- Eviction stale file handle recovery (complete): Extended ESTALE recovery to cover both
open()andlock_exclusive()calls during batch eviction lock acquisition. Previously onlyopen()was retried; now the full open+lock sequence is retried once on stale NFS file handles.
- Eviction log deduplication: Downgraded inner batch eviction lock failure messages (
disk_cacheandBATCH_EVICTION) fromwarn!todebug!. The top-levelEVICTION_ERRORremains atwarn!, eliminating triple-logging of the same error.
- Cache initialization coordinator size mismatch: The
CacheConfigpassed toCacheInitializationCoordinatorhad a hardcoded 1 GBmax_cache_sizeinstead of reading the actual configured value frominner.statistics.max_cache_size_limit. This caused incorrect "Cache over capacity" warnings at startup when the configured limit differed from 1 GB.
- Log noise reduction (continued): Downgraded two remaining range-miss
warn!messages todebug!: "Range file missing (will fetch from S3)" indisk_cache.rsand "Range spec not found for streaming" inhttp_proxy.rs. These are normal cache miss scenarios with graceful fallback, not operational concerns.
- Eviction stale file handle recovery: Batch delete lock acquisition now recovers from stale NFS file handles (ESTALE/os error 116) by deleting the stale lock file and retrying once, preventing eviction from getting stuck when lock files have invalid handles on shared storage.
- Log noise reduction: Downgraded "Range file missing" (fetching from S3), "Range file missing for streaming" (falling back to buffered), "Failed to create stream for range" (fallback), and "Eviction freed no ranges" from
warn!todebug!. These are normal cache miss / eviction scenarios with graceful recovery, not operational concerns.
- Dashboard: Disk Revalidated metric: Renamed "Stale Refreshes" to "Disk Revalidated" and changed from raw count to percentage of total metadata lookups. Updated tooltip to accurately describe the TTL-based revalidation mechanism.
- RAM Cache Range Fix — Streaming Path: The streaming path (
serve_range_from_cache) bypassed RAM cache entirely for ranges >=disk_streaming_threshold(1 MiB). It never checked RAM, never promoted disk hits to RAM, and never recorded RAM hit/miss statistics. Addedget_range_from_ram_cacheandpromote_range_to_ram_cachemethods to CacheManager. The streaming path now checks RAM cache before disk I/O (serving hits as buffered 206 responses), collects streamed chunks on disk hits and promotes to RAM cache after completion (skipping promotion for ranges exceedingmax_ram_cache_size), and records RAM cache hits/misses for dashboard statistics from both streaming and buffered paths.
- Dashboard: Metadata Cache Hit/Miss Counters: Dashboard was using
head_hits/head_missesfrom CacheManager statistics (which were never incremented for HEAD hits) instead of the MetadataCache's own hit/miss counters. Switched tometadata_cache.metrics()counters which are correctly tracked.
- Streaming Disk Cache Hit Counter: The streaming range cache hit path (
serve_range_from_cache) was not callingupdate_statistics, so disk cache hits for ranges >=disk_streaming_threshold(1 MiB) were not counted. Dashboard showed near-zero hit rate despite hundreds of streaming hits per second in logs.
- Dashboard: RAM Metadata Cache Card: Renamed title from "Metadata Cache" to "RAM Metadata Cache", updated subtitle to "In-memory cache for .meta objects (HEAD + GET)", corrected tooltips to say "metadata lookups" instead of "HEAD requests". Added "Cached Entries" line showing current/max entries.
- Over-Eviction Race Condition: Sequential evictions read stale
size_state.json, causing cache to drop to ~37% instead of target 80%. Both eviction paths now updatesize_state.jsondirectly under the eviction lock viaflush_and_apply_accumulator. Write-path eviction (evict_if_needed) now re-reads size after lock acquisition and uses configurable trigger threshold.
- Dashboard: Tooltip Descriptions on All Stats: Every stat item in the cache statistics dashboard now shows a descriptive tooltip on hover, explaining what the metric means and how it's calculated.
- Dashboard: Configurable Refresh Intervals: JavaScript now reads
cache_stats_refresh_msandlogs_refresh_msfrom the/api/system-infoendpoint, so YAML config values actually drive the dashboard refresh behavior instead of hardcoded 5s/10s. - Dashboard: RAM Cache Subtitle: Changed from "Metadata and data" to "Object range data (GET responses)" to accurately reflect that the RAM cache stores GET response body data, not metadata.
- Dashboard: Metadata Cache Card Title: Changed from "Disk Cache: Object Metadata" to "Metadata Cache" with subtitle "HEAD request hit/miss tracking" — it's a RAM cache, not disk, and the hit/miss stats track HEAD requests specifically.
- Dashboard: Total Disk Size Label: Changed "Read Cache Size" to "Total Disk Size" — the underlying value (
size_state.total_size) includes both read and write cache. Write cache size shown separately as a subset. - Dashboard: Write Cache Merged into Disk Cache Card: Removed the separate Write Cache tile. Write cache size now displays inside the "Disk Cache: Object Ranges" card alongside total disk size.
- Dashboard: Write Cache Description: Updated from "PUT operations (multipart uploads)" to accurately describe that write cache holds MPUs in progress and PUT objects not yet read via GET.
- Dashboard: Renamed WriteCacheStats.entries to evicted_uploads: The API field was misleadingly named
entriesbut actually containedincomplete_uploads_evicted. Renamed toevicted_uploadsfor accuracy. - Dashboard: Stale Refreshes Displayed: Metadata cache stale refreshes now shown in the UI (previously API-only).
- Dashboard: Overall Stats Labels: "Total Requests" renamed to "Cache Requests (GET + HEAD)" with combined count. "Cache Effectiveness" renamed to "Cache Hit Rate" with clarifying tooltip that it reflects GET operations only.
- Dashboard Documentation Rewrite: Fixed concurrent connection limit (50, not 10), removed Docker section, documented
/api/logsquery parameters and/api/system-inforesponse fields, added text search feature documentation.
- Metadata Pass-Through in handle_range_request: Load metadata once via
get_metadata_cached()and pass through the call chain (has_cached_ranges,find_cached_ranges,serve_range_from_cache). NFS reads per cache hit reduced from ~5 to ~1. - Skip Full-Object Cache Check for Large Files: When
content_lengthexceedsfull_object_check_threshold(default 64 MiB), skip the full-object cache check and proceed directly to range-specific lookup. Avoids scanning hundreds of cached ranges unnecessarily. - Connection Pool max_idle_per_host Default 1→10: Keeps more idle TLS connections alive to S3, reducing handshake overhead during burst cache misses.
- Consolidation Cycle Timeout: Per-key processing phase in
run_consolidation_cycle()enforces a configurable timeout (default 30s). On timeout, logs unprocessed key count and proceeds to delta collection and eviction. Unprocessed keys retry next cycle. - Streaming Range Data from Disk Cache: Cached ranges at or above
disk_streaming_threshold(default 1 MiB) are streamed in 512 KiB chunks instead of loaded fully into memory. LZ4-compressed ranges are decompressed first, then streamed. RAM cache hits continue to serve from memory.
-
Logging: Demoted 9 High-Volume INFO Sites to DEBUG
- Per-chunk "Range stored (hybrid)" in
disk_cache.rs - Per-entry "SIZE_TRACK: Add COUNTED/SKIPPED" and "SIZE_TRACK: Remove COUNTED" in
calculate_size_delta() - Per-key "Object metadata journal consolidation completed"
- Per-entry "Removing journal entry for evicted range" and "Removing stale journal entry"
- Per-call "Atomic size subtract", "Atomic size add", and "Atomic size add (non-blocking)"
- Per-chunk "Range stored (hybrid)" in
-
Consolidation: KEY_CONCURRENCY_LIMIT Increased from 4 to 8
- Processes up to 8 cache keys concurrently via
buffer_unordered(8) - Reduces wall-clock consolidation time when many keys have few entries each
- Processes up to 8 cache keys concurrently via
-
Eviction: Batched Journal Writes by Cache Key
write_eviction_journal_entries()groups entries by cache_key using a HashMap- New
append_range_entries_batch()method writes all entries for a key in a single file operation - On batch failure for a key, logs warning and continues with remaining keys
- Produces identical journal format to individual
append_range_entry()calls
- Dead Code: Removed
calculate_size_delta()and Related Tests- Removed
calculate_size_delta()function (superseded by accumulator-based size tracking in v1.1.33) - Removed
ConsolidationResult::success_with_size_delta()constructor (never called) - Removed
create_journal_entry_with_size()helper and 12test_calculate_size_delta_*unit tests - Removed
prop_calculate_size_delta_correctnessproperty test - Cleaned up stale comments referencing
calculate_size_delta
- Removed
- Updated
docs/CACHING.md: Eviction triggers section describes accumulator-based size tracking instead of journal-based approach - Updated
docs/ARCHITECTURE.md: Module organization table matches actualsrc/contents; accumulator-based size tracking section verified - Updated
docs/CONFIGURATION.md: Cache size tracking section describes accumulator-based approach with per-instance delta files
-
Consolidation: Parallel Cache Key Processing
- Consolidation cycle now processes up to 4 cache keys concurrently via
buffer_unordered(4) - Reduces wall-clock consolidation time when individual keys hit NFS latency spikes
- Per-key locks are independent flock-based locks — no contention between concurrent keys
- Consolidation cycle now processes up to 4 cache keys concurrently via
-
Eviction: Re-read Size After Acquiring Global Lock
enforce_disk_cache_limits_internal()re-readscurrent_sizeafter acquiring the global eviction lock- Skips eviction if a previous instance's eviction already brought the cache under the limit
- Prevents over-eviction caused by stale size snapshots taken before lock acquisition
-
Eviction: Immediate Accumulator Flush After Eviction
- Calls
size_accumulator.flush()after eviction completes but before releasing the global eviction lock - Ensures the eviction subtract delta is written to a delta file promptly
- Next consolidation cycle collects the delta and updates
size_state.jsonbefore another instance can evict
- Calls
-
Logging: Reduced SIZE_ACCUM Verbosity
SIZE_ACCUM addandSIZE_ACCUM subtractlog level changed from INFO to DEBUGSIZE_ACCUM flush,collect, andcollect_totalremain at INFO- Reduces log volume by thousands of lines per download test
- Size Tracking: NFS Stale Read in Delta Collection
- Root cause of 120 MiB (8.6%) size tracking gap identified: NFS stale reads during cross-instance delta file collection
- Changed from additive read-modify-write of a single per-instance delta file to append-only unique files per flush
- Each
flush()createsdelta_{instance}_{sequence}.json— no read of existing file, eliminates stale read race collect_and_apply_deltas()unchanged — already iterates alldelta_*.jsonfiles and deletes after reading- Directory stays bounded: ~3 files per instance between collections (5s flush interval, 5s consolidation interval)
-
Eviction Performance: Decoupled Eviction from Consolidation Cycle
- Eviction now runs as a detached
tokio::spawntask instead of blocking the consolidation cycle AtomicBoolguard (eviction_in_progress) prevents concurrent eviction spawns usingcompare_exchangewithSeqCstorderingscopeguardresets the guard on all exit paths (success, error, panic)- Consolidation cycle releases the global lock immediately, eliminating 100+ second lock holds during eviction
- Eviction now runs as a detached
-
Eviction Performance: Parallel NFS File Deletes
batch_delete_ranges()now usestokio::fs::remove_fileandtokio::fs::metadata(async) instead ofstd::fs(sync)- File deletes execute concurrently via
futures::stream::buffer_unorderedwith a concurrency limit of 32 - Object-level eviction processes up to 8 objects concurrently via
buffer_unordered - Per-object metadata lock remains held for the entire batch delete operation
-
Eviction Performance: Early Exit Check
- Eviction loop stops processing objects once
total_bytes_freed >= bytes_to_free - Avoids unnecessary file deletes when actual file sizes exceed
compressed_sizeestimates
- Eviction loop stops processing objects once
- Size Tracking: Delta File Race Condition
- Root cause: Consolidator reset delta files to zero after reading, but an instance could flush a new delta between the read and reset, causing the new value to be overwritten with zero (lost deltas)
- Fix: Consolidator now DELETES delta files after reading instead of resetting to zero
- Flush now uses additive writes: reads existing delta file, adds new delta, writes back. Handles missing file (deleted by consolidator) gracefully by starting from zero
reset_all_delta_files()(validation scan) now deletes files instead of resetting to zero- Removed dead
atomic_update_size_delta(0, 0)call that was meant to increment consolidation_count but was skipped by early return
- SIZE_ACCUM Logging: INFO-level logging on every accumulator add, subtract, flush, and collect operation for production traceability
SIZE_ACCUM add/subtract: logs each individual size change with byte count and instance IDSIZE_ACCUM flush: logs delta values being flushed to diskSIZE_ACCUM collect: logs per-file delta values read by consolidator, plus total summary
- Size Tracking: Replaced Journal-Based Tracking with In-Memory Accumulator
- Root cause: Journal-based size tracking suffered from timing gaps between when data is written and when size is counted, causing drift in multi-instance environments
- Solution: In-memory
AtomicI64accumulator tracks size at write/eviction time with zero NFS overhead - Size changes recorded immediately via
fetch_add/fetch_suboperations - Each instance flushes accumulated delta to per-instance file (
size_tracking/delta_{instance_id}.json) every consolidation cycle - Consolidator reads all delta files under global lock, sums into
size_state.json, resets delta files - Journal entries continue to be processed for metadata updates only (no longer used for size tracking)
- Daily validation scan corrects any drift and resets all delta files
- Graceful shutdown flushes pending accumulator delta to disk
- New
SizeAccumulatorstruct injournal_consolidator.rswithadd(),subtract(),add_write_cache(),subtract_write_cache(),flush(),reset()methods store_range()increments accumulator after successful HybridMetadataWriter writeperform_eviction_with_lock()decrements accumulator usingcompressed_sizefromRangeEvictionCandidatewrite_multipart_journal_entries()increments accumulator for MPU completion rangesrun_consolidation_cycle()flushes accumulator at cycle start, collects deltas under global lockconsolidate_object()no longer callscalculate_size_delta()for size state updatesupdate_size_from_validation()resets all delta files after correcting driftshutdown()flushes accumulator before final consolidation cycle
- Size Tracking: Global Consolidation Lock to Prevent Multi-Instance Race Conditions
- Root cause: Multiple instances could run consolidation cycles simultaneously, processing the same journal entries due to NFS caching delays in journal cleanup
- Even with per-cache-key locking, instances would process the same cache_key sequentially (not simultaneously), causing duplicate size counting
- Solution: Added global consolidation lock using flock-based file locking
- Only one instance can run a consolidation cycle at a time across all instances
- Lock file:
{cache_dir}/locks/global_consolidation.lock - Uses non-blocking try_lock_exclusive() - if lock held, instance skips the cycle
- Lock automatically released when cycle completes (via scopeguard RAII)
- Added
GlobalConsolidationLockstruct for lock metadata (debugging) - Added
scopeguarddependency for RAII-based lock release
- Size Tracking: Fix Over-Counting from Duplicate Journal Entries
- Root cause:
calculate_size_delta()was counting ALL valid journal entries, butapply_journal_entries()skips Add entries where the range already exists in metadata - In multi-instance environments, the same range can have multiple journal entries (from retries or multiple instances), causing size to be counted multiple times
- Solution: Only count size delta for entries that actually affect size:
- Add entries that were actually applied (not skipped because range already in metadata)
- All Remove entries (file was deleted)
- Changed
apply_journal_entries()to returnsize_affecting_entriesinstead of empty vector - Changed
consolidate_object()to usesize_affecting_entriesforcalculate_size_delta()
- Root cause:
- Size Tracking: Fix Double Subtraction on Eviction
- Root cause: Eviction was subtracting bytes_freed twice:
- Directly via
atomic_subtract_size_with_retry()after eviction - Via Remove journal entries processed by consolidation
- Directly via
- Solution: Removed direct subtraction; let consolidation handle all size updates via journal entries
- This maintains single-writer pattern where consolidation is the only component updating size_state.json
- Root cause: Eviction was subtracting bytes_freed twice:
- Size Tracking: Debug Logging for metadata_written Flag
- Added INFO-level logging to trace size tracking decisions
- Logs each Add entry: COUNTED (metadata_written=false) or SKIPPED (metadata_written=true)
- Logs each Remove entry with size
- Summary log shows add_counted, add_skipped, remove_counted, total_delta
- Purpose: Diagnose why v1.1.28 still shows ~7% under-reporting
- Size Tracking: metadata_written Flag for Accurate Tracking
- Root cause: v1.1.27 used metadata diff (size_after - size_before) but HybridMetadataWriter writes to .meta immediately, so ranges are already present when consolidation runs → delta = 0
- Solution: Added
metadata_written: boolfield to JournalEntry - When HybridMetadataWriter succeeds (hybrid mode):
metadata_written: true→ consolidation skips size counting (already in .meta) - When falling back to journal-only:
metadata_written: false→ consolidation counts size - Remove operations always counted (range being deleted)
- This correctly handles all scenarios without NFS lock overhead
- Size Tracking: Fixed Negative Size Delta Bug in v1.1.26
- Root cause: v1.1.26 calculated size_delta from journal entries, but Add entries are cleaned up after consolidation while Remove entries are created later during eviction
- When eviction runs, Remove entries subtract size but the corresponding Add entries are already gone
- Result: size_delta goes negative, total_size clamps to 0, cache appears empty
- Fix: Calculate size_delta from metadata diff (size_after - size_before) instead of journal entries
- This correctly handles:
- Skipped Adds (range already in metadata from HybridMetadataWriter): delta = 0
- Applied Adds (new range): delta = +size
- Removes (range deleted): delta = -size
- Metadata-based diff is the source of truth for actual changes, avoiding cross-cycle imbalances
- Size Tracking: Reverted to Journal-Based Approach (v1.1.19-style)
- Removed direct size tracking from
store_range_data()- eliminates per-write NFS lock attempts - Removed
size_trackedfield fromJournalEntry- no longer needed - Size delta now calculated from journal entries with per-cycle deduplication
- Deduplication uses HashSet by (start, end) to handle multiple instances caching same range
- This restores download performance (removes ~30% throughput degradation from v1.1.23-v1.1.25)
- May over-report size when multiple instances cache same range across consolidation cycles
- Over-reporting is safe (eviction triggers early) vs under-reporting (disk fills)
- Removed direct size tracking from
- Size Tracking Performance: Replaced retry logic with non-blocking try-lock for direct size adds
- v1.1.24 used 3 retries with exponential backoff (100-400ms delays) causing 30-90 second consolidation cycles
- Now uses non-blocking
try_lock_exclusive()- returns immediately if lock is busy - If lock busy, sets
size_tracked=falseand lets consolidation handle size tracking - Eliminates lock contention performance degradation in multi-instance deployments
- Size Tracking Double-Counting: Fixed bug where v1.1.23's direct size tracking caused double-counting
- Root cause: v1.1.23 added direct
atomic_add_size_with_retry()instore_range_data(), but consolidation also adds size viaatomic_update_size_delta()when processing journal entries - With
WriteMode::JournalOnly, both paths add size = 2x actual size - Observed: Tracked 1.71 GiB, Actual 6 KB (empty after eviction). Eviction loops forever.
- Fix: Added
size_tracked: boolfield toJournalEntry(defaults to false for backward compatibility) - When direct add succeeds,
size_tracked: trueis set on the journal entry - Consolidation skips size delta for entries with
size_tracked: trueto avoid double-counting
- Root cause: v1.1.23 added direct
- Size Tracking - Add Path Not Updating Size: Fixed bug where caching new ranges did not update size tracking
- Root cause: HybridMetadataWriter writes ranges to
.metafile immediately, then creates journal entry - When consolidation runs, range is already in metadata, so
size_before == size_after→size_delta = 0 - This caused tracked size to under-report by ~20-25% (e.g., 1.40 GiB tracked vs 1.75 GiB actual)
- Fix: DiskCacheManager now calls
atomic_add_size_with_retry()directly after storing a range - This mirrors how eviction works (direct subtract) - both add and subtract now bypass journal-based tracking
- Added
atomic_add_size()andatomic_add_size_with_retry()methods to JournalConsolidator - Added
journal_consolidatorfield to DiskCacheManager for direct size updates
- Root cause: HybridMetadataWriter writes ranges to
- Eviction Size Tracking - Missing Code Path: Fixed bug where eviction triggered via
evict_if_needed()did not update size tracking- Root cause: v1.1.21 added subtract code to
enforce_disk_cache_limits_internal()but missedevict_if_needed() evict_if_needed()is called from http_proxy.rs and range_handler.rs before caching new data- When eviction was triggered via this path,
perform_eviction_with_lock()ran but size was never subtracted - Fix: Added same
atomic_subtract_size_with_retry()call toevict_if_needed()after eviction completes
- Root cause: v1.1.21 added subtract code to
- Eviction Size Tracking: Fixed bug where eviction did not reduce tracked size
- Root cause: Eviction updates metadata directly (removes ranges from .meta file), then writes Remove journal entries
- When consolidation processes Remove entries, ranges are already gone from metadata
- Result: size_before = size_after = 0, so size_delta = 0 (no reduction tracked)
- Fix: Directly call
atomic_subtract_size_with_retry(bytes_freed)after eviction completes - This bypasses the journal-based approach for eviction since eviction already knows exact bytes freed
- Consolidation still handles Add entries for size increases; eviction handles size decreases directly
- Size Tracking Accuracy - Metadata-Based Calculation: Complete rewrite of size delta calculation to use metadata comparison instead of journal entries
- Root cause: Journal-based size tracking was fundamentally flawed in multi-instance deployments
- Multiple instances create journal entries for the same range (shared storage, same file path)
- Previous fixes (v1.1.18, v1.1.19) tried to deduplicate journal entries but couldn't handle all edge cases
- New approach: Calculate size_delta = (sum of compressed_size after) - (sum of compressed_size before)
- This measures actual change in metadata, not journal entry counts
- Eliminates all double-counting issues regardless of how many instances write journal entries
- Simplified apply_journal_entries() by removing complex size tracking logic
- Size Tracking Double-Counting in Multi-Instance Deployments: Fixed bug where the same range could be counted multiple times for size tracking
- Root cause: When multiple instances cache the same range, each creates a journal entry. v1.1.18 fix counted ALL journal entries for size, even duplicates
- Example: Instance A and B both cache range X → two journal entries → size counted twice
- Fix: Track which ranges have been counted in each consolidation cycle using a HashSet
- Only the first journal entry for each (start, end) range pair is counted for size delta
- This applies to both Add and Remove operations to prevent over/under-counting
- Fixes the ~350 MB over-reporting observed after v1.1.18 deployment
- Size Tracking Missed Ranges Written by Hybrid Mode: Fixed bug where ranges written immediately by HybridMetadataWriter were not counted in size tracking
- Root cause: In hybrid mode, metadata is written directly to
.metafile, then a journal entry is created for redundancy - When consolidation ran, it found the range "already exists" in metadata and skipped adding it to
applied_entries - Since size delta is calculated only from
applied_entries, these ranges were never counted - Fix: Add journal entries to
applied_entriesfor size tracking even when range already exists in metadata - The presence of a journal entry proves size hasn't been tracked yet (entries are removed after consolidation)
- This fixes the ~120MB discrepancy observed after v1.1.17 deployment
- Root cause: In hybrid mode, metadata is written directly to
- Full Object Caching Bypassed Journal System (Actual Fix): Fixed critical bug where
store_full_object_as_range_new()wrote directly to disk without creating journal entries- Root cause: v1.1.15 CHANGELOG claimed this was fixed, but the actual code still bypassed the journal system entirely
- Two issues fixed:
range_spec.file_pathused only filename instead of full relative path (e.g.,object_0-1023.bininstead ofbucket/XX/YYY/object_0-1023.bin)- No journal entries were created after storing metadata, so consolidator never tracked the size
- Impact: Full object GET responses cached via this path were never counted in size tracking
- Fix: Now computes proper relative path and calls
write_multipart_journal_entries()after storing metadata - This is the actual fix for the 273MB discrepancy observed after v1.1.16 deployment (which only fixed multipart uploads)
- Multipart Upload Completion Bypassed Journal System: Fixed critical bug where CompleteMultipartUpload wrote metadata directly without creating journal entries
- Root cause:
finalize_multipart_upload()insigned_put_handler.rswrote metadata and range files directly, bypassing the journal system - Impact: Multipart uploads were never counted in size tracking, causing size_state.json to under-report
- Observed: 273MB discrepancy between actual disk usage (1.46GB) and tracked size (1.28GB)
- Fix: Added
write_multipart_journal_entries()method to JournalConsolidator, called after CompleteMultipartUpload creates metadata - This ensures all multipart upload ranges are tracked via journal entries for consolidation to process
- Root cause:
- Full Object Caching Bypassed Journal System: Fixed critical bug where
store_full_object_as_range_new()wrote directly to disk without creating journal entries- Root cause: This method wrote range files and metadata directly, bypassing
DiskCacheManager::store_range()which creates journal entries for size tracking - Impact: Full object GET responses cached via this path were never counted in size tracking, causing size_state.json to under-report by hundreds of MB
- Observed: du showed 2.0GB actual disk usage, size_state.json showed 1.39GB tracked (~711MB under-counted)
- Fix: Now uses
DiskCacheManager::store_range()which properly writes journal entries for consolidation to process - Affected code paths: GET response caching, PUT body caching, write cache entry storage
- Root cause: This method wrote range files and metadata directly, bypassing
- Remove Journal Entries Not Processed: Fixed bug where Remove journal entries from eviction were not being processed for size tracking
- Root cause 1:
validate_journal_entries_with_staleness()checked if range file exists, but Remove entries have intentionally deleted files - Root cause 2:
apply_journal_entries()only added Remove entries toapplied_entriesif the range was found in metadata - Fix: Remove operations now bypass file existence check (file is intentionally deleted) and always count for size tracking
- This caused size state to show 2.2GB tracked when disk was actually 861MB after eviction
- Root cause 1:
- Journal-Based Size Tracking for Eviction: Eviction now writes Remove journal entries instead of directly updating size state
- Previous approach: Eviction called
atomic_subtract_size_with_retry()which required locking and could race with consolidation - New approach: Eviction writes Remove entries to journal, consolidation processes them and updates size state
- Benefits: Single writer to size_state.json (consolidation only), eliminates race conditions, no lock contention
- Added
write_eviction_journal_entries()method to JournalConsolidator - Consolidation already handles Remove operations via
calculate_size_delta()
- Previous approach: Eviction called
- Consolidation vs Eviction Race Condition: Fixed race condition where consolidation's size state update could overwrite eviction's update
- Root cause: Consolidation did a non-atomic read-modify-write without holding the
size_state.lock - Sequence: Consolidation reads 2GB → Eviction subtracts 500MB (writes 1.5GB) → Consolidation adds +10MB to stale 2GB → Consolidation writes 2.01GB, overwriting eviction's 1.5GB
- This caused size state to show 1.6GB tracked when disk was actually empty (all data evicted)
- Fix: Added
atomic_update_size_delta()method that uses the same file locking asatomic_subtract_size() - Consolidation now uses this atomic method, ensuring sequential consistency with eviction
- Root cause: Consolidation did a non-atomic read-modify-write without holding the
- Size State Race Condition During Eviction: Fixed race condition where size state was updated AFTER releasing the eviction lock
- Previous behavior: Release eviction lock → Update size state
- This allowed another instance to acquire the lock and read stale size state before the first instance updated it
- New behavior: Update size state → Release eviction lock
- This ensures sequential consistency - each eviction sees the result of the previous one
- Critical: Size Tracking Discrepancy (47MB actual vs 1.5GB tracked): Fixed bug in journal consolidation that caused massive size tracking inflation
- Root cause:
validate_journal_entries_with_staleness()was adding entries with missing range files but recent timestamps tovalid_entries - These entries were then processed by
apply_journal_entries(), which calculated size delta from them - But the range files didn't exist on disk (e.g., due to NFS caching delays), so size was counted for non-existent data
- Fix: Entries with missing range files but recent timestamps are now kept in journal for retry (not added to
valid_entries) - Only entries with existing range files are processed for size delta
- Stale entries (missing files + old timestamps) are still removed from journal
- Root cause:
- Consolidation Loop Deadlock During Idle Eviction: Fixed deadlock where the consolidation loop would hang when triggering eviction during idle periods
- Root cause:
enforce_disk_cache_limits()was callingconsolidate_object()for pre-eviction journal consolidation, but this was being called from within the consolidation loop itself - When called from the consolidation loop, we just finished consolidating, so pre-eviction consolidation is redundant and can cause blocking
- Added
enforce_disk_cache_limits_skip_consolidation()variant that skips pre-eviction consolidation maybe_trigger_eviction()now uses this variant to avoid the deadlock- Other callers (maintenance operations) still do pre-eviction consolidation for accurate access times
- Root cause:
- Eviction Not Triggering During Idle Periods: Fixed bug where eviction would not trigger when cache was over capacity but no new data was being added
- Previously, eviction was only checked when
size_delta > 0(cache grew), meaning idle periods with over-capacity cache would never trigger eviction - Now eviction is checked at the end of EVERY consolidation cycle (every 5 seconds), regardless of whether there was activity
- Modified
maybe_trigger_eviction()to accept an optionalknown_sizeparameter to avoid redundant NFS reads - This ensures cache stays within capacity limits even during read-only workloads or idle periods
- Previously, eviction was only checked when
- Critical: Lost Updates in Size State: Fixed race condition where concurrent evictions from multiple instances caused lost updates to size state
- Previous fix (v1.1.4) did read-modify-write without locking, causing multiple instances to read the same value, subtract their bytes_freed, and overwrite each other
- Added
atomic_subtract_size()function that uses file locking (size_state.lock) to ensure atomic read-modify-write - This prevents size inflation when multiple instances evict concurrently
- Critical: TOCTOU Race in Eviction Lock: Fixed race condition where multiple threads could acquire the eviction lock simultaneously
- v1.1.5 fix had a TOCTOU (time-of-check-time-of-use) bug: threads checked
is_some()then released the mutex before settingSome - Now the entire check-and-set operation is atomic within a single mutex guard scope
- Restructured to drop the mutex guard before async metrics recording to satisfy Rust's
Sendrequirements
- v1.1.5 fix had a TOCTOU (time-of-check-time-of-use) bug: threads checked
- Critical: Concurrent Eviction Race Condition: Fixed bug where multiple threads within the same instance could all acquire the eviction lock simultaneously
- The
flock-based lock was per-file-descriptor, not per-process - each thread opened a new file descriptor and got its own lock - This caused multiple concurrent evictions to run, each reading stale size state and writing back incorrect values
- Fix: Added check at start of
try_acquire_global_eviction_lock()to returnfalseifeviction_lock_fileis alreadySome - This ensures only one thread per instance can hold the eviction lock at a time
- The
- Critical: Eviction Not Updating Size State: Fixed bug where eviction triggered from
monitor_and_enforce_cache_limits()did not update the size state- Two code paths could trigger eviction: (1)
JournalConsolidator::maybe_trigger_eviction()and (2)CacheManager::monitor_and_enforce_cache_limits() - Only path (1) was updating the size state after eviction, causing size inflation when eviction happened via path (2)
- Observed behavior: After heavy downloads filled the cache, two back-to-back evictions occurred but only the first one's
bytes_freedwas subtracted from size state - Fix: Moved size state update into
enforce_disk_cache_limits()so ALL eviction paths update the size state - Removed duplicate size state update from
maybe_trigger_eviction()to prevent double-counting
- Two code paths could trigger eviction: (1)
- Critical: Size Tracking Double-Counting Bug: Fixed bug where cache size was inflated because journal entries that were already present in metadata were still counted in size delta
- Previously,
calculate_size_delta()was called on ALL valid journal entries, including entries that were already consolidated in a previous cycle - Now size delta is only calculated from entries that were actually applied (new entries not already in metadata)
- This caused
size_state.jsonto show sizes much higher than actual disk usage (e.g., 2.5GB tracked vs 565MB actual) - Root cause: Journal entries remain in journal files until cleanup, and were being re-counted on each consolidation cycle
- Previously,
- Multi-Instance Size Consistency (Complete Fix): Removed in-memory size state entirely - disk is now the single source of truth
- Previously, each instance maintained its own in-memory
size_stateand could overwrite the shared disk file with stale values during consolidation - This caused size drops of a few hundred MB when one instance's stale in-memory state overwrote another instance's recent updates
- Now all size operations (
get_current_size(),get_write_cache_size(),get_size_state()) read directly from the sharedsize_state.jsonfile - Eliminates race conditions where instances could see different sizes or overwrite each other's updates
- Previously, each instance maintained its own in-memory
get_current_size()andget_write_cache_size()are now async: These methods now read from disk instead of in-memory state- Callers must use
.awaitwhen calling these methods - This ensures all instances see the same size values from the shared disk file
- Callers must use
-
Multi-Instance Size Consistency: Dashboard and metrics now read cache size from the shared
size_state.jsonfile instead of in-memory state- All instances now show the same cache size value
get_size_state()reads from disk for multi-instance consistencyget_current_size()remains in-memory for hot paths (eviction checks)
-
Dashboard Timestamp Display: Fixed "Invalid Date" display for Last Consolidation timestamp
- JavaScript now correctly handles Rust's SystemTime serialization format (
secs_since_epoch)
- JavaScript now correctly handles Rust's SystemTime serialization format (
-
Journal-Based Size Tracking: Size tracking is now handled by the JournalConsolidator instead of a separate delta buffer system
- Size deltas are calculated from Add/Remove operations in journal entries during consolidation
- Size state is persisted to
size_tracking/size_state.jsonafter each consolidation cycle (every 5s) - Eviction is triggered automatically by the consolidator when cache exceeds capacity
- Consolidation interval changed from 30s to 5s for near-realtime size tracking
- Use
shared_storage.consolidation_intervalto control frequency (default: 5s)
-
Removed
shared_storage.enabledConfig Option: Journal-based metadata writes and distributed eviction locking are now always enabled- The
shared_storage.enabledconfig option has been removed - All deployments (single-instance and multi-instance) use the same code path
- This simplifies the codebase and ensures consistent behavior
- The
-
Consolidation Loop Timing: Changed from burst catch-up to delay behavior when consolidation takes longer than the interval
- Prevents rapid back-to-back consolidation cycles after long evictions
-
Deprecated Size Tracking Config: Removed
size_tracking_flush_intervalandsize_tracking_buffer_sizeconfig options- These were part of the buffered delta system which has been replaced by journal-based tracking
-
Dead Code Cleanup: Removed ~200 lines of unused eviction lock methods (
write_global_eviction_lock,read_global_eviction_lock)- These were superseded by flock-based locking via
try_acquire_global_eviction_lock()
- These were superseded by flock-based locking via
Breaking Change - Cache Directory Migration Required
This release changes the size tracking architecture. A fresh cache directory is recommended:
- Stop all proxy instances
- Clear the cache directory:
rm -rf /path/to/cache/* - Update configuration:
- Remove
shared_storage.enabledif present (no longer supported) - Remove
size_tracking_flush_intervalif present (no longer supported) - Remove
size_tracking_buffer_sizeif present (no longer supported)
- Remove
- Deploy new version
- Start proxy instances
Old files that will be automatically cleaned up:
size_tracking/checkpoint.json- replaced bysize_state.jsonsize_tracking/delta-*.log- no longer used
New files created:
size_tracking/size_state.json- contains total_size, write_cache_size, last_consolidation timestamp
- Dashboard Object Metadata Hit Rate: Fixed hit rate calculation to use HEAD request hits/misses instead of RAM metadata cache hits/misses. Now accurately reflects S3 HEAD request cache performance.
- Removed NFS Propagation Delays: Removed 50ms and 500ms delays in checkpoint sync that were workarounds for NFS visibility issues. With
lookupcache=posmount option,sync_all()is sufficient for cross-instance file visibility. Reduces checkpoint sync latency by ~550ms.
- NFS Mount Requirements: Added critical documentation for multi-instance deployments requiring
lookupcache=posmount option on NFS volumes. This caches positive lookups (file exists) but not negative lookups (file not found), ensuring new files from other instances are immediately visible while maintaining good cache hit performance. Added to CONFIGURATION.md and GETTING_STARTED.md. - Archived Investigation: Moved INVESTIGATION-JOURNAL-CONSOLIDATION-BUG.md to archived/docs/ after successful resolution of all 5 journal consolidation bugs.
- Reduced Log Noise: Downgraded benign race condition logs from WARN/ERROR to INFO:
- "Failed to read journal file for cleanup" - Expected when another instance already deleted the file
- "Failed to break stale lock" - Expected when lock file was already removed by another instance
- These race conditions are harmless on shared NFS storage and don't indicate real problems
- S3 Request Retry on Transient Failures: Added retry logic (up to 2 retries with backoff) for S3 range requests that fail with connection errors like
SendRequest. Previously, a single transient failure would returnBadGatewayto the client. Now the proxy retries before giving up.
- Stale Journal Lock File Cleanup: Journal consolidation now cleans up orphaned
.journal.lockfiles that remain after fresh journal files are deleted. These lock files accumulated during high-concurrency downloads but are now automatically removed.
- Critical: Cleanup vs Append Race Condition (Bug 5): Fixed race condition where journal cleanup could overwrite entries being appended concurrently. The v1.0.9 mutex only protected appends from each other, not from cleanup operations. Now uses file-level locking (
flock) with a "fresh journal on lock contention" strategy:- Append tries non-blocking lock on primary journal file
- If lock is busy (cleanup in progress), creates a fresh journal file with timestamp suffix
- Cleanup acquires exclusive lock before read-modify-write
- Appends never block - cache writes stay fast during consolidation
- Fresh journal files are automatically discovered by consolidator and deleted when empty
- Expected to reduce orphaned ranges from ~0.8% to 0%
- Critical: Journal Append Race Condition (Bug 4): Fixed thread-safety issue in
append_range_entry()where concurrent appends within the same instance could overwrite each other. When multiple threads read the journal file simultaneously, appended their entries, and wrote back, the last writer would overwrite entries from other threads. Addedtokio::sync::Mutexto serialize journal appends within each instance.- Evidence: Orphaned range
5GB-1:1216348160-1224736767was stored at 13:41:29.912 with "Range stored (hybrid)" logged, but no journal entry existed - it was overwritten by a concurrent append within milliseconds. - Expected to reduce orphaned ranges from ~0.5% to 0% in high-concurrency scenarios.
- Evidence: Orphaned range
- Critical: Non-Atomic Metadata Write in Journal Consolidator: Fixed race condition where consolidation could read empty/corrupted metadata files. The
write_metadata_to_disk()function usedtokio::fs::write()which is NOT atomic on NFS - readers could see empty or partial files during the write. Now uses atomic write pattern (temp file + rename) likehybrid_metadata_writer.rs, ensuring readers always see complete, valid JSON.- Error symptom: "Failed to parse metadata file: EOF while parsing a value at line 1 column 0"
- Reduced orphaned ranges from 1.2% to expected 0%
- Critical: Multi-Instance Consolidation Race Condition: Fixed race condition where multiple proxy instances consolidating the same cache key simultaneously caused entries to be lost. The lock was acquired AFTER reading journal entries, allowing all instances to read the same entries before any acquired the lock. Now the lock is acquired BEFORE reading entries, ensuring only one instance processes each cache key at a time.
- Reduced orphaned ranges from 0.7% to 0% in multi-instance deployments
- Instances that can't acquire the lock skip the cache key (another instance is handling it)
- Critical: Journal Consolidation Losing Ranges: Fixed bug where 12% of cached ranges were "orphaned" (range files existed but not tracked in metadata). The
cleanup_instance_journals()function was truncating ALL journal files after consolidation, butvalidate_journal_entries()filtered out entries where range files weren't yet visible due to NFS attribute caching. This caused journal entries to be permanently lost before they could be consolidated.- Added
consolidated_entriesfield toConsolidationResultto track which entries were actually processed - New
cleanup_consolidated_entries()method removes only specific entries that were successfully consolidated - Entries with missing range files (due to NFS caching delays) are now preserved and retried on the next consolidation cycle
- Deprecated
cleanup_instance_journals()which truncated everything unconditionally - Cache hit rate improved from ~88% to ~99% on repeat downloads
- Added
- Critical: NFS Directory Entry Caching Bug: Removed
.exists()checks before reading metadata files. The.exists()calls caused NFS to cache directory entries, making newly created files invisible to other instances even after journal consolidation. This caused 40%+ cache miss rate on repeat downloads. Now reads files directly, avoiding directory lookups entirely.
- Delta File Archiving: Delta files are now archived with timestamps before truncation during checkpoint consolidation. Archives are kept for the last 20 checkpoints per instance to aid troubleshooting of size tracking discrepancies.
- Range Storage Logging: Added INFO-level logging for successful range storage operations to diagnose cache write failures.
- Dashboard Cleanup: Removed "Stale Refreshes" counter (internal metric not useful to users).
- Dashboard Statistics Accuracy: Separated HEAD and GET hit/miss counters. "Object Metadata" now shows HEAD request statistics only, "Object Ranges" shows GET request statistics only. Previously both sections showed combined stats, making it appear that GET requests were missing cache when only HEAD metadata was missing.
- Distributed Lock Reliability: Replaced file rename-based locking with
flock()for both eviction and checkpoint locks. The rename approach failed on NFS due to attribute caching, causing 75+ lock errors per minute and allowing multiple instances to evict simultaneously.flock()provides reliable distributed locking on NFS4 without consistency issues.
- Lock Mechanism: Eviction and checkpoint locks now use persistent files with
flock()instead of temp file + rename atomicity. - No Delays Needed: Removed NFS propagation delays (50ms, 100ms, 200ms, 500ms) since
flock()is atomic and works immediately.
- Eviction Not Triggered During Read-Only Workloads: Fixed cache staying over capacity (230%) indefinitely when no new writes occur. Checkpoint sync now triggers eviction check every 30 seconds if cache exceeds limit, ensuring capacity is enforced even during read-only workloads.
- Eviction Lock NFS Propagation: Added 100ms delay after sync_all() before rename to account for NFS propagation time, reducing lock acquisition failures.
- Reduced Log Noise: Changed delta buffer flush and cache size recovery logs from INFO to DEBUG level.
- Error Severity: Reduced eviction lock rename failures from ERROR to WARN since they're automatically retried and don't affect functionality.
- Terminology: Changed EFS-specific references to NFS (applies to all network filesystems, not just EFS).
First stable 1.0.0 release with production-ready multi-instance cache coordination and comprehensive bug fixes.
- Write Cache Critical Bug: Fixed PUT operations storing incorrect file paths (filename only instead of full sharded path), causing "Failed to slice cached range data" errors on GET requests after PUT.
- Cross-Instance Size Tracking: Implemented near-realtime multi-instance cache size synchronization with 30-second checkpoint consolidation, randomized coordination, and NFS consistency handling.
- NFS Consistency: Added
flush()andsync_all()to all critical file operations (checkpoint, delta, eviction lock) to ensure data visibility across instances on network filesystems. - Eviction Lock Failures: Fixed "No such file or directory" errors during eviction lock acquisition by ensuring temp files are synced before rename.
- Dashboard Log Parser: Fixed log viewer to handle tracing's inconsistent spacing (single space for ERROR, double space for INFO/WARN/DEBUG).
- Checkpoint Interval: Reduced from 5 minutes to 30 seconds for better cross-instance accuracy.
- Checkpoint Coordination: Added randomized delay (0-5s) and lock-based coordination so only one instance consolidates per interval.
- Logging Format: Added
.compact()format to tracing configuration. - Reduced Log Noise: Checkpoint operations use DEBUG level, sync only logs significant changes (>10 MB).
- Dead Code: Removed unused
acquire_global_eviction_lock()andis_eviction_lock_stale()methods.
- EFS Consistency for Checkpoint Writes: Added
flush()andsync_all()to checkpoint file writes to ensure data is fully committed to EFS before rename, preventing other instances from reading stale checkpoint data. - EFS Consistency for Delta Files: Added
sync_all()to delta file writes to ensure data is committed before checkpoint consolidation reads and truncates the files. - Cross-Instance Delta Timing: Added 5-second wait after acquiring checkpoint lock to ensure all instances have flushed their deltas before consolidation reads them, accounting for random delay spread (0-5 seconds).
- EFS Propagation Delay: Increased checkpoint read delay from 100ms to 500ms to account for EFS eventual consistency when other instances update the checkpoint file.
- Checkpoint Interval: Reduced from 60 seconds to 30 seconds for better cross-instance size accuracy with acceptable EFS I/O overhead.
- Reduced Log Noise: Changed checkpoint lock acquisition/skip from INFO to DEBUG level, and only log checkpoint sync when size changes by >10 MB.
- Dead Code: Removed unused
acquire_global_eviction_lock()method that was replaced bytry_acquire_global_eviction_lock().
- Write Cache Range Storage Bug: Fixed critical bug where PUT operations stored only the filename instead of the full sharded relative path in RangeSpec, causing "Failed to slice cached range data" errors on subsequent GET requests. Now correctly stores paths like
bucket/XX/YYY/object_0-1023.bin. - Cross-Instance Size Tracking: Implemented near-realtime cross-node cache size synchronization. Checkpoint process now consolidates deltas from ALL instances every minute (down from 5 minutes), providing accurate size tracking across the cluster without filesystem scanning.
- Checkpoint Coordination: Added randomized delay (0-5 seconds) and lock-based coordination to ensure only ONE instance writes the consolidated checkpoint per minute, preventing wasted work and race conditions. All instances re-read the checkpoint to stay synchronized.
- EFS Consistency for Delta Files: Fixed critical race condition where delta files were being truncated before data was visible on EFS. Added
sync_all()after delta flush to ensure data is committed to disk before checkpoint consolidation reads and truncates the files. - Eviction Lock EFS Consistency: Fixed eviction lock failures on EFS/NFS by adding
sync_all()before rename to ensure temp file is flushed to disk before atomic rename operation. - Dashboard Log Parser: Fixed dashboard log viewer to handle tracing's inconsistent spacing (single space for ERROR, double space for INFO/WARN/DEBUG) by using
trim_start()after timestamp extraction.
- Checkpoint Interval: Reduced default checkpoint interval from 5 minutes to 1 minute for near-realtime cross-instance size accuracy.
- Logging Format: Added
.compact()format to tracing configuration for more consistent log formatting.
- Cache Size Limit vs Current Size Confusion: Fixed multiple places in the code that were using
total_cache_size(current usage) when they should have been usingmax_cache_size_limit(configured limit). This affected:- Post-initialization eviction check
- Write cache capacity calculation
evict_if_needed()threshold calculationenforce_disk_cache_limits()checkget_maintenance_recommendations()utilization calculation- Write cache max size recalculation
- Startup Over-Capacity Message: Changed "eviction needed" to "eviction will take place the next time data is cached" for clarity.
- Dashboard Disk Cache Size Display: Fixed dashboard showing size/size instead of size/limit. Added
max_cache_size_limitfield toCacheStatisticsto track the configured limit separately fromtotal_cache_size(current usage).
- Scalable Cache Size Tracking: Replaced filesystem walks with size tracker for all cache size checks. Previously,
evict_if_needed(),get_cache_size_stats(),enforce_disk_cache_limits(), andget_maintenance_recommendations()all walked the filesystem to calculate cache size, which doesn't scale to billions of files. Now all these functions use the incremental size tracker (updated on every cache write/delete, corrected daily by validation scan). - Eviction Now Triggers Correctly: Fixed eviction not triggering because it was using stale checkpoint data. The size tracker's in-memory
current_sizeis now used directly, which is updated in real-time as ranges are stored. - MetricsManager Cache Size Tracker: Wired size tracker to MetricsManager so
cache_sizemetrics are populated. - Dashboard Field Fix: Fixed reference to non-existent
max_cache_size_limitfield in dashboard (now usestotal_cache_size). - Missing HybridMetadataWriter Getter: Added
get_hybrid_metadata_writer()method to CacheManager for background orphan recovery.
- Cache Size Tracking for Range Storage: Fixed critical bug where cache size was not being tracked when storing range data through
CacheManagermethods (store_full_object_as_range_new,store_write_cache_entry,complete_multipart_upload). The size tracker was only being updated inDiskCacheManager.store_range()but not in theCacheManagercode paths. This caused the size tracker to show incorrect values (e.g., 702MB when actual disk usage was 1.5GB), preventing eviction from triggering. - Size Tracker Wiring Logging: Added INFO-level logging when size tracker is wired up to disk cache manager, and WARN-level logging when size tracker is not available for range storage operations.
- Eviction Lock Logging: Added INFO-level logging for eviction lock operations to diagnose lock acquisition issues. Logs now show when locks are acquired, when existing locks are found (with elapsed time and timeout), and when stale locks are forcibly acquired.
- Dashboard Uptime Auto-Refresh: System info (including uptime) now refreshes automatically every 5 seconds along with other dashboard metrics.
- Distributed Eviction Over-Eviction: In shared storage mode, after acquiring the eviction lock, proxies now re-check cache size using the size tracker before proceeding. This prevents over-eviction when multiple proxies detect over-capacity simultaneously - the second proxy will see the cache is already under target and skip eviction.
- Dashboard Log Text Filter: Text filter now searches server-side across all log entries, not just the already-displayed entries. Previously, filtering for "eviction" with level=All would only search the 100 most recent INFO entries; now it searches all entries matching the criteria.
- Cache Eviction Bug: Fixed critical bug where cache eviction never triggered because
total_cache_sizewas used for both the configured limit and current usage. Added separatemax_cache_size_limitfield to store the configured limit, ensuring eviction triggers correctly when cache exceeds capacity. - RAM Cache Excluded from Disk Total:
total_cache_sizenow only includes disk cache (read + write), not RAM cache, since RAM is separate and doesn't count against disk limit. - Range Spec Journal Fallback: In shared storage mode,
load_range_data_from_new_storagenow checks journals as fallback when range not found in metadata file, fixing "Range spec not found" warnings caused by race conditions.
- Dashboard Size Display: Dashboard now shows cache size with limit (e.g., "1.5 GiB / 1.2 GiB") for both disk cache and RAM cache.
- Coordinator Max Cache Size: Cache initialization coordinator now uses actual configured
max_cache_sizeinstead of hardcoded 1GB, fixing misleading "Cache over capacity" warnings
- Non-Destructive Metadata Error Handling: Metadata files are no longer deleted when read/parse errors occur
- JSON parse failures now retry up to 3 times with 50ms delays (handles partial reads during in-progress writes)
- Empty file and I/O errors treated as cache miss without deletion
- Prevents race condition where multiple proxies delete valid in-progress metadata writes
- Orphan recovery system handles truly corrupt files over time
- Range File Rename Before Journal Write: In shared storage mode, range files are now renamed to their final path BEFORE writing the journal entry. This eliminates the race condition where another proxy's consolidator could read a journal entry referencing a file that doesn't exist yet. If journal write fails after rename, the orphan recovery system will clean up the range file.
- Additional Journal Parse Warnings Downgraded: All "Failed to parse journal entry" warnings in journal_manager.rs now debug level (4 additional locations)
- Downgraded Journal Warnings to Debug: "Failed to read journal file" (stale file handle) and "Journal entry references non-existent range file" warnings are now debug level since they're expected during concurrent writes on shared storage
- Range Response Metadata Retry: Added retry logic (5 attempts with increasing delays: 20-80ms) when retrieving cached metadata for range responses, reducing "Could not retrieve cached metadata" warnings during concurrent writes
- Metadata Read Retry with Delay:
get_metadata_from_disk()now retries up to 3 times with 10ms delay for transient errors (empty file, parse errors, I/O errors) before falling back to journal lookup
- Journal Fallback for Corrupted Metadata Files:
get_metadata_from_disk()now tries journal lookup when.metafile exists but fails to parse (EOF error, empty file, or corruption during concurrent writes)
- Journal Metadata Lookup for Range Responses:
get_metadata_from_disk()now checks pending journal entries when.metafile doesn't exist, eliminating "Could not retrieve cached metadata for range response" warnings during journal consolidation window
- Emergency Eviction on ENOSPC: When disk write fails with "No space left on device", triggers cache eviction (80% target) and retries once before giving up
- Post-Eviction Capacity Check: After eviction completes, verifies sufficient space exists before allowing new cache writes
- Hard Capacity Check: Blocks new cache writes when disk usage exceeds configured max capacity
- Journal Metadata Propagation: Journal entries now include
object_metadatafield, ensuring response headers are preserved when metadata files are created by journal consolidation - Range Response Content-Length: Fixed incorrect content_length override that caused "Start position exceeds content length" warnings
- Journal Lookup Race Condition: Added fallback journal lookup in
find_cached_ranges()to check pending journal entries during the consolidation window
- Orphaned Range Recovery: Integrated with BackgroundRecoverySystem for scalable sharded scanning of orphaned .bin files
-
Buffered Access Logging: Access logs are now buffered in RAM and flushed periodically
- Reduces disk I/O on shared storage (EFS/NFS) by batching writes
- Configurable flush interval (
access_log_flush_interval, default: 5s) - Configurable buffer size (
access_log_buffer_size, default: 1000 entries) - Force flush on graceful shutdown to minimize data loss
- Maintains existing S3-compatible log format and date-partitioned directory structure
-
Buffered Size Delta Tracking: Cache size deltas are now buffered and written to per-instance files
- Eliminates lock contention between proxy instances on shared storage
- Each instance writes to its own delta file (
size_tracking/delta-{instance_id}.log) - Configurable flush interval (
size_tracking_flush_interval, default: 5s) - Configurable buffer size (
size_tracking_buffer_size, default: 10000 deltas) - Recovery reads all instance delta files and sums with checkpoint
- Stale delta files from crashed instances cleaned up based on age
- Removed
AccessLogWriter: Replaced withAccessLogBufferfor buffered writes - Removed synchronous delta methods:
try_append_delta()andtry_append_write_cache_delta()replaced with bufferedSizeDeltaBuffer
- Shared Storage Optimization: Significantly reduced disk I/O for EFS/NFS deployments
- Access logs: Up to 99% reduction in write operations (1000 entries per flush vs per-request)
- Size tracking: Eliminated per-operation disk writes and lock contention
- Improved throughput for high-traffic multi-instance deployments
- Presigned URL Expiration Rejection: Proxy now detects and rejects expired AWS SigV4 presigned URLs before cache lookup
- Parses
X-Amz-DateandX-Amz-Expiresfrom query parameters - Checks expiration locally without S3 API calls
- Returns 403 Forbidden immediately for expired URLs
- INFO-level logging with expiration details (seconds expired, signed time, validity duration)
- Prevents serving cached data with expired access credentials
- Example:
cargo run --release --example presigned_url_demo
- Parses
- Presigned URL Support: Added comprehensive documentation in CACHING.md
- Explains how presigned URLs interact with caching
- Documents two TTL strategies: Long TTL (performance) vs Zero TTL (security)
- Clarifies cache key generation (path only, excludes query parameters)
- Security considerations for time-limited access control
- Early rejection behavior for expired presigned URLs
- Per-Instance Part Request Deduplication: Prevents duplicate S3 requests when concurrent part requests arrive for the same object
- When a part request arrives but multipart metadata is missing (object cached via regular GET), the proxy checks if this instance is already fetching any part for the same object
- If an active fetch is in progress: waits 5 seconds, then returns HTTP 503 with
Retry-After: 5header - Maximum 3 deferrals (15 seconds total wait) before forwarding to S3 anyway
- The in-flight request populates multipart metadata (
parts_count,part_size) from S3 response headers - Active fetches automatically expire after 60 seconds (stale timeout) to handle edge cases
- This is per-instance coordination only - no cross-instance state sharing required
- Simplified Concurrent Part Request Handling: Replaced complex cross-instance metadata population coordination with simpler per-instance request deduplication
- Previous approach attempted RAM-based cross-instance coordination which doesn't work with shared storage
- New approach: each instance independently tracks its own active S3 fetches
- More reliable and predictable behavior in multi-instance deployments
- Part Requests Incorrectly Served from Range Cache: Fixed critical bug where part requests without multipart metadata were incorrectly served from cached ranges
- Previously, when
lookup_partreturned cache miss, the code fell through to range handling which served the full object data instead of the specific part - This returned incorrect data (full object) with wrong headers (no
x-amz-mp-parts-count, wrongContent-Range) - Now, part requests that miss the cache go directly to S3, bypassing range handling entirely
- Part requests are only served from cache when multipart metadata (
parts_count,part_size) is known
- Previously, when
- Part 1 Not Included in Deduplication: Fixed bug where part 1 requests bypassed deduplication logic
- Previously, part 1 was treated as "single-part object" when multipart metadata was missing
- Now ALL part requests without multipart metadata go through deduplication
- Ensures only one S3 request is made regardless of which part number arrives first
ActivePartFetchstruct tracks cache_key, part_number, start time, and deferral counthandle_missing_multipart_metadataregisters active fetch and defers concurrent requests- ALL part requests without multipart metadata now go through deduplication (including part 1)
- After 3 deferrals, forwards to S3 to prevent indefinite blocking
complete_part_fetchandfail_part_fetchmethods clean up tracking after S3 response- 503 responses include
Retry-After: 5header for client retry guidance
- Dead Code Removal: Cleaned up legacy and unused code
- Removed deprecated no-op methods:
refresh_metadata_expiration(),merge_overlapping_ranges() - Removed unused journal cleanup methods:
cleanup_processed_entries(),cleanup_invalid_entries() - Fixed test expectations for shared storage default configuration
- All functionality preserved, no breaking changes
- Removed deprecated no-op methods:
- Shared Storage Enabled by Default: Multi-instance coordination is now enabled by default
shared_storage.enablednow defaults totrueinstead offalse- Provides better safety for multi-instance deployments out of the box
- Single-instance deployments can set
shared_storage.enabled: falseto disable coordination overhead
- Faster Cross-Instance Cache Visibility: Reduced default journal consolidation interval from 30s to 5s
- Improves cache hit rates in multi-instance deployments with shared storage (EFS, FSx)
- When one instance caches data, other instances see it within ~5s instead of ~30s
- Configurable via
shared_storage.consolidation_interval(valid range: 1-60 seconds) - Tradeoff: Slightly more frequent consolidation I/O for significantly better cache utilization
- Consolidation Interval Range: Changed valid range from 5-300 seconds to 1-60 seconds
- Allows sub-5-second consolidation for latency-sensitive workloads
- Upper bound reduced since consolidation is fast (~3ms for 200+ entries)
- Faster Cross-Instance Cache Visibility: Reduced default journal consolidation interval from 30s to 5s
- Improves cache hit rates in multi-instance deployments with shared storage (EFS, FSx)
- When one instance caches data, other instances see it within ~5s instead of ~30s
- Configurable via
shared_storage.consolidation_interval(valid range: 5-300 seconds) - Tradeoff: Slightly more frequent consolidation I/O for significantly better cache utilization
- Versioned Request Handling: Requests with
versionIdquery parameter now properly validate against cached version- If cached object has matching
x-amz-version-id: serve from cache (cache hit) - If cached object has different version: bypass cache, forward to S3, do NOT cache response
- If no cached object exists: bypass cache, forward to S3, do NOT cache response
- Prevents serving wrong version data when requesting specific object versions
- Prevents cache pollution with version-specific data that may not be the "current" version
- New metrics:
versioned_request_mismatchandversioned_request_no_cachefor monitoring
- If cached object has matching
- Version ID Cache Correctness: Previously, versioned GET requests would incorrectly use cached data regardless of version
- Old behavior:
GET /bucket/object?versionId=v2could return cached data from version v1 - New behavior: Only serves from cache if
x-amz-version-idin cached metadata matches requestedversionId
- Old behavior:
- Zero-Copy Cache Writes: Eliminated unnecessary data copy in CacheWriter when compression is disabled
- Previously:
data.to_vec()copied every chunk even without compression - Now: Writes directly from original slice when no compression needed
- Reduces memory allocations and CPU usage during cache-miss streaming
- Improves throughput for large file transfers
- Previously:
- JournalOnly Mode for Range Writes: Changed cache-miss range metadata writes from
WriteMode::HybridtoWriteMode::JournalOnly- Eliminates lock contention on shared storage (EFS) during large file transfers
- Journal consolidator merges entries asynchronously without blocking streaming
- Addresses 4x performance gap (80MB/s vs 300MB/s) caused by metadata lock contention
- Each 8MB range write no longer acquires exclusive lock on metadata file
- DiskCacheManager Lock: Changed from
MutextoRwLockfor improved parallel read performance- Cache lookups (reads) now use
.read().awaitallowing concurrent access - Cache mutations (writes) use
.write().awaitfor exclusive access - Significantly improves throughput for parallel range requests on cache hits
- Addresses performance gap where HTTPS (bypassing proxy) was faster than HTTP for parallel downloads
- Cache lookups (reads) now use
- Read Methods: Changed read-only methods to take
&selfinstead of&mut selfload_range_data,get_cache_entry,get_full_object_as_rangenow use&selfdecompress_data,decompress_with_algorithmin CompressionHandler now use&self- Enables true parallel reads through RwLock
- AccessTracker Module: Removed redundant
src/access_tracker.rsmodule- Time-bucketed access logs in
access_tracking/directory no longer used - Functionality consolidated into journal system (
CacheHitUpdateBuffer)
- Time-bucketed access logs in
- BatchFlushCoordinator: Removed from RAM cache module
- RAM cache access tracking now handled by journal system at DiskCacheManager level
- Removed
batch_update_range_accessfunction from DiskCacheManager - Removed
set_flush_channel,start_flush_coordinator,shutdown_flush_coordinatormethods
- RAM Cache AccessTracker: Simplified RAM cache by removing internal AccessTracker
record_disk_access,should_verify,record_verification,pending_disk_updatesnow no-ops- Access tracking for disk metadata handled by
record_range_accessin DiskCacheManager
- Unified Access Tracking: All cache-hit access tracking now uses journal system
- Range accesses recorded via
DiskCacheManager::record_range_access() - Buffered in
CacheHitUpdateBuffer, flushed periodically, consolidated to metadata - Eliminates duplicate tracking systems and reduces code complexity
- Range accesses recorded via
- Journal-Based Metadata Updates: New system for atomic metadata updates on shared storage
- Eliminates race conditions when multiple proxy instances share cache storage (EFS/NFS)
- RAM-buffered cache-hit updates with periodic flush to per-instance journal files
- Background consolidation applies journal entries to metadata files with full-duration locking
- Lock acquisition with exponential backoff and jitter for contention handling
- New
CacheHitUpdateBufferfor buffering TTL refresh and access count updates - New journal operations:
TtlRefreshandAccessUpdatefor incremental metadata changes
- Shared Storage Mode: Cache-hit updates now route through journal system
- TTL refreshes buffered in RAM, flushed every 5 seconds to instance journal
- Access count updates buffered similarly, applied during consolidation
- Single-instance mode retains direct write behavior for performance
- Journal Consolidation: Enhanced to handle new operation types
TtlRefreshupdates range expiration without replacing range dataAccessUpdateincrements access count and updates last_accessed timestamp- Conflict resolution skips incremental operations (they don't carry full range data)
- Lock Manager: Added
acquire_lock_with_retrywith configurable backoff- Exponential backoff with jitter prevents thundering herd
- Configurable max retries, initial/max backoff, jitter factor
- Race Condition: Concurrent metadata updates on shared storage no longer corrupt data
- Conflict Resolution: TtlRefresh/AccessUpdate operations no longer overwrite existing range data
- Journal RAM Buffering: Access tracking now buffers entries in RAM before flushing to disk
- Entries buffered in memory with periodic flush (every 5 seconds by default)
- Dramatically reduces disk I/O on shared storage (EFS/NFS)
- Buffer auto-flushes when reaching 10,000 entries
- Force flush available for shutdown/testing scenarios
- Removed Immediate Metadata Updates: Access tracking no longer updates metadata files on every access
- Metadata updates now happen only during consolidation (every ~60 seconds)
- Eliminates per-access disk writes, improving throughput significantly
- Access counts and timestamps still accurately tracked via journal consolidation
- Simplified Access Tracking Directory Structure:
- Removed per-instance
.access.{instance_id}files from metadata directories - All access logs now stored in
access_tracking/{time_bucket}/{instance_id}.log - Cleaner separation between metadata and access tracking data
- Removed per-instance
- Reduced Disk I/O: Up to 99% reduction in disk writes for high-traffic workloads
- Lower Latency: Access recording completes in <1ms (RAM buffer only)
- Better Shared Storage Performance: Optimized for EFS/NFS with batched writes
- RAM Metadata Cache: New in-memory cache for
NewCacheMetadataobjects- Reduces disk I/O by caching frequently accessed metadata in RAM
- LRU eviction with configurable max entries (default: 10,000)
- Per-key locking prevents concurrent disk reads for same key
- Stale file handle recovery with configurable retry logic
- Configurable via
metadata_cachesection in config
- Unified HEAD/GET metadata storage: HEAD and GET requests now share a single
.metafile- Independent TTLs: HEAD expiry doesn't affect cached ranges, range expiry doesn't affect HEAD validity
- New fields in
NewCacheMetadata:head_expires_at,head_last_accessed,head_access_count - HEAD access tracking via journal system with format
bucket/key:HEAD
- Directory rename:
objects/directory renamed tometadata/- Cache will be wiped on upgrade (no migration needed)
- All metadata files now stored in
metadata/{bucket}/{XX}/{YYY}/
- Legacy HEAD cache: Removed separate
head_cache/directory and associated code- Removed
HeadRamCacheEntry,HeadAccessStats,HeadPendingUpdate,HeadAccessTrackerstructs - Removed HEAD-specific methods from
RamCacheandThreadSafeRamCache - Removed HEAD cache scanning from
CacheSizeTracker - Kept
HeadCacheEntryas return type for backward compatibility
- Removed
- Cache bypass headers support: Clients can now explicitly bypass the cache using standard HTTP headers
Cache-Control: no-cache- Bypass cache lookup but cache the response for future requestsCache-Control: no-store- Bypass cache lookup and do not cache the responsePragma: no-cache- HTTP/1.0 compatible cache bypass (same behavior as no-cache)- Case-insensitive header parsing with support for multiple directives
no-storetakes precedence when bothno-cacheandno-storeare presentCache-Controltakes precedence overPragmawhen both headers are present- Headers are stripped before forwarding requests to S3
- Configurable via
cache_bypass_headers_enabledoption (enabled by default) - Metrics tracking for bypass reasons (no-cache, no-store, pragma)
- INFO-level logging for cache bypass events
- Unified disk cache eviction: Replaced dual-mode eviction with unified range-level eviction
- Removed arbitrary 3-range threshold that determined eviction mode
- All ranges now treated as independent eviction candidates
- Consistent LRU/TinyLFU sorting across all objects regardless of range count
- Metadata file deleted only when all ranges evicted
- Empty directories cleaned up automatically after eviction
- Efficient batching: one metadata update per object during eviction
- Web-based monitoring dashboard with real-time cache statistics and log viewing
- Accessible at
localhost:8081(configurable port) - Real-time cache hit rates, sizes, and eviction statistics
- Application log viewer with filtering and auto-refresh
- System information display (hostname, version, uptime)
- No authentication required for internal monitoring
- Minimal performance impact (<10MB memory, supports 10 concurrent users)
- Accessible at
- Dashboard configuration options in
config.example.yaml- Configurable refresh intervals for cache stats and logs
- Adjustable maximum log entries display
- Bind address and port configuration
- Moved deployment-related files to non-public directory for better organization
- Updated documentation to reflect dashboard functionality
- Enhanced repository structure and cleanup
- Old debugging scripts from
old-or-reference/directory
- Multi-tier caching (RAM + disk) with intelligent HEAD metadata caching
- Sub-millisecond HEAD response times from RAM cache
- Streaming response architecture for large files
- Unified TinyLFU eviction algorithm across GET and HEAD entries
- Intelligent range request optimization with merging
- Content-aware LZ4 compression with per-entry metadata
- Connection pooling with IP load balancing
- Write-through caching for single and multipart object uploads
- Multi-instance shared cache coordination
- OpenTelemetry Protocol (OTLP) metrics export
- Comprehensive test suite with property-based testing
- Docker deployment support
- Health check and metrics endpoints
- HTTP (Port 80): Full caching with range optimization
- HTTPS (Port 443): TCP passthrough (no caching)
- S3-compatible access logs and structured application logs
- Configurable TTL overrides per bucket/prefix
- Distributed eviction coordination for multi-instance deployments
- Performance optimizations for large-scale deployments
- Basic S3 proxy functionality
- Simple caching implementation
- Core HTTP/HTTPS proxy server
- Basic configuration system