S3 Proxy Architecture

Technical architecture overview and design principles for S3 Proxy.

Core Principles
System Architecture
Module Organization
Key Design Decisions
Request Flow
Performance Characteristics
Security Considerations
Observability

Core Principles

Transparent Forwarder

Proxy only responds to client requests
Cannot initiate requests to S3 (no AWS credentials)
Cannot sign requests (relies on client-signed requests)
Acts as intelligent cache between client and S3

Streaming Architecture

All S3 responses stream directly to client via TeeStream
Simultaneous caching in background
Eliminates buffering and memory pressure
Constant memory usage regardless of file size

Unified Range Storage

All cached data stored as ranges
PUT operations: Stored as range 0-N
GET operations: Full objects or partial ranges
Multipart uploads: Parts assembled into ranges
No data copying on TTL transitions

System Architecture

    ┌─────────────────────┐              ┌─────────────────────┐
    │   S3 Client (HTTP)  │              │  S3 Client (HTTPS)  │
    │   - AWS CLI         │              │   - AWS CLI         │
    │   - S3 SDK          │              │   - S3 SDK          │
    └──────────┬──────────┘              └──────────┬──────────┘
               │                                    │
               ▼                                    ▼
┌──────────────────────────────────────────────────────────────────┐
│                       S3 Proxy (1..N)                            │
│  ┌───────────────────────────┐    ┌───────────────────────────┐  │
│  │   HTTP Handler (Port 80)  │    │  HTTPS Handler (Port 443) │  │
│  │   - Caching               │    │   - TCP Passthrough       │  │
│  │   - Range Merging         │    │   - No Caching            │  │
│  │   - Streaming             │    │   - Direct to S3          │  │
│  └─────────────┬─────────────┘    └─────────────┬─────────────┘  │
│                │                                │                │
│                ▼                                │                │
│  ┌───────────────────────────┐                  │                │
│  │        RAM Cache          │                  │                │
│  │   - Metadata + ranges     │                  │                │
│  │   - Compression           │                  │                │
│  │   - Eviction              │                  │                │
│  └─────────────┬─────────────┘                  │                │
└────────────────┼────────────────────────────────┼────────────────┘
                 │                                │
                 ▼                                │
┌───────────────────────────────┐                 │
│   Shared Disk Cache (NFS)     │                 │
│   - Metadata + ranges         │                 │
│   - Compression               │                 │
│   - LRU/TinyLFU-like eviction │                 │
│   - Fixed or elastic size     │                 │
│   - Journaled writes          │                 │
└───────────────┬───────────────┘                 │
                │                                 │
                └────────────────┬────────────────┘
                                 │
                                 ▼
                    ┌─────────────────────────┐
                    │    Amazon S3 (HTTPS)    │
                    └─────────────────────────┘

Module Organization

src/
├── main.rs              # Entry point, server initialization
├── lib.rs               # Library exports, module declarations
│
│   # Cache layer
├── cache.rs             # Unified cache manager
├── cache_types.rs       # Cache data structures
├── cache_writer.rs      # Async cache writer for streaming
├── cache_size_tracker.rs # Cache size tracking (delegates to consolidator)
├── cache_initialization_coordinator.rs # Coordinated cache initialization
├── cache_validator.rs   # Cache integrity validation and file scanning
├── capacity_manager.rs  # Cache capacity checks and bypass decisions
├── disk_cache.rs        # Disk cache with streaming support
├── ram_cache.rs         # RAM cache with TinyLFU for range data
├── metadata_cache.rs    # RAM cache for NewCacheMetadata objects
├── write_cache_manager.rs # Write-cache capacity, eviction, and incomplete upload cleanup
│
│   # Journal and consolidation
├── journal_manager.rs   # Per-instance journal file management
├── journal_consolidator.rs # Background consolidation + size tracking + eviction
├── hybrid_metadata_writer.rs # Journal-based metadata writes
├── metadata_lock_manager.rs # Lock management for shared storage
├── cache_hit_update_buffer.rs # RAM buffer for cache-hit metadata updates
│
│   # Recovery
├── background_recovery.rs # Background orphan detection and prioritized recovery
├── orphaned_range_recovery.rs # Scan and recover range files missing from metadata
│
│   # HTTP proxy
├── http_proxy.rs        # HTTP proxy with streaming
├── inflight_tracker.rs  # Download coordination: coalesces concurrent cache misses
├── signed_put_handler.rs # Signed PUT, multipart upload handling and caching
├── signed_request_proxy.rs # SigV4 request forwarding and TLS connection management
├── presigned_url.rs     # Presigned URL parsing and expiration checking
├── aws_chunked_decoder.rs # AWS chunked transfer encoding decode/encode
├── s3_client.rs         # S3 client wrapper with streaming
├── tee_stream.rs        # TeeStream for simultaneous streaming/caching
├── range_handler.rs     # Range request parsing and merging
│
│   # HTTPS / TCP proxy
├── https_proxy.rs       # HTTPS proxy server (TCP passthrough mode)
├── https_connector.rs   # Custom HTTPS connector with connection pool integration
├── tcp_proxy.rs         # TCP tunneling with SNI extraction and IP load balancing
│
│   # Networking and compression
├── compression.rs       # LZ4 compression with content-aware detection
├── connection_pool.rs   # IP distribution, DNS resolution, and health tracking
│
│   # Observability
├── logging.rs           # Access and application logging
├── log_sampler.rs       # Rate-based log sampling for high-frequency operations
├── metrics.rs           # Metrics collection
├── otlp.rs              # OpenTelemetry Protocol export
├── health.rs            # Health check endpoints and system status monitoring
├── dashboard.rs         # Web dashboard with cache stats, system info, and log viewer
│
│   # Configuration and infrastructure
├── config.rs            # Configuration management
├── error.rs             # Error types
├── permissions.rs       # Directory permission validation on startup
└── shutdown.rs          # Graceful shutdown coordination

Key Design Decisions

1. Streaming Response Architecture

Problem: Large files (500MB+) caused AWS SDK throughput timeouts when buffering entire response.

Solution: TeeStream architecture

All streaming S3 responses stream directly to client via TeeStream
Data simultaneously sent to background task for caching
No buffering of entire response in memory

Implementation:

pub struct TeeStream<S> {
    inner: S,
    sender: mpsc::Sender<Result<Bytes>>,
}

impl<S: Stream<Item = Result<Bytes>>> Stream for TeeStream<S> {
    fn poll_next(&mut self, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
        match self.inner.poll_next_unpin(cx) {
            Poll::Ready(Some(Ok(chunk))) => {
                // Send to cache writer (non-blocking)
                let _ = self.sender.try_send(Ok(chunk.clone()));
                Poll::Ready(Some(Ok(chunk)))
            }
            // ... error handling
        }
    }
}

Benefits:

Eliminates timeout issues
Constant memory usage (64KB chunks)
Sub-100ms first byte latency
No cache performance regression

RAM Cache Integration: Both streaming and buffered paths check RAM cache before disk I/O. On a RAM hit, the streaming path serves data directly from memory as a buffered response. On a RAM miss, the streaming path collects chunks during disk streaming and promotes the range to RAM cache after completion (skipping promotion for ranges exceeding max_ram_cache_size).

2. Bucket-First Hash-Based Sharding

Problem: Flat directory structure doesn't scale beyond 10K files per directory.

Solution: Bucket-first hash-based sharding with BLAKE3

cache_dir/
├── metadata/{bucket}/{XX}/{YYY}/
└── ranges/{bucket}/{XX}/{YYY}/

Why BLAKE3:

10x faster than SHA-256
Cryptographically secure (prevents collision attacks)
Excellent distribution properties
Native Rust implementation

Implementation:

fn get_sharded_path(base_dir: &Path, cache_key: &str, suffix: &str) -> Result<PathBuf> {
    let (bucket, object_key) = parse_cache_key(cache_key)?;
    let hash = blake3::hash(object_key.as_bytes());
    let hash_hex = hash.to_hex();
    
    let level1 = &hash_hex[0..2];   // 256 directories
    let level2 = &hash_hex[2..5];   // 4,096 subdirectories
    
    Ok(base_dir.join(bucket).join(level1).join(level2).join(filename))
}

Capacity:

1,048,576 leaf directories per bucket (256 × 4,096)
10.5B files per bucket maximum (10K files/directory)
2.6B files per bucket optimal (40% safety margin)
Unlimited buckets

3. Multi-Tier Caching Strategy

RAM Cache (TinyLFU):

Range data from both streaming and buffered paths
Sub-millisecond response times
Frequency-based eviction
Configurable size limits
Streaming path checks RAM before disk, promotes to RAM after disk hits

Disk Cache (LZ4 Compressed):

All object data stored as ranges
Content-aware compression
Streaming read/write support
TTL-based expiration

Cache Coordination:

File locking for multi-instance coordination
Atomic cache operations
Consistent metadata management

4. Stateless Instance Architecture

Problem: Traditional distributed caches require cluster membership, leader election, or inter-node communication, adding operational complexity.

Solution: Coordination exclusively through shared storage

Instances have no knowledge of each other
All coordination via file-based locking on shared volume
No cluster configuration, membership protocols, or network discovery

Benefits:

Ephemeral Nodes: Instances can be added, removed, or replaced at any time without affecting other instances
Simple Operations: No cluster reconfiguration when scaling up/down
Fault Isolation: One instance failure has no impact on others
Easy Recovery: Replace failed instance by starting a new one pointing to same shared storage

Coordination Mechanisms:

File locks for write coordination (metadata updates, eviction)
Per-instance journal files for cache-hit updates (no write conflicts)
Distributed eviction lock prevents over-eviction during scale events
Journal consolidator merges updates from all instances asynchronously

Deployment Model:

Instance 1 ──┐
Instance 2 ──┼── Shared Storage (NFS)
Instance N ──┘    ├── metadata/
                  ├── ranges/
                  ├── journals/
                  └── locks/

Each instance operates independently - the shared storage is the only integration point.

5. Buffered Logging and Accumulator-Based Size Tracking

Problem: Direct disk writes for access logs cause high I/O contention on shared NFS filesystems, especially with multiple proxy instances. Journal-based size tracking suffered from timing gaps between when data is written and when size is counted.

Solution:

Access logs buffered in memory, flushed every 5 seconds or 1000 entries
In-memory AtomicI64 accumulator tracks size at write/eviction time with zero NFS overhead
Per-instance delta files flushed periodically, summed by consolidator under global lock

Implementation:

┌─────────────────────────────────────────────────────────────────┐
│                         S3 Proxy Instance                       │
│  ┌─────────────────────┐    ┌─────────────────────────────────┐ │
│  │  Request Handler    │    │  Cache Operations               │ │
│  │                     │    │                                 │ │
│  │  log_access()  ─────┼────┼──► AccessLogBuffer              │ │
│  │                     │    │    - entries: Vec<AccessLogEntry>│ │
│  │                     │    │    - flush every 5s             │ │
│  │                     │    │                                 │ │
│  │                     │    │  store_range() ─────────────────┼─┤
│  │                     │    │    │                            │ │
│  │                     │    │    ▼                            │ │
│  │                     │    │  SizeAccumulator.add()          │ │
│  │                     │    │    - AtomicI64 fetch_add        │ │
│  │                     │    │    - Zero NFS overhead          │ │
│  └─────────────────────┘    └─────────────────────────────────┘ │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │  JournalConsolidator (background task, every 5s)            ││
│  │    - Flushes accumulator to delta_{instance_id}.json        ││
│  │    - Acquires global lock                                   ││
│  │    - Sums all delta files → updates size_state.json         ││
│  │    - Resets delta files to zero                             ││
│  │    - Processes journal entries for metadata updates only    ││
│  │    - Triggers eviction when cache exceeds capacity          ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼ (every 5 seconds)
┌─────────────────────────────────────────────────────────────────┐
│                    Shared Storage (NFS)                         │
│                                                                 │
│  logs/access/YYYY/MM/DD/                                        │
│    └── {timestamp}-{hostname}     ← Access logs (per-instance)  │
│                                                                 │
│  cache/metadata/_journals/                                      │
│    └── {instance_id}.journal      ← Per-instance journal entries │
│                                                                 │
│  cache/size_tracking/                                           │
│    ├── size_state.json            ← Authoritative size state    │
│    └── delta_{instance_id}.json   ← Per-instance delta files    │
└─────────────────────────────────────────────────────────────────┘

Accumulator-Based Size Tracking:

Size recorded immediately at write/eviction time via atomic operations
Each instance flushes accumulated delta to per-instance file every 5 seconds
Consolidator sums all delta files under global lock, updates size_state.json
Journal entries processed for metadata updates only (not size tracking)
Eviction triggered automatically when cache exceeds capacity

Performance Impact:

Disk writes reduced by ~1000x (from per-request to periodic batch)
NFS contention eliminated (per-instance files, no shared file writes)
Maximum data loss on crash: 5 seconds of logs and size deltas

Recovery:

On startup, consolidator loads size_state.json
If missing, starts at 0 and daily validation will correct
In-memory accumulator starts at zero; delta files are summed on next consolidation cycle

Request Flow

GET Request Processing

Request Validation: Parse and validate incoming request
Cache Lookup: Check RAM cache for metadata
Range Analysis: Determine required byte ranges
Cache Hit Path: Serve from cache if available
Download Coordination: On cache miss, coalesce with any in-flight fetch for the same resource
Cache Miss Path: Forward to S3, stream response
Background Caching: Store response data asynchronously
Response Assembly: Merge ranges if needed

PUT Request Processing

Request Forwarding: Stream request body to S3
Response Capture: Capture S3 response headers
Cache Storage: Store object data and metadata
Cache Invalidation: Remove conflicting cache entries
Response Return: Return S3 response to client

Multipart Upload Processing

Upload Part (UploadPart):

Request Forwarding: Stream part data to S3
Part Caching: Store part data in temporary location
Tracker Update: Record part metadata in upload tracker
Response Return: Return S3 response to client

Complete Multipart Upload (CompleteMultipartUpload):

Request Parsing: Parse XML body to extract requested parts and ETags
Request Forwarding: Forward completion request to S3
ETag Validation: Verify cached part ETags match request ETags
Part Filtering: Retain only parts listed in the request, delete unreferenced parts
Cache Finalization: Build part_ranges from filtered parts with cumulative byte offsets
Graceful Degradation: On ETag mismatch or missing parts, skip caching and cleanup
Response Return: Return S3 response to client

Multi-Instance Safety:

Each upload isolated in mpus_in_progress/{upload_id}/ directory
Missing parts trigger cleanup without affecting other uploads
Prevents incomplete cache entries that could serve corrupted data

Performance Characteristics

Latency

RAM Cache Hit: < 1ms
Disk Cache Hit: 5-50ms (depending on size)
Cache Miss: S3 latency + minimal overhead

Throughput

Streaming: No throughput degradation for large files
Compression: 2-10x space savings with minimal CPU overhead
Connection Pooling: Reduced connection establishment overhead

Scalability

Multi-Instance: Shared cache coordination via file locking
Cache Size: Supports petabyte-scale caches
Concurrent Requests: Configurable request limits

Security Considerations

Your Responsibility: You are responsible for securing both network access to the proxy and file system access to the shared cache storage. Any client that can reach the proxy over the network can read any cached object with TTL > 0, because cache hits are served without contacting S3. Similarly, anyone with access to the cache storage volume can read cached data directly. Restrict access those clients and systems authorized to access the cached data.

Network Security Requirements

HTTP Traffic is Unencrypted: All communication between clients and the proxy uses HTTP (port 80) for caching functionality. This means:

Trusted Network Required: Deploy only in secured network environments (VPCs, internal networks, isolated subnets)
Data in Transit: S3 data flows unencrypted between client and proxy (proxy-to-S3 communication always uses HTTPS)
Network Controls: Use security groups, firewalls, or network segmentation to restrict proxy access to authorized clients only

Shared Cache Access Model

The proxy is a shared cache. It does not authenticate clients or authorize requests — S3 handles both, but only when requests reach S3. With TTL > 0, cache hits bypass S3 entirely.

What this means in practice:

Any client with network access to the proxy can read any cached object until its TTL expires
A user whose S3 access was revoked can still retrieve cached data until TTL expires
Different users share the same cached responses — there is no per-user isolation

This is the same security model as any shared cache (CDN edge cache, Mountpoint for Amazon S3's local cache, or a shared NFS export of downloaded files). The proxy does not weaken S3's security — it requires you to control who can access the cache, just as you would for any local copy of S3 data.

Two access paths to secure:

Network access to the proxy — restrict to authorized clients using security groups, firewalls, or network segmentation
File system access to the cache volume — restrict to authorized proxy instances only, using file permissions and mount controls

Mitigation - Always-Revalidate Mode (TTL=0): For environments requiring per-request authentication and authorization enforcement, TTL values can be set to zero:

cache:
  get_ttl: "0s"
  head_ttl: "0s"
  actively_remove_cached_data: false  # Required - must use lazy expiration

Behavior with TTL>0 (expired): When a cached object's TTL has elapsed, the next request triggers the same conditional revalidation flow described below. See Cache Revalidation in the Caching documentation for full details including 304, 200, and 403 outcomes.

Behavior with TTL=0:

Cache still stores data (useful for range merging, bandwidth savings)
Every request triggers S3 revalidation via conditional requests (If-Modified-Since + If-None-Match)
Client's original request headers (including Authorization) are forwarded to S3
S3 validates the requesting client's IAM credentials on every request
S3 returns 304 Not Modified if unchanged → cached data served, bandwidth saved
S3 returns 200 OK if changed → fresh data fetched and cached
S3 returns 403 Forbidden if client lacks access → error returned, cache unchanged

Important: TTL=0 requires actively_remove_cached_data: false (lazy expiration). With active expiration enabled, cached data would be immediately deleted after storage, defeating the purpose.

Use cases:

Per-request IAM authentication and authorization: Ensures every client's credentials are validated by S3, preventing unauthorized access to cached data
Maximum freshness guarantees: Every request confirms data hasn't changed
Compliance scenarios: Audit trail of S3 validating each access

Trade-offs:

Every request incurs S3 round-trip latency
Bandwidth savings only from 304 responses (no data transfer when unchanged)
Higher S3 request costs

Request Validation Limitations

Bucket Owner Validation Not Supported: The x-amz-expected-bucket-owner header cannot be validated for cached responses:

S3 Does Not Return Owner: GetObject and HeadObject responses do not include bucket owner information
No Owner Stored in Cache: Since owner is never received, it cannot be stored or validated
Cached Responses Bypass Validation: Requests with x-amz-expected-bucket-owner served from cache will succeed regardless of the header value
Cache Miss Behavior: Only on cache miss does S3 validate the header (and return 403 if mismatched)

This is an inherent limitation of the S3 API design, not a proxy implementation choice.

Data Confidentiality

Cache Storage: Cached data is stored on disk with LZ4 compression but no encryption:

Plaintext Storage: Object data stored in readable format (compressed but not encrypted)
Metadata Exposure: Object keys, sizes, and access patterns visible in cache metadata
File System Access: Anyone with file system access can read cached data directly
Encryption at Rest: Use storage-level encryption if encryption at rest is required

Deployment Guidelines

Before deploying, ensure:

Network access to the proxy is restricted to clients authorized to access all objects that may be cached
File system access to the cache volume is restricted to authorized proxy instances
TTL values reflect your tolerance for serving stale data after access revocation (use TTL=0 for per-request authorization)

Appropriate Use Cases:

Internal networks where all clients are authorized to access the same S3 data
Development and testing environments
Single-tenant deployments with controlled network access
On-premises environments with network segmentation between trust zones

Not Recommended For:

Multi-tenant environments where clients should not see each other's data
Public-facing or untrusted network deployments
Environments requiring per-object access control between clients (unless using TTL=0)

Observability

Metrics

Cache hit/miss rates
Response times
Throughput statistics
Error rates
Resource utilization

Logging

S3-compatible access logs
Structured application logs
Performance metrics
Error tracking

Health Checks

HTTP health endpoint
Cache system status
S3 connectivity validation
Resource availability checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Proxy Architecture

Table of Contents

Core Principles

Transparent Forwarder

Streaming Architecture

Unified Range Storage

System Architecture

Module Organization

Key Design Decisions

1. Streaming Response Architecture

2. Bucket-First Hash-Based Sharding

3. Multi-Tier Caching Strategy

4. Stateless Instance Architecture

5. Buffered Logging and Accumulator-Based Size Tracking

Request Flow

GET Request Processing

PUT Request Processing

Multipart Upload Processing

Performance Characteristics

Latency

Throughput

Scalability

Security Considerations

Network Security Requirements

Shared Cache Access Model

Request Validation Limitations

Data Confidentiality

Deployment Guidelines

Observability

Metrics

Logging

Health Checks

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

S3 Proxy Architecture

Table of Contents

Core Principles

Transparent Forwarder

Streaming Architecture

Unified Range Storage

System Architecture

Module Organization

Key Design Decisions

1. Streaming Response Architecture

2. Bucket-First Hash-Based Sharding

3. Multi-Tier Caching Strategy

4. Stateless Instance Architecture

5. Buffered Logging and Accumulator-Based Size Tracking

Request Flow

GET Request Processing

PUT Request Processing

Multipart Upload Processing

Performance Characteristics

Latency

Throughput

Scalability

Security Considerations

Network Security Requirements

Shared Cache Access Model

Request Validation Limitations

Data Confidentiality

Deployment Guidelines

Observability

Metrics

Logging

Health Checks