Skip to content

Releases: lablup/backend.ai

25.17.0

23 Nov 15:07

Choose a tag to compare

Features

Agent and Multi-Agent Support

Laid the groundwork for running multiple agent instances on a single host. This is an early-stage effort focusing on configuration structure and runtime architecture refactoring. Full resource isolation between agents is planned for future releases.

  • Add support for array of tables syntax in config sample generator (#6311)
  • Add support for multiple agents in agent server config (#6315)
  • Update Agent server RPC functions to include agent ID for agent runtime with multiple agents (#6320)
  • Change agent config field names and serialization aliases to use internal-addr naming (#6697)
  • Add AgentEtcdClientView for clean handling of etcd clients for multi agents (#6721)
  • Add custom resource allocation in agent server config (#6724)
  • Extract agent common resources to AgentRuntime (#6728)
  • Move ownership of resources from agent to a separate component in agent runtime (#6766)
  • Add resource isolation options for multi-agent setup (#6770)
  • Store installed images information to Redis in Agent (#6834)
  • Implement pickle based Kernel registry recovery which can replace existing kernel registry load and save functions (#6489)
  • Add agent-id label for session Docker containers (#6870)

Health Check and Dependency Verification

Added health check endpoints and dependency verification CLI commands across all Backend.AI components. Operators can now diagnose connectivity issues with external services (database, Redis, etcd) before and during runtime.

  • Implement health check infrastructure for component monitoring (#6732)
  • Add health checker system for all components (#6736)
  • Add dependency management system for manager (#6753)
  • Add dependency verification system for web component (#6757)
  • Add dependency verification for storage proxy (#6760)
  • Add dependency verification for App Proxy Coordinator and Worker (#6767)
  • Add dependency verification CLI in agent (#6775)
  • Add dependency health checking infrastructure (#6781)
  • Integrate HealthProbe across all components with real connectivity checks (#6836)

Artifact and Reservoir Registry

Enhanced artifact management with verification plugins, real-time download progress tracking via Redis, and support for gated HuggingFace models. Also enabled delegation-based imports when artifacts are unavailable in local reservoir registries.

  • Implement artifact_verifier type plugin in storage-proxy (#6258)
  • Fix limit, search parameters not working in reservoir registry's scan_artifact API (#6488)
  • Separate DB source from artifact repository layers (#6490)
  • Re-import available artifacts only when necessary based on digest (#6501)
  • Add extra column to Artifact model to store gated information for huggingface models (#6620)
  • Collect artifact verification results to artifact_revisions table (#6662)
  • Track Artifact download progress through redis (#6663)
  • Create artifact download progress query REST API (#6666)
  • Extend reservoir registry artifact import API to perform import delegation when the artifact is not available in the remote reservoir (#6672)
  • Track Reservoir registry artifact download progress (#6673)
  • Add metadata field to artifact verifier interface (#6676)
  • Add missing id, registry_id fields to ArtifactRegistry GQL Node (#6750)

Notification System

Introduced a notification center that allows administrators to configure webhook channels and define event-based notification rules. Included REST/GraphQL APIs for channel and rule management, along with CLI tools for validation and delivery testing.

  • Implement notification system with channels, rules, and event processing (#6635)
  • Implement notification center with REST/GraphQL APIs for managing channels and rules (#6653)
  • Add notification validation API and notification CLI (#6657)
  • Add notification center with webhook channel support (#6668)
  • Implement notification message type system and validation APIs (#6677)

Background Task Infrastructure

Began migrating background tasks to a retriable pattern with initial support for image operations and session commits. This is an ongoing effort, and further tasks will be migrated in subsequent releases.

  • Add image purge/rescan background tasks and modernize task system (#6597)
  • Improve bgtask infrastructure with repository pattern and type adapters (#6606)
  • Migrate commit session to retriable background task pattern (#6625)

Model Service and Routing

Introduced route health checking with a 3-state model (healthy/unhealthy/degraded) and automatic eviction for unhealthy endpoints. Also added Prometheus integration via service discovery for model service metrics collection.

  • Add model service route synchronization to service discovery (#6832)
  • Add periodic service discovery sync for model service routes (#6833)
  • Implement 3-state route health check with configurable eviction (#6839)
  • Add missing and newly introduced fields to service field specifications (#6714)

Session and Resource Management

Parallelized session termination for improved performance and added automatic cleanup for sessions associated with lost agents. Also introduced an async file deletion API for vfolders.

  • Parallelize session termination and add lost agent cleanup (#6826)
  • Implement async file deletion API in vfolder (#6861)

Infrastructure and Configuration

Added flexible bind/advertise address configuration for app-proxy components and restructured Valkey client by separating monitor and operation clients for better resource management.

  • Add read committed transaction support in ExtendedAsyncSAEngine, enabling higher throughput for read-heavy workloads by reducing transaction isolation overhead (#6665)
  • Replace time() with Redis TIME command in ValkeyScheduleClient (#6695)
  • Support multiple Apollo Router endpoints with load balancing (#6703)
  • Support bind, advertised address configuration options for app-proxy coordinator and worker components (#6631)
  • Separate monitor and operation clients in Valkey client (#6829)
  • Ensure normal URLs are called even if the protocol is included in the host of HostPortPair, preventing network error in app-proxy communication (#6813)
  • delete-dev.sh now supports interactive confirmation and non-interactive -y/--yes flag (#6815)

Improvements

  • Change Action Processor arguments to immutable types and made them contravariant to prevent memory leaks and improve type safety (#6596)
  • Introduce Source-based structure in AuthRepository decoupling database access for easier testing (#6641)
  • Make error_code method in BackendAIError as instance method making injection or modification of the error code from outside the class easier, improving flexibility when handling errors (#6722)
  • Move kernel registry ownership to agent runtime (#6730)
  • Use resources functions directly in AbstractAgent (#6763)

Fixes

Session a...

Read more

25.15.3

05 Dec 03:56
ad94c06

Choose a tag to compare

Features

  • Support bind, advertised address configuration options for app-proxy coordinator and worker components (#6631)
  • Add missing and newly introduced fields to service field specifications (#6714)
  • Parallelize session termination and add lost agent cleanup (#6826)

Fixes

  • Add AppProxy setup and initialization workflow to the TUI installer (replacing WSProxy) (#6228)
  • Fix Pydantic validation error from incorrect slot_type type in mock plugin (#6692)
  • Fix domain admin users seeing vfolder hosts from projects they were not members of. They now only see hosts for projects they belong to (#6694)
  • Model Card resolver now returns the actual error message when it fails, instead of showing a generic "Unknown error" string (#6702)
  • Add missing advertise_address info in app proxy status response (#6772)
  • Allow zero values in DecimalType conversion (#6783)
  • Disallow dot('.') usage in model service name (#6800)
  • Fix auto-scaling functionality for inference services when using framework-based scaling rules. Metrics collection and rule comparison logic have been corrected to ensure proper scaling behavior (#6801)
  • Use the kernel’s occupied slots when calculating the agent’s resources (#6817)
  • Explicitly wrap slot key with SlotName() to prevent validation failure when initializing AgentInfo (#6841)
  • Apply http client pool in app proxy worker (#6851)
  • Add missing cache invalidation for resource preset (#6852)

Full Changelog

Check out the full changelog until this release (25.15.3).

Full Commit Logs

Check out the full commit logs between release (25.15.2) and (25.15.3).

25.17.0rc3

20 Nov 09:26
a5cc355

Choose a tag to compare

25.17.0rc3 Pre-release
Pre-release

Features

  • Add support for array of tables syntax in config sample generator (#6311)
  • Add support for multiple agents in agent server config (#6315)
  • Update Agent server RPC functions to include agent ID for agent runtime with multiple agents (#6320)
  • Add read committed transaction support in ExtendedAsyncSAEngine, enabling higher throughput for read-heavy workloads by reducing transaction isolation overhead (#6665)
  • Replace time() with Redis TIME command in ValkeyScheduleClient (#6695)
  • Change agent config field names and serialization aliases to use internal-addr naming (#6697)
  • Support multiple Apollo Router endpoints with load balancing (#6703)
  • Add missing and newly introduced fields to service field specifications (#6714)
  • Add AgentEtcdClientView for clean handling of etcd clients for multi agents (#6721)
  • Add custom resource allocation in agent server config (#6724)
  • Extract agent common resources to AgentRuntime (#6728)
  • Implement health check infrastructure for component monitoring (#6732)
  • Add health checker system for all components (#6736)
  • Add missing id, registry_id fields to ArtifactRegistry GQL Node (#6750)
  • Add dependency management system for manager (#6753)
  • Add dependency verification system for web component (#6757)
  • Add dependency verification for storage proxy (#6760)
  • Move ownership of resources from agent to a separate component in agent runtime (#6766)
  • Add dependency verification for App Proxy Coordinator and Worker (#6767)
  • Add resource isolation options for multi-agent setup (#6770)
  • Add dependency verification CLI in agent (#6775)
  • Add dependency health checking infrastructure (#6781)
  • Ensure normal URLs are called even if the protocol is included in the host of HostPortPair, preventing network error in app-proxy communication (#6813)
  • delete-dev.sh now supports interactive confirmation and non-interactive -y/--yes flag. (#6815)
  • Parallelize session termination and add lost agent cleanup (#6826)
  • Separate monitor and operation clients in Valkey client (#6829)
  • Add model service route synchronization to service discovery (#6832)
  • Add periodic service discovery sync for model service routes (#6833)
  • Store installed images information to Redis in Agent (#6834)
  • Integrate HealthProbe across all components with real connectivity checks (#6836)
  • Implement 3-state route health check with configurable eviction (#6839)

Improvements

  • Introduce Source-based structure in AuthRepository decoupling database access for easier testing (#6641)
  • Make error_code method in BackendAIError as instance method making injection or modification of the error code from outside the class easier, improving flexibility when handling errors (#6722)
  • Move kernel registry ownership to agent runtime (#6730)
  • Use resources functions directly in AbstractAgent (#6763)

Fixes

  • Fix app proxy to properly handle redirect parameter in HTTP protocol auth flow by appending the redirect path to the generated proxy URL (#6686)
  • Fix Pydantic validation error from incorrect slot_type type in mock plugin (#6692)
  • Fix domain admin users seeing vfolder hosts from projects they were not members of. They now only see hosts for projects they belong to (#6694)
  • Model Card resolver now returns the actual error message when it fails, instead of showing a generic "Unknown error" string (#6702)
  • Fix storage proxy client to handle non-JSON error responses instead of crashing on parse failures (#6712)
  • Support service-definition.toml override with optional fields(image, arch, resource) (#6751)
  • Add missing advertise_address info in app proxy status response (#6772)
  • Allow zero values in DecimalType conversion (#6783)
  • Disallow dot('.') usage in model service name (#6800)
  • Fix auto-scaling functionality for inference services when using framework-based scaling rules. Metrics collection and rule comparison logic have been corrected to ensure proper scaling behavior (#6801)
  • Use the kernel’s occupied slots when calculating the agent’s resources (#6817)
  • Make agent installed image sync to match by canonical name and architecture instead of digest, preventing digest change by container image driver (#6838)
  • Explicitly wrap slot key with SlotName() to prevent validation failure when initializing AgentInfo (#6841)

Miscellaneous

  • Move EndpointLifecycle enum to a shared common package for improved reusability (#6637)
  • Add debug log when app proxy worker got server disconnected error making tracing and diagnosing unexpected disconnections easier (#6735)

Full Changelog

Check out the full changelog until this release (25.17.0rc3).

Full Commit Logs

Check out the full commit logs between release (25.17.0rc2) and (25.17.0rc3).

25.17.0rc2

09 Nov 16:18
dc2c320

Choose a tag to compare

25.17.0rc2 Pre-release
Pre-release

Fixes

  • Reservoir artifact import API response is blocking when using delegation (#6683)

Full Changelog

Check out the full changelog until this release (25.17.0rc2).

Full Commit Logs

Check out the full commit logs between release (25.17.0rc1) and (25.17.0rc2).

25.17.0rc1

09 Nov 12:05
e9bbd89

Choose a tag to compare

25.17.0rc1 Pre-release
Pre-release

Features

  • Implement artifact_verifier type plugin in storage-proxy (#6258)
  • Fix limit, search parameters not working in reservoir registry's scan_artifact API (#6488)
  • Implement pickle based Kernel registry recovery which can replace existing kernel registry load and save functions (#6489)
  • Separate DB source from artifact repository layers (#6490)
  • Re-import available artifacts only when necessary based on digest (#6501)
  • Add image purge/rescan background tasks and modernize task system (#6597)
  • Improve bgtask infrastructure with repository pattern and type adapters (#6606)
  • Add extra column to Artifact model to store gated information for huggingface models (#6620)
  • Migrate commit session to retriable background task pattern (#6625)
  • Support bind, advertised address configuration options for app-proxy coordinator and worker components (#6631)
  • Implement notification system with channels, rules, and event processing (#6635)
  • Implement notification center with REST/GraphQL APIs for managing channels and rules (#6653)
  • Add notification validation API and notification CLI (#6657)
  • Collect artifact verification results to artifact_revisions table (#6662)
  • Track Artifact download progress through redis (#6663)
  • Create artifact download progress query REST API (#6666)
  • Add notification center with webhook channel support (#6668)
  • Extend reservoir registry artifact import API to perform import delegation when the artifact is not available in the remote reservoir (#6672)
  • Track Reservoir registry artifact download progress (#6673)
  • Add metadata field to artifact verifier interface (#6676)
  • Implement notification message type system and validation APIs (#6677)

Improvements

  • Change Action Processor arguments to immutable types and made them contravariant to prevent memory leaks and improve type safety (#6596)

Fixes

  • Update gpu_allocated legacy metric fields to consider all accelerator devices, including both cuda.devices and cuda.shares, but also MIG variants and other NPUs as well (Known issue: all resources visible to each user and group MUST use a consistent fraction mode) (#2404)
  • Refresh VAST cluster info cache rather than keep the cache alive forever (#6428)
  • Adjust reservoir download API client timeout and add proper connection termination handling (#6627)
  • Remove DoPullReservoirRegistryEvent, and the event handler (#6680)

Documentation Updates

  • Add Entry Point, Event, and Background Task architecture documentation (#6594)
  • Document adapter and Querier patterns in API/GraphQL/Repository READMEs (#6656)

Full Changelog

Check out the full changelog until this release (25.17.0rc1).

Full Commit Logs

Check out the full commit logs between release (25.16.0) and (25.17.0rc1).

25.15.2

07 Nov 10:46
5e25f6a

Choose a tag to compare

Features

  • Support project name filter when resolving user nodes (#6298)

Fixes

  • Fix GQL agent_summary_list resolver (#6389)

  • Add timestamp tracking for route health status to enable staleness detection.

    Route health checks now store both status and check timestamp, automatically marking health data older than 5 minutes as unhealthy. This prevents routing traffic to routes with stale health information and improves overall system reliability. (#6423)

  • Refresh VAST cluster info cache rather than keep the cache alive forever (#6428)

  • Change keypair query to include keypairs of users with no group membership (#6455)

  • Resolve deadlock occurring due to incorrect use of semaphore in specific image rescan scenarios (#6469)

  • Artifact revision not found error caused by get_artifact_revision_readme (#6485)

  • Add missing PRE_ENQUEUE_HOOK and POST_ENQUEUE_HOOK call in Sokovan scheduler (#6584)

Miscellaneous

  • Expose SESSION_PRIORITY_* constants from the manager package to the common package for consistent priority handling across components (#6459)

Full Changelog

Check out the full changelog until this release (25.15.2).

Full Commit Logs

Check out the full commit logs between release (25.15.1) and (25.15.2).

25.16.0

31 Oct 18:29
91fb690

Choose a tag to compare

Features

Resilience Framework

  • Add resilience framework with Policy interface and Resilience executor for composable fault-tolerance patterns (#6203)
  • Applied resilience framework to all Valkey clients with ContextVar-based operation tracking, replacing legacy decorators with composable policies (#6205)
  • Apply resilience framework (MetricPolicy + RetryPolicy with exponential backoff) to all manager repositories for improved observability and fault tolerance (#6210)
  • Apply Resilience framework to agent client with exponential backoff retry policy for improved reliability in agent RPC communications (#6211)
  • Apply Resilience framework to appproxy (wsproxy) client with exponential backoff retry policy for improved reliability in app proxy communications (#6213)
  • Apply Resilience framework to storage proxy client with exponential backoff retry policy for improved reliability in storage proxy communications (#6214)

Artifact and Reservoir Registry

  • Implement delegate_import GQL mutation to trigger remote reservoir registry import (#6221)
  • Implement periodic polling and synchronization for remote reservoir (#6222)
  • Introduce artifact import pipeline to enable customizable storage configuration for each stage (#6223)
  • Add vfs_storages DB table, and GQL model and mutations (#6229)
  • Support artifact download from the reservoir registry's vfs_storage (#6236)

GraphQL and Real-time Updates

  • Replace Apollo Router with Hive Router (MIT License) to enable GQL Subscription support in federated environments without licensing fees, and add GQL Subscription support in webserver and backend for real-time updates (#6234)
  • Add GraphQL subscription schedulingEventsBySession for real-time session scheduling events (#6239)
  • Add GraphQL subscription backgroundTaskEvents for real-time background task events (#6243)
  • Support project name filter when resolving user nodes (#6298)

RBAC and Access Control

  • Add Global scope to simplify access control management for superadmin and monitor users (#6004)
  • Add superadmin and monitor role fixtures and migrate existing admin and monitor data to RBAC DB (#6006)

Session Scheduling

  • Implement scaling group filtering with injectable rules for session scheduling. The new ScalingGroupFilter applies configurable filter rules (public/private access, session type support) to determine eligible scaling groups before session creation, replacing the previous validation-only approach. This enables more flexible and extensible scaling group selection logic. (#6424)
  • Remove redundant agent count validation for multi-node cluster sessions (#6276)

Model Service and Runtime

  • Add support for Modular MAX and SGLang runtime variants (#6237)
  • Implement API layer for model deployment with pagination and filtering support (#6394)

Authentication

  • Implement JWT authentication module for GraphQL Federation (#6410)
  • Apply JWT authentication to webserver and manager (#6421)

Configuration Management

  • Add domain-level app configuration GraphQL API (#6295)
  • Add domain-level app configuration support for frontend (#6401)
  • Add allow-auto-quota-scope-creation configuration option to control whether virtual folders can be created in quota scopes that don't yet exist. Administrators can now prevent automatic quota scope creation by disabling this option, requiring quota scopes to be explicitly created before use. (#6263)
  • Container commit timeout can now be customized through configuration settings (#6346)

Web and Proxy

  • Implement WebSocket connectionParams to HTTP headers conversion (#6379)
  • Add connection error handling to storage proxy client (#6319)

Metrics and Monitoring

  • Add error code labels to layer operation metrics (#6322)

Improvements

  • Integrate Pydantic validators with Agent server configuration (#6172)
  • Introduce Service, Repository layer pattern in Image GQL Object resolvers (#6238)
  • Set Image ID as Redis key instead of Canonical when storing installed agents per image (#6249)
  • Introduce action-processor pattern to Image batch load resolvers for decoupling Redis and DB access in GQL API layer (#6269)
  • Remove unused agent life cycle event handler as it is replaced by handle_agent_terminated and handle_agent_started in AgentEventHandler (#6299)
  • Replace the use of the raw_labels field in the Image GQL object with the existing labels field (#6309)
  • Introduced separate DTOs for permission groups(PermissionGroupExtendedData, PermissionGroupData) to conditionally load relationship data and prevent lazy loading errors (#6451)
  • Add Kernel registry recovery abstract class for refactor Agent by detaching kernel registry load and save logic (#6482)

Fixes

Session Management

  • Raise correct exception in SessionTransitionData.main_kernel (#6202)
  • Improve the performance of VFolder queries by optimizing group lookups, resulting in faster VFolder detail panel popups and quicker session creation page loading (#6300)
  • Session commits now fail immediately with a clear error when attempting to exceed quota limits. This improves user experience by providing faster feedback instead of attempting a commit that would ultimately fail (#6304)
  • Prevent re-terminating sessions already in terminal states (#6353)
  • Session type validation is now properly enforced when creating sessions within scaling groups (#6354)
  • Add missing PRE_ENQUEUE_HOOK and POST_ENQUEUE_HOOK call in Sokovan scheduler (#6584)

Agent and Resource Management

  • Align occupied_slots values between Agent summary and Agent type queries (#6257)
  • Decoupled agent batch loading resolvers that previously depended on the from_row function. (#6280)
  • Fix GQL agent_summary_list resolver (#6389)

Storage and Configuration

  • Fix logging directory not auto-generated (#6225)
  • Clients can now correctly fetch null quota scopes when a scope does not exist, instead of encountering an error (#6227)
  • Fix a permission issue where Storage Proxy would fail to access its glide socket after changing process uid/gid. Users running Storage Proxy with non-root credentials will no longer encounter "Permission Denied" errors during startup. (#6284)
  • Fix an issue where the Storage Proxy would unnecessarily attempt to create an XFS backend lockfile at /tmp/backendai-xfs-file-lock even when XFS storage was not configured. This could cause permission errors if the Storage Proxy lacked access to the /tmp directory, preventing the service from starting properly. (#6285)
  • Prevent artifact deletion when download and archive storage are identical (#6330)
  • Fixed a timing issue in vfolder deletion by ensuring the DELETE-ONGOING status is set before sending the storage deletion request (#6486)

Artifact and Reservoir Registry

  • Cleanup previous stages' files in artifact import pipeline (#6251)
  • ReservoirDownloadStep fails with connection reset error when artifact size is too large in vfs_storage type remote reservoir registry (#6254)
  • Artifact revision not found error caused by get_artifact_revision_readme ([#6485](https://...
Read more

25.16.0rc3

31 Oct 02:18
e0e45cc

Choose a tag to compare

25.16.0rc3 Pre-release
Pre-release

Features

  • Add Global scope to simplify access control management for superadmin and monitor users (#6004)
  • Add superadmin and monitor role fixtures and migrate existing admin and monitor data to RBAC DB (#6006)
  • Add allow-auto-quota-scope-creation configuration option to control whether virtual folders can be created in quota scopes that don't yet exist. Administrators can now prevent automatic quota scope creation by disabling this option, requiring quota scopes to be explicitly created before use. (#6263)
  • Remove redundant agent count validation for multi-node cluster sessions (#6276)
  • Add domain-level app configuration GraphQL API (#6295)
  • Support project name filter when resolving user nodes (#6298)
  • Add connection error handling to storage proxy client (#6319)
  • Add error code labels to layer operation metrics (#6322)
  • Container commit timeout can now be customized through configuration settings (#6346)
  • Implement WebSocket connectionParams to HTTP headers conversion (#6379)
  • Implement API layer for model deployment with pagination and filtering support (#6394)
  • Add domain-level app configuration support for frontend (#6401)
  • Implement JWT authentication module for GraphQL Federation (#6410)
  • Apply JWT authentication to webserver and manager (#6421)
  • Implement scaling group filtering with injectable rules for session scheduling. The new ScalingGroupFilter applies configurable filter rules (public/private access, session type support) to determine eligible scaling groups before session creation, replacing the previous validation-only approach. This enables more flexible and extensible scaling group selection logic. (#6424)

Improvements

  • Introduce Service, Repository layer pattern in Image GQL Object resolvers (#6238)
  • Set Image ID as Redis key instead of Canonical when storing installed agents per image (#6249)
  • Introduce action-processor pattern to Image batch load resolvers for decoupling Redis and DB access in GQL API layer (#6269)
  • Remove unused agent life cycle event handler as it is replaced by handle_agent_terminated and handle_agent_started in AgentEventHandler (#6299)
  • Replace the use of the raw_labels field in the Image GQL object with the existing labels field (#6309)
  • Introduced separate DTOs for permission groups(PermissionGroupExtendedData, PermissionGroupData) to conditionally load relationship data and prevent lazy loading errors (#6451)
  • Add Kernel registry recovery abstract class for refactor Agent by detaching kernel registry load and save logic (#6482)

Fixes

  • Add AppProxy setup and initialization workflow to the TUI installer (replacing WSProxy) (#6228)

  • Upgrade aiotools (1.9.2 -> 2.2.3) for improved structured concurrency and refactor server initialization and shutdown procedures to avoid excessive exception stack traces but display only the exact error cleanly (#6250)

  • Align occupied_slots values between Agent summary and Agent type queries (#6257)

  • Fix typo in apollo router config serialization alias (sapollo-router -> apollo-router) (#6266)

  • Decoupled agent batch loading resolvers that previously depended on the from_row function. (#6280)

  • Fix a permission issue where Storage Proxy would fail to access its glide socket after changing process uid/gid. Users running Storage Proxy with non-root credentials will no longer encounter "Permission Denied" errors during startup. (#6284)

  • Fix an issue where the Storage Proxy would unnecessarily attempt to create an XFS backend lockfile at /tmp/backendai-xfs-file-lock even when XFS storage was not configured. This could cause permission errors if the Storage Proxy lacked access to the /tmp directory, preventing the service from starting properly. (#6285)

  • Improve the performance of VFolder queries by optimizing group lookups, resulting in faster VFolder detail panel popups and quicker session creation page loading (#6300)

  • Session commits now fail immediately with a clear error when attempting to exceed quota limits. This improves user experience by providing faster feedback instead of attempting a commit that would ultimately fail (#6304)

  • Prevent artifact deletion when download and archive storage are identical (#6330)

  • Fix model service extra mounts in client SDK to omit unset mount type fields, ensuring compatibility with the manager API (#6347)

  • Prevent re-terminating sessions already in terminal states (#6353)

  • Session type validation is now properly enforced when creating sessions within scaling groups (#6354)

  • Fix GQL agent_summary_list resolver (#6389)

  • Add timestamp tracking for route health status to enable staleness detection.

    Route health checks now store both status and check timestamp, automatically marking health data older than 5 minutes as unhealthy. This prevents routing traffic to routes with stale health information and improves overall system reliability. (#6423)

  • Change keypair query to include keypairs of users with no group membership (#6455)

  • fix VFolderAlreadyExists HTTP status code from 400 Bad Request to 409 Conflict to match semantic meaning of resource conflicts (#6465)

  • Resolve deadlock occurring due to incorrect use of semaphore in specific image rescan scenarios (#6469)

  • Remove useless README not found warning log (#6473)

  • Artifact revision not found error caused by get_artifact_revision_readme (#6485)

  • Fixed a timing issue in vfolder deletion by ensuring the DELETE-ONGOING status is set before sending the storage deletion request (#6486)

Documentation Updates

  • Add English documentation for Sokovan orchestration layer covering session scheduling, deployment management, and routing architecture. (#6446)
  • Add comprehensive README documentation for Actions, Services, and Repositories layers in the manager, covering architecture patterns, design principles, resilience patterns with Prometheus metrics, and best practices for each layer. (#6449)
  • Add comprehensive architecture and component documentation (#6466)
  • Expand CONTRIBUTING.md with comprehensive pull request guidelines including workflow, sizing best practices, and review process. (#6481)

Miscellaneous

  • Add warning logs in storage implementations when used or limit bytes are under 0 (#6359)
  • Expose SESSION_PRIORITY_* constants from the manager package to the common package for consistent priority handling across components (#6459)

Full Changelog

Check out the full changelog until this release (25.16.0rc3).

Full Commit Logs

Check out the full commit logs between release (25.16.0rc2) and (25.16.0rc3).

25.15.1

23 Oct 14:21
a41e6eb

Choose a tag to compare

Features

  • Container commit timeout can now be customized through configuration settings (#6346)

Fixes

  • Raise correct exception in SessionTransitionData.main_kernel (#6202)
  • Fix logging directory not auto-generated (#6225)
  • Decoupled agent batch loading resolvers that previously depended on the from_row function. (#6280)
  • Fix a permission issue where Storage Proxy would fail to access its glide socket after changing process uid/gid. Users running Storage Proxy with non-root credentials will no longer encounter "Permission Denied" errors during startup. (#6284)
  • Fix an issue where the Storage Proxy would unnecessarily attempt to create an XFS backend lockfile at /tmp/backendai-xfs-file-lock even when XFS storage was not configured. This could cause permission errors if the Storage Proxy lacked access to the /tmp directory, preventing the service from starting properly. (#6285)
  • Improve the performance of VFolder queries by optimizing group lookups, resulting in faster VFolder detail panel popups and quicker session creation page loading (#6300)
  • Session commits now fail immediately with a clear error when attempting to exceed quota limits. This improves user experience by providing faster feedback instead of attempting a commit that would ultimately fail (#6304)
  • Fix model service extra mounts in client SDK to omit unset mount type fields, ensuring compatibility with the manager API (#6347)
  • Prevent re-terminating sessions already in terminal states (#6353)
  • Session type validation is now properly enforced when creating sessions within scaling groups (#6354)

Full Changelog

Check out the full changelog until this release (25.15.1).

Full Commit Logs

Check out the full commit logs between release (25.15.0) and (25.15.1).

25.16.0rc2

15 Oct 13:51

Choose a tag to compare

25.16.0rc2 Pre-release
Pre-release

Features

  • Add support for Modular MAX and SGLang runtime variants (#6237)
  • Add GraphQL subscription backgroundTaskEvents for real-time background task events (#6243)

Fixes

  • Cleanup previous stages' files in artifact import pipeline (#6251)
  • ReservoirDownloadStep fails with connection reset error when artifact size is too large in vfs_storage type remote reservoir registry (#6254)

Documentation Updates

  • Add Artifact API descriptions (#6104)
  • Add Artifact concept documentation (#6105)

Full Changelog

Check out the full changelog until this release (25.16.0rc2).

Full Commit Logs

Check out the full commit logs between release (25.16.0rc1) and (25.16.0rc2).