Releases: lablup/backend.ai
25.17.0
Features
Agent and Multi-Agent Support
Laid the groundwork for running multiple agent instances on a single host. This is an early-stage effort focusing on configuration structure and runtime architecture refactoring. Full resource isolation between agents is planned for future releases.
- Add support for array of tables syntax in config sample generator (#6311)
- Add support for multiple agents in agent server config (#6315)
- Update Agent server RPC functions to include agent ID for agent runtime with multiple agents (#6320)
- Change agent config field names and serialization aliases to use internal-addr naming (#6697)
- Add AgentEtcdClientView for clean handling of etcd clients for multi agents (#6721)
- Add custom resource allocation in agent server config (#6724)
- Extract agent common resources to AgentRuntime (#6728)
- Move ownership of resources from agent to a separate component in agent runtime (#6766)
- Add resource isolation options for multi-agent setup (#6770)
- Store installed images information to Redis in Agent (#6834)
- Implement pickle based Kernel registry recovery which can replace existing kernel registry load and save functions (#6489)
- Add agent-id label for session Docker containers (#6870)
Health Check and Dependency Verification
Added health check endpoints and dependency verification CLI commands across all Backend.AI components. Operators can now diagnose connectivity issues with external services (database, Redis, etcd) before and during runtime.
- Implement health check infrastructure for component monitoring (#6732)
- Add health checker system for all components (#6736)
- Add dependency management system for manager (#6753)
- Add dependency verification system for web component (#6757)
- Add dependency verification for storage proxy (#6760)
- Add dependency verification for App Proxy Coordinator and Worker (#6767)
- Add dependency verification CLI in agent (#6775)
- Add dependency health checking infrastructure (#6781)
- Integrate HealthProbe across all components with real connectivity checks (#6836)
Artifact and Reservoir Registry
Enhanced artifact management with verification plugins, real-time download progress tracking via Redis, and support for gated HuggingFace models. Also enabled delegation-based imports when artifacts are unavailable in local reservoir registries.
- Implement
artifact_verifiertype plugin in storage-proxy (#6258) - Fix
limit,searchparameters not working in reservoir registry'sscan_artifactAPI (#6488) - Separate DB source from artifact repository layers (#6490)
- Re-import available artifacts only when necessary based on digest (#6501)
- Add
extracolumn to Artifact model to store gated information forhuggingfacemodels (#6620) - Collect artifact verification results to
artifact_revisionstable (#6662) - Track Artifact download progress through redis (#6663)
- Create artifact download progress query REST API (#6666)
- Extend reservoir registry artifact import API to perform import delegation when the artifact is not available in the remote reservoir (#6672)
- Track Reservoir registry artifact download progress (#6673)
- Add
metadatafield to artifact verifier interface (#6676) - Add missing
id,registry_idfields toArtifactRegistryGQL Node (#6750)
Notification System
Introduced a notification center that allows administrators to configure webhook channels and define event-based notification rules. Included REST/GraphQL APIs for channel and rule management, along with CLI tools for validation and delivery testing.
- Implement notification system with channels, rules, and event processing (#6635)
- Implement notification center with REST/GraphQL APIs for managing channels and rules (#6653)
- Add notification validation API and notification CLI (#6657)
- Add notification center with webhook channel support (#6668)
- Implement notification message type system and validation APIs (#6677)
Background Task Infrastructure
Began migrating background tasks to a retriable pattern with initial support for image operations and session commits. This is an ongoing effort, and further tasks will be migrated in subsequent releases.
- Add image purge/rescan background tasks and modernize task system (#6597)
- Improve bgtask infrastructure with repository pattern and type adapters (#6606)
- Migrate commit session to retriable background task pattern (#6625)
Model Service and Routing
Introduced route health checking with a 3-state model (healthy/unhealthy/degraded) and automatic eviction for unhealthy endpoints. Also added Prometheus integration via service discovery for model service metrics collection.
- Add model service route synchronization to service discovery (#6832)
- Add periodic service discovery sync for model service routes (#6833)
- Implement 3-state route health check with configurable eviction (#6839)
- Add missing and newly introduced fields to service field specifications (#6714)
Session and Resource Management
Parallelized session termination for improved performance and added automatic cleanup for sessions associated with lost agents. Also introduced an async file deletion API for vfolders.
- Parallelize session termination and add lost agent cleanup (#6826)
- Implement async file deletion API in vfolder (#6861)
Infrastructure and Configuration
Added flexible bind/advertise address configuration for app-proxy components and restructured Valkey client by separating monitor and operation clients for better resource management.
- Add read committed transaction support in
ExtendedAsyncSAEngine, enabling higher throughput for read-heavy workloads by reducing transaction isolation overhead (#6665) - Replace time() with Redis TIME command in ValkeyScheduleClient (#6695)
- Support multiple Apollo Router endpoints with load balancing (#6703)
- Support bind, advertised address configuration options for app-proxy coordinator and worker components (#6631)
- Separate monitor and operation clients in Valkey client (#6829)
- Ensure normal URLs are called even if the protocol is included in the host of
HostPortPair, preventing network error in app-proxy communication (#6813) - delete-dev.sh now supports interactive confirmation and non-interactive -y/--yes flag (#6815)
Improvements
- Change Action Processor arguments to immutable types and made them contravariant to prevent memory leaks and improve type safety (#6596)
- Introduce Source-based structure in
AuthRepositorydecoupling database access for easier testing (#6641) - Make
error_codemethod inBackendAIErroras instance method making injection or modification of the error code from outside the class easier, improving flexibility when handling errors (#6722) - Move kernel registry ownership to agent runtime (#6730)
- Use resources functions directly in AbstractAgent (#6763)
Fixes
Session a...
25.15.3
Features
- Support bind, advertised address configuration options for app-proxy coordinator and worker components (#6631)
- Add missing and newly introduced fields to service field specifications (#6714)
- Parallelize session termination and add lost agent cleanup (#6826)
Fixes
- Add AppProxy setup and initialization workflow to the TUI installer (replacing WSProxy) (#6228)
- Fix Pydantic validation error from incorrect
slot_typetype in mock plugin (#6692) - Fix domain admin users seeing vfolder hosts from projects they were not members of. They now only see hosts for projects they belong to (#6694)
- Model Card resolver now returns the actual error message when it fails, instead of showing a generic "Unknown error" string (#6702)
- Add missing advertise_address info in app proxy status response (#6772)
- Allow zero values in DecimalType conversion (#6783)
- Disallow dot('.') usage in model service name (#6800)
- Fix auto-scaling functionality for inference services when using framework-based scaling rules. Metrics collection and rule comparison logic have been corrected to ensure proper scaling behavior (#6801)
- Use the kernel’s occupied slots when calculating the agent’s resources (#6817)
- Explicitly wrap slot key with
SlotName()to prevent validation failure when initializingAgentInfo(#6841) - Apply http client pool in app proxy worker (#6851)
- Add missing cache invalidation for resource preset (#6852)
Full Changelog
Check out the full changelog until this release (25.15.3).
Full Commit Logs
Check out the full commit logs between release (25.15.2) and (25.15.3).
25.17.0rc3
Features
- Add support for array of tables syntax in config sample generator (#6311)
- Add support for multiple agents in agent server config (#6315)
- Update Agent server RPC functions to include agent ID for agent runtime with multiple agents (#6320)
- Add read committed transaction support in
ExtendedAsyncSAEngine, enabling higher throughput for read-heavy workloads by reducing transaction isolation overhead (#6665) - Replace time() with Redis TIME command in ValkeyScheduleClient (#6695)
- Change agent config field names and serialization aliases to use internal-addr naming (#6697)
- Support multiple Apollo Router endpoints with load balancing (#6703)
- Add missing and newly introduced fields to service field specifications (#6714)
- Add AgentEtcdClientView for clean handling of etcd clients for multi agents (#6721)
- Add custom resource allocation in agent server config (#6724)
- Extract agent common resources to AgentRuntime (#6728)
- Implement health check infrastructure for component monitoring (#6732)
- Add health checker system for all components (#6736)
- Add missing
id,registry_idfields toArtifactRegistryGQL Node (#6750) - Add dependency management system for manager (#6753)
- Add dependency verification system for web component (#6757)
- Add dependency verification for storage proxy (#6760)
- Move ownership of resources from agent to a separate component in agent runtime (#6766)
- Add dependency verification for App Proxy Coordinator and Worker (#6767)
- Add resource isolation options for multi-agent setup (#6770)
- Add dependency verification CLI in agent (#6775)
- Add dependency health checking infrastructure (#6781)
- Ensure normal URLs are called even if the protocol is included in the host of
HostPortPair, preventing network error in app-proxy communication (#6813) - delete-dev.sh now supports interactive confirmation and non-interactive -y/--yes flag. (#6815)
- Parallelize session termination and add lost agent cleanup (#6826)
- Separate monitor and operation clients in Valkey client (#6829)
- Add model service route synchronization to service discovery (#6832)
- Add periodic service discovery sync for model service routes (#6833)
- Store installed images information to Redis in Agent (#6834)
- Integrate HealthProbe across all components with real connectivity checks (#6836)
- Implement 3-state route health check with configurable eviction (#6839)
Improvements
- Introduce Source-based structure in
AuthRepositorydecoupling database access for easier testing (#6641) - Make
error_codemethod inBackendAIErroras instance method making injection or modification of the error code from outside the class easier, improving flexibility when handling errors (#6722) - Move kernel registry ownership to agent runtime (#6730)
- Use resources functions directly in AbstractAgent (#6763)
Fixes
- Fix app proxy to properly handle redirect parameter in HTTP protocol auth flow by appending the redirect path to the generated proxy URL (#6686)
- Fix Pydantic validation error from incorrect
slot_typetype in mock plugin (#6692) - Fix domain admin users seeing vfolder hosts from projects they were not members of. They now only see hosts for projects they belong to (#6694)
- Model Card resolver now returns the actual error message when it fails, instead of showing a generic "Unknown error" string (#6702)
- Fix storage proxy client to handle non-JSON error responses instead of crashing on parse failures (#6712)
- Support service-definition.toml override with optional fields(image, arch, resource) (#6751)
- Add missing advertise_address info in app proxy status response (#6772)
- Allow zero values in DecimalType conversion (#6783)
- Disallow dot('.') usage in model service name (#6800)
- Fix auto-scaling functionality for inference services when using framework-based scaling rules. Metrics collection and rule comparison logic have been corrected to ensure proper scaling behavior (#6801)
- Use the kernel’s occupied slots when calculating the agent’s resources (#6817)
- Make agent installed image sync to match by canonical name and architecture instead of digest, preventing digest change by container image driver (#6838)
- Explicitly wrap slot key with
SlotName()to prevent validation failure when initializingAgentInfo(#6841)
Miscellaneous
- Move
EndpointLifecycleenum to a shared common package for improved reusability (#6637) - Add debug log when app proxy worker got server disconnected error making tracing and diagnosing unexpected disconnections easier (#6735)
Full Changelog
Check out the full changelog until this release (25.17.0rc3).
Full Commit Logs
Check out the full commit logs between release (25.17.0rc2) and (25.17.0rc3).
25.17.0rc2
Fixes
- Reservoir artifact import API response is blocking when using delegation (#6683)
Full Changelog
Check out the full changelog until this release (25.17.0rc2).
Full Commit Logs
Check out the full commit logs between release (25.17.0rc1) and (25.17.0rc2).
25.17.0rc1
Features
- Implement
artifact_verifiertype plugin in storage-proxy (#6258) - Fix
limit,searchparameters not working in reservoir registry'sscan_artifactAPI (#6488) - Implement pickle based Kernel registry recovery which can replace existing kernel registry load and save functions (#6489)
- Separate DB source from artifact repository layers (#6490)
- Re-import available artifacts only when necessary based on digest (#6501)
- Add image purge/rescan background tasks and modernize task system (#6597)
- Improve bgtask infrastructure with repository pattern and type adapters (#6606)
- Add
extracolumn to Artifact model to store gated information forhuggingfacemodels (#6620) - Migrate commit session to retriable background task pattern (#6625)
- Support bind, advertised address configuration options for app-proxy coordinator and worker components (#6631)
- Implement notification system with channels, rules, and event processing (#6635)
- Implement notification center with REST/GraphQL APIs for managing channels and rules (#6653)
- Add notification validation API and notification CLI (#6657)
- Collect artifact verification results to
artifact_revisionstable (#6662) - Track Artifact download progress through redis (#6663)
- Create artifact download progress query REST API (#6666)
- Add notification center with webhook channel support (#6668)
- Extend reservoir registry artifact import API to perform import delegation when the artifact is not available in the remote reservoir (#6672)
- Track Reservoir registry artifact download progress (#6673)
- Add
metadatafield to artifact verifier interface (#6676) - Implement notification message type system and validation APIs (#6677)
Improvements
- Change Action Processor arguments to immutable types and made them contravariant to prevent memory leaks and improve type safety (#6596)
Fixes
- Update
gpu_allocatedlegacy metric fields to consider all accelerator devices, including bothcuda.devicesandcuda.shares, but also MIG variants and other NPUs as well (Known issue: all resources visible to each user and group MUST use a consistent fraction mode) (#2404) - Refresh VAST cluster info cache rather than keep the cache alive forever (#6428)
- Adjust reservoir download API client timeout and add proper connection termination handling (#6627)
- Remove
DoPullReservoirRegistryEvent, and the event handler (#6680)
Documentation Updates
- Add Entry Point, Event, and Background Task architecture documentation (#6594)
- Document adapter and Querier patterns in API/GraphQL/Repository READMEs (#6656)
Full Changelog
Check out the full changelog until this release (25.17.0rc1).
Full Commit Logs
Check out the full commit logs between release (25.16.0) and (25.17.0rc1).
25.15.2
Features
- Support project name filter when resolving user nodes (#6298)
Fixes
-
Fix GQL agent_summary_list resolver (#6389)
-
Add timestamp tracking for route health status to enable staleness detection.
Route health checks now store both status and check timestamp, automatically marking health data older than 5 minutes as unhealthy. This prevents routing traffic to routes with stale health information and improves overall system reliability. (#6423)
-
Refresh VAST cluster info cache rather than keep the cache alive forever (#6428)
-
Change keypair query to include keypairs of users with no group membership (#6455)
-
Resolve deadlock occurring due to incorrect use of semaphore in specific image rescan scenarios (#6469)
-
Artifact revision not found error caused by
get_artifact_revision_readme(#6485) -
Add missing
PRE_ENQUEUE_HOOKandPOST_ENQUEUE_HOOKcall in Sokovan scheduler (#6584)
Miscellaneous
- Expose
SESSION_PRIORITY_*constants from the manager package to the common package for consistent priority handling across components (#6459)
Full Changelog
Check out the full changelog until this release (25.15.2).
Full Commit Logs
Check out the full commit logs between release (25.15.1) and (25.15.2).
25.16.0
Features
Resilience Framework
- Add resilience framework with Policy interface and Resilience executor for composable fault-tolerance patterns (#6203)
- Applied resilience framework to all Valkey clients with ContextVar-based operation tracking, replacing legacy decorators with composable policies (#6205)
- Apply resilience framework (MetricPolicy + RetryPolicy with exponential backoff) to all manager repositories for improved observability and fault tolerance (#6210)
- Apply Resilience framework to agent client with exponential backoff retry policy for improved reliability in agent RPC communications (#6211)
- Apply Resilience framework to appproxy (wsproxy) client with exponential backoff retry policy for improved reliability in app proxy communications (#6213)
- Apply Resilience framework to storage proxy client with exponential backoff retry policy for improved reliability in storage proxy communications (#6214)
Artifact and Reservoir Registry
- Implement
delegate_importGQL mutation to trigger remote reservoir registry import (#6221) - Implement periodic polling and synchronization for remote reservoir (#6222)
- Introduce artifact import pipeline to enable customizable storage configuration for each stage (#6223)
- Add
vfs_storagesDB table, and GQL model and mutations (#6229) - Support artifact download from the reservoir registry's
vfs_storage(#6236)
GraphQL and Real-time Updates
- Replace Apollo Router with Hive Router (MIT License) to enable GQL Subscription support in federated environments without licensing fees, and add GQL Subscription support in webserver and backend for real-time updates (#6234)
- Add GraphQL subscription
schedulingEventsBySessionfor real-time session scheduling events (#6239) - Add GraphQL subscription
backgroundTaskEventsfor real-time background task events (#6243) - Support project name filter when resolving user nodes (#6298)
RBAC and Access Control
- Add Global scope to simplify access control management for superadmin and monitor users (#6004)
- Add superadmin and monitor role fixtures and migrate existing admin and monitor data to RBAC DB (#6006)
Session Scheduling
- Implement scaling group filtering with injectable rules for session scheduling. The new ScalingGroupFilter applies configurable filter rules (public/private access, session type support) to determine eligible scaling groups before session creation, replacing the previous validation-only approach. This enables more flexible and extensible scaling group selection logic. (#6424)
- Remove redundant agent count validation for multi-node cluster sessions (#6276)
Model Service and Runtime
- Add support for Modular MAX and SGLang runtime variants (#6237)
- Implement API layer for model deployment with pagination and filtering support (#6394)
Authentication
- Implement JWT authentication module for GraphQL Federation (#6410)
- Apply JWT authentication to webserver and manager (#6421)
Configuration Management
- Add domain-level app configuration GraphQL API (#6295)
- Add domain-level app configuration support for frontend (#6401)
- Add
allow-auto-quota-scope-creationconfiguration option to control whether virtual folders can be created in quota scopes that don't yet exist. Administrators can now prevent automatic quota scope creation by disabling this option, requiring quota scopes to be explicitly created before use. (#6263) - Container commit timeout can now be customized through configuration settings (#6346)
Web and Proxy
- Implement WebSocket connectionParams to HTTP headers conversion (#6379)
- Add connection error handling to storage proxy client (#6319)
Metrics and Monitoring
- Add error code labels to layer operation metrics (#6322)
Improvements
- Integrate Pydantic validators with Agent server configuration (#6172)
- Introduce Service, Repository layer pattern in
ImageGQL Object resolvers (#6238) - Set Image ID as Redis key instead of Canonical when storing installed agents per image (#6249)
- Introduce action-processor pattern to
Imagebatch load resolvers for decoupling Redis and DB access in GQL API layer (#6269) - Remove unused agent life cycle event handler as it is replaced by
handle_agent_terminatedandhandle_agent_startedinAgentEventHandler(#6299) - Replace the use of the
raw_labelsfield in the Image GQL object with the existinglabelsfield (#6309) - Introduced separate DTOs for permission groups(
PermissionGroupExtendedData,PermissionGroupData) to conditionally load relationship data and prevent lazy loading errors (#6451) - Add Kernel registry recovery abstract class for refactor Agent by detaching kernel registry load and save logic (#6482)
Fixes
Session Management
- Raise correct exception in
SessionTransitionData.main_kernel(#6202) - Improve the performance of VFolder queries by optimizing group lookups, resulting in faster VFolder detail panel popups and quicker session creation page loading (#6300)
- Session commits now fail immediately with a clear error when attempting to exceed quota limits. This improves user experience by providing faster feedback instead of attempting a commit that would ultimately fail (#6304)
- Prevent re-terminating sessions already in terminal states (#6353)
- Session type validation is now properly enforced when creating sessions within scaling groups (#6354)
- Add missing
PRE_ENQUEUE_HOOKandPOST_ENQUEUE_HOOKcall in Sokovan scheduler (#6584)
Agent and Resource Management
- Align
occupied_slotsvalues between Agent summary and Agent type queries (#6257) - Decoupled agent batch loading resolvers that previously depended on the
from_rowfunction. (#6280) - Fix GQL agent_summary_list resolver (#6389)
Storage and Configuration
- Fix logging directory not auto-generated (#6225)
- Clients can now correctly fetch null quota scopes when a scope does not exist, instead of encountering an error (#6227)
- Fix a permission issue where Storage Proxy would fail to access its glide socket after changing process uid/gid. Users running Storage Proxy with non-root credentials will no longer encounter "Permission Denied" errors during startup. (#6284)
- Fix an issue where the Storage Proxy would unnecessarily attempt to create an XFS backend lockfile at
/tmp/backendai-xfs-file-lockeven when XFS storage was not configured. This could cause permission errors if the Storage Proxy lacked access to the/tmpdirectory, preventing the service from starting properly. (#6285) - Prevent artifact deletion when download and archive storage are identical (#6330)
- Fixed a timing issue in vfolder deletion by ensuring the DELETE-ONGOING status is set before sending the storage deletion request (#6486)
Artifact and Reservoir Registry
- Cleanup previous stages' files in artifact import pipeline (#6251)
ReservoirDownloadStepfails with connection reset error when artifact size is too large invfs_storagetype remote reservoir registry (#6254)- Artifact revision not found error caused by
get_artifact_revision_readme([#6485](https://...
25.16.0rc3
Features
- Add Global scope to simplify access control management for superadmin and monitor users (#6004)
- Add superadmin and monitor role fixtures and migrate existing admin and monitor data to RBAC DB (#6006)
- Add
allow-auto-quota-scope-creationconfiguration option to control whether virtual folders can be created in quota scopes that don't yet exist. Administrators can now prevent automatic quota scope creation by disabling this option, requiring quota scopes to be explicitly created before use. (#6263) - Remove redundant agent count validation for multi-node cluster sessions (#6276)
- Add domain-level app configuration GraphQL API (#6295)
- Support project name filter when resolving user nodes (#6298)
- Add connection error handling to storage proxy client (#6319)
- Add error code labels to layer operation metrics (#6322)
- Container commit timeout can now be customized through configuration settings (#6346)
- Implement WebSocket connectionParams to HTTP headers conversion (#6379)
- Implement API layer for model deployment with pagination and filtering support (#6394)
- Add domain-level app configuration support for frontend (#6401)
- Implement JWT authentication module for GraphQL Federation (#6410)
- Apply JWT authentication to webserver and manager (#6421)
- Implement scaling group filtering with injectable rules for session scheduling. The new ScalingGroupFilter applies configurable filter rules (public/private access, session type support) to determine eligible scaling groups before session creation, replacing the previous validation-only approach. This enables more flexible and extensible scaling group selection logic. (#6424)
Improvements
- Introduce Service, Repository layer pattern in
ImageGQL Object resolvers (#6238) - Set Image ID as Redis key instead of Canonical when storing installed agents per image (#6249)
- Introduce action-processor pattern to
Imagebatch load resolvers for decoupling Redis and DB access in GQL API layer (#6269) - Remove unused agent life cycle event handler as it is replaced by
handle_agent_terminatedandhandle_agent_startedinAgentEventHandler(#6299) - Replace the use of the
raw_labelsfield in the Image GQL object with the existinglabelsfield (#6309) - Introduced separate DTOs for permission groups(
PermissionGroupExtendedData,PermissionGroupData) to conditionally load relationship data and prevent lazy loading errors (#6451) - Add Kernel registry recovery abstract class for refactor Agent by detaching kernel registry load and save logic (#6482)
Fixes
-
Add AppProxy setup and initialization workflow to the TUI installer (replacing WSProxy) (#6228)
-
Upgrade aiotools (1.9.2 -> 2.2.3) for improved structured concurrency and refactor server initialization and shutdown procedures to avoid excessive exception stack traces but display only the exact error cleanly (#6250)
-
Align
occupied_slotsvalues between Agent summary and Agent type queries (#6257) -
Fix typo in apollo router config serialization alias (
sapollo-router->apollo-router) (#6266) -
Decoupled agent batch loading resolvers that previously depended on the
from_rowfunction. (#6280) -
Fix a permission issue where Storage Proxy would fail to access its glide socket after changing process uid/gid. Users running Storage Proxy with non-root credentials will no longer encounter "Permission Denied" errors during startup. (#6284)
-
Fix an issue where the Storage Proxy would unnecessarily attempt to create an XFS backend lockfile at
/tmp/backendai-xfs-file-lockeven when XFS storage was not configured. This could cause permission errors if the Storage Proxy lacked access to the/tmpdirectory, preventing the service from starting properly. (#6285) -
Improve the performance of VFolder queries by optimizing group lookups, resulting in faster VFolder detail panel popups and quicker session creation page loading (#6300)
-
Session commits now fail immediately with a clear error when attempting to exceed quota limits. This improves user experience by providing faster feedback instead of attempting a commit that would ultimately fail (#6304)
-
Prevent artifact deletion when download and archive storage are identical (#6330)
-
Fix model service extra mounts in client SDK to omit unset mount
typefields, ensuring compatibility with the manager API (#6347) -
Prevent re-terminating sessions already in terminal states (#6353)
-
Session type validation is now properly enforced when creating sessions within scaling groups (#6354)
-
Fix GQL agent_summary_list resolver (#6389)
-
Add timestamp tracking for route health status to enable staleness detection.
Route health checks now store both status and check timestamp, automatically marking health data older than 5 minutes as unhealthy. This prevents routing traffic to routes with stale health information and improves overall system reliability. (#6423)
-
Change keypair query to include keypairs of users with no group membership (#6455)
-
fix
VFolderAlreadyExistsHTTP status code from 400 Bad Request to 409 Conflict to match semantic meaning of resource conflicts (#6465) -
Resolve deadlock occurring due to incorrect use of semaphore in specific image rescan scenarios (#6469)
-
Remove useless README not found warning log (#6473)
-
Artifact revision not found error caused by
get_artifact_revision_readme(#6485) -
Fixed a timing issue in vfolder deletion by ensuring the DELETE-ONGOING status is set before sending the storage deletion request (#6486)
Documentation Updates
- Add English documentation for Sokovan orchestration layer covering session scheduling, deployment management, and routing architecture. (#6446)
- Add comprehensive README documentation for Actions, Services, and Repositories layers in the manager, covering architecture patterns, design principles, resilience patterns with Prometheus metrics, and best practices for each layer. (#6449)
- Add comprehensive architecture and component documentation (#6466)
- Expand CONTRIBUTING.md with comprehensive pull request guidelines including workflow, sizing best practices, and review process. (#6481)
Miscellaneous
- Add warning logs in storage implementations when used or limit bytes are under 0 (#6359)
- Expose
SESSION_PRIORITY_*constants from the manager package to the common package for consistent priority handling across components (#6459)
Full Changelog
Check out the full changelog until this release (25.16.0rc3).
Full Commit Logs
Check out the full commit logs between release (25.16.0rc2) and (25.16.0rc3).
25.15.1
Features
- Container commit timeout can now be customized through configuration settings (#6346)
Fixes
- Raise correct exception in
SessionTransitionData.main_kernel(#6202) - Fix logging directory not auto-generated (#6225)
- Decoupled agent batch loading resolvers that previously depended on the
from_rowfunction. (#6280) - Fix a permission issue where Storage Proxy would fail to access its glide socket after changing process uid/gid. Users running Storage Proxy with non-root credentials will no longer encounter "Permission Denied" errors during startup. (#6284)
- Fix an issue where the Storage Proxy would unnecessarily attempt to create an XFS backend lockfile at
/tmp/backendai-xfs-file-lockeven when XFS storage was not configured. This could cause permission errors if the Storage Proxy lacked access to the/tmpdirectory, preventing the service from starting properly. (#6285) - Improve the performance of VFolder queries by optimizing group lookups, resulting in faster VFolder detail panel popups and quicker session creation page loading (#6300)
- Session commits now fail immediately with a clear error when attempting to exceed quota limits. This improves user experience by providing faster feedback instead of attempting a commit that would ultimately fail (#6304)
- Fix model service extra mounts in client SDK to omit unset mount
typefields, ensuring compatibility with the manager API (#6347) - Prevent re-terminating sessions already in terminal states (#6353)
- Session type validation is now properly enforced when creating sessions within scaling groups (#6354)
Full Changelog
Check out the full changelog until this release (25.15.1).
Full Commit Logs
Check out the full commit logs between release (25.15.0) and (25.15.1).
25.16.0rc2
Features
- Add support for Modular MAX and SGLang runtime variants (#6237)
- Add GraphQL subscription
backgroundTaskEventsfor real-time background task events (#6243)
Fixes
- Cleanup previous stages' files in artifact import pipeline (#6251)
ReservoirDownloadStepfails with connection reset error when artifact size is too large invfs_storagetype remote reservoir registry (#6254)
Documentation Updates
Full Changelog
Check out the full changelog until this release (25.16.0rc2).
Full Commit Logs
Check out the full commit logs between release (25.16.0rc1) and (25.16.0rc2).