Skip to content

Persist deployment chat history and replay as request context #11410

@jopemachine

Description

@jopemachine

Objective

The ./bai deployment chat CLI added in BA-5528 is currently stateless: every invocation is a one-shot request that loses prior turns, so multi-turn conversations against a deployed OpenAI-compatible inference endpoint (vLLM, SGLang, NIM, TGI in messages mode) cannot keep their context across CLI calls.

Persist per-deployment chat transcripts on the client side and automatically replay the most recent K messages as messages[] context on each chat-completions request, while keeping disk usage bounded and letting the user opt out or wipe the transcript.

Scope

  • Add ~/.backend.ai/deployment_chat/history.json storage, isolated from the existing cache.json (auto-managed endpoint metadata) and config.json (user-managed token/model).
  • Inject the most recent K user/assistant turns into the request messages[] array before the new user content; default K configurable via --history-limit, with K=0 skipping context for the turn.
  • Persist a round only when both user and assistant content are present, so the file never holds half-recorded conversations that would skew future context.
  • Cap persisted messages per deployment with FIFO truncation to bound disk usage.
  • Add ./bai deployment chat-history show / clear subcommands for inspection and reset.
  • Type the chat-completions response via a Pydantic model (ChatCompletionResponse) so the path used for history bookkeeping (choices[0].message.content) is strict-validated at the SDK boundary, while runtime-specific extras (usage, system_fingerprint, vLLM/NIM telemetry) ride through via extra=allow.

Acceptance Criteria

  • Two consecutive ./bai deployment chat <id> <text> calls share context (the second call sees the first round in messages[]).
  • --history-limit 0 skips context for that turn while still recording the round; chat-history clear <id> wipes the persisted transcript without touching cache or config.
  • Disk usage per deployment never exceeds the configured cap; corrupt or schema-mismatched history.json is ignored with a warning instead of crashing.
  • Tool-call-only or empty-choices responses do not corrupt history (the round is dropped); streaming-chunk shape (delta) raises ValidationError at the SDK boundary instead of silently passing through.
  • Unit tests cover storage round-trip, FIFO truncation, loader resilience, and the response model's assistant_content extraction including OpenAI-compat edge cases.

JIRA Issue: BA-5903

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions