Objective
The ./bai deployment chat CLI added in BA-5528 is currently stateless: every invocation is a one-shot request that loses prior turns, so multi-turn conversations against a deployed OpenAI-compatible inference endpoint (vLLM, SGLang, NIM, TGI in messages mode) cannot keep their context across CLI calls.
Persist per-deployment chat transcripts on the client side and automatically replay the most recent K messages as messages[] context on each chat-completions request, while keeping disk usage bounded and letting the user opt out or wipe the transcript.
Scope
- Add
~/.backend.ai/deployment_chat/history.json storage, isolated from the existing cache.json (auto-managed endpoint metadata) and config.json (user-managed token/model).
- Inject the most recent K user/assistant turns into the request
messages[] array before the new user content; default K configurable via --history-limit, with K=0 skipping context for the turn.
- Persist a round only when both user and assistant content are present, so the file never holds half-recorded conversations that would skew future context.
- Cap persisted messages per deployment with FIFO truncation to bound disk usage.
- Add
./bai deployment chat-history show / clear subcommands for inspection and reset.
- Type the chat-completions response via a Pydantic model (ChatCompletionResponse) so the path used for history bookkeeping (choices[0].message.content) is strict-validated at the SDK boundary, while runtime-specific extras (usage, system_fingerprint, vLLM/NIM telemetry) ride through via
extra=allow.
Acceptance Criteria
- Two consecutive
./bai deployment chat <id> <text> calls share context (the second call sees the first round in messages[]).
--history-limit 0 skips context for that turn while still recording the round; chat-history clear <id> wipes the persisted transcript without touching cache or config.
- Disk usage per deployment never exceeds the configured cap; corrupt or schema-mismatched history.json is ignored with a warning instead of crashing.
- Tool-call-only or empty-choices responses do not corrupt history (the round is dropped); streaming-chunk shape (delta) raises ValidationError at the SDK boundary instead of silently passing through.
- Unit tests cover storage round-trip, FIFO truncation, loader resilience, and the response model's assistant_content extraction including OpenAI-compat edge cases.
JIRA Issue: BA-5903
Objective
The
./bai deployment chatCLI added in BA-5528 is currently stateless: every invocation is a one-shot request that loses prior turns, so multi-turn conversations against a deployed OpenAI-compatible inference endpoint (vLLM, SGLang, NIM, TGI in messages mode) cannot keep their context across CLI calls.Persist per-deployment chat transcripts on the client side and automatically replay the most recent K messages as
messages[]context on each chat-completions request, while keeping disk usage bounded and letting the user opt out or wipe the transcript.Scope
~/.backend.ai/deployment_chat/history.jsonstorage, isolated from the existing cache.json (auto-managed endpoint metadata) and config.json (user-managed token/model).messages[]array before the new user content; default K configurable via--history-limit, with K=0 skipping context for the turn../bai deployment chat-history show / clearsubcommands for inspection and reset.extra=allow.Acceptance Criteria
./bai deployment chat <id> <text>calls share context (the second call sees the first round in messages[]).--history-limit 0skips context for that turn while still recording the round;chat-history clear <id>wipes the persisted transcript without touching cache or config.JIRA Issue: BA-5903