Persist deployment chat history and replay as request context

## Objective

The `./bai deployment chat` CLI added in BA-5528 is currently stateless: every invocation is a one-shot request that loses prior turns, so multi-turn conversations against a deployed OpenAI-compatible inference endpoint (vLLM, SGLang, NIM, TGI in messages mode) cannot keep their context across CLI calls.

Persist per-deployment chat transcripts on the client side and automatically replay the most recent K messages as `messages[]` context on each chat-completions request, while keeping disk usage bounded and letting the user opt out or wipe the transcript.

## Scope

- Add `~/.backend.ai/deployment_chat/history.json` storage, isolated from the existing cache.json (auto-managed endpoint metadata) and config.json (user-managed token/model).
- Inject the most recent K user/assistant turns into the request `messages[]` array before the new user content; default K configurable via `--history-limit`, with K=0 skipping context for the turn.
- Persist a round only when both user and assistant content are present, so the file never holds half-recorded conversations that would skew future context.
- Cap persisted messages per deployment with FIFO truncation to bound disk usage.
- Add `./bai deployment chat-history show / clear` subcommands for inspection and reset.
- Type the chat-completions response via a Pydantic model (ChatCompletionResponse) so the path used for history bookkeeping (choices[0].message.content) is strict-validated at the SDK boundary, while runtime-specific extras (usage, system_fingerprint, vLLM/NIM telemetry) ride through via `extra=allow`.

## Acceptance Criteria

- Two consecutive `./bai deployment chat <id> <text>` calls share context (the second call sees the first round in messages[]).
- `--history-limit 0` skips context for that turn while still recording the round; `chat-history clear <id>` wipes the persisted transcript without touching cache or config.
- Disk usage per deployment never exceeds the configured cap; corrupt or schema-mismatched history.json is ignored with a warning instead of crashing.
- Tool-call-only or empty-choices responses do not corrupt history (the round is dropped); streaming-chunk shape (delta) raises ValidationError at the SDK boundary instead of silently passing through.
- Unit tests cover storage round-trip, FIFO truncation, loader resilience, and the response model's assistant_content extraction including OpenAI-compat edge cases.


JIRA Issue: BA-5903

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist deployment chat history and replay as request context #11410

Objective

Scope

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Persist deployment chat history and replay as request context #11410

Description

Objective

Scope

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions