Native support for OpenAI-compatible local inference servers (llama.cpp, vLLM, etc.) #3398

legaultjm · 2026-04-05T02:23:27Z

legaultjm
Apr 5, 2026

Describe the feature

Home Assistant needs a built-in conversation integration that can connect to any OpenAI-compatible API endpoint — not just api.openai.com. The current OpenAI integration explicitly rejects third-party endpoints, and the Ollama integration speaks a different API format (/api/chat vs the standard /v1/chat/completions). Users running llama.cpp, vLLM, or any other server exposing the OpenAI-compatible API — which has become the de facto standard for LLM inference — are forced to rely on HACS custom components.
Only supporting Ollama as the local LLM path is increasingly short-sighted. Ollama is a convenience wrapper around llama.cpp. The serious local inference community has moved past it because llama.cpp directly offers capabilities Ollama cannot expose: separate multimodal vision encoder loading (--mmproj), independent key/value cache quantization (--cache-type-k, --cache-type-v), Flash Attention, and immediate access to experimental optimizations like TurboQuant — 3-bit KV cache compression presented at ICLR 2026, already running on Apple Silicon in community forks. These aren't niche features. They determine whether a 32B parameter model fits on consumer hardware at all.
llama.cpp has 85k+ GitHub stars and one of the most active open-source communities in AI. vLLM is the production GPU serving standard. Both expose the OpenAI-compatible API. Frigate 0.17 already supports llama.cpp this way, and 0.18 is adding a dedicated provider. Home Assistant is the odd one out in the self-hosted AI stack.
The integration would need: a configurable base URL, an optional API key (local servers typically don't require one), context size configuration (llama.cpp doesn't expose this in API responses), and tool calling support for Assist device control.

Use cases

Combined vision + voice on a single local server. I run Qwen3-VL 32B on a dedicated Apple M5 MacBook Air via llama.cpp with Metal GPU acceleration. This single model serves both Frigate security camera GenAI analysis (vision) and Home Assistant voice assistant duties (tool calling) — no cloud APIs, no subscriptions, no data leaving the network. Frigate connects natively via OPENAI_BASE_URL. Home Assistant cannot connect without a HACS workaround.
Running models Ollama can't serve. Multimodal vision-language models like Qwen3-VL require loading a separate vision encoder file alongside the main weights. llama.cpp handles this with --mmproj. Ollama has no clean mechanism for this, which is why the llama.cpp community runs these models directly.
Memory-constrained deployments. On a 32GB Apple Silicon machine, fitting a 32B model requires compressing the KV cache to 4-bit (--cache-type-k q4_0 --cache-type-v q4_0) and enabling Flash Attention. These are llama.cpp flags with no Ollama equivalent. When TurboQuant lands in mainline llama.cpp, it will push this further to 3-bit, opening 32B+ models to even more consumer hardware. Users on this cutting edge will never be Ollama users.
The broader local LLM community. Anyone running vLLM on an Nvidia GPU, llama.cpp on Apple Silicon or CPU, LocalAI, LM Studio's server mode, or any OpenAI-compatible inference backend hits this same wall. The OpenAI-compatible API is the lingua franca of local LLM deployment. Every major tool in the ecosystem speaks it — except Home Assistant.

Anything else?

Related discussion: #137087 — proposes a "LiteLLM Conversation" integration with virtual integrations for llama.cpp, OpenRouter, etc. That approach would solve this cleanly.
Relevant links:

llama.cpp (85k+ stars): https://github.com/ggml-org/llama.cpp
TurboQuant KV cache compression (ICLR 2026): ggml-org/llama.cpp#20969
Frigate's OpenAI-compatible provider docs: https://docs.frigate.video/configuration/genai/genai_config/#openai
HACS integration currently required as workaround: https://github.com/skye-harris/hass_local_openai_llm
Community thread requesting this: https://community.home-assistant.io/t/add-api-base-url-option-for-openai-integration-to-support-multiple-endpoints/737818
Project contentLLMServerCreated by youllmserver-runbook.md1,490 linesmd

mik-laj · 2026-04-05T09:21:24Z

mik-laj
Apr 5, 2026

Duplicate: https://github.com/orgs/home-assistant/discussions/1681

For now, we are looking forward to wider adoption of OpenResponses spec so we don't have to maintain quirks for each provider.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home Assistant

Native support for OpenAI-compatible local inference servers (llama.cpp, vLLM, etc.) #3398

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Home Assistant

Native support for OpenAI-compatible local inference servers (llama.cpp, vLLM, etc.) #3398

Uh oh!

legaultjm Apr 5, 2026

Describe the feature

Use cases

Anything else?

Replies: 1 comment

Uh oh!

mik-laj Apr 5, 2026

legaultjm
Apr 5, 2026

mik-laj
Apr 5, 2026