Summary
When proxying OpenAI-compatible /v1/chat/completions to a vLLM backend that serves Gemma 4 (and similar) multimodal models, the gateway’s request schema validation rejects several audio content shapes that vLLM accepts on a direct connection. Only input_audio (OpenAI-style) is accepted through the proxy today.
Motivation
-
vLLM (e.g. Gemma 4 E4B with audio) successfully handles:
{"type": "input_audio", "input_audio": {"data": "<base64>", "format": "wav"|"mp3"}} — works through Envoy AI Gateway.
{"type": "audio_url", "audio_url": {"url": "<https or data: URI>"}} — works direct to vLLM, but the gateway responds with 400 — malformed request: failed to parse JSON for /v1/chat/completions.
- Hugging Face–style
{"type": "audio", "url": "..."} — vLLM may return 501 / not implemented on the OpenAI path; lower priority unless upstream standardizes it.
-
Clients and docs often use audio_url or remote URLs; aligning the gateway with vLLM reduces “works on localhost, fails behind ellm” surprises.
Desired behavior
- Parse and forward OpenAI-compatible chat messages whose
content array includes audio_url parts matching the structure vLLM expects (nested audio_url.url for HTTPS or data:audio/...;base64,...).
- Optionally document which multimodal part types are supported per route/backend (OpenAI vs provider-specific).
- Clear 400 messages that distinguish schema validation from true JSON parse errors (if feasible).
Current behavior (observed)
input_audio + base64: 200 through gateway → vLLM.
audio_url (public URL or data URI): 400 malformed request: failed to parse JSON at gateway.
- Same payloads 200 when posted directly to vLLM OpenAI server (same model).
Environment (reference)
- Backend: vLLM OpenAI API, Gemma 4 E4B–class model with
--limit-mm-per-prompt including audio.
- Gateway: Envoy AI Gateway–style OpenAI route (e.g. NRP
ellm).
Non-goals / follow-ups
- Full Hugging Face
type: "audio" support is only worth it if the project wants parity with HF chat templates; vLLM’s OpenAI entrypoint may still reject it with 501.
Thank you for considering this — happy to help test or provide minimal JSON fixtures.
Summary
When proxying OpenAI-compatible
/v1/chat/completionsto a vLLM backend that serves Gemma 4 (and similar) multimodal models, the gateway’s request schema validation rejects several audio content shapes that vLLM accepts on a direct connection. Onlyinput_audio(OpenAI-style) is accepted through the proxy today.Motivation
vLLM (e.g. Gemma 4 E4B with audio) successfully handles:
{"type": "input_audio", "input_audio": {"data": "<base64>", "format": "wav"|"mp3"}}— works through Envoy AI Gateway.{"type": "audio_url", "audio_url": {"url": "<https or data: URI>"}}— works direct to vLLM, but the gateway responds with400—malformed request: failed to parse JSON for /v1/chat/completions.{"type": "audio", "url": "..."}— vLLM may return 501 / not implemented on the OpenAI path; lower priority unless upstream standardizes it.Clients and docs often use
audio_urlor remote URLs; aligning the gateway with vLLM reduces “works on localhost, fails behind ellm” surprises.Desired behavior
contentarray includesaudio_urlparts matching the structure vLLM expects (nestedaudio_url.urlfor HTTPS ordata:audio/...;base64,...).Current behavior (observed)
input_audio+ base64: 200 through gateway → vLLM.audio_url(public URL or data URI): 400malformed request: failed to parse JSONat gateway.Environment (reference)
--limit-mm-per-promptincluding audio.ellm).Non-goals / follow-ups
type: "audio"support is only worth it if the project wants parity with HF chat templates; vLLM’s OpenAI entrypoint may still reject it with 501.Thank you for considering this — happy to help test or provide minimal JSON fixtures.