feat(openai): accept audio_url (and align multimodal parts) for vLLM chat completions

## Summary

When proxying OpenAI-compatible **`/v1/chat/completions`** to a **vLLM** backend that serves **Gemma 4** (and similar) multimodal models, the gateway’s request schema validation rejects several audio content shapes that **vLLM accepts** on a direct connection. Only **`input_audio`** (OpenAI-style) is accepted through the proxy today.

## Motivation

- **vLLM** (e.g. Gemma 4 E4B with audio) successfully handles:
  - `{"type": "input_audio", "input_audio": {"data": "<base64>", "format": "wav"|"mp3"}}` — works through Envoy AI Gateway.
  - `{"type": "audio_url", "audio_url": {"url": "<https or data: URI>"}}` — **works direct to vLLM**, but the gateway responds with **`400` — `malformed request: failed to parse JSON for /v1/chat/completions`**.
  - Hugging Face–style `{"type": "audio", "url": "..."}` — vLLM may return **501** / not implemented on the OpenAI path; lower priority unless upstream standardizes it.

- Clients and docs often use **`audio_url`** or remote URLs; aligning the gateway with vLLM reduces “works on localhost, fails behind ellm” surprises.

## Desired behavior

1. **Parse and forward** OpenAI-compatible chat messages whose `content` array includes **`audio_url`** parts matching the structure vLLM expects (nested `audio_url.url` for HTTPS or `data:audio/...;base64,...`).
2. Optionally document which multimodal part types are supported per route/backend (OpenAI vs provider-specific).
3. Clear **400** messages that distinguish schema validation from true JSON parse errors (if feasible).

## Current behavior (observed)

- **`input_audio`** + base64: **200** through gateway → vLLM.
- **`audio_url`** (public URL or data URI): **400** `malformed request: failed to parse JSON` at gateway.
- Same payloads **200** when posted directly to vLLM OpenAI server (same model).

## Environment (reference)

- Backend: vLLM OpenAI API, Gemma 4 E4B–class model with `--limit-mm-per-prompt` including audio.
- Gateway: Envoy AI Gateway–style OpenAI route (e.g. NRP `ellm`).

## Non-goals / follow-ups

- Full Hugging Face **`type: "audio"`** support is only worth it if the project wants parity with HF chat templates; vLLM’s OpenAI entrypoint may still reject it with 501.

Thank you for considering this — happy to help test or provide minimal JSON fixtures.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openai): accept audio_url (and align multimodal parts) for vLLM chat completions #2035

Summary

Motivation

Desired behavior

Current behavior (observed)

Environment (reference)

Non-goals / follow-ups

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(openai): accept audio_url (and align multimodal parts) for vLLM chat completions #2035

Description

Summary

Motivation

Desired behavior

Current behavior (observed)

Environment (reference)

Non-goals / follow-ups

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions