Skip to content

feat(openai): accept audio_url (and align multimodal parts) for vLLM chat completions #2035

@groundsada

Description

@groundsada

Summary

When proxying OpenAI-compatible /v1/chat/completions to a vLLM backend that serves Gemma 4 (and similar) multimodal models, the gateway’s request schema validation rejects several audio content shapes that vLLM accepts on a direct connection. Only input_audio (OpenAI-style) is accepted through the proxy today.

Motivation

  • vLLM (e.g. Gemma 4 E4B with audio) successfully handles:

    • {"type": "input_audio", "input_audio": {"data": "<base64>", "format": "wav"|"mp3"}} — works through Envoy AI Gateway.
    • {"type": "audio_url", "audio_url": {"url": "<https or data: URI>"}}works direct to vLLM, but the gateway responds with 400malformed request: failed to parse JSON for /v1/chat/completions.
    • Hugging Face–style {"type": "audio", "url": "..."} — vLLM may return 501 / not implemented on the OpenAI path; lower priority unless upstream standardizes it.
  • Clients and docs often use audio_url or remote URLs; aligning the gateway with vLLM reduces “works on localhost, fails behind ellm” surprises.

Desired behavior

  1. Parse and forward OpenAI-compatible chat messages whose content array includes audio_url parts matching the structure vLLM expects (nested audio_url.url for HTTPS or data:audio/...;base64,...).
  2. Optionally document which multimodal part types are supported per route/backend (OpenAI vs provider-specific).
  3. Clear 400 messages that distinguish schema validation from true JSON parse errors (if feasible).

Current behavior (observed)

  • input_audio + base64: 200 through gateway → vLLM.
  • audio_url (public URL or data URI): 400 malformed request: failed to parse JSON at gateway.
  • Same payloads 200 when posted directly to vLLM OpenAI server (same model).

Environment (reference)

  • Backend: vLLM OpenAI API, Gemma 4 E4B–class model with --limit-mm-per-prompt including audio.
  • Gateway: Envoy AI Gateway–style OpenAI route (e.g. NRP ellm).

Non-goals / follow-ups

  • Full Hugging Face type: "audio" support is only worth it if the project wants parity with HF chat templates; vLLM’s OpenAI entrypoint may still reject it with 501.

Thank you for considering this — happy to help test or provide minimal JSON fixtures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions