Add Streaming Audio Support for TTS Entities #2277
-
Describe the featureAdd real-time streaming audio support for TTS (Text-to-Speech) engines in the voice assistant pipeline. Currently, all TTS providers must buffer the complete audio file before playback can begin, adding significant latency. Many modern TTS APIs (Cartesia, ElevenLabs, OpenAI) support streaming responses where audio chunks arrive progressively, but Home Assistant's TTS framework doesn't support this. This feature would allow TTS integrations to stream audio chunks as they arrive from the API, enabling the voice assistant to start speaking immediately without waiting for the entire response to be generated. Example commands"Tell me about the history of Prague" (long response = most latency improvement) Current behavior: 3-5 second delay before speech starts (waiting for complete audio) Use cases1. More Natural Conversations 2. Better User Experience for Long Responses 3. Enable Modern TTS Providers 4. Reduced Hardware Requirements 5. Competitive with Commercial Assistants Anything else?Language considerations:
Hardware setups:
Integration with other voice services:
Technical implementation notes:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
|
Support for streaming the tts response is already in the core, but the tts integrations needs to adopt it - some already do so (Elevenlabs, Nabu Cloud and Wyoming) |
Beta Was this translation helpful? Give feedback.
-
|
This is a strong and well-explained feature request, especially in terms of improving perceived latency for voice responses. Allowing audio playback to start immediately instead of waiting several seconds would significantly enhance the natural flow of conversations. Real-time streaming is already a proven approach in other ecosystems, where progressive delivery improves responsiveness and user satisfaction. A similar concept can be seen in media streaming add-ons like torrentio, where content becomes usable as soon as data starts arriving rather than after full buffering. Applying that same idea to TTS pipelines feels both logical and overdue. The suggested backward-compatible design is also practical. Existing TTS providers could continue returning complete audio, while newer integrations take advantage of streaming capabilities. This would especially benefit lower-end hardware by reducing memory usage and improving overall performance. Overall, adding streaming TTS support would meaningfully improve the voice assistant experience and help close the gap with commercial platforms that already offer near-instant speech responses. |
Beta Was this translation helpful? Give feedback.
Support for streaming the tts response is already in the core, but the tts integrations needs to adopt it - some already do so (Elevenlabs, Nabu Cloud and Wyoming)