Feature Description
Hey Braden, absolutely love Whispering! Been using it 10 hours a day for what feels like years.
In the meantime, I’ve been building a voice-to-text component for my own application and hit a massive performance breakthrough I wanted to share, hoping it might be useful as an optional toggle for your cloud users.
The Latency Benchmark: I noticed that Whispering (on my DELL XPS 17 hardware) currently can have about a ~1/2 to 1 second delay before recording actually starts, and another ~1 second delay after stopping before the transcription appears. In my custom implementation, the recording starts with zero perceptible latency, and after releasing the hotkey, the final text appears in the input field in just a few hundred milliseconds.
Here are the two architectural tricks I used to achieve this "mind-reading" speed with the Groq API:
1. Direct Frontend Fetch (Bypassing Backend IPC) Instead of passing the audio blob to a backend server (or through the Tauri/Rust bridge), my JavaScript executes a direct fetch to api.groq.com right inside the mediaRecorder.onstop event. This completely eliminates the IPC serialization overhead and backend routing delays.
2. Skipping FFmpeg (Browser WebM is already compressed) I saw the "Compress audio before transcription" toggle. If users are using the browser recording backend, the native MediaRecorder already produces a highly compressed webm blob in RAM. Sending this raw blob directly to Groq skips all local disk I/O and FFmpeg processing time. For short dictations, running FFmpeg locally actually takes much longer than just uploading the WebM blob!
Here is the exact vanilla JavaScript logic I use in my app to achieve this. It’s incredibly simple:
// Inside your mediaRecorder.onstop callback
const formData = new FormData();
// The browser's native WebM blob is already highly compressed and Groq-compatible
formData.append("file", audioBlob, "recording.webm");
formData.append("model", "whisper-large-v3-turbo");
formData.append("response_format", "verbose_json");
formData.append("temperature", "0.0"); // For deterministic results
// Direct fetch from the frontend to Groq, bypassing the backend entirely
const response = await fetch("https://api.groq.com/openai/v1/audio/transcriptions", {
method: "POST",
headers: { "Authorization": `Bearer ${YOUR_GROQ_KEY}` },
body: formData
});
const data = await response.json();
// data.text is ready to be pasted in ~200-300ms!
I completely understand that your current transcribe-rs architecture is beautifully designed to support local offline models and advanced features like silence trimming. However, maybe this "Direct Frontend Fetch" could be added as an opt-in "Fast-Path" setting specifically for users utilizing Cloud APIs like Groq who want absolute minimum latency.
Hope this insight is helpful for the project!
Relevant Platforms
All Platforms
How important is this feature to you?
Critical for my use case
Willing to Contribute?
No, but I can test it
Discord Link
No response
Checklist
Feature Description
Hey Braden, absolutely love Whispering! Been using it 10 hours a day for what feels like years.
In the meantime, I’ve been building a voice-to-text component for my own application and hit a massive performance breakthrough I wanted to share, hoping it might be useful as an optional toggle for your cloud users.
The Latency Benchmark: I noticed that Whispering (on my DELL XPS 17 hardware) currently can have about a ~1/2 to 1 second delay before recording actually starts, and another ~1 second delay after stopping before the transcription appears. In my custom implementation, the recording starts with zero perceptible latency, and after releasing the hotkey, the final text appears in the input field in just a few hundred milliseconds.
Here are the two architectural tricks I used to achieve this "mind-reading" speed with the Groq API:
1. Direct Frontend Fetch (Bypassing Backend IPC) Instead of passing the audio blob to a backend server (or through the Tauri/Rust bridge), my JavaScript executes a direct fetch to api.groq.com right inside the mediaRecorder.onstop event. This completely eliminates the IPC serialization overhead and backend routing delays.
2. Skipping FFmpeg (Browser WebM is already compressed) I saw the "Compress audio before transcription" toggle. If users are using the browser recording backend, the native MediaRecorder already produces a highly compressed webm blob in RAM. Sending this raw blob directly to Groq skips all local disk I/O and FFmpeg processing time. For short dictations, running FFmpeg locally actually takes much longer than just uploading the WebM blob!
Here is the exact vanilla JavaScript logic I use in my app to achieve this. It’s incredibly simple:
I completely understand that your current
transcribe-rsarchitecture is beautifully designed to support local offline models and advanced features like silence trimming. However, maybe this "Direct Frontend Fetch" could be added as an opt-in "Fast-Path" setting specifically for users utilizing Cloud APIs like Groq who want absolute minimum latency.Hope this insight is helpful for the project!
Relevant Platforms
All Platforms
How important is this feature to you?
Critical for my use case
Willing to Contribute?
No, but I can test it
Discord Link
No response
Checklist