I use the open source Handy [1] app with Parakeet V3 for STT when talking to coding agents and I’ve yet to see anything that beats this setup in terms of speed/accuracy. I get near instant transcription, and the slight accuracy drop is immaterial when talking to AIs that can “read between the lines”.
I tried incorporating this Voxtral C implementation into Handy but got very slow transcriptions on my M1 Max MacBook 64GB.
[1] https://github.com/cjpais/Handy
I’ll have to try the other implementations mentioned here.
Big fan of Salvatore's voxtral.c and flux2.c projects - hope they continue to get optimized as it'd be great to have lean options without external deps. Unfortunately it's currently too slow for real-world use (AMD 7800X3D/Blas) when adding Voice Input support to llms-py [1].
In the end Omarchy's new support for voxtype.io provided the nicest UX, followed by Whisper.cpp, and despite being slower, OpenAI's Whisper is still a solid local transcription option.
Also very impressed with both the performance and price of Mistral's new Voxtral Transcription API [2] - really fast/instant and really cheap ($0.003/min), IMO best option in CPU/disk-constrained environments.
[1] https://llmspy.org/docs/features/voice-input
[2] https://docs.mistral.ai/models/voxtral-mini-transcribe-26-02
Hi! This model is great, but it is too big for local inference, Whisper medium (the "base" IMHO is not usable for most things, and "large" is too large) is a better deal for many environments, even if the transcription quality is noticeable lower (and even if it does not have a real online mode). But... It's time for me to check the new Qwen 0.6 transcription model. If it works as well as their benchmarks claim, that could be the target for very serious optimizations and a no deps inference chain conceived since the start for CPU execution, not just for MPS. Since, many times, you want to install such transcription systems on server rent online via Hetzner and other similar vendors. So I'm going to handle it next, and if it delivers, really, time for big optimizations covering specifically the Intel, AMD and ARM instructions sets, potentially also thinking at 8bit quants if the performance remain good.
Same experience here with Whisper, medium is often not good enough. The large-turbo model however is pretty decent and on Apple silicon fast enough for real time conversations. The addition of the prompt parameter can also help with transcription quality, especially when using domain specific vocabulary. In general Whisper.cpp is better with transcribing full phrases than with streaming.
And not to forget, for many use cases more than just English is needed. Unfortunately right now most STT/ASR and TTS focus on English plus 0-10 other languages. Thus being able to add with reasonable effort more languages or domain specific vocabulary would be a huge plus for any STT and TTS.
+1 for voxtype with Whisper-base model it is quite fast an accurate
One thing I keep looking for is transcribing while I'm talking. I feel like I need that visual feedback. Does voxtype support that?
(I wasn't able to find anything at glance)
Handy claims to have an overlay, but it seems to not work on my system.
Not sure how it works in other OS's but in Omarchy [1] you hold down `Super + Ctrl + X` to start recording and release it to stop, while it's recording you'll see a red voice recording icon in the top bar so it's clear when its recording.
Although as llms-py is a local web App I had to build my own visual indicator [2] which also displays a red microphone next to the prompt when it's recording. It also supports both Tap On/Off and hold down for recording modes. When using voxtype I'm just using the tool for transcription (i.e. not Omarchy OS-wide dictation feature) like:
$ voxtype transcribe /path/to/audio.wav
If you're interested the Python source code to support multiple voice transcription backends is at: [3]
[1] https://learn.omacom.io/2/the-omarchy-manual/107/ai
[2] https://llmspy.org/docs/features/voice-input
[3] https://github.com/ServiceStack/llms/blob/main/llms/extensio...
Ah, the thing I really want is to see the words that I'm speaking being transcribed (i.e. realtime) For some reason I rarely see that feature.
The more things change…
hahaha! plus ca change indeed.
(I keep coming back to this one so I've got half a dozen messages on HN asking for the exact same thing!).
It's a shame, whisper is so prevalent, but not great at actual streaming, but everyone uses it.
I'm hoping one of these might become a realtime de facto standard so we can actually get our realtime streaming api (and yep, I'd be perfectly happy with something just writing to stdout. But all the tools always end up just batching it because it's simpler!)
I am using a window manager with Waybar. Voxtype can display a status icon on Waybar [1], it is enough for me to know what is going on.
[1] https://github.com/peteonrails/voxtype/blob/main/docs/WAYBAR...
This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet, ala Whisper.cpp stream or Moonshine.
--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:
ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin
It's possible my system is simply under spec for the default model.
I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf
I am interested in a way to capture audio not only from the mic, but also from one of the monitor ports so you could pipe the audio you are hearing from the web directly for real-time transcription with one of these solutions. Did anyone manage to do that?
I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.
Set -i 1 to -i default or to one of your monitors, look them up with pactl list short sources
Does it work if you use ffmpeg to feed it audio from a file? I personally would try file->ffmpeg->voxtral then mic->ffmpeg->file, and then try to glue together mic->ffmpeg->voxtral.
(But take with grain of salt; I haven't tried yet)
Recording audio with FFMPEG, and transcribing a file that’s piped from FFMPEG both work.
Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)
Ah. In that case... Yeah. Is it using GPU, and does the whole model fit in your (V)RAM?
This is a CPU implementation only.
Oh, that's interesting. The readme talks about GPU acceleration on Apple Silicon and I didn't see anything explicit for other platforms, so I assumed it needs GPU everywhere, but it does BLAS acceleration which a web search seems to agree is just a CPU optimized math library. That's great; should really increase the places where it's useful:)
Funny, this and the Rust runtime implementation are neck and neck on the frontpage right now.
Cool project!
There is also a MLX implementation: https://github.com/awni/voxmlx
I'm very interested in speech to text - but like tricky dialects and use of various terminologies but I'm still confused as to where to start in the best possible place, in order to train the models with a huge database of voice samples I own.
Any ideas from the HN crowd currently involved in speech 2 text models?
It seems so bizarre that we need a nearly 9gb model to do something you could do over 20 years ago with ~200mb.
[deleted]
From a cybersecurity perspective, this project is impressive not just for performance, but for transparency.
Finally a plain and simple C lib to run LLM opened weights?
[dead]
[dead]