If you want fast speech-to-text for Ubuntu, start with the CDNsun speech-to-text-for-ubuntu project on GitHub.
We think this matters because modern developer work is increasingly language-heavy. We write prompts, bug reports, architecture notes, issue summaries, and rough instructions all day. In that kind of workflow, voice input, voice typing, and push-to-talk dictation become useful only when they are fast enough to keep up.
That is the important shift here. This project is not just about speech recognition on Linux. It is about making local voice input on Ubuntu practical enough for everyday work.
Why speed matters more than feature count
For speech-to-text, usability is mostly latency.
A system can have reasonable accuracy and still feel bad if every utterance takes too long to process. In real desktop work, the difference between roughly two seconds and roughly four seconds is the difference between staying in flow and falling back to the keyboard.
That is why this release is interesting. The newer client-server architecture keeps the transcription engine loaded in memory, so each new request avoids repeated startup overhead. On ordinary CPU-only hardware, that can make dictation for Ubuntu desktop feel quick enough to use throughout the day.
What changed in the new version
The project now uses a more practical local split:
- servers/key_listener.py
- servers/speech_to_text_server.py
- scripts/speech_to_text_client.py
The file list is not the point by itself. The important part is the behavior. The speech-to-text server stays running, the model stays loaded, and the client sends short recordings over a local Unix socket. That is what makes low-latency speech-to-text on Ubuntu much more realistic.
In other words, the architecture matters because it improves the experience, not because client-server is fashionable.
Why this fits developer workflows
A lot of developer work is now really text production under time pressure. We describe intent, explain constraints, sketch solutions, and give tools enough context to do useful work. In those moments, speaking can be faster than typing.
That is especially true when we do not need perfect prose on the first pass. For prompts, notes, rough drafts, and intermediate instructions, a good-enough transcription that arrives quickly is often more valuable than a slower system chasing perfect output.
That makes this a good fit for several common Linux use cases:
- speech-to-text for Ubuntu when you want local transcription instead of a cloud-only path
- voice typing on Linux when you want text inserted into the active app
- push-to-talk dictation when you prefer explicit control over always-on listening
- local voice input when speed, privacy, or offline-friendly operation matter
- dictate into any app on Linux when you want a general desktop workflow, not a single-purpose note app
Push-to-talk dictation is more useful than it sounds
One practical strength of this setup is that it supports a push-to-talk workflow instead of always-on listening. That matches real work better.
We press a button, say what we need, release it, and get text back. That is usually a better fit for prompts, short notes, commands, and message drafting than a permanently listening assistant.
The project also supports primary and secondary language or model paths. In practice, that lets us set up two different workflows, for example:
- one button for maximum speed
- one button for higher accuracy
or:
- one button for English
- one button for another working language
That flexibility makes the tool easier to keep using because real workflows are mixed and context-dependent.
Clipboard behavior matters more than it seems
The optional clipboard-first behavior is a small detail with a big usability payoff.
If automatic typing into the active window gets interrupted by a focus change, losing the dictated text is frustrating. Copying the recognized text to the clipboard first gives us a recovery path. That makes the whole workflow feel more dependable.
Typing itself uses xdotool. In practice, clipboard support works on both X11 and Wayland, while automatic typing tends to be more reliable on X11. That is worth stating plainly because voice input on Ubuntu desktop depends not only on transcription quality, but also on how the Linux desktop handles simulated input.
KDE, GNOME, X11, and Wayland, briefly
For readers outside the Linux world, a short definition helps.
KDE and GNOME are desktop environments. They shape the overall desktop experience, including settings, panels, shortcuts, and window management.
X11 and Wayland are display system protocols. They affect how applications interact with the desktop, including clipboard behavior, focus handling, and input simulation.
That matters because a dictation workflow on Ubuntu is not only about speech recognition. It is also about how recognized text gets into the active application.
Getting started, the practical version
The basic setup has three moving parts:
- a key listener that reacts to the push-to-talk trigger
- a speech-to-text server that stays running locally
- a client that sends audio for transcription and returns text
If you want a straightforward Ubuntu setup, the flow looks like this.
First, install the system packages. For the most up-to-date package list, check the project README on GitHub before you paste this into a terminal.
sudo apt update
sudo apt install -y \
git \
python3 \
python3-venv \
libsndfile1 \
xdotool \
xclip \
wl-clipboard \
input-remapper \
alsa-utils \
python3-evdev \
evtest
Then clone the repository and create the Python virtual environment.
cd /home/david
git clone https://github.com/CDNsun/speech-to-text-for-ubuntu.git
cd /home/david/speech-to-text-for-ubuntu
python3 -m venv /home/david/venv
/home/david/venv/bin/pip install --upgrade pip
/home/david/venv/bin/pip install -r /home/david/speech-to-text-for-ubuntu/requirements.txt
After that, start the local speech-to-text server.
/home/david/venv/bin/python3 /home/david/speech-to-text-for-ubuntu/servers/speech_to_text_server.py
And in another terminal, start the key listener.
sudo python3 /home/david/speech-to-text-for-ubuntu/servers/key_listener.py
From there, the remaining work is mostly configuration.

Input Remapper
- Use input-remapper to map a convenient key or mouse button to a trigger such as F16 or F17.
- Review the key listener configuration so the device path, trigger key, display-related settings, and client path match your machine.
- Review the server configuration so the language path, model choice, compute settings, and thread count match your hardware and priorities.
- Test dictation in a normal text field before treating it as part of your daily workflow.
That is the important operational idea: once the server is already running and the transcription engine is already loaded, voice-to-text on Ubuntu starts to feel like an input method instead of a batch process.
What this changes in daily work
The main gain is not just technical. It is behavioral.
When local speech-to-text feels slow, we hesitate before using it. We shorten what we wanted to say, or we skip it and type instead. Once the round-trip is fast enough, that hesitation drops away.
That is what makes this project useful. We can hold a button, dictate a prompt or note, and get usable text back quickly enough to stay in the flow of work. We can keep the workflow local. We can use clipboard-first behavior for safety. We can adapt it to KDE or GNOME, and to X11 or Wayland, with realistic expectations about how those environments differ.
It does not replace the keyboard. It gives Ubuntu users a faster option for the moments when speaking is the better input method.
Final thoughts
The interesting part of this update is that the workflow seems to cross a practical threshold. It is no longer only a Linux speech recognition demo. It is becoming a credible push-to-talk dictation setup for real Ubuntu desktop work.
If that matches the way you work, start with the repository here: CDNsun speech-to-text-for-ubuntu on GitHub.

