Voice Dictation on Linux: From CPU Whisper to GPU-Accelerated Real-Time Speech-to-Text

2026-03-19

I wanted voice dictation for coding on my Fedora 44 GNOME Wayland system. Not cloud-based — fully local, private, and fast enough to feel real-time. This documents the full journey from first attempt to working solution, including the dead ends.

The Hardware

Laptop: Hybrid graphics — Intel UHD (TigerLake-H) + NVIDIA RTX 3050 Ti Mobile (4GB VRAM)
OS: Fedora 44, GNOME Wayland
NVIDIA driver: 590.48.01 (proprietary, via RPM Fusion)

Starting Point: ibus-speech-to-text (CPU)

Fedora ships ibus-speech-to-text, an IBus input method engine that uses pywhispercpp for Whisper inference. It integrates natively with GNOME’s input method switching (Super+Space). Out of the box it uses CPU inference.

The problem: Even on a modern i7, CPU inference with the small.en model takes ~8 seconds for a 6-second audio clip. You’d finish a sentence, wait, wait some more, and then the text would appear. Unusable for dictation.

The GPU Question: CUDA vs Vulkan

The RTX 3050 Ti supports both CUDA and Vulkan. The obvious choice seemed like CUDA, but:

	CUDA	Vulkan
Extra packages needed	3-5 GB CUDA toolkit	Nothing (driver includes it)
Setup complexity	Add NVIDIA repo, install toolkit, match driver versions	Just works with the proprietary driver
whisper.cpp support	Mature	Mature (since ggml added Vulkan backend)

Decision: Vulkan. Simpler, lighter, no extra downloads. The NVIDIA proprietary driver already provides Vulkan support.

Vulkan Device Selection on Hybrid Laptops

On a hybrid Intel + NVIDIA laptop, the system exposes two Vulkan devices. By default, ggml picks device 0 — which is the Intel iGPU. To force the NVIDIA dGPU:

GGML_VK_VISIBLE_DEVICES=1

This hides the Intel GPU entirely from ggml’s perspective. Set it system-wide via ~/.config/environment.d/ so it’s available to all user-session processes.

Gotcha: systemd’s environment-d-generator does NOT follow directory symlinks. If ~/.config/environment.d/ itself is a symlink, the env var won’t load. The directory must be a real directory.

The Benchmark: CPU vs GPU

Built whisper.cpp v1.8.2 from source with Vulkan + SDL2 enabled, then benchmarked all model sizes on a 6-second audio clip (3 runs averaged):

Model	Size	CPU Time	GPU Time	Speedup
tiny.en	75 MB	1.1s	0.12s	9x
base.en	142 MB	2.3s	0.05s	46x
small.en	487 MB	8.0s	0.10s	78x
medium	1.5 GB	27.4s	0.26s	106x

The results are dramatic. The small.en model goes from borderline unusable (8 seconds — longer than the audio itself) to imperceptible (0.1 seconds). Even the medium model at 0.26s is well under the real-time threshold.

The speedup scales with model size because larger models benefit more from GPU parallelism. The CPU is doing sequential matrix multiplications; the GPU does them all at once.

Dead End: Patching ibus-stt’s Shared Libraries

My first approach was to replace the pywhispercpp shared libraries that ibus-speech-to-text links against with Vulkan-enabled builds from whisper.cpp. This worked — ibus-stt ran on the GPU — but:

Fragile: Any package update to pywhispercpp overwrites the .so files and you’re back to CPU
Dangerous: Replacing libs while the process is running causes a SIGSEGV — the process has old libs memory-mapped while the files change on disk
Maintenance burden: You’re fighting the package manager

Lesson: Don’t patch system packages. Build standalone.

The Final Architecture: whisper-stream + ydotool

Instead of patching the system, I built a self-contained pipeline:

keyboard shortcut (toggle)
  └→ whisper-stream (SDL2 mic capture → Vulkan GPU inference → text on stdout)
       └→ incremental diff (only new text, no repeats)
            └→ ydotool type (uinput keystroke injection → focused window)

whisper-stream

The whisper-stream example from whisper.cpp does real-time streaming inference — it captures audio from the microphone via SDL2, runs it through the Whisper model, and prints transcriptions to stdout.

In VAD mode (--step 0), it only transcribes when speech is detected. The output is structured:

[Start speaking]
### Transcription 0 START | t0 = 0 ms | t1 = 3709 ms
[00:00:00.000 --> 00:00:04.000]   Hello, this is a test.
### Transcription 0 END
### Transcription 1 START | t0 = 0 ms | t1 = 8493 ms
[00:00:00.000 --> 00:00:08.000]   Hello, this is a test. And here is more text.
### Transcription 1 END

Important detail: each transcription block contains the FULL sliding window, not just new text. Block 1 repeats everything from block 0 plus the new words. The wrapper script diffs consecutive blocks and only types the new suffix.

Why ydotool, Not wtype

wtype is the Wayland equivalent of xdotool type, but it requires the virtual-keyboard-v1 Wayland protocol. GNOME doesn’t support this protocol — it only works on wlroots-based compositors like Sway and Hyprland.

ydotool works at the kernel level via /dev/uinput, creating a virtual input device. It’s compositor-agnostic — works on GNOME, KDE, Sway, anything.

It requires a daemon (ydotoold) running with access to /dev/uinput.

Why VAD Mode (—step 0) Is the Only Viable Piping Mode

With --step N (e.g., --step 1500), whisper-stream uses ANSI terminal escape codes (\e[2K — clear line) to do in-place terminal updates. This looks nice in a terminal but is completely unparseable in a pipe — you get raw escape sequences mixed with text and [BLANK_AUDIO] tokens flooding stdout during silence.

VAD mode (--step 0) produces clean, structured output with ### Transcription markers and only emits blocks when speech is actually detected. This is the only mode suitable for piping into another program.

The Sliding Window Repeat Problem

Because whisper-stream re-processes overlapping audio windows, each transcription block contains text that was already in the previous block. Without deduplication, the same sentence gets typed repeatedly.

The fix: track the previous block’s text, and when the new block starts with the same prefix, only type the new suffix. If the new block doesn’t start with the previous text, it’s treated as a fresh utterance.

The Missing Space Problem

When new text chunks are injected by ydotool, they append directly to whatever was previously typed with no separator. If you pause speaking and resume, the first word of the new chunk merges with the last word of the previous chunk. Fix: always prepend a space before each typed chunk.

Bilingual: German + English

The .en models (e.g., ggml-small.en.bin) are English-only. The base models without the .en suffix support 99 languages including German and English. With --language auto, whisper-stream auto-detects the language per transcription block. Switching mid-conversation between German and English just works.

The medium model (1.5 GB, multilingual) handles both languages well. At 0.26s GPU inference it’s still comfortably real-time.

Building whisper-stream with Vulkan

git clone --branch v1.8.2 https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build \
  -DGGML_VULKAN=ON \
  -DWHISPER_SDL2=ON \
  -DWHISPER_BUILD_EXAMPLES=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target whisper-stream

This produces whisper-stream plus the shared libraries it needs (libwhisper.so.1, libggml.so, libggml-base.so, libggml-cpu.so, libggml-vulkan.so). Put them somewhere and set LD_LIBRARY_PATH accordingly.

What I’d Do Differently

Skip the ibus-stt patching entirely — modifying system shared libraries was a waste of time. Go straight to standalone whisper-stream.
Start with VAD mode — --step N produces ANSI-escaped terminal output unsuitable for piping. Don’t even try to parse it.
Test wtype on GNOME early — would have discovered the Wayland protocol incompatibility before writing the pipeline around it.

Open Questions

Is there a way to get whisper-stream to output only new text instead of the full sliding window? This would eliminate the need for diffing in the wrapper script.
Could a quantized model (e.g., medium Q5) give similar accuracy with faster load times?
ydotool type injects characters very fast — is there a benefit to throttling it to a more natural typing speed?