dictee

Push-to-talk voice dictation with KDE Plasma 6 plasmoid, PyQt6 setup wizard, and tray icon. 4 ASR backends (Parakeet, Vosk, Whisper, Canary), post-processing pipeline, translation. 100% local, 25+ languages.

Package 9 stars GitHub

dictée

Speaking is just easier.

Speak freely, type instantly — 100% local voice dictation for Linux with 25+ languages, 5 translation backends, speaker diarization, and real-time visual feedback. Text appears right where your cursor is.

📚 New: the full dictee wiki is now online — 24 pages covering installation, configuration, all 4 ASR backends (with Parakeet-TDT and Canary-1B deep-dives), post-processing, diarization, troubleshooting, and developer guide. Available in 🇬🇧 English and 🇫🇷 French.

What is dictee? • Quick start • Features • Installation • Configuration • Usage • Post-processing • Limitations • Roadmap • Wiki

What is dictee?

dictee is a complete voice dictation system for Linux. Press a shortcut, speak, and the text is typed directly into the active application — any application, any window, any text field.

Transcription is performed 100% locally by default: no audio ever leaves your machine unless you explicitly choose a cloud translation backend.

🔒 100% local by default — Parakeet, Canary, faster-whisper and Vosk all run offline on your hardware
🌍 25+ languages — with native punctuation and capitalization (Parakeet-TDT)
🔀 4 ASR backends — switch instantly depending on language, latency and hardware
🎨 Visual feedback — KDE Plasma widget, system tray, or fullscreen animation

Quick start

Three steps to go from zero to dictation in under two minutes:

1. Install

curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash

2. Configure

The first-run wizard walks you through backend selection, model download and keyboard shortcut binding. Re-run anytime with dictee --setup.

First-run setup wizard

3. Speak

Press your shortcut (default F9), speak, release. The transcription appears at your cursor.

Plasmoid widget recording

For detailed install paths (manual .deb/.rpm, GPU prerequisites, AUR, from source), see Installation below or the wiki's Installation and GPU-Setup pages.

Features

4 ASR backends

Backend	Languages	Model size	Warm latency	Notes
Parakeet-TDT 0.6B v3	25	~2.5 GB	~0.8s CPU · ~0.16s GPU	Default, native punctuation
Canary-1B v2	25	~5 GB	~0.7s GPU	Built-in translation (25 ↔ EN, 48 pairs)
faster-whisper	99	~500 MB–3 GB	~0.3s	Wide language coverage
Vosk	20+	~50 MB	~1.5s	Lightweight, strict offline

Each backend runs as a systemd user service with the same Unix socket protocol — switching is transparent. → ASR-Backends wiki

5 translation backends

Backend	Privacy	Speed	Quality	Languages
Canary-1B	🔒 Local	Built-in	Excellent	4
LibreTranslate	🔒 Local	0.1–0.3s	Good	30+
Ollama	🔒 Local	2–3s	Excellent	Any (LLM)
Google Translate	🌐 Cloud	0.2–0.7s	Excellent	130+
Bing Translator	🌐 Cloud	1.7–2.2s	Very good	100+

→ Translation wiki · Ollama-Setup

Post-processing pipeline

A 12-step configurable pipeline transforms raw ASR output before it hits your cursor:

Regex rules + dictionary — 7 languages, ASR variants, voice commands → Rules-and-Dictionary
LLM correction — optional fluency polish via local Ollama (first / last / hybrid position) → LLM-Correction
Numbers & dates — cardinal, ordinal, versions, decimals, French times → Numbers-Dates-Continuation
Continuation buffer — continue a sentence across dictations with last-word memory
Short-text keepcaps — per-language exceptions for acronyms and names (new in v1.3)

→ Post-Processing-Overview

Speaker diarization (Meetings)

Answer "who spoke when?" in multi-speaker recordings via NVIDIA's Sortformer model. Up to 4 speakers, ideal for meeting notes and interviews. Triggered via Meeting mode or dictee --meeting. → Diarization wiki

Speaker diarization output

Speaker diarization — speaker labels

3 visual interfaces

KDE Plasma 6 widget — native QML plasmoid, 5 animation styles, live state → Plasmoid-Widget
System tray icon — PyQt6, works on GNOME/XFCE/Sway (AppIndicator fallback) → Tray-Icon
animation-speech (external) — fullscreen overlay on wlr-layer-shell compositors

All three share state via a filesystem watcher — any change is reflected instantly across interfaces (multi-user safe with UID suffix).

KDE Plasma plasmoid

System tray menu

animation-speech (fullscreen overlay)

animation-speech is a standalone project that provides a fullscreen visual animation during recording, with cancellation via the Escape key. It works on any Wayland compositor supporting wlr-layer-shell (KDE Plasma, Sway, Hyprland…).

sudo dpkg -i animation-speech_1.2.0_all.deb

Download: animation-speech releases

Note: animation-speech is not compatible with GNOME (no wlr-layer-shell support). GNOME users can rely on dictee-tray for visual feedback. Contributions for a GNOME Shell extension are welcome — see the plasmoid source for reference architecture.

Installation

One-liner (recommended)

Auto-detects distro and GPU, adds the NVIDIA CUDA repo if needed, installs the right package:

curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash

Supported: Ubuntu, Debian, Fedora, openSUSE, Arch Linux. Other distros fall back to the tarball path.

Options (after --):

# Force CPU (skip GPU detection)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --cpu

# Force GPU (CUDA)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --gpu

# Pin a specific version
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --version 1.3.0

# Non-interactive
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --non-interactive

Manual install

Download from Releases.

Ubuntu / Debian (CPU):

sudo apt install ./dictee-cpu_1.3.0_amd64.deb

Ubuntu / Debian (GPU): requires the NVIDIA CUDA APT repo — see GPU-Setup for the one-time setup, then:

sudo apt install ./dictee-cuda_1.3.0_amd64.deb

Fedora / openSUSE (CPU):

sudo dnf install ./dictee-cpu-1.3.0-1.x86_64.rpm

Fedora / openSUSE (GPU): add the CUDA repo first (see GPU-Setup), then dictee-cuda-1.3.0-1.x86_64.rpm.

Arch Linux (AUR): PKGBUILD in the repo root (x86_64 + aarch64). Clone + makepkg -si.

aarch64 / Jetson: no pre-built package — build from source. CUDA limited to NVIDIA Jetson boards.

Other distros (tarball):

tar xzf dictee-1.3.0_amd64.tar.gz
cd dictee-1.3.0
sudo ./install.sh

From source: cargo build --release --features sortformer then sudo ./install.sh. See Developer-Guide for full Cargo features and package build scripts.

Configuration

First launch triggers a setup wizard (backend, model, shortcuts).

First-run setup wizard

Reconfigure anytime from the application menu, tray icon, Plasma widget, or by running:

dictee --setup

Full configuration panel

Backend switching (one-liner)

# Show current backends
dictee-switch-backend status

# Switch ASR (parakeet · canary · whisper · vosk)
dictee-switch-backend asr canary

# Switch translation (canary · libretranslate · ollama · google · bing)
dictee-switch-backend translate ollama

The tray and plasmoid include backend sub-menus — no terminal required.

For detailed configuration (all ASR backends, translation matrix, plasmoid settings, keyboard shortcuts on tiling WMs), see the wiki:

ASR-Backends · Translation
Plasmoid-Widget · Tray-Icon
Keyboard-Shortcuts (KDE/GNOME/Sway/i3/Hyprland)

Usage

# Simple dictation — transcribe and type
dictee

# Dictate + translate (default: system language → English)
dictee --translate
dictee --translate --ollama            # 100% local via Ollama

# Change target language
DICTEE_LANG_TARGET=es dictee --translate   # → Spanish

# Meeting mode (diarization, up to 4 speakers)
dictee --meeting

# Cancel ongoing dictation
dictee --cancel

# Test post-processing rules live
dictee-test-rules                       # interactive
dictee-test-rules --loop                # continuous loop
dictee-test-rules --wav file.wav        # from audio file

→ Full command reference: CLI-Reference wiki

Post-processing

dictee runs a configurable 12-step pipeline after transcription and before paste:

ASR variants normalization
Dictionary substitution
Numbers & dates conversion
Continuation buffer merge
Regex rules (pre-LLM)
LLM correction (optional, first position)
Regex rules (post-LLM)
Short-text exceptions (keepcaps)
Extended match mode
Final capitalization
Translation (optional)
Paste / inject

Configure via dictee --setup → Post-processing tab, or test rules live with dictee-test-rules.

Regex rules editor

Regex rules with integrated test panel

→ Deep dives: Post-Processing-Overview · Rules-and-Dictionary · LLM-Correction · Numbers-Dates-Continuation

Known limitations

Diarization + Parakeet on 8 GB GPU is capped around 10–15 min of audio. Parakeet-TDT loads the full mel-spectrogram in one pass (~185 MB VRAM per minute), which overflows consumer GPUs past ~15 min. Workarounds: split the file, disable diarization, or use the CPU backend. Auto-chunking is planned for the v1.3 final release. → Diarization wiki
AMD / Intel GPUs are not currently supported — dictee falls back to CPU.
No real-time streaming — Parakeet-TDT and Canary require the full utterance; only Nemotron (EN-only, via Rust binary) streams natively.

For bug reports and workarounds, see Troubleshooting.

Roadmap

v1.3.0 (current) — Short-text keepcaps exceptions (7 languages), extended match mode, LibreTranslate purge models, continuation + translate fixes, version-number dictation, multi-user safe (UID suffix on state files), plasmoid cross-process toggles (LLM / Short / Meeting), 682 postprocess tests + 148 pipeline tests, theme-aware banner.

v1.4+ (planned)

Chunked diarization — process files > 15 min via transcribe-diarize-batch (prototype validated: 54 min in 122 s)
Hotword boosting — bias ASR decoding toward custom names (shallow fusion on TDT logits, Parakeet only)
Whisper translate — multi-target translation via task="translate" (EN-only, offline)
Moonshine CPU backend
CLI speech-to-text — pipe audio, get text
VAD — hands-free dictation without push-to-talk
Streaming transcription with live text display
Built-in overlay — replace external animation-speech
AppImage / Flatpak packaging
COSMIC / GNOME Shell applets (contributions welcome!)

→ Full history: Changelog wiki

Credits

The transcription engine builds on parakeet-rs by Enes Altun — Rust library for NVIDIA Parakeet inference via ONNX Runtime. The Rust Canary implementation was originally ported from onnx-asr by Ivan Stupakov and is now fully self-contained. Parakeet and Canary ONNX models are provided by NVIDIA (downloaded separately from HuggingFace, not redistributed by this project).

Keyboard input simulation uses dotool by geb (GPL-3.0).

License

This project is distributed under the GPL-3.0-or-later license (see LICENSE).

The original parakeet-rs code by Enes Altun is under the MIT license (see LICENSE-MIT). dotool is bundled under GPL-3.0.

Back to KDE