Project Awesome project awesome

dictee

Push-to-talk voice dictation with KDE Plasma 6 plasmoid, PyQt6 setup wizard, and tray icon. 4 ASR backends (Parakeet, Vosk, Whisper, Canary), post-processing pipeline, translation. 100% local, 25+ languages.

Package 9 stars GitHub

dictée

Speaking is just easier.

Speak freely, type instantly — 100% local voice dictation for Linux with 25+ languages, 5 translation backends, speaker diarization, and real-time visual feedback. Text appears right where your cursor is.

Latest Release License GPL-3.0 Rust PyQt6 / Bash Linux Wiki

📚 New: the full dictee wiki is now online — 24 pages covering installation, configuration, all 4 ASR backends (with Parakeet-TDT and Canary-1B deep-dives), post-processing, diarization, troubleshooting, and developer guide. Available in 🇬🇧 English and 🇫🇷 French.

What is dictee?Quick startFeaturesInstallationConfigurationUsagePost-processingLimitationsRoadmapWiki


What is dictee?

dictee is a complete voice dictation system for Linux. Press a shortcut, speak, and the text is typed directly into the active application — any application, any window, any text field.

Transcription is performed 100% locally by default: no audio ever leaves your machine unless you explicitly choose a cloud translation backend.

  • 🔒 100% local by default — Parakeet, Canary, faster-whisper and Vosk all run offline on your hardware
  • 🌍 25+ languages — with native punctuation and capitalization (Parakeet-TDT)
  • 🔀 4 ASR backends — switch instantly depending on language, latency and hardware
  • 🎨 Visual feedback — KDE Plasma widget, system tray, or fullscreen animation

Quick start

Three steps to go from zero to dictation in under two minutes:

1. Install

curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash

2. Configure

The first-run wizard walks you through backend selection, model download and keyboard shortcut binding. Re-run anytime with dictee --setup.

First-run setup wizard

3. Speak

Press your shortcut (default F9), speak, release. The transcription appears at your cursor.

Plasmoid widget recording

For detailed install paths (manual .deb/.rpm, GPU prerequisites, AUR, from source), see Installation below or the wiki's Installation and GPU-Setup pages.


Features

4 ASR backends

Backend Languages Model size Warm latency Notes
Parakeet-TDT 0.6B v3 25 ~2.5 GB ~0.8s CPU · ~0.16s GPU Default, native punctuation
Canary-1B v2 25 ~5 GB ~0.7s GPU Built-in translation (25 ↔ EN, 48 pairs)
faster-whisper 99 ~500 MB–3 GB ~0.3s Wide language coverage
Vosk 20+ ~50 MB ~1.5s Lightweight, strict offline

Each backend runs as a systemd user service with the same Unix socket protocol — switching is transparent. → ASR-Backends wiki

5 translation backends

Backend Privacy Speed Quality Languages
Canary-1B 🔒 Local Built-in Excellent 4
LibreTranslate 🔒 Local 0.1–0.3s Good 30+
Ollama 🔒 Local 2–3s Excellent Any (LLM)
Google Translate 🌐 Cloud 0.2–0.7s Excellent 130+
Bing Translator 🌐 Cloud 1.7–2.2s Very good 100+

Translation wiki · Ollama-Setup

Post-processing pipeline

A 12-step configurable pipeline transforms raw ASR output before it hits your cursor:

  • Regex rules + dictionary — 7 languages, ASR variants, voice commands → Rules-and-Dictionary
  • LLM correction — optional fluency polish via local Ollama (first / last / hybrid position) → LLM-Correction
  • Numbers & dates — cardinal, ordinal, versions, decimals, French times → Numbers-Dates-Continuation
  • Continuation buffer — continue a sentence across dictations with last-word memory
  • Short-text keepcaps — per-language exceptions for acronyms and names (new in v1.3)

Post-Processing-Overview

Speaker diarization (Meetings)

Answer "who spoke when?" in multi-speaker recordings via NVIDIA's Sortformer model. Up to 4 speakers, ideal for meeting notes and interviews. Triggered via Meeting mode or dictee --meeting. → Diarization wiki

Speaker diarization output

Speaker diarization — speaker labels

3 visual interfaces

  • KDE Plasma 6 widget — native QML plasmoid, 5 animation styles, live state → Plasmoid-Widget
  • System tray icon — PyQt6, works on GNOME/XFCE/Sway (AppIndicator fallback) → Tray-Icon
  • animation-speech (external) — fullscreen overlay on wlr-layer-shell compositors

All three share state via a filesystem watcher — any change is reflected instantly across interfaces (multi-user safe with UID suffix).

KDE Plasma plasmoid

System tray menu

animation-speech (fullscreen overlay)

animation-speech is a standalone project that provides a fullscreen visual animation during recording, with cancellation via the Escape key. It works on any Wayland compositor supporting wlr-layer-shell (KDE Plasma, Sway, Hyprland…).

animation-speech demo — click to watch on YouTube

sudo dpkg -i animation-speech_1.2.0_all.deb

Download: animation-speech releases

Note: animation-speech is not compatible with GNOME (no wlr-layer-shell support). GNOME users can rely on dictee-tray for visual feedback. Contributions for a GNOME Shell extension are welcome — see the plasmoid source for reference architecture.


Installation

One-liner (recommended)

Auto-detects distro and GPU, adds the NVIDIA CUDA repo if needed, installs the right package:

curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash

Supported: Ubuntu, Debian, Fedora, openSUSE, Arch Linux. Other distros fall back to the tarball path.

Options (after --):

# Force CPU (skip GPU detection)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --cpu

# Force GPU (CUDA)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --gpu

# Pin a specific version
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --version 1.3.0

# Non-interactive
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --non-interactive

Manual install

Download from Releases.

Ubuntu / Debian (CPU):

sudo apt install ./dictee-cpu_1.3.0_amd64.deb

Ubuntu / Debian (GPU): requires the NVIDIA CUDA APT repo — see GPU-Setup for the one-time setup, then:

sudo apt install ./dictee-cuda_1.3.0_amd64.deb

Fedora / openSUSE (CPU):

sudo dnf install ./dictee-cpu-1.3.0-1.x86_64.rpm

Fedora / openSUSE (GPU): add the CUDA repo first (see GPU-Setup), then dictee-cuda-1.3.0-1.x86_64.rpm.

Arch Linux (AUR): PKGBUILD in the repo root (x86_64 + aarch64). Clone + makepkg -si.

aarch64 / Jetson: no pre-built package — build from source. CUDA limited to NVIDIA Jetson boards.

Other distros (tarball):

tar xzf dictee-1.3.0_amd64.tar.gz
cd dictee-1.3.0
sudo ./install.sh

From source: cargo build --release --features sortformer then sudo ./install.sh. See Developer-Guide for full Cargo features and package build scripts.


Configuration

First launch triggers a setup wizard (backend, model, shortcuts).

First-run setup wizard

Reconfigure anytime from the application menu, tray icon, Plasma widget, or by running:

dictee --setup

Full configuration panel

Backend switching (one-liner)

# Show current backends
dictee-switch-backend status

# Switch ASR (parakeet · canary · whisper · vosk)
dictee-switch-backend asr canary

# Switch translation (canary · libretranslate · ollama · google · bing)
dictee-switch-backend translate ollama

The tray and plasmoid include backend sub-menus — no terminal required.

For detailed configuration (all ASR backends, translation matrix, plasmoid settings, keyboard shortcuts on tiling WMs), see the wiki:


Usage

# Simple dictation — transcribe and type
dictee

# Dictate + translate (default: system language → English)
dictee --translate
dictee --translate --ollama            # 100% local via Ollama

# Change target language
DICTEE_LANG_TARGET=es dictee --translate   # → Spanish

# Meeting mode (diarization, up to 4 speakers)
dictee --meeting

# Cancel ongoing dictation
dictee --cancel

# Test post-processing rules live
dictee-test-rules                       # interactive
dictee-test-rules --loop                # continuous loop
dictee-test-rules --wav file.wav        # from audio file

→ Full command reference: CLI-Reference wiki


Post-processing

dictee runs a configurable 12-step pipeline after transcription and before paste:

  1. ASR variants normalization
  2. Dictionary substitution
  3. Numbers & dates conversion
  4. Continuation buffer merge
  5. Regex rules (pre-LLM)
  6. LLM correction (optional, first position)
  7. Regex rules (post-LLM)
  8. Short-text exceptions (keepcaps)
  9. Extended match mode
  10. Final capitalization
  11. Translation (optional)
  12. Paste / inject

Configure via dictee --setupPost-processing tab, or test rules live with dictee-test-rules.

Regex rules editor

Regex rules with integrated test panel

→ Deep dives: Post-Processing-Overview · Rules-and-Dictionary · LLM-Correction · Numbers-Dates-Continuation


Known limitations

  • Diarization + Parakeet on 8 GB GPU is capped around 10–15 min of audio. Parakeet-TDT loads the full mel-spectrogram in one pass (~185 MB VRAM per minute), which overflows consumer GPUs past ~15 min. Workarounds: split the file, disable diarization, or use the CPU backend. Auto-chunking is planned for the v1.3 final release. → Diarization wiki
  • AMD / Intel GPUs are not currently supported — dictee falls back to CPU.
  • No real-time streaming — Parakeet-TDT and Canary require the full utterance; only Nemotron (EN-only, via Rust binary) streams natively.

For bug reports and workarounds, see Troubleshooting.


Roadmap

v1.3.0 (current) — Short-text keepcaps exceptions (7 languages), extended match mode, LibreTranslate purge models, continuation + translate fixes, version-number dictation, multi-user safe (UID suffix on state files), plasmoid cross-process toggles (LLM / Short / Meeting), 682 postprocess tests + 148 pipeline tests, theme-aware banner.

v1.4+ (planned)

  • Chunked diarization — process files > 15 min via transcribe-diarize-batch (prototype validated: 54 min in 122 s)
  • Hotword boosting — bias ASR decoding toward custom names (shallow fusion on TDT logits, Parakeet only)
  • Whisper translate — multi-target translation via task="translate" (EN-only, offline)
  • Moonshine CPU backend
  • CLI speech-to-text — pipe audio, get text
  • VAD — hands-free dictation without push-to-talk
  • Streaming transcription with live text display
  • Built-in overlay — replace external animation-speech
  • AppImage / Flatpak packaging
  • COSMIC / GNOME Shell applets (contributions welcome!)

→ Full history: Changelog wiki


Credits

The transcription engine builds on parakeet-rs by Enes Altun — Rust library for NVIDIA Parakeet inference via ONNX Runtime. The Rust Canary implementation was originally ported from onnx-asr by Ivan Stupakov and is now fully self-contained. Parakeet and Canary ONNX models are provided by NVIDIA (downloaded separately from HuggingFace, not redistributed by this project).

Keyboard input simulation uses dotool by geb (GPL-3.0).

License

This project is distributed under the GPL-3.0-or-later license (see LICENSE).

The original parakeet-rs code by Enes Altun is under the MIT license (see LICENSE-MIT). dotool is bundled under GPL-3.0.

Back to KDE