Building On-Device AI: How We Built a Real-Time Translator with Whisper
A deep dive into how we built Interpreter — a speech-to-speech translation app that runs entirely on-device using Whisper and Piper AI models, with sub-5-second latency.
Building AI that runs on a phone — without any server calls — is one of the most interesting engineering challenges in mobile development today. This is the story of how we built Interpreter, our real-time voice translation app.
The Problem
Most translation apps rely on cloud APIs. This creates three problems:
- Latency — every translation round-trips to a server
- Privacy — your voice data leaves your device
- Offline — no internet, no translation
We wanted to solve all three.
The Stack
We chose Kotlin Multiplatform (KMP) for Android and iOS code sharing, with the following AI components:
- Whisper.cpp via sherpa-onnx for speech-to-text
- Google ML Kit Translate for on-device translation (40+ language pairs)
- Piper TTS (also via sherpa-onnx) for text-to-speech
The key insight: all three components can run fully on-device.
Performance Results
After optimizing quantized Whisper models (using whisper-base instead of whisper-small):
| Metric | Result |
|---|---|
| STT latency | ~2s |
| Translation | ~100ms |
| TTS generation | ~1.5s |
| End-to-end | ~4s |
Sub-5-second end-to-end on a mid-range Android device.
Lessons Learned
Model quantization matters. Moving from whisper-small to a quantized whisper-base.en reduced file size by 60% with minimal accuracy loss for supported languages.
On-device ≠ slow. Modern mobile NPUs are surprisingly capable. The key is choosing the right model architecture for the task.
KMP is production-ready. Sharing business logic across Android and iOS via Kotlin Multiplatform significantly reduced duplicated code.
We’re continuing to improve Interpreter. If you’re building something similar or want to discuss on-device AI architectures, get in touch.