Real-Time Text-to-Speech on RP2040

ECE 4760 Final Project, Fall 2025

Team: Yayun Zhao (yz3545), Qingyin Zhong (qz425), Hongming Yang (hy665)

Project Summary: This project demonstrates a real-time text-to-speech (TTS) system running entirely on the RP2040 microcontroller. The system converts input text into intelligible speech audio with low latency, leveraging dual-core parallelism and efficient embedded audio processing. The design is tailored for resource-constrained environments, making it ideal for embedded, IoT, and audio applications.

  • End-to-end TTS: text input → speech output
  • Dual-core (RP2040) real-time audio synthesis
  • Multiple voice profiles, adjustable pitch, speed, and timbre
  • SD card dictionary and phoneme storage
  • Hardware-accelerated DAC output via SPI (MCP4822)

What you'll learn: How to architect and optimize a real-time TTS pipeline on embedded hardware, and the trade-offs between algorithmic complexity, memory, and audio quality.

System Overview

  • Input: ASCII text (via serial or SD card)
  • Processing: Text parsing → phoneme mapping → audio synthesis
  • Output: Real-time audio via DAC (SPI to MCP4822), speaker

The system supports two main implementation pathways, each with distinct trade-offs in latency, memory, and audio quality.
[System Block Diagram Placeholder]

What you'll learn: The overall data flow and hardware-software co-design for embedded TTS.

Implementation Pathways Comparison

Pathway Method Pros Cons
A: Real-Time Synthesis
  • Core Usage: Dual-Core
  • Vowels: Synthesized via Formant synthesis (F1/F2/F3 adjustments).
  • Consonants: Approximated with random white noise (e.g., S, T, K, F, Sh).
  • Voiced Consonants: Simplified synthesis using a source-filter model.
  • No PCM storage needed (only dictionary).
  • Minimal storage footprint.
  • Fully editable.
  • Poorest audio quality, sounds robotic.
  • Rough consonant sounds.
  • High computational load.
B: Hybrid Synthesis
  • Core Usage: Single-Core + Interrupt
  • Vowels: Real-time Formant synthesis (tunable timbre).
  • Consonants/Voiced: Read pre-generated PCM from SD card.
  • Splicing: Vowel loudness is reduced for smooth transitions.
  • Very low latency (small read volume).
  • Better audio quality than Pathway A.
  • Retains timbre adjustment capability.
  • Vowels can still sound slightly robotic.
  • Requires SD card file management.
C: Sample-Based TTS
  • Core Usage: Single-Core + Interrupt
  • All phonemes are pre-generated as 16-bit mono PCM files on SD card.
  • "Sentence Preloading": Analyzes text to pre-load only necessary phonemes, ensuring no playback delay.
  • Best audio quality, sounds natural.
  • Smooth splicing.
  • Large storage requirement.
  • Cannot pre-buffer all phonemes.
  • Requires complex preloading logic.

Key takeaway: Pathway A is ideal for extreme resource constraints. Pathway B offers a good balance of quality and performance. Pathway C provides the best audio quality at the cost of storage and complexity.

Q&A Section