Andrew Usher

Give Your Web App a Voice

11 min read

Introduction

What if your website could literally talk to your users? Not through pre-recorded audio files or complicated audio libraries, but using the voices already built into their browser?

The Web Speech API makes this possible. With just a few lines of JavaScript, you can transform any text into spoken words, customize the voice, pitch, and speed, and even track which word is being spoken in real-time. Try it out:

Speech Synthesis Not Supported

Your browser doesn't support the Speech Synthesis API. Try using a modern version of Chrome, Safari, Firefox, or Edge.

Pretty cool, right? This technology opens up incredible possibilities: accessibility tools that read content aloud, language learning apps with pronunciation practice, interactive storytelling with character voices, and creative experiments you haven’t even imagined yet.

In this article, we’ll explore the Speech Synthesis API from the ground up. We’ll start with the basics, progressively build up to advanced patterns, and create plenty of interactive demos you can play with along the way. By the end, you’ll have all the tools you need to give your web apps a voice.

Browser Compatibility Note: The Speech Synthesis API is supported in modern browsers (Chrome, Safari, Edge, Firefox), but voice availability and behavior can vary significantly across platforms. iOS Safari, for example, has more limited voice options than desktop Chrome. We’ll explore these differences throughout this article.

Your First Words: The Basics

The Speech Synthesis API consists of two main pieces: the speechSynthesis object (the controller) and SpeechSynthesisUtterance (the thing being spoken). Think of it like a music player: speechSynthesis is the play/pause/stop controls, while SpeechSynthesisUtterance is the track you want to play.

Here’s the absolute simplest example:

// Your browser's built-in text-to-speech
const utterance = new SpeechSynthesisUtterance("Hello, world!");
speechSynthesis.speak(utterance);

That’s it! Just two lines of code. Let’s break down what’s happening:

  1. Create an utterance: new SpeechSynthesisUtterance("Hello, world!") creates a speech request containing the text you want spoken.
  2. Speak it: speechSynthesis.speak(utterance) adds your utterance to the speech queue and starts speaking it.

The speech synthesis system uses a queue, which means if you call speak() multiple times, each utterance will be spoken in order. Think of it like a playlist - one finishes, then the next begins.

Speech Synthesis Not Supported

Your browser doesn't support the Speech Synthesis API. Try using a modern version of Chrome, Safari, Firefox, or Edge.

Go ahead, modify the text in the playground above and hear how it sounds. The default voice varies by platform, but we’ll learn how to choose specific voices later.

Customizing the Voice: Parameters

The default voice is fine, but what if you want to make speech faster, slower, higher, or lower? The SpeechSynthesisUtterance object has three properties you can adjust to customize how the speech sounds.

Pitch

Controls how high or low the voice sounds. The value ranges from 0 to 2, with 1 being the default.

  • 0.5 = Deep, low-pitched voice
  • 1.0 = Normal pitch (default)
  • 1.5 = Higher-pitched, squeakier voice

Rate

Controls the speed of speech. Values range from 0.1 to 10, though most useful values are between 0.5 and 2.

  • 0.5 = Half speed (good for language learning)
  • 1.0 = Normal speed (default)
  • 1.5 = 1.5x speed (like a podcast on fast-forward)

Volume

Controls how loud the voice is. Values range from 0 (silent) to 1 (full volume).

  • 0 = Silent
  • 0.5 = Half volume
  • 1.0 = Full volume (default)

Play with these parameters in the interactive demo below. Notice how different combinations can create wildly different effects:

Speech Synthesis Not Supported

Your browser doesn't support the Speech Synthesis API. Try using a modern version of Chrome, Safari, Firefox, or Edge.

Using Parameters in Code

Here’s how you set these parameters in vanilla JavaScript:

const utterance = new SpeechSynthesisUtterance("I sound different now!");
utterance.pitch = 1.5;  // Higher pitched
utterance.rate = 0.8;   // Slower
utterance.volume = 0.9; // Slightly quieter
speechSynthesis.speak(utterance);

And here’s a React component that lets users control these parameters:

import { useState } from 'react';

function SpeechDemo() {
  const [pitch, setPitch] = useState(1);
  const [rate, setRate] = useState(1);
  const [volume, setVolume] = useState(1);

  const speak = (text: string) => {
    const utterance = new SpeechSynthesisUtterance(text);
    utterance.pitch = pitch;
    utterance.rate = rate;
    utterance.volume = volume;
    speechSynthesis.speak(utterance);
  };

  // UI controls here...
}

Browser Note: Most browsers support the full range of pitch and rate values, but some mobile browsers may clamp these values more aggressively. Safari on iOS, for example, limits how high or low the pitch can go compared to desktop Chrome.

So far we’ve been using the browser’s default voice. But most browsers actually provide multiple voices across different languages. Some sound robotic, some sound surprisingly natural, and the selection varies wildly depending on your operating system and browser.

You can get all available voices using speechSynthesis.getVoices():

// Get all available voices
const voices = speechSynthesis.getVoices();

voices.forEach(voice => {
  console.log(voice.name, voice.lang);
});

Try browsing through all the voices available on your device. The number and quality will vary - I get 80+ voices on macOS Chrome, but only a handful on iOS Safari:

Speech Synthesis Not Supported

Your browser doesn't support the Speech Synthesis API. Try using a modern version of Chrome, Safari, Firefox, or Edge.

Each SpeechSynthesisVoice object has several properties:

  • name: The voice’s name (e.g., “Alex”, “Samantha”, “Google US English”)
  • lang: The language code (e.g., “en-US”, “es-ES”, “ja-JP”)
  • localService: Boolean indicating if the voice is on-device (true) or requires network (false)
  • default: Boolean indicating if this is the system’s default voice

The voiceschanged Gotcha

Here’s something that trips up a lot of developers: voices load asynchronously in most browsers. If you try to get voices immediately when your page loads, you’ll often get an empty array:

// Wrong: voices might not be loaded yet!
const voices = speechSynthesis.getVoices(); // Often returns []
console.log(voices.length); // 0 😢

The solution is to wait for the voiceschanged event:

// Right: wait for the event
speechSynthesis.addEventListener('voiceschanged', () => {
  const voices = speechSynthesis.getVoices();
  console.log('Voices loaded:', voices.length); // 80 🎉
});

In React, you’d typically handle this in a useEffect:

useEffect(() => {
  const loadVoices = () => {
    const availableVoices = speechSynthesis.getVoices();
    setVoices(availableVoices);
  };

  // Load voices immediately (works in some browsers)
  loadVoices();

  // Also listen for the voiceschanged event (required in others)
  speechSynthesis.addEventListener('voiceschanged', loadVoices);

  return () => {
    speechSynthesis.removeEventListener('voiceschanged', loadVoices);
  };
}, []);

This approach covers both bases: it tries to load voices immediately (works in Firefox and Safari) and also listens for the event (required in Chrome and Edge).

Lifecycle Events: Controlling Playback

Speech isn’t just fire-and-forget. The SpeechSynthesisUtterance object fires events throughout its lifecycle, allowing you to track when speech starts, ends, encounters errors, or even which word is currently being spoken.

The Event Lifecycle

Each utterance can emit several events:

  • onstart - Fired when speech begins
  • onend - Fired when speech completes
  • onerror - Fired if something goes wrong
  • onpause - Fired when speech is paused
  • onresume - Fired when speech resumes after pause
  • onboundary - Fired at word/sentence boundaries (not supported in all browsers)

Here’s a basic example in vanilla JavaScript:

const utterance = new SpeechSynthesisUtterance("Track my lifecycle!");

utterance.onstart = () => console.log('Started speaking');
utterance.onend = () => console.log('Finished speaking');
utterance.onerror = (event) => console.error('Error:', event.error);

speechSynthesis.speak(utterance);

Interactive Demo

Try the interactive demo below. Click “Speak” and watch the event log populate in real-time. Try pausing and resuming to see how those events fire:

Speech Synthesis Not Supported

Your browser doesn't support the Speech Synthesis API. Try using a modern version of Chrome, Safari, Firefox, or Edge.

The onboundary event is particularly interesting - it fires at word boundaries, giving you the character index and length of each word. You can use this to highlight words as they’re spoken, create karaoke-style effects, or track reading progress. Unfortunately, not all browsers support it (Firefox and Safari notably don’t).

Building a Reusable Hook

Rather than wiring up all these events every time, let’s create a reusable React hook. This is exactly what all the interactive demos in this article use:

// useSpeechSynthesis.ts
export function useSpeechSynthesis() {
  const [voices, setVoices] = useState<SpeechSynthesisVoice[]>([]);
  const [speaking, setSpeaking] = useState(false);
  const [paused, setPaused] = useState(false);

  // Load voices (handling the async gotcha)
  useEffect(() => {
    const loadVoices = () => {
      setVoices(speechSynthesis.getVoices());
    };
    loadVoices();
    speechSynthesis.addEventListener('voiceschanged', loadVoices);
    return () => {
      speechSynthesis.removeEventListener('voiceschanged', loadVoices);
    };
  }, []);

  const speak = useCallback((text: string, options = {}) => {
    speechSynthesis.cancel(); // Clear queue
    const utterance = new SpeechSynthesisUtterance(text);

    // Apply options
    if (options.voice) utterance.voice = options.voice;
    if (options.pitch) utterance.pitch = options.pitch;
    if (options.rate) utterance.rate = options.rate;
    if (options.volume) utterance.volume = options.volume;

    // Attach event handlers
    utterance.onstart = () => {
      setSpeaking(true);
      options.onStart?.();
    };
    utterance.onend = () => {
      setSpeaking(false);
      options.onEnd?.();
    };
    utterance.onerror = (event) => {
      console.error('Speech error:', event);
      setSpeaking(false);
      options.onError?.(event);
    };

    speechSynthesis.speak(utterance);
  }, []);

  const cancel = useCallback(() => {
    speechSynthesis.cancel();
    setSpeaking(false);
  }, []);

  return { speak, cancel, voices, speaking, paused };
}

Now using speech synthesis becomes much simpler:

// In your component
const { speak, voices, speaking, cancel } = useSpeechSynthesis();

// Speak with custom options
speak("Hello!", {
  voice: voices[0],
  pitch: 1.2,
  rate: 1.0,
  onEnd: () => console.log('Done!')
});

This hook handles voice loading, state management, and provides a clean API for all our needs. All the interactive demos in this article use this same hook - we’re not reinventing the wheel for each one!

Creative Use Cases

Now that you understand the fundamentals, let’s explore some creative applications. The Speech Synthesis API opens up possibilities that go far beyond simple text-to-speech.

Interactive Storytelling

Imagine a choose-your-own-adventure story where different characters have distinct voices. By switching between voices and adjusting parameters, you can create immersive, dynamic narratives:

// Character voice switching example
const narrator = voices.find(v => v.name.includes('Alex'));
const character = voices.find(v => v.name.includes('Samantha'));

function speakDialogue(text, isNarrator) {
  const utterance = new SpeechSynthesisUtterance(text);
  utterance.voice = isNarrator ? narrator : character;
  utterance.pitch = isNarrator ? 1.0 : 1.3;
  speechSynthesis.speak(utterance);
}

// Usage
speakDialogue("Once upon a time...", true);
speakDialogue("Help! A dragon!", false);

You could take this further by:

  • Using the onboundary event to highlight text as it’s spoken
  • Synchronizing animations with speech events
  • Letting users skip ahead by canceling the current utterance
  • Creating voice-activated choices using the Speech Recognition API

Language Learning

The Speech Synthesis API is perfect for language learning applications. By controlling the rate and selecting native voices for different languages, you can create pronunciation practice tools:

// Language learning helper
function pronunciationPractice(word, language = 'es-ES') {
  const voice = voices.find(v => v.lang === language);

  // Slow version for learning
  const slow = new SpeechSynthesisUtterance(word);
  slow.voice = voice;
  slow.rate = 0.6;

  // Normal speed version
  const normal = new SpeechSynthesisUtterance(word);
  normal.voice = voice;
  normal.rate = 1.0;

  // Speak slow first, then normal
  speechSynthesis.speak(slow);
  slow.onend = () => speechSynthesis.speak(normal);
}

// Try it
pronunciationPractice("¡Hola! ¿Cómo estás?", "es-ES");

This pattern works great for:

  • Vocabulary flashcards with audio
  • Accent comparison (compare English voice saying Spanish words vs native Spanish voice)
  • Pronunciation drills that repeat words at adjustable speeds
  • Interactive lessons that respond to user progress

Creative & Experimental

Speech synthesis can be an artistic medium. By randomizing parameters and using the event system creatively, you can create generative audio art:

// Generative poetry reader with random parameters
function readPoetically(text) {
  const lines = text.split('\n');

  lines.forEach((line, i) => {
    const utterance = new SpeechSynthesisUtterance(line);

    // Random voice parameters for artistic effect
    utterance.pitch = 0.8 + Math.random() * 0.8;  // 0.8-1.6
    utterance.rate = 0.7 + Math.random() * 0.6;   // 0.7-1.3

    // Add delay between lines
    setTimeout(() => {
      speechSynthesis.speak(utterance);
    }, i * 2000);
  });
}

// Read a poem with varying voice characteristics
const poem = `Roses are red
Violets are blue
This poem sounds weird
Because pitch is askew`;

readPoetically(poem);

Other creative ideas:

  • Voice-based games: Speak clues in a mystery game, or have enemies taunt players
  • Data sonification: “Speak” numbers from charts to make data more accessible
  • Generative music: Use speech as a rhythmic or melodic element
  • Interactive art: Create installations that respond to user input with synthesized speech

The key is experimentation. Try combining speech with other web APIs - like the Web Audio API for effects, Canvas for visualizations, or Gamepad API for voice-controlled games.

Browser Compatibility & Gotchas

The Speech Synthesis API is widely supported, but with significant differences in implementation quality and available features. Let’s dig into the details so you know what to expect.

Support Matrix

Here’s a breakdown of feature support across major browsers:

FeatureChromeFirefoxSafariEdgeiOS SafariAndroid Chrome
Basic synthesis
Voice selection⚠️ Limited⚠️ Very Limited
Full pitch/rate range⚠️ Clamped⚠️ Clamped
onboundary event
pause()/resume()⚠️ Buggy

✅ = Fully supported ⚠️ = Partially supported or has quirks ❌ = Not supported

Platform-Specific Quirks

iOS Safari

iOS Safari has the most limitations:

  • Very few voices: Often just 2-3 voices available (compared to 80+ on desktop)
  • Requires user interaction: Speech won’t work until the user has interacted with the page (click, tap, etc.)
  • No background playback: Speech stops when the app is backgrounded
  • Parameter clamping: Pitch and rate values are more restricted than on desktop
  • Pause/resume issues: The pause() and resume() methods can be unreliable
// iOS-friendly pattern: trigger speech from a user event
button.addEventListener('click', () => {
  const utterance = new SpeechSynthesisUtterance("Hello!");
  speechSynthesis.speak(utterance);
});

Android

Android’s implementation varies based on the system TTS engine:

  • System-dependent voices: Voice quality and selection depend on what TTS engines the user has installed
  • Google voices are common: Most Android devices have Google TTS pre-installed
  • Generally good support: Most features work as expected on modern Android versions

Desktop Browsers

Desktop browsers generally have the best support:

  • Chrome/Edge: Excellent support, extensive voice libraries (Google voices + system voices)
  • Firefox: Good support, but lacks onboundary event
  • Safari: Good support, but limited to system voices (high quality, but fewer options)

Common Gotchas

1. Voice Loading Timing

We covered this earlier, but it’s worth repeating: voices load asynchronously in most browsers. Always use the voiceschanged event:

speechSynthesis.addEventListener('voiceschanged', () => {
  const voices = speechSynthesis.getVoices();
  // Now you can use voices
});

2. Queue Behavior

The speech queue can sometimes get stuck, especially when rapidly calling speak() multiple times. Always cancel before speaking new text:

// Clear the queue if things get stuck
speechSynthesis.cancel();

// Then speak your new text
const utterance = new SpeechSynthesisUtterance("New text");
speechSynthesis.speak(utterance);

3. User Interaction Requirements

Many mobile browsers (especially iOS Safari) require user interaction before allowing speech synthesis. This is similar to autoplay restrictions for video and audio:

// This might not work on page load
speechSynthesis.speak(new SpeechSynthesisUtterance("Hello!"));

// This will work after a user click
button.addEventListener('click', () => {
  speechSynthesis.speak(new SpeechSynthesisUtterance("Hello!"));
});

4. Long Text Truncation

Some browsers (notably Chrome on some platforms) may cut off text after ~200-300 characters. The workaround is to chunk your text:

// Workaround: chunk long text
function speakLongText(text) {
  // Split into ~200 character chunks at sentence boundaries
  const chunks = text.match(/.{1,200}/g) || [text];

  chunks.forEach((chunk, i) => {
    const utterance = new SpeechSynthesisUtterance(chunk);
    if (i === chunks.length - 1) {
      utterance.onend = () => console.log('Complete!');
    }
    speechSynthesis.speak(utterance);
  });
}

5. Rate Limits

Some browsers may rate-limit or restrict speech synthesis if called too frequently. Be mindful of how often you’re triggering speech, especially in response to user input.

Feature Detection & Fallbacks

Always check for browser support before using the API:

if ('speechSynthesis' in window) {
  // Use speech synthesis
  const utterance = new SpeechSynthesisUtterance(text);
  speechSynthesis.speak(utterance);
} else {
  // Fallback: show text in a modal, use audio files, etc.
  showTextFallback(text);
}

You can also check for specific features:

// Check if onboundary is supported
const utterance = new SpeechSynthesisUtterance();
const hasBoundary = 'onboundary' in utterance;

if (hasBoundary) {
  // Use word highlighting features
} else {
  // Skip word-by-word tracking
}

The key takeaway: test your implementation across different platforms, especially if you’re targeting mobile users. What works perfectly on desktop Chrome might need adjustments for iOS Safari.

Wrapping Up

We’ve covered a lot of ground! From the basics of creating your first utterance to advanced patterns with event handling, voice selection, and creative applications. Here’s what we explored:

  • The fundamentals: How speechSynthesis and SpeechSynthesisUtterance work together
  • Customization: Adjusting pitch, rate, and volume to create different effects
  • Voice selection: Browsing and choosing from available voices (and handling the async loading quirk)
  • Events: Tracking speech lifecycle with onstart, onend, onboundary, and other events
  • Creative applications: Interactive storytelling, language learning, and experimental uses
  • Browser quirks: Platform-specific limitations and workarounds

The Speech Synthesis API is just one half of the Web Speech API. The other half is the Speech Recognition API, which does the opposite - it listens to spoken words and converts them to text. Combine both, and you can create fully voice-interactive applications.

This technology is mature enough for production use, but remember to:

  • Test across multiple browsers and devices
  • Provide fallbacks for unsupported browsers
  • Consider accessibility implications
  • Respect user preferences (some users may find unexpected speech jarring)

Resources

Want to dive deeper? Here are some helpful resources:

If you liked this article and think others should read it, please share it on Twitter!