eSpeak Formant Synthesizer

Last update : November 2, 2014

eSpeak

eSpeak is a compact multi-platform multi-language open source speech synthesizer using a formant synthesis method.

eSpeak is derived from the “Speak” speech synthesizer for British English for Acorn Risc OS computers, developed by Jonathan Duddington in 1995. He is still the author of the current eSpeak version 1.48.12 released on November 1, 2014. The sources are available on Sourceforge.

eSpeak provides two methods of formant synthesis : the original eSpeak synthesizer and a Klatt synthesizer. It can also be used as a front end for MBROLA diphone voices. eSpeak can be used as a command-line program or as a shared library. On Windows, a SAPI5 version is also installed. eSpeak supports SSML (Speech Synthesis Marking Language) and uses an ASCII representation of phoneme names which is loosely based on the Kirshenbaum system.

In formant synthesis, voiced speech (vowels and sonorant consonants) is created by using formants. Unvoiced consonants are created by using pre-recorded sounds. Voiced consonants are created as a mixture of a formant-based voiced sound in combination with a pre-recorded unvoiced sound. The eSpeakEditor allows to generate formant files for individual vowels and voiced consonants, based on a sequence of keyframes which define how the formant peaks (peaks in the frequency spectrum) vary during the sound. A sequence of formant frames can be created with a modified version of Praat, a free scientific computer software package for the analysis of speech in phonetics. The Praat formant frames, saved in a spectrum.dat file, can be converted to formant keyframes with eSpeakEdit.

To use eSpeak on the command line, type

espeak "Hello world"

There are plenty of command line options available, for instance to load from file, to adjust the volume, the pitch, the speed or the gaps between words, to select a voice or a language, etc.

To use the MBROLA voices in the Windows SAPI5 GUI or at the command line, they have to be installed during the setup of the program. It’s possible to rerun the setup to add additional voices. To list the available voices type

espeak --voices

eSpeak uses a master phoneme file containing the utility phonemes, the consonants and a schwa. The file is named phonemes (without extension) and located in the espeak/phsource program folder. The vowels are defined in the language specific phoneme files in text format. These files can also redefine consonants if you wish. The language specific phoneme text-files are located in the same espeak/phsource folder and must be referenced in the phonemes master file (see example for luxembourgish).

....
phonemetable lb base
include ph_luxembourgish

In addition to the specific phoneme file ph_luxembourgish (without extension), the following files are requested to add a new language, e.g. luxembourgish :

lb file (without extension) in the folder espeak/espeak-data/voices : a text file which in its simplest form contains only 2 lines :

name luxembourgish
language lb

lb_rules file (without extension) in the folder espeak/dictsource : a text file which contains the spelling-to-phoneme translation rules.

lb_list file (without extension) in the folder espeak/dictsource : a text file which contains pronunciations for special words (numbers, symbols, names, …).

The eSpeakEditor (espeakedit.exe) allows to compile the lb_ files into an lb_dict file (without extension) in the folder espeak/espeak-data and to add the new phonemes into the files phontab, phonindex and phondata in the same folder. These compiled files are used by eSpeak for the speech synthesis. The file phondata-manifest lists the type of data that has been compiled into the phondata file. The files dict_log and dict_phonemes provide informations about the phonemes used in the lb_rules and lb_dict files.

eSpeak applies tunes to model intonations depending on punctuation (questions, statements, attitudes, interaction). The tunes (s.. = full-stop, c.. = comma, q.. = question, e.. = exclamation) used for a language can be specified by using a tunes statement in the voice file.

tunes s1  c1  q1a  e1

The named tunes are defined in the text file espeak/phsource/intonation (without extension) and must be compiled for use by eSpeak with the espeakedit.exe program (menu : Compile intonation data).

meSpeak.js

Three years ago, Matthew Temple ported the eSpeak program from C++ to JavaScript using Emscripten : speak.js. Based on this Javascript project, Norbert Landsteiner from Austria created the meSpeak.js text-to-speech web library. The latest version is 1.9.6 released in February 2014.

meSpeak.js is supported by most browsers. It introduces loadable voice modules. The typical usage of the meSpeak.js library is shown below :

<!DOCTYPE html>
<html lang="en">
<head>
 <title>Bonjour le monde</title>
 <script type="text/javascript" src="mespeak.js"></script>
 <script type="text/javascript">
 meSpeak.loadConfig("mespeak_config.json");
 meSpeak.loadVoice("voices/fr.json");
 function speakIt() {
 meSpeak.speak("Bonjour le monde");
 }
 </script>
</head>
<body>
<h1>Try meSpeak.js</h1>
<button onclick="speakIt();">Speak It</button>
</body>
</html>

Click here to test this example.

The mespeak_config.json file contains the data of the phontab, phonindex, phondata and intonations files and the default configuration values (amplitude, pitch, …). This data is encoded as base64 octed stream. The voice.json file includes the id of the voice, the dictionary used and the corresponding binary data (base64 encoded) of these two files. There are various desktop or online Base64 Decoders and Encoders available on the net to create the required .json files (base64decode.org, motobit.com, activexdev.com, …).

meSpeak cam mix multiple parts (diiferent languages or voices) in a single utterance.meSpeak supports the Web Audio API (AudioContext) with internal wav files, Flash is used as a fallback.

Links

A list with links to websites providing additional informations about eSpeak and meSpeak follows :

Google Text-to-Speech (TTS) support

Last update : 30 April 2011

On november 16th, 2009, Google announced on their official blog that english text-to-speech was added to the translation tools.  Google used eSpeak, which is an open source software speech synthesizer for this service.

In may 2010,  Google Translate added more audio translations languages, including Afrikaans, Albanian, Catalan, Chinese (Mandarin), Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Haitian Creole, Hindi, Hungarian, Icelandic, Indonesian, Italian, Latvian, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Spanish, Swahili, Swedish, Turkish, Vietnamese and Welsh.

The speech audio is in MP3 format and is queried via a simple HTTP GET (REST) request. For english, an example url is:

http://translate.google.com/translate_tts?tl=en&q=how are you?

The TTS web service is restricting the text to 100 characters and the service returns 404 (Not Found) if the request includes a Referer header.

December 3, 2010, Google acquired Phonetic Arts, a company specialised in speech synthesis. Phonetic Arts Limited delivers technology that generates natural expressive speech. The products include Phonetic Morpher,  Phonetic LipSync  and Phonetic Synthesizer. Phonetic Arts, formerly known as Tayvin 356 Limited, was founded in 2006 and is based in Cambridge, UK.  The Phonetic Arts technology generates natural computer speech from small samples of recorded voice and should improve the voice output quality of Googles text-to-speech applications.

Google does not only provide speech output tools, but also speech input tools (Voice Search, Voice Input, Voice Actions), mainly in relation with the mobile phone OS Android.

Version 11 of the Google Chrome browser includes the HTML5 Speech Input API.

An amusing application of the Google TTS system is the Google Translate Beatbox.