Installation de Festival TTS sur Debian

Last update : April 9, 2015

Suite à l’installation du système Festival TTS sur mon MacBook Air il y a trois mois, je viens de l’installer sur mon laptop avec système d’exploitation Debian 7 Linux. Les différentes archives du système Festival ont été téléchargées et décomprimées avec ARC.

Décompression d'un archive Festival

Décompression d’une archive Festival

J’ai suivi ensuite la même procédure que sur Mac OSX, à savoir

  • création d’un répertoire Festival-TTS sur le desktop avec les sous-répertoires festival, speech_tools et festvox
  • compilation des programmes dans l’ordre speech_tools, festival, festvox
  • installation des voix et dictionnaires dans les sous-répertoires lib/voices et lib/dicts du répertoire festival
Configuration du

Configuration du programme speech_tools

La compilation du programme speech_tools s’est arrêtée avec les messages d’erreur

/usr/bin/ld: cannot find -lcurses
/usr/bin/ld: cannot find -lncurses

L’installation de la bibliothèque libncurses5-dev a réglé ce problème. La suite de la compilation s’est passée sans autres erreurs, abstraction faite de plusieurs avertissements concernant des variables spécifiées, mais non utilisées .

Il a été possible de démarrer le programme Festival avec la commande

/Desktop/Festival-TTS/festival/bin/festival

mais la synthèse d’une phrase de test

festival> (SayText "Hello, how are you")

a produit l’erreur

Linux: can't open /dev/dsp

Parmi les remèdes trouvés sur le net, j’ai opté pour la solution

apt-get install oss-compat
modprobe snd-pcm-oss

qui a été couronnée de succès.

Il ne restait plus que la configuration des différents chemins d’accès pour mettre le système Festival tout à fait opérationnel. Les commandes suivantes ont été ajoutées au script ~/.bashrc :

FESTIVALDIR="/home/mbarnig/Desktop/Festival-TTS/festival"
FESTVOXDIR="/home/mbarnig/Desktop/Festival-TTS/festvox"
ESTDIR="/home/mbarnig/Desktop/Festival-TTS/speech_tools"
PATH="$FESTIVALDIR/bin:$PATH"
PATH="$ESTDIR/bin:$PATH"
export PATH
export FESTIVALDIR
export FESTVOXDIR
export ESTDIR
Pat

Variables d’environnement et PATH du système Festival

Ca marche!

Lancement

Lancement du programme Festival TTS sur Debian Linux Wheezy

Le chargement respectivement la compilation d’une nouvelle voix, comme le luxembourgeois, ne réussit que si la voix anglaise kal_diphon est présente, si non une erreur “unbound variable rfs_info” se produit.

Character encoding for Festival TTS files

Today, the dominant character encoding for the World Wide Web and for text files is UTF-8 (Universal Character Set + Transformation Format 8 bits). UTF-8 uses one to 4 bytes to encode one symbol (1,112,064 valid code points in the Unicode code space). The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

The Festival TTS package doesn’t  support UTF-8. The development of Festival started about 20 years ago when UTF-8 was only known by a few people of the Open Group for Unix Systems. Festival only supports one byte character encoding. All files created with the Festvox tools to develop new languages or voices for Festival are in US-ASCII format.

Problems appear if we need to use non-ASCII characters in Festival, for example the characters é è ë à ä ö ü for the luxembourgish language. In UTF-8 these characters are encoded with two bytes (16 bits), which yields errors in Festival. There exist however a series of 8-bit character encoding standards defined by ECMA, IEC and ISO. These standards are known as ISO-8859-x, where x is one of 15 parts. The listed luxembourgish specific characters are included in the ISO-8859 parts 1, 2, 3, 4, 9, 10, 14, 15 and 16. The  preferred ISO-8859 standard for the luxembourgish language is part 15, which includes the euro sign and provides the coverage of all the french and german letters. ISO 8859-15 encodes what it refers to as Latin alphabet no. 9.

Most text editors and other tools used today to write scripts and program code use UTF-8 as default format. On Windows I use Notepad++ to edit my files. Changing the encoding is easy. On my Mac OSX 10.10.2 (Yosemite) I use TextEdit, Terminal and Xcode to edit my files for Festival TTS. I changed the following preferences to encode my scripts and programs in ISO-8859-15.

TextEdit

I first checked the list of the encoding formats to show in the selection menu of the preference window.

List of character encoding formats in TextEdit

List of character encoding formats in TextEdit

The Occidental (ISO Latin 9) format corresponds to the ISO-8859-15 standard.

TextEdit

Preferences selected in TextEdit

Terminal (bash)

Same procedure for the OSX terminal. I checked the list of the encoding formats to show in the selection menu of the preference window.

List of character encoding formats in the OSX Terminal

List of character encoding formats in the OSX Terminal

In the OSX terminal preference window I also selected the Occidental (ISO Latin 9) format corresponding to the ISO-8859-15 standard.

Terminal

Preferences selected for the OSX Terminal

XCode 6.2

An ISO-8859-15 encoded text file (for example created witt TextEdit) is not recognized as such by Xcode 6.2. Characters as “é à è ö ä ü” are displayed as “È ‡ Ë ˆ ‰ ¸ “. The indicated Xcode decoding is Western (Mac OS Roman), even if the default text encoding set in Preferences > Text Editing is set to Western (ISO Latin 9). The characters are displayed as expected if the text is reinterpreted with the File Inspector to ISO Latin 9. Files created with Xcode are encoded always in UTF-8, the default text encoding setting seems to take no effect.

The Xcode window to select potential encoding formats uses a different presentation, but displays the same formats as those in TextEdit or OSX Terminal.

Xcode

List of encoding formats in Xcode

Again I defined the default text encoding as Occidental (ISO Latin 9) alias ISO-8859-15.

Xcode

Preferences selected for Xcode

File conversion

To check the text encoding of a file we can use the command file

mbarnig$ file -I name.ext

Some examples are shown hereafter :

file

Show file encoding with the command $ file -I

To convert an UTF-8 file in the ISO-8859-15 format we can use the command iconv

mbarnig$ iconv -f utf8 -t ISO-8859-1 utf8.txt > iso.txt

An example is shown hereafter :

iconv

File format conversion with the command $ iconv

The encoding formats available in iconv are listed with the option -l or –list :

iconv

Encoding formats available in iconv

Environment Variables

To show the environment variables we can use the command locale :

mbarnig$ locale
locale

Environment variables in OSX locale

LANG=                         ; native language + local customization if no LC_ variables
LC_COLLATE=”C”       ; character collation
LC_CTYPE=”C”           ; character handling functions
LC_MESSAGES=”C”   ; affirmative and negative responses
LC_MONETARY=”C”   ; monetary related numeric formatting
LC_NUMERIC=”C”      ; numeric formatting
LC_TIME=”C”              ; date and timing formatting
LC_ALL=                     ; value to overwrite all LC_ variables

“C” stands for simplest locale (character set ASCII, single byte character encoding, language US-english, …)

To show all the available locales we use the option -a

mbarnig$ locale -a
locale -a

List of locales defined in OSX 10.10.2

To change the locales we use the export function :

mbarnig$ export LANG=fr_CH
mbarnig$ export LC_ALL=fr_CH.ISO8859-15
locate_iso

Modification of locales in OSX 10.10.2 with the export command

The export functions can be included in the /Users/mbarnig/.bash_profile file.

Links

Speech Utterance

Last update : April 2, 2015

Utterance Definition

In linguistics an utterance is a unit of speech, without having a precise definition. It’s a bit of spoken language. It could be anything from “Baf!” to a full sentence or a long speech. The corresponding unit in written language is text.

Phonetically an utterance is a unit of speech bounded (preceded and followed) by silence. Phonemes, phones, morphemes,  words etc are all considered items of an utterance.

In orthography, an utterance begins with a capital letter and ends in a period, question mark, or exclamation point.

In Speech Synthesis (TTS) the text that you wish to be spoken is contained within an utterance object (example : SpeechSynthesisUtterance). The Festival TTS system uses the utterance as the basic object for synthesis. Speech synthesis is the process that applies a set of programs to an utterance.

The main stages to convert textual input to speech output are :

  1. Conversion of the input text to tokens
  2. Conversion of tokens to words
  3. Conversion of words to strings of phonemes
  4. Addition of prosodic information
  5. Generation of a waveform

In Festival each stage is executed in several steps. The number of steps and what actually happens may vary and is dependent on the particular language and voice selected. Each of the steps is achieved by a Festival module which will typically add new information to the utterance structure. Swapping of modules is possible.

Festival provides six synthesizer modules :

  • 2 diphone engines : MBROLA and diphone
  • 2 unit selection engines : clunits and multisyn
  • 2 HMM engines : clustergen and HTS

Festival Utterance Architecture

A very simple utterance architecture is the string model where the high level items are replaced sequentially by lower level items, from tokens to phones. The disadvantage of this architecture is the loss of information about higher levels.

Another architecture is the multi-level table model with one hierarchy. The problem is that there are no explicit connections between levels.

Festival uses a Heterogeneous Relation Graph (HRG). This model is defined as follows :

  • Utterances consist of a set of items, representing things like tokens, words, phones,
  • Each item is related by one or more relations to other items.
  • Each item contains a set of features, having each a name and a value.
  • Relations define lists, trees or lattices of items.

The stages and steps to build an utterance in Festival, described in the following chapters, are related to the us-english language and to the clustergen voice cmu_us_slt_cg.

To explore the architecture (structure) of an utterance in Festival, I will analyse the relation-trees created by the synthesis of the text string “253”.

festival> (voice_cmu_us_slt_cg)
cmu_us_slt_cg
festival> (set! utter (SayText "253"))
#<Utterance 0x104c20720>
festival> (utt.relationnames utter)
(Token
 Word
 Phrase
 Syllable
 Segment
 SylStructure
 IntEvent
 Intonation
 Target
 HMMstate
 segstate
 mcep
 mcep_link
 Wave)
festival> (utt.relation_tree utter 'Token)
((("253"
   ((id "_1")
    (name "253")
    (whitespace "")
    (prepunctuation "")
    (token_pos "cardinal")))
  (("two"
    ((id "_2")
     (name "two")
     (pos_index 1)
     (pos_index_score 0)
     (pos "cd")
     (phr_pos "cd")
     (phrase_score -0.69302821)
     (pbreak_index 1)
     (pbreak_index_score 0)
     (pbreak "NB"))))
  (("hundred"
    ((id "_3")
     (name "hundred")
     (pos_index 1)
     (pos_index_score 0)
     (pos "cd")
     (phr_pos "cd")
     (phrase_score -0.692711)
     (pbreak_index 1)
     (pbreak_index_score 0)
     (pbreak "NB"))))
  (("fifty"
    ((id "_4")
     (name "fifty")
     (pos_index 8)
     (pos_index_score 0)
     (pos "nn")
     (phr_pos "n")
     (phrase_score -0.69282991)
     (pbreak_index 1)
     (pbreak_index_score 0)
     (pbreak "NB"))))
  (("three"
    ((id "_5")
     (name "three")
     (pos_index 1)
     (pos_index_score 0)
     (pos "cd")
     (phr_pos "cd")
     (pbreak_index 0)
     (pbreak_index_score 0)
     (pbreak "B")
     (blevel 3))))))
festival> (utt.relation_tree utter 'Word)
((("two"
   ((id "_2")
    (name "two")
    (pos_index 1)
    (pos_index_score 0)
    (pos "cd")
    (phr_pos "cd")
    (phrase_score -0.69302821)
    (pbreak_index 1)
    (pbreak_index_score 0)
    (pbreak "NB"))))
 (("hundred"
   ...
   ...
    (blevel 3)))))
festival> (utt.relation_tree utter 'Phrase)
((("B" ((id "_6") (name "B")))
  (("two"
    ((id "_2")
     (name "two")
     (pos_index 1)
     (pos_index_score 0)
     (pos "cd")
     (phr_pos "cd")
     (phrase_score -0.69302821)
     (pbreak_index 1)
     (pbreak_index_score 0)
     (pbreak "NB"))))
  (("hundred"
    ...
    ...
     (blevel 3))))))
festival> (utt.relation_tree utter 'Syllable)
((("syl" ((id "_7") (name "syl") (stress 1))))
 (("syl" ((id "_10") (name "syl") (stress 1))))
 (("syl" ((id "_14") (name "syl") (stress 0))))
 (("syl" ((id "_19") (name "syl") (stress 1))))
 (("syl" ((id "_23") (name "syl") (stress 0))))
 (("syl" ((id "_26") (name "syl") (stress 1)))))
festival> (utt.relation_tree utter 'Segment)
((("pau" ((id "_30") (name "pau") (end 0.15000001))))
 (("t" ((id "_8") (name "t") (end 0.25016451))))
 (("uw" ((id "_9") (name "uw") (end 0.32980475))))
 (("hh" ((id "_11") (name "hh") (end 0.39506164))))
 (("ah" ((id "_12") (name "ah") (end 0.48999402))))
 (("n" ((id "_13") (name "n") (end 0.56175226))))
 (("d" ((id "_15") (name "d") (end 0.59711802))))
 (("r" ((id "_16") (name "r") (end 0.65382934))))
 (("ax" ((id "_17") (name "ax") (end 0.67743915))))
 (("d" ((id "_18") (name "d") (end 0.75765681))))
 (("f" ((id "_20") (name "f") (end 0.86216313))))
 (("ih" ((id "_21") (name "ih") (end 0.93317086))))
 (("f" ((id "_22") (name "f") (end 1.0023116))))
 (("t" ((id "_24") (name "t") (end 1.0642071))))
 (("iy" ((id "_25") (name "iy") (end 1.1534019))))
 (("th" ((id "_27") (name "th") (end 1.2816957))))
 (("r" ((id "_28") (name "r") (end 1.3449684))))
 (("iy" ((id "_29") (name "iy") (end 1.5254952))))
 (("pau" ((id "_31") (name "pau") (end 1.6754951)))))
festival> (utt.relation_tree utter 'SylStructure)
((("two"
   ((id "_2")
    (name "two")
    (pos_index 1)
    (pos_index_score 0)
    (pos "cd")
    (phr_pos "cd")
    (phrase_score -0.69302821)
    (pbreak_index 1)
    (pbreak_index_score 0)
    (pbreak "NB")))
  (("syl" ((id "_7") (name "syl") (stress 1)))
   (("t" ((id "_8") (name "t") (end 0.25016451))))
   (("uw" ((id "_9") (name "uw") (end 0.32980475))))))
 (("hundred"
   ((id "_3")
    (name "hundred")
    (pos_index 1)
    (pos_index_score 0)
    (pos "cd")
    (phr_pos "cd")
    (phrase_score -0.692711)
    (pbreak_index 1)
    (pbreak_index_score 0)
    (pbreak "NB")))
  (("syl" ((id "_10") (name "syl") (stress 1)))
   (("hh" ((id "_11") (name "hh") (end 0.39506164))))
   (("ah" ((id "_12") (name "ah") (end 0.48999402))))
   (("n" ((id "_13") (name "n") (end 0.56175226)))))
  (("syl" ((id "_14") (name "syl") (stress 0)))
   (("d" ((id "_15") (name "d") (end 0.59711802))))
   (("r" ((id "_16") (name "r") (end 0.65382934))))
   (("ax" ((id "_17") (name "ax") (end 0.67743915))))
   (("d" ((id "_18") (name "d") (end 0.75765681))))))
 (("fifty"
   ((id "_4")
    (name "fifty")
    (pos_index 8)
    (pos_index_score 0)
    (pos "nn")
    (phr_pos "n")
    (phrase_score -0.69282991)
    (pbreak_index 1)
    (pbreak_index_score 0)
    (pbreak "NB")))
  (("syl" ((id "_19") (name "syl") (stress 1)))
   (("f" ((id "_20") (name "f") (end 0.86216313))))
   (("ih" ((id "_21") (name "ih") (end 0.93317086))))
   (("f" ((id "_22") (name "f") (end 1.0023116)))))
  (("syl" ((id "_23") (name "syl") (stress 0)))
   (("t" ((id "_24") (name "t") (end 1.0642071))))
   (("iy" ((id "_25") (name "iy") (end 1.1534019))))))
 (("three"
   ((id "_5")
    (name "three")
    (pos_index 1)
    (pos_index_score 0)
    (pos "cd")
    (phr_pos "cd")
    (pbreak_index 0)
    (pbreak_index_score 0)
    (pbreak "B")
    (blevel 3)))
  (("syl" ((id "_26") (name "syl") (stress 1)))
   (("th" ((id "_27") (name "th") (end 1.2816957))))
   (("r" ((id "_28") (name "r") (end 1.3449684))))
   (("iy" ((id "_29") (name "iy") (end 1.5254952)))))))
festival> (utt.relation_tree utter 'IntEvent)
((("L-L%" ((id "_32") (name "L-L%"))))
 (("H*" ((id "_33") (name "H*"))))
 (("H*" ((id "_34") (name "H*"))))
 (("H*" ((id "_35") (name "H*")))))
festival> (utt.relation_tree utter 'Intonation)
((("syl" ((id "_26") (name "syl") (stress 1)))
  (("L-L%" ((id "_32") (name "L-L%")))))
 (("syl" ((id "_7") (name "syl") (stress 1)))
  (("H*" ((id "_33") (name "H*")))))
 (("syl" ((id "_10") (name "syl") (stress 1)))
  (("H*" ((id "_34") (name "H*")))))
 (("syl" ((id "_19") (name "syl") (stress 1)))
  (("H*" ((id "_35") (name "H*"))))))
festival> (utt.relation_tree utter 'Target)
((("t" ((id "_8") (name "t") (end 0.25016451)))
  (("0" ((id "_36") (f0 101.42016) (pos 0.1)))))
 (("uw" ((id "_9") (name "uw") (end 0.32980475)))
  (("0" ((id "_37") (f0 121.11904) (pos 0.25)))))
 (("hh" ((id "_11") (name "hh") (end 0.39506164)))
  (("0" ((id "_38") (f0 119.19957) (pos 0.30000001)))))
 (("ah" ((id "_12") (name "ah") (end 0.48999402)))
  (("0" ((id "_39") (f0 123.81679) (pos 0.44999999)))))
 (("d" ((id "_15") (name "d") (end 0.59711802)))
  (("0" ((id "_40") (f0 117.02986) (pos 0.60000002)))))
 (("ax" ((id "_17") (name "ax") (end 0.67743915)))
  (("0" ((id "_41") (f0 110.17942) (pos 0.85000008)))))
 (("f" ((id "_20") (name "f") (end 0.86216313)))
  (("0" ((id "_42") (f0 108.59299) (pos 1.0000001)))))
 (("ih" ((id "_21") (name "ih") (end 0.93317086)))
  (("0" ((id "_43") (f0 115.24371) (pos 1.1500001)))))
 (("t" ((id "_24") (name "t") (end 1.0642071)))
  (("0" ((id "_44") (f0 108.76601) (pos 1.3000002)))))
 (("iy" ((id "_25") (name "iy") (end 1.1534019)))
  (("0" ((id "_45") (f0 102.23844) (pos 1.4500003)))))
 (("th" ((id "_27") (name "th") (end 1.2816957)))
  (("0" ((id "_46") (f0 99.160072) (pos 1.5000002)))))
 (("iy" ((id "_29") (name "iy") (end 1.5254952)))
  (("0" ((id "_47") (f0 90.843689) (pos 1.7500002))))
  (("0" ((id "_48") (f0 88.125809) (pos 1.8000003))))))
festival> (utt.relation_tree utter 'HMMstate)
((("pau_1" ((id "_49") (name "pau_1") (statepos 1) (end 0.050000001)*
 (("pau_2" ((id "_50") (name "pau_2") (statepos 2) (end 0.1))))
 (("pau_3" ((id "_51") (name "pau_3") (statepos 3) (end 0.15000001)*
 (("t_1" ((id "_52") (name "t_1") (statepos 1) (end 0.16712391))))
 (("t_2" ((id "_53") (name "t_2") (statepos 2) (end 0.23217295))))
 (("t_3" ((id "_54") (name "t_3") (statepos 3) (end 0.25016451))))
 (("uw_1" ((id "_55") (name "uw_1") (statepos 1) (end 0.2764155))))
 (("uw_2" ((id "_56") (name "uw_2") (statepos 2) (end 0.3001706))))
 (("uw_3" ((id "_57") (name "uw_3") (statepos 3) (end 0.32980475))))
 (("hh_1" ((id "_58") (name "hh_1") (statepos 1) (end 0.3502973))))
 ...
 ...
 (("iy_1" ((id "_100") (name "iy_1") (statepos 1) (end 1.3995106))))
 (("iy_2" ((id "_101") (name "iy_2") (statepos 2) (end 1.4488922))))
 (("iy_3" ((id "_102") (name "iy_3") (statepos 3) (end 1.5254952))))
 (("pau_1" ((id "_103") (name "pau_1") (statepos 1) (end 1.5754951)*
 (("pau_2" ((id "_104") (name "pau_2") (statepos 2) (end 1.6254952)*
 (("pau_3" ((id "_105") (name "pau_3") (statepos 3) (end 1.6754951)*
festival> (utt.relation_tree utter 'segstate)
((("pau" ((id "_30") (name "pau") (end 0.15000001)))
  (("pau_1" ((id "_49") (name "pau_1") (statepos 1) (end 0.050000001)
  (("pau_2" ((id "_50") (name "pau_2") (statepos 2) (end 0.1))))
  (("pau_3" ((id "_51") (name "pau_3") (statepos 3) (end 0.15000001)*
 (("t" ((id "_8") (name "t") (end 0.25016451)))
  (("t_1" ((id "_52") (name "t_1") (statepos 1) (end 0.16712391))))
  (("t_2" ((id "_53") (name "t_2") (statepos 2) (end 0.23217295))))
  (("t_3" ((id "_54") (name "t_3") (statepos 3) (end 0.25016451)))))
 (("uw" ((id "_9") (name "uw") (end 0.32980475)))
  (("uw_1" ((id "_55") (name "uw_1") (statepos 1) (end 0.2764155))))
  (("uw_2" ((id "_56") (name "uw_2") (statepos 2) (end 0.3001706))))
  (("uw_3" ((id "_57") (name "uw_3") (statepos 3) (end 0.32980475))))
 ...
 ...
 (("iy" ((id "_29") (name "iy") (end 1.5254952)))
  (("iy_1" ((id "_100") (name "iy_1") (statepos 1) (end 1.3995106))))
  (("iy_2" ((id "_101") (name "iy_2") (statepos 2) (end 1.4488922))))
  (("iy_3" ((id "_102") (name "iy_3") (statepos 3) (end 1.5254952))*
 (("pau" ((id "_31") (name "pau") (end 1.6754951)))
  (("pau_1" ((id "_103") (name "pau_1") (statepos 1) (end 1.5754951)*
  (("pau_2" ((id "_104") (name "pau_2") (statepos 2) (end 1.6254952)*
  (("pau_3" ((id "_105") (name "pau_3") (statepos 3) (end 1.6754951)*
festival> (utt.relation_tree utter 'mcep)
((("pau_1"
   ((id "_106")
    (frame_number 0)
    (name "pau_1")
    (clustergen_param_frame 19315))))
 (("pau_1"
   ((id "_107")
    (frame_number 1)
    (name "pau_1")
    (clustergen_param_frame 19315))))
 (("pau_1"
   ((id "_108")
    (frame_number 2)
    (name "pau_1")
    (clustergen_param_frame 19315))))
 (("pau_1"
   ((id "_109")
    (frame_number 3)
    (name "pau_1")
    (clustergen_param_frame 19315))))
 ...
 ...
 (("t_1"
   ((id "_137")
    (frame_number 31)
    (name "t_1")
    (clustergen_param_frame 26089))))
 (("t_1"
   ((id "_138")
    (frame_number 32)
    (name "t_1")
    (clustergen_param_frame 26085))))
 (("t_1"
   ((id "_139")
    (frame_number 33)
    (name "t_1")
    (clustergen_param_frame 26085))))
 (("t_2"
   ((id "_140")
    (frame_number 34)
    (name "t_2")
    (clustergen_param_frame 26642))))
...
...
 (("uw_1"
   ((id "_157")
    (frame_number 51)
    (name "uw_1")
    (clustergen_param_frame 27595))))
 ...
 (("pau_3"
   ((id "_438")
    (frame_number 332)
    (name "pau_3")
    (clustergen_param_frame 22148))))
 (("pau_3"
   ((id "_439")
    (frame_number 333)
    (name "pau_3")
    (clustergen_param_frame 22148))))
 (("pau_3"
   ((id "_440")
    (frame_number 334)
    (name "pau_3")
    (clustergen_param_frame 22148))))
 (("pau_3"
   ((id "_441")
    (frame_number 335)
    (name "pau_3")
    (clustergen_param_frame 22365)))))
festival> (utt.relation_tree utter 'mcep_link)
((("pau_1" ((id "_49") (name "pau_1") (statepos 1) (end 0.050000001).
  (("pau_1"
    ((id "_106")
     (frame_number 0)
     (name "pau_1")
     (clustergen_param_frame 19315))))
  (("pau_1"
    ((id "_107")
     (frame_number 1)
     (name "pau_1")
     (clustergen_param_frame 19315))))
  (("pau_1"
    ((id "_108")
     (frame_number 2)
     (name "pau_1")
     (clustergen_param_frame 19315))))
  ...
  ...
  (("pau_3"
    ((id "_439")
     (frame_number 333)
     (name "pau_3")
     (clustergen_param_frame 22148))))
  (("pau_3"
    ((id "_440")
     (frame_number 334)
     (name "pau_3")
     (clustergen_param_frame 22148))))
  (("pau_3"
    ((id "_441")
     (frame_number 335)
     (name "pau_3")
     (clustergen_param_frame 22365))))))
festival> (utt.relation_tree utter 'Wave)
((("0" ((id "_442") (wave "[Val wave]")))))
festival>

Notes :
* some parentheses have been deleted in the display for formating reasons
… some content has been deleted to reduce the size of the analyzed code

Results of the code analysis

The number of items created for the string “253” are shown in the following table :

number item id’s
1 token 1
4 word 2-5
1 phrase 6
6 syllable 7, 10, 14, 19, 23, 26
19 segment 8-9, 11-13, 15-18, 20-22, 24-25, 27-31
4 intevent 32-35
13 target 36-48
57 hmmstate 49-105
336 mcep 106-441
1 wave 442

The features associated to the different items are presented in the next table :

item features
token name, whitespace, prepunctuation, token_pos
word name, pos_index, pos_index_score, pos, phr_pos, phrase_score, pbreak_index, pbreak_index_score, pbreak, blevel
phrase name
syllable name, stress
segment name, end
intevent name
target f0, pos
hmmstate name, statepos, end
mcep name, frame_number, clustergen_param_frame
wave Val

The last table shows the relations between the different items in the HRG :

item daughter leaf relation
token word x Token
word syllable SylStructure
phrase word x Phrase
syllable segment x (except silence) SylStructure
syllable intevent x Intonation
segment target x Target
segment hmmstate x segstate
segment mcep x mcep_link

Relations between utterance items

To better understand the relations between utterance items, I use a second example :

festival>
(set! utter (SayText "253 and 36"))
(utt.relation.print utter 'Token)
Tok

Festival SayText

There are 3 tokens. The Token relation is a list of trees where each root is the white space separated tokenized object from the input character string and where the daughters are the list of words associated with the tokens. Most often it is a one to one relationship, but in the case of digits a token is associated with several words. The following command shows the Token tree with the daughters :

(utt.relation_tree utter 'Token)
Festival Token_tree

Festival Token_tree

We can check that the word list corresponds to the Token tree list :

(utt.relation.print utter 'Word)
Festival Word List

Festival Word List

To access the second word of the first token we can use two methods :

(item.name (item.daughter2 (utt.relation.first utter 'Token)))

or

(item.name (item.next (utt.relation.first utter 'Word)))
tok

Festival access methods to word item

TTS stages and steps

In the next chapters the different stages and steps executed to synthesize a text string are described with more details. In the first step a simple and a complex utterance of type Text are created :

(set! simple_utt (Utterance Text 
"The quick brown fox jumps over the lazy dog"))
(set! complex_utt (Utterance Text
"Mr. James Brown Jr. attended flight No AA4101 to Boston on
Friday 02/13/2014."))

The complex utterance named complex_utt is used in the following examples.

1. Text-to-Token Conversion

Text

Text is a string of characters in ASCII or ISO-8850 format. Written (raw) text usually contains also numbers, names, abbreviations, symbols, punctuation etc which must be translated into spoken text. This process is called Tokenization. Other terms used are (lexical) Pre-Processing, Text Normalization or Canonicalization. To access the items and features of the defined utterance named complex_utt in Festival we use the following modules :

festival> (Initialize complex_utt) ; Initialize utterance
festival> (utt.relationnames complex_utt) ; show created relations
Festival Utterance Initialization

Festival Utterance Initialization

The result nil indicates that there exist not yet a relation inside the text-utterance.

Tokens

The second step is the Tokenization which consists in the conversion of the input text to tokens. A token is a sequence of characters where whitespace and punctuation are eliminated. The following Festival command is used to convert raw text to tokens and to show them :

festival> (Text complex_utt) ; convert text to tokens
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Token) ; display tokens
Festival Text Module to convert raw text to tokens

Festival Text Module to convert raw text to tokens

There are several methods to access individual tokens :

festival> (utt.relation.first complex_utt 'Token) ; returns 1st token
festival> (utt.relation.last complex_utt 'Token) ; returns last token
festival> (utt.relation_tree complex_utt 'Token) ; returns token tree
Festival Token Access

Festival Token Access

This utt.relation_tree method can also be applied to other relations than ‘Tokens.

2. Token-to-Word Conversion

Words

In linguistics, a word is the smallest element that may be uttered in isolation with semantic or pragmatic content. To convert the isolated tokens to words, we use the Festival commands :

festival> (Token complex_utt) ; token to word conversion
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Word) ; display words
Festival Token Module to convert tokens to words

Festival Token Module to convert tokens to words

The rules to perform the token to word conversion are specified in the Festival script token.scm.

POS

Part-of-Speech (POS) Tagging is related to the Token-to-Word conversion. POS is also called grammatical tagging or word-category disambiguation. It’s the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context (identification of words as nouns, verbs, adjectives, adverbs, etc.)

To do the POS tagging, we use the commands

festival> (POS complex_utt) ; Part of Speech tagging
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Word) ; display words

The relation check shows that no new relation was created with the POS method. There are however new features which have been added to the ‘Word relation.

Festival POS Module to tag the words

Festival POS Module to tag the words

The new features are :

  • pos_index n
  • pos_index_score m
  • pos xx

Phrase

The last step of the Token-to-Word conversion is the phrasing. This process determines the places where phrase boundaries should be inserted. Prosodic phrasing in TTS makes the whole speech more understandable. The phrasing is launched with the following commands :

festival> (Phrasify complex_utt) ;
festival> (utt.relationnames complex_utt) ; check new relations
festival> (utt.relation.print complex_utt 'Phrase) ; display breaks
Festival Phrasify Module to insert boundaries

Festival Phrasify Module to insert boundaries

The result can be seen in new attributes in the Word relation:

festival> (utt.relation.print complex_utt 'Word)
  • phr_pos xx
  • phrase_score nn
  • pbreak_index n
  • pbreak_index_score m
  • pbreak yy  (B for small breaks, BB is for big breaks, NB for no break)
  • blevel p
utt6a

Festival Word list after phrasing (click to enlarge)

3. Word-to-Phoneme Conversion

The command

festival> (Word complex_utt)

generates 3 new relations : syllables, segments and SylStructure.

utt7

Festival relations generated by the Word method

Segment

Segments and phones are synonyms.

festival> (utt.relation.print complex_utt 'Segment)
utt9

Festival segments = phones

Syllable

Consonants and vowels combine to make syllables. They are often considered the phonological building blocks of words, but there is no universally accepted definition for a syllable. An approximate definition is : a syllable is a vowel sound together with some of the surrounding consonants closely associated with it. The general structure of a syllable consists of three segments :

  • Onset : a consonant or consonant cluster
  • Nucleus : a sequence of vowels or syllabic consonant
  • Coda : a sequence of consonants

Nucleus and coda are grouped together as a Rime. Prominent syllables are called accented; they are louder, longer and have a different pitch.

The following Festival command shows the syllables of the defined utterance.

festival> (utt.relation.print complex_utt 'Syllable)
utt8

Festival syllables

SylStructure

Words, segments and syllables are related in the HRG trought the SylStructure. The command

festival> (utt.relation.print complex_utt 'SylStructure)

prints these related items.

utt10 (click to enlarge)

Festival SylStructure  (click to enlarge)

4. Prosodic Information Addition

Besides the phrasing with break indices, additional prosodic components can be added to speech synthesis to improve the voice quality. Some of these elements are :

  • pitch accents (stress)
  • final boundary tones
  • phrasal tones
  • F0 contour
  • tilt
  • duration

Festival supports ToBI, a framework for developing community-wide conventions for transcribing the intonation and prosodic structure of spoken utterances in a language variety.

The process

festival> (Intonation complex_utt)

generates two additional relations : IntEvent and Intonation

utt11

Festival prosodic relations

IntEvent

The command

festival> (utt.relation.print complex_utt 'IntEvent)

prints the IntEvent items.

utt12

Festival IntEvent items

The following types are listed :

  • L-L% : low boundary tone
  • H* : peak accent
  • !H* : downstep high
  • L+H* : bitonal accent, rising peak

Intonation

The command

festival> (utt.relation.print complex_utt 'Intonation)

prints the Intonation items.

utt13

Festival Intonation items

Only the syllables with stress are displayed.

Duration

The process

festival> (Duration complex_utt)

creates no new relations and I have not seen any new items or features in other relations.

utt14

utt14

Target

The last process in the prosodic stage

festival> (Int_Targets complex_utt)

generates the additional relation Target.

utt15

Festival relations after the Int_Targets process

The command

festival> (utt.relation.print complex_utt 'Target)

prints the target items.

Festival clustergen targets

Festival clustergen targets

The unique target features are the segment name and the segment end time.

5. Waveform Generation

Wave

The process

festival> (Wave_Synth complex_utt)
festival> (utt.relation.print complex_utt 'Wave)

generates five new relations :

  • HMMstate
  • segstate
  • mcep
  • mcep_links
  • Wave
Festival

Festival Wave relations for clustergen voice

In the next chapters we use the method

(utt.relation.print complex_utt 'Relation)

to display the relations and features specific to the diphone voice.

Relation ‘HMMstate

HMMstates for clustergen voice

HMMstates for Festival clustergen voice

Relation ‘segstate

segstates for

segstates for Festival clustergen voice

Relation ‘mcep

mcep

mcep features for Festival clustergen voice

Relation ‘mcep_links

mcep_link

mcep_links relation for Festival clustergen voice

Relation ‘Wave

wave

Wave relation for Festival clustergen voice

Diphone Voice Utterance

If we use a diphone voice (e.g. the default kal_diphone voice) instead of the clustergen voice, the last step of the prosodic stage (No 4) and the complete wave-synthesis stage (No 5) provide different relations and features.

We use the Festival method “SayText”, a combination of the above presented processes

  • Initialize utt
  • Text utt
  • Token utt
  • POS utt
  • Phrasify utt
  • Word utt
  • Intonation utt
  • Duration utt
  • Int_Targets utt
  • Wave_Synt utt

to create the same complex utterance as in the first example :

festival>
(set! complex_utt (SayText "Mr. James Brown Jr. attended flight 
No AA4101 to Boston on Friday 02/13/2014."))
(utt.relationnames complex_utt)

Here are the results :

clun

Utterance relations for a Festival diphone voice

In the next chapters we use the method

(utt.relation.print complex_utt 'Relation)

to display the relations and features specific to the diphone voice.

Relation ‘Target

utt16

Relation Target for diphone voice

Relation ‘Unit

utt18

Relation Unit for diphone voice (click to enlarge)

Relation ‘SourceCoef

utt19

Relation ‘SourceCoef for diphone voice

Relation ‘fo

utt20

Relation f0 for diphone voice

Relation ‘TargetCoef

utt21

Relation TargetCoef for diphone voice

Relation ‘US_map

utt22

Relation US_map for diphone voice

Relation ‘Wave

utt23

Relation Wave for diphone voice

Playing and saving the diphone voice utterance :

utt24

Playing and saving a synthesized Festival utterance

Diphone voice utterance shown in Audacity :

utt_wave

Display of a synthesized Festival utterance

Clunits Voice Utterance

What is true for the diphone voice is also ture for a clunits voice. The last step of the prosodic stage (No 4) and the complete wave-synthesis stage (No 5) generate different relations and features. As an example we use a swedish clunits voice :

clu1

Relations for Festival clunits voice

Relation ‘Target

clu4

Relation Target for Festival clunits voice  (click to enlarge)

Relation ‘Unit

clu3

Relation unit for Festival clunits voice  (click to enlarge)

Relation ‘SourceSegments

clu2

Relation SourceSegments for Festival clunits voice

That’s all.

Festival TTS Scheme

Last update : April 23, 2015

Scheme

The Scheme programmimg language is a dialect of Lisp designed to be more consistent. Originally specified in 1958, Lisp is the second-oldest high-level programming language;  Fortran is one year older. Lisp has a distinctive, fully parenthesized Polish Notation. The name LISP derives from LISt Processing. All program code in Lisp is written as s-expressions, or parenthesized lists. Lots of brackets comes to most people’s mind when they look at s-expressions. Lisp was invented by John McCarthy in 1958 while he was at the Massachusetts Institute of Technology (MIT). Lisp became quickly the favored programming language for artificial intelligence (AI) research. Another dialect of Lisp is Common Lisp (CL).

Scheme was invented by Guy Lewis Steele Jr. and Gerald Jay Sussman. The Scheme language is standardized in the official IEEE standard and in a de facto standard called the Revisedn Report on the Algorithmic Language Scheme (RnRS). The latest standard R7RS was finalized in 2013. An improper list of Scheme resources is available at the schemers.org website.

The Festival TTS system uses an enhanced version of Scheme, based on George Carrette’s SIOD (Scheme in one Defun).

In Scheme, an expression is an atom or a list. Atoms can be symbols, numbers, strings or special types like functions, arrays, hash tables. A list consist of a left parenthesis, a number of expressions and a right parenthesis. Comments are started by a semicolon and run until end of line. Scheme is case sensitive.

Each member of a list (except set!) is evaluated. The 1st item is treated as function and applied using the remainder of the lists as arguments to the function. Strings are bounded by the double-quote character.

"This is a string with embedded \"...\" quotes"

Double quotes in a string are escaped with a backslash ( \ ).

Scheme Core Functions

The core functions are :

  • (set! SYMBOL VALUE) ; attribute a value to a variable
  • (define (FUNCNAME ARG0 ARG1 …) . BODY) ; declare the boy of a function with arguments
  • (if TEST TRUECASE [FALSECASE] ) ; if the value of TEST is non-nil, return value of evaluated truecase-expression, otherwise, if present, return evaluated falsecase-expressionor nil
  • (cond (TEST0 . BODY) (TEST1 . BODY) …) ; multipe if statement
  • (begin . BODY) ; returns the value of the last s-expression in a list
  • (or . DISJ) ; return the value of the 1st non-nil disjunct
  • (and : CONJ) ; return the value of the 1st nil conjunct or the value of the last conjunct
  • (car EXPR) ; return the 1st item of a list or nil for an atom or empty list
  • (cdr EXPR) ; return the rest of a list or nil for an atom or empty list
  • (cons EXPR0 EXPR1) ; build a new list with car of EXPR0 and crd of EXPR1
  • (list . BODY) ; form a list from each of the arguments
  • (append . BODY) ; join each of the arguments into a single list
  • (nth N LIST) ; return Nth member of list (1st item is 0th member)
  • (nth_cdr N LIST) ; return Nth cdr list
  • (last LIST) ; return last crd of a list
  • (reverse list) ; return the list in inverse order
  • (member ITEM LIST) ; returns the cdr in LIST whose car is ITEM or nil if not found
  • (assoc ITEM ALIST) ; standardlist format for representing value pairs
  • (intern “abc”) ; convert a string to a symbol
  • (parse-number “3.14”) ; convert a string to a number
  • (mapcar fct list1 list2 …) ; returns a list which is the result of applying the function fcn to the elements of each of the lists specified
  • (number? x) ; returns true if x is a number
  • (symbol? x) ; returns true if x is a symbol
  • (quote x) ;  special form that returns x without evaluating it. Commonly written in abbreviated format as ‘x (apostrophe x = short hand for quote). The (quote something) form returns that something as itself no matter what the something is (symbol, list …). Self evaluated data like numbers and strings need not to be quoted.

If the 1st statement of a function is a string, it is treated as a documentation string. The string will be printed when help is requested for that function symbol. To request help, press the keys <Esc> and <h> simultaneously after entering the name of the function.

Help

Scheme Help Function

s-expressions

A Scheme s-expression is a construct that returns a value. Typical s-expressions are

3
( 1 2 3)
(a ( b c ) d)
(( a b ) ( d e ))

Some examples of simple arithmetic scheme s-expressions are presented hereafter :

(+ 2 3 4)   ; add the arguments
(set! a 3)  ; define the variable a to 3
(* a 5)     ; multiply the arguments
(/ 8 a)     ; divide the 1st argument by the 2nd argument
(- 12 a)    ; substract the 2nd argument from the 1st argument
(cons 1 2)  ; join 2 arguments as a pair

The following figure shows the results :

Scheme

Scheme arithmetic expressions

Here are some more examples of scheme s-expressions :

; define a list with 3 fruits
(set! fruits '(apples pears bananas))
; use the 1st item of the list
(car fruits) 
; use the remaining items following the 1st item of the list
(cdr fruits)
; define a list with 2 vegetables
(set! vegetables '(carrots beans)) 
; combine the fruit and vegetable lists
(append fruits vegetables) 
; get the number of items in the list
(length fruits)
; get the number of items in the combined list
(length (append fruits vegetables))
; join 2 lists as a pair
(cons vegetables fruits)

The next figure shows the results :

Scheme list expressions

Scheme list expressions

Regular expressions

A regular expression is a sequence of characters that forms a search pattern, mainly for use in string matching (filters). A regular expression is sometimes called a rational expression and abbreviated as regex or regexp. The concept was formalized in the 1950s by the American mathematician Stephen Kleene who created a regular language. Today regular expressions are so useful in computing that the various systems to specify regular expressions have evolved to provide both a basic and extended standard for the grammar and syntax.

Festival Scheme uses a regex implementation based on Henry Spencer‘s regex code. In general all characters match themselves, except for the following ones which have special interpretations (meta-characters), if they are not preceded by a backslash :

.  *  +  ?  [  ]  (  )  |  ^  $  \
  • .    matches any character in the regex
  • *    matches zero or more occurences of the preceding item in the regex
  • +   matches one or more occurences of the preceding item in the regex
  • ?   matches zero or one occurence of the preceding item in the regex
  • [  ]   defines a range of characters
  • (  )  defines a section (scope, block, capturing group)
  • |     or operator ; separates alternatives
  • ^   if specified first negates the class
  • $   matches the ending position of a string
  • \  backslash to escape meta-characters

Examples :

  • a.c   > abc, acc, adc
  • a*c   > c, aac, ac, aaaaaaaac
  • a+c  > ac, aaac
  • ab?c  > ac, abc
  • [a-z] any lower case letter
  • [a-zA-Z] any lower or upper case letter
  • gr(e|a)y  > grey, gray
  • ^abc  > any characters other than a, b or c

Boolean Values

Scheme provides a special unique object called false, whose written representation is #f and who is the result of a condition expression (if or cond). This is the only value that counts as false, all others count as true. For this reason there is no need to provide a boolean object true, but for clarity, Scheme provides an object written #t which can be used as true value. In Festival’s Scheme (SIOD), #f and #t are unbound variables. In SIOD the variable t stands for true and nil stands for false.

Named and unnamed (anonymous = lambda) functions

Here is an example of a new (named) function :

(define (add a b) (+ a b))
(add 7 5)
Scheme  functions

Scheme functions

Scheme provides a lot of useful string functions, for example

(string-equal ATOM1 ATOM2)
(string-append STR1 STR2)
(string-before STR SUBSTR)
(string-after STR SUBSTR)
(string-length SYMBOL)
(string-matches STR REGEX)
(Symbolexplode SYMBOL)
(member_string STR LIST)

A function that returns either true (t) or false (nil) is called a predicate.

Instead of a named function, we can create unnamed (anonymous) functions using the special lambda form.

(lambda (arg1 arg2 ...) form1 form2 ...)

This function is simply defined where it’s used; for example, to square the items of a list we can apply the following construct

(set! mylist '(1 3 6 8))
(mapcar (lambda (x) (* x x)) mylist)
Lambda

Scheme lambda example

Conditionals

Scheme provides a very powerful nested conditional expression with the syntax

( cond < clause 1 > <c lause 2 > < clause 3 > ... )

where < clause > is of the form

( < predicate > < form > ...)

The last clause may be an else clause which has the syntax

( < else > < form 1> < form 2 > ...)

Cond is a special form where each clause is processed until the predicate expression of the clause evaluates true. Then each subform in the predicate is evaluated with the value of the last one becoming the value of the cond form.

A simple example is :

(cond (( > a b ) 'greater)
      (( < a b ) 'less)
      (t 'equal))
cond

Scheme cond example

A more complex example is show hereafter :

(cond
((string-matches name "[1-9][0-9]+")
  (mbarnig_lb::number token name))
((string-matches name "[A-Z]+")
  (mbarnig_lb::upper_case_abbr token name))
((string-matches  name "[a-z]+")
  (mbarnig_lb::lower_case_abbr token name))
(t (list name))) ; when no specific rules apply do the general ones

Another special conditional form is

(and subform1 subform2 subform3 ...)

This form causes the evaluation of its subforms in order, from left to right, continuing if and only if the subform returns a non-null value.

System commands

Here are some examples of scheme system commands :

(quit) or (exit) ; close the program
(pwd) ; show the pathname of the current directory
(cd DIRECTORY)
(getenv NAME)
(setenv NAME VALUE)
(set_backtrace t) ; display a backtrace when a scheme error occurs
(unwind-protect …) ; catch errors and continue normally

Hooks

A hook in Scheme terms is a position within the program code where a user may specify his own customization. There a number of places in Festival where hooks are used. A hook variable contains either a function or a list of functions that are to be applied at some point in the processing. A list of defined hooks  in Festival is shown below :

  • after_analysis_hooks : functions applied after analysis, before synthesis
  • default_after_analysis_hooks : default functions applied after analysis
  • before_synth_hooks : functions applied on synthesized utterances
  • default_before_synth_hooks : default functions applied on synthesized utterances
  • after_synth_hooks : functions applied after synthesis to manipulate waveforms
  • default_after_synth_hooks : default functions applied after synthesis
  • diphone_module_hooks : functions applied at the start of the diphone module
  • tts_hooks : functions applied during text to speech
  • xxml_hooks : functions applied before tts_hooks
  • xxml_token_hooks : functions applied to each token
Festival hooks

Festival hooks

Links

A list with links to websites providing additional informations about Scheme is provided hereafter :

Festival uniphone voice creation

Referring to my recent post about Festival, I am glad to announce that I was successful in building and testing a new uniphone voice (english) with my own prompt recordings. The goal is to set up my system to create a luxembourgish synthetic voice for the Festival package.

The list of the different steps is shown below :

• (creation of the voice directory mbarnig_en_marco)
• $FESTVOXDIR/src/unitsel/setup_clunits mbarnig en marco uniphone
• (define a phoneset)
• (define a lexicon)
• festival -b festvox/build_clunits.scm '(build_prompts_waves 
"etc/uniphone.data")'
• (uncomment the line USE_SOX=1 in the script prompt_them)
• ./bin/prompt_them etc/uniphone.data 
• ./bin/make_labs prompt-wav/*.wav 
• festival -b festvox/build_clunits.scm '(build_utts 
"etc/uniphone.data")' 
• (copy etc/uniphone.data into etc/txt.done.data)
• ./bin/make_pm_wave wav/*.wav
• ./bin/make_mcep wav/*.wav
• festival -b festvox/build_clunits.scm '(build_clunits 
"etc/uniphone.data")'
• (copy data in Festival voice directory)
• festival> (voice_mbarnig_en_marco_clunits)
• festival> (SayText "Hello Marco, how are you?")

1. Voice Folder

First I created a new voice folder mbarnig_en_marco inside the Festival_TTS/festvox/ directory and opened a terminal window inside this new folder.

2. Clunits Setup

I launched the script

$FESTVOXDIR/src/unitsel/setup_clunits mbarnig en marco uniphone

to construct the voice folder structure and copy there the template files for voice building.
The arguments of the setup script setup_clunits are :

  • institution : mbarnig
  • language : en
  • speaker : marco
  • standard prompt list : uniphone
setup

Festival : setup_clunits

The following folders are created inside the voice directory mbarnig_en_marco :

  1. bin
  2. cep
  3. emu
  4. etc
  5. f0
  6. festival
  7. festvox
  8. group
  9. lab
  10. lar
  11. lpc
  12. mcep
  13. phr
  14. pm
  15. pm_lab
  16. prompt_cep
  17. prompt_lab
  18. prompt_utt
  19. prompt_wav
  20. recording
  21. scratch
  22. syl
  23. versions
  24. wav
  25. wrd

The following programs are copied into the 1sr folder (mbarnig_en_marco/bin)  :

  1. add_noise
  2. contour_powernormalize
  3. do_build
  4. find_db_duration
  5. find_num_available_cpu
  6. find_powercontours
  7. find_poerfactors
  8. get_lars
  9. get_wavs
  10. make_cmm
  11. make_dist
  12. make_f0
  13. make_labs
  14. make_lpc
  15. make_mcep
  16. make_pm
  17. make_pm_fix
  18. make_pm_pmlab
  19. make_pm_wave
  20. make_pmlab_pm
  21. make_samples
  22. prompt_them
  23. prune_middle_silence
  24. prune_silence
  25. reduce_prompts
  26. simple_powernormalize
  27. sphinx_lab
  28. sphinxtrain
  29. synthfile
  30. traintest
  31. ws

The following files are created inside the 4th folder (mbarnig_en_marco/etc) :

  • emu_f0.tpl
  • emu_hier.tpl
  • emu_lab.tpl
  • emu_pm.tpl
  • uniphone.data
  • voice.defs
  • ws_festvox.conf

The following sub-folders (most empty) are created inside the 6th folder (mbarnig_en_marco/festival) :

  • clunits, including a file all.desc
  • coeffs
  • disttabs
  • dur
  • f0
  • feats
  • phrbrk
  • trees
  • utts

The following scripts are created inside the 7th folder (mbarnig_en_marco/festvox) :

  • build_clunits.scm
  • build_st.scm
  • mbarnig_en_marco_clunits.scm
  • mbarnig_en_marco_duration.scm
  • mbarnig_en_marco_durdata.scm
  • mbarnig_en_marco_f0model.scm
  • mbarnig_en_marco_intonation.scm
  • mbarnig_en_marco_lexicon.scm
  • mbarnig_en_marco_other.scm
  • mbarnig_en_marco_phoneset.scm
  • mbarnig_en_marco_phrasing.scm
  • mbarnig_en_marco_tagger.scm
  • mbarnig_en_marco_tokenizer.scm

The other listed folders are empty.

The file uniphone.data in the mbarnig_en_marco/etc folder contains the following minimal prompt-set :

( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )

These 3 sentences contain each of the english phonemes once. The prompt list is coded in the standard Festival data-format. The spaces after the left parantheses are required.

The file voice.defs in the mbarnig_en_marco/etc folder contains the following parameters :

FV_INST=mbarnig
FV_LANG=en
FV_NAME=marco
FV_TYPE=clunits
FV_VOICENAME=$FV_INST"_"$FV_LANG"_"$FV_NAME
FV_FULLVOICENAME=$FV_VOICENAME"_"FV_TYPE

The file ws_festvox.conf in the mbarnig_en_marco/etc folder is automatically generated by WaveSurfer.

Phoneset Definition

The phoneset for the new voice is defined in the script mbarnig_en_marco_phoneset.scm. Referring to the english Festival radio phoneset, I modified the phoneset-script as follows :

;;; Phoneset for mbarnig_en_marco
;;;
(defPhoneSet
mbarnig_en_marco
;;; Phone Features
(;; vowel or consonant
(vc + -)
;; vowel length: short long dipthong schwa
(vlng s l d a 0)
;; vowel height: high mid low
(vheight 1 2 3 0)
;; vowel frontness: front mid back
(vfront 1 2 3 0)
;; lip rounding
(vrnd + - 0)
;; consonant type: stop fricative affricate nasal lateral approximant
(ctype s f a n l r 0)
;; place of articulation: labial alveolar palatal labio-dental
;; dental velar glottal
(cplace l a p b d v g 0)
;; consonant voicing
(cvox + - 0)
)
;; Phone set members
(
;; Note these features were set by awb so they are wrong !!!
(aa + l 3 3 - 0 0 0) ;; father
(ae + s 3 1 - 0 0 0) ;; fat
(ah + s 2 2 - 0 0 0) ;; but
(ao + l 3 3 + 0 0 0) ;; lawn
(aw + d 3 2 - 0 0 0) ;; how
(ax + a 2 2 - 0 0 0) ;; about
(axr + a 2 2 - r a +)
(ay + d 3 2 - 0 0 0) ;; hide
(b - 0 0 0 0 s l +)
(ch - 0 0 0 0 a p -)
(d - 0 0 0 0 s a +)
(dh - 0 0 0 0 f d +)
(dx - a 0 0 0 s a +) ;; ??
(eh + s 2 1 - 0 0 0) ;; get
(el + s 0 0 0 l a +)
(em + s 0 0 0 n l +)
(en + s 0 0 0 n a +)
(er + a 2 2 - r 0 0) ;; always followed by r (er-r == axr)
(ey + d 2 1 - 0 0 0) ;; gate
(f - 0 0 0 0 f b -)
(g - 0 0 0 0 s v +)
(hh - 0 0 0 0 f g -)
(hv - 0 0 0 0 f g +)
(ih + s 1 1 - 0 0 0) ;; bit
(iy + l 1 1 - 0 0 0) ;; beet
(jh - 0 0 0 0 a p +)
(k - 0 0 0 0 s v -)
(l - 0 0 0 0 l a +)
(m - 0 0 0 0 n l +)
(n - 0 0 0 0 n a +)
(nx - 0 0 0 0 n d +) ;; ???
(ng - 0 0 0 0 n v +)
(ow + d 2 3 + 0 0 0) ;; lone
(oy + d 2 3 + 0 0 0) ;; toy
(p - 0 0 0 0 s l -)
(r - 0 0 0 0 r a +)
(s - 0 0 0 0 f a -)
(sh - 0 0 0 0 f p -)
(t - 0 0 0 0 s a -)
(th - 0 0 0 0 f d -)
(uh + s 1 3 + 0 0 0) ;; full
(uw + l 1 3 + 0 0 0) ;; fool
(v - 0 0 0 0 f b +)
(w - 0 0 0 0 r l +)
(y - 0 0 0 0 r p +)
(z - 0 0 0 0 f a +)
(zh - 0 0 0 0 f p +)
(pau - 0 0 0 0 0 0 -)
(h# - 0 0 0 0 0 0 -)
(brth - 0 0 0 0 0 0 -)
)
)
(PhoneSet.silences '(pau))

(define (mbarnig_en_marco::select_phoneset)
 "(mbarnig_en_marco::select_phoneset)
Set up phone set for mbarnig_en_marco."
 (Parameter.set 'PhoneSet 'mbarnig_en_marco)
 (PhoneSet.select 'mbarnig_en_marco)
)

(define (mbarnig_en_marco::reset_phoneset)
 "(mbarnig_en_marco::reset_phoneset)
Reset phone set for mbarnig_en_marco."
 t
)

(provide 'mbarnig_en_marco_phoneset)

Lexicon Creation

Without a lexicon, the result of a building command generates unknown words messages

uni 8

Festival : unknown words without lexicon

The lexicon for the new voice is defined in the script mbarnig_en_marco_lexicon.scm. Referring the the english Festival cmu lexicon, I modified the lexicon-script as follows :

;;; Lexicon, LTS and Postlexical rules for mbarnig_en_marco
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;
;;; CMU lexicon for US English
;;;

;;; Load any necessary files here
(require 'postlex)
(setup_cmu_lex)
(define (mbarnig_en_marco::select_lexicon)
 "(mbarnig_lx_marco::select_lexicon)
Set up the lexicon for mbarnig_lx_marco."
(lex.select "cmu")

;; Post lexical rules
(set! postlex_rules_hooks (list postlex_apos_s_check))
(set! postlex_vowel_reduce_cart_tree nil) ; no reduction
)


(define (mbarnig_lx_marco::reset_lexicon)
 "(mbarnig_lx_marco::reset_lexicon)
Reset lexicon information."
 t
)

(provide 'mbarnig_lx_marco_lexicon)

Building Prompts

The second command

festival -b festvox/build_clunits.scm '(build_prompts_waves 
"etc/uniphone.data")'

generates synthesized waveforms to act as prompts and timing cues. The nearest available voice (in this case kal_diphone) is used for synthesizing. The generated files are also used in aligning the spoken data. The -b option (–batch) avoids switching in the interactive Festival mode.

Festival

Festival : build_prompts_waves

The following files are created :

  • folder prompt-lab : files uniph_0001.lab, uniph_0002.lab and uniph_0003.lab
  • folder prompt-utt : files uniph_0001.utt, uniph_0002.utt and uniph_0003.utt
  • folder prompt-wav : files uniph_0001.wav, uniph_0002.wav and uniph_0003.wav

The uniph_xxxx.lab files have the following type of content :

#
0.1100 100 pau
0.2200 100 ax
0.3300 100 hh
0.4400 100 ow
0.5500 100 l
0.6600 100 jh
0.7700 100 oy
0.8800 100 w
0.9900 100 aa
1.1000 100 z
1.2100 100 r
1.3200 100 iy
1.4850 100 p
1.6500 100 ih
1.8150 100 ng
1.9250 100 pau

The uniph_xxxx.utt files have the following type of content :

EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 77 ; type Text ; 
iform "\"a whole joy was reaping.\"" ;
Stream_Items
1 id _1 ; name a ; whitespace "" ; prepunctuation "" ;
2 id _2 ; name whole ; whitespace " " ; prepunctuation "" ;
3 id _3 ; name joy ; whitespace " " ; prepunctuation "" ;
4 id _4 ; name was ; whitespace " " ; prepunctuation "" ;
5 id _5 ; name reaping ; punc . ; whitespace " " ; 
prepunctuation "" ;
6 id _10 ; name reaping ; pbreak B ; pos nil ;
7 id _11 ; name . ; pbreak B ; pos punc ;
8 id _9 ; name was ; pbreak NB ; pos nil ;
9 id _8 ; name joy ; pbreak NB ; pos nil ;
10 id _7 ; name whole ; pbreak NB ; pos nil ;
11 id _6 ; name a ; pbreak NB ; pos dt ;
12 id _12 ; name B ;
13 id _13 ; name syl ; stress 0 ;
14 id _15 ; name syl ; stress 1 ;
15 id _19 ; name syl ; stress 1 ;
16 id _22 ; name syl ; stress 1 ;
17 id _26 ; name syl ; stress 1 ;
18 id _29 ; name syl ; stress 0 ;
19 id _33 ; name pau ; dur_factor 1 ; end 0.11 ;source_end 0.101815 ;
20 id _14 ; name ax ; dur_factor 1 ; end 0.22 ;source_end 0.235802 ;
21 id _16 ; name hh ; dur_factor 1 ; end 0.33 ;source_end 0.322177 ;
22 id _17 ; name ow ; dur_factor 1 ; end 0.44 ;source_end 0.493926 ;
23 id _18 ; name l ; dur_factor 1 ; end 0.55 ;source_end 0.626926 ;
24 id _20 ; name jh ; dur_factor 1 ; end 0.66 ;source_end 0.732624 ;
25 id _21 ; name oy ; dur_factor 1 ; end 0.77 ;source_end 0.900228 ;
26 id _23 ; name w ; dur_factor 1 ; end 0.88 ;source_end 1.06616 ;
27 id _24 ; name aa ; dur_factor 1 ; end 0.99 ;source_end 1.20716 ;
28 id _25 ; name z ; dur_factor 1 ; end 1.1 ;source_end 1.33726 ;
29 id _27 ; name r ; dur_factor 1 ; end 1.21 ;source_end 1.46326 ;
30 id _28 ; name iy ; dur_factor 1 ; end 1.32 ;source_end 1.58507 ;
31 id _30 ; name p ; dur_factor 1.5 ; end 1.485 ;source_end 1.71307 ;
32 id _31 ; name ih ; dur_factor 1.5 ; end 1.65 ;source_end 1.83979 ;
33 id _32 ; name ng ; dur_factor 1.5 ; end 1.815 source_end 2.02441 ;
34 id _34 ; name pau ; dur_factor 1 ; end 1.925 ;source_end 2.36643 ;
35 id _35 ; name Accented ;
36 id _36 ; name Accented ;
37 id _37 ; name Accented ;
38 id _38 ; name Accented ;
39 id _56 ; f0 110 ; pos 1.815 ;
40 id _54 ; f0 126.571 ; pos 1.31 ;
41 id _55 ; f0 116.286 ; pos 1.32 ;
42 id _53 ; f0 128.571 ; pos 1.11 ;
43 id _50 ; f0 130 ; pos 1.09 ;
44 id _51 ; f0 118.571 ; pos 1.1 ;
45 id _52 ; f0 118.571 ; pos 1.101 ;
46 id _49 ; f0 132 ; pos 0.78 ;
47 id _46 ; f0 132.286 ; pos 0.76 ;
48 id _47 ; f0 122 ; pos 0.77 ;
49 id _48 ; f0 122 ; pos 0.771 ;
50 id _45 ; f0 134.286 ; pos 0.56 ;
51 id _42 ; f0 135.714 ; pos 0.54 ;
52 id _43 ; f0 124.286 ; pos 0.55 ;
53 id _44 ; f0 124.286 ; pos 0.551 ;
54 id _41 ; f0 137.714 ; pos 0.23 ;
55 id _40 ; f0 127.714 ; pos 0.22 ;
56 id _39 ; f0 130 ; pos 0.11 ;
57 id _57 ; name pau-ax ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 9 ; end 0.172053 ; num_frames 17 ;
58 id _58 ; name ax-hh ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 0.288115 ; num_frames 11 ;
59 id _59 ; name hh-ow ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 2 ; end 0.406552 ; num_frames 11 ;
60 id _60 ; name ow-l ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 7 ; end 0.559239 ; num_frames 14 ;
61 id _61 ; name l-jh ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 0.673416 ; num_frames 10 ;
62 id _62 ; name jh-oy ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 0.82104 ; num_frames 13 ;
63 id _63 ; name oy-w ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 6 ; end 1.01148 ; num_frames 17 ;
64 id _64 ; name w-aa ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 1.1311 ; num_frames 11 ;
65 id _65 ; name aa-z ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 6 ; end 1.25207 ; num_frames 11 ;
66 id _66 ; name z-r ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 7 ; end 1.39807 ; num_frames 14 ;
67 id _67 ; name r-iy ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 1.52951 ; num_frames 12 ;
68 id _68 ; name iy-p ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 1.64963 ; num_frames 11 ;
69 id _69 ; name p-ih ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 5 ; end 1.78517 ; num_frames 13 ;
70 id _70 ; name ih-ng ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 4 ; end 1.91617 ; num_frames 12 ;
71 id _71 ; name ng-pau ; sig "[Val wave]" ; coefs "[Val track]" ; 
middle_frame 9 ; end 2.19542 ; num_frames 27 ;
72 id _72 ; name coef ; coefs "[Val track]" ; 
frame "[Val wavevector]" ;
73 id _73 ; name f0 ; f0 "[Val track]" ;
74 id _74 ; coefs "[Val track]" ; residual "[Val wave]" ;
75 id _75 ;
76 id _76 ; map "[Val ivector]" ;
77 id _77 ; wave "[Val wave]" ;
End_of_Stream_Items
Relations
Relation Token ; ()
6 11 1 0 0 0
1 1 0 6 2 0
7 10 2 0 0 0
2 2 0 7 3 1
8 9 3 0 0 0
3 3 0 8 4 2
9 8 4 0 0 0
4 4 0 9 5 3
10 6 5 0 11 0
11 7 0 0 0 10
5 5 0 10 0 4
End_of_Relation
Relation Word ; ()
1 11 0 0 2 0
2 10 0 0 3 1
3 9 0 0 4 2
4 8 0 0 5 3
5 6 0 0 0 4
End_of_Relation
Relation Phrase ; ()
2 11 1 0 3 0
3 10 0 0 4 2
4 9 0 0 5 3
5 8 0 0 6 4
6 6 0 0 0 5
1 12 0 2 0 0
End_of_Relation
Relation Syllable ; ()
1 13 0 0 2 0
2 14 0 0 3 1
3 15 0 0 4 2
4 16 0 0 5 3
5 17 0 0 6 4
6 18 0 0 0 5
End_of_Relation
Relation Segment ; ()
1 19 0 0 2 0
2 20 0 0 3 1
3 21 0 0 4 2
4 22 0 0 5 3
5 23 0 0 6 4
6 24 0 0 7 5
7 25 0 0 8 6
8 26 0 0 9 7
9 27 0 0 10 8
10 28 0 0 11 9
11 29 0 0 12 10
12 30 0 0 13 11
13 31 0 0 14 12
14 32 0 0 15 13
15 33 0 0 16 14
16 34 0 0 0 15
End_of_Relation
Relation SylStructure ; ()
8 20 7 0 0 0
7 13 1 8 0 0
1 11 0 7 2 0
10 21 9 0 11 0
11 22 0 0 12 10
12 23 0 0 0 11
9 14 2 10 0 0
2 10 0 9 3 1
14 24 13 0 15 0
15 25 0 0 0 14
13 15 3 14 0 0
3 9 0 13 4 2
17 26 16 0 18 0
18 27 0 0 19 17
19 28 0 0 0 18
16 16 4 17 0 0
4 8 0 16 5 3
22 29 20 0 23 0
23 30 0 0 0 22
20 17 5 22 21 0
24 31 21 0 25 0
25 32 0 0 26 24
26 33 0 0 0 25
21 18 0 24 0 20
5 6 0 20 6 4
6 7 0 0 0 5
End_of_Relation
Relation IntEvent ; ()
1 35 0 0 2 0
2 36 0 0 3 1
3 37 0 0 4 2
4 38 0 0 0 3
End_of_Relation
Relation Intonation ; ()
5 35 1 0 0 0
1 14 0 5 2 0
6 36 2 0 0 0
2 15 0 6 3 1
7 37 3 0 0 0
3 16 0 7 4 2
8 38 4 0 0 0
4 17 0 8 0 3
End_of_Relation
Relation Target ; ()
12 56 1 0 0 0
1 19 0 12 2 0
13 55 2 0 0 0
2 20 0 13 3 1
14 54 3 0 0 0
3 21 0 14 4 2
15 51 4 0 16 0
16 52 0 0 17 15
17 53 0 0 0 16
4 23 0 15 5 3
18 50 5 0 0 0
5 24 0 18 6 4
19 47 6 0 20 0
20 48 0 0 21 19
21 49 0 0 0 20
6 25 0 19 7 5
22 46 7 0 0 0
7 26 0 22 8 6
23 43 8 0 24 0
24 44 0 0 25 23
25 45 0 0 0 24
8 28 0 23 9 7
26 42 9 0 0 0
9 29 0 26 10 8
27 40 10 0 28 0
28 41 0 0 0 27
10 30 0 27 11 9
29 39 11 0 0 0
11 33 0 29 0 10
End_of_Relation
Relation Unit ; grouped 1 ;
1 57 0 0 2 0
2 58 0 0 3 1
3 59 0 0 4 2
4 60 0 0 5 3
5 61 0 0 6 4
6 62 0 0 7 5
7 63 0 0 8 6
8 64 0 0 9 7
9 65 0 0 10 8
10 66 0 0 11 9
11 67 0 0 12 10
12 68 0 0 13 11
13 69 0 0 14 12
14 70 0 0 15 13
15 71 0 0 0 14
End_of_Relation
Relation SourceCoef ; ()
1 72 0 0 0 0
End_of_Relation
Relation f0 ; ()
1 73 0 0 0 0
End_of_Relation
Relation TargetCoef ; ()
1 74 0 0 2 0
2 75 0 0 0 1
End_of_Relation
Relation US_map ; ()
1 76 0 0 0 0
End_of_Relation
Relation Wave ; ()
1 77 0 0 0 0
End_of_Relation
End_of_Relations
End_of_Utterance

Recording Prompts

Before launching the next command

./bin/prompt_them etc/uniphone.data

to start the automatic recording of the prompts, I uncommented the line USE_SOX=1 in the prompt_them script to use the SOX package on the Mac instead of the na_play / na_record programs.

Festival

Festival : recording prompts

Festival plays the synthesized prompt before each record and calculates the recording duration, based on the synthesis. The recorded audio files are saved into the wav folder.

As the recording in the required format 16.000 Hz, mono 16 bits was not possible, I did a manual recording with the Audacity app and replaced the audio files in the wav folder.

Audacity

Audacity app to record prompts

Labeling

The labeling of the spoken prompts is done by matching the synthesized prompts with the spoken ones.

./bin/make_labs prompt_wav/*.wav
Festival make_labs

Festival : make_labs

The following files are created :

  • folder cep : files uniph_0001.cep, uniph_0002.cep and uniph_0003.cep
  • folder lab : files uniph_0001.lab, uniph_0002.lab and uniph_0003.lab
  • folder prompt-cep : files uniph_0001.cep, uniph_0002.cep and uniph_0003.cep

The uniph_xxxx.cep files have the following type of content :

EST_File Track
DataType binary
ByteOrder 01
NumFrames 425
NumChannels 24
EqualSpace 0
BreaksPresent true
CommentChar ;

Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
...
Channel_20 melcep_d_9
Channel_21 melcep_d_10
Channel_22 melcep_d_11
Channel_23 melcep_d_N
EST_Header_End
..........

The uniph_xxxx.cep files have the following content :

separator ;
nfields 1
#
0.01500 26 pau
0.12500 26 ax
0.23500 26 hh
0.37000 26 ow
0.53000 26 l
0.65000 26 jh
0.87000 26 oy
1.08500 26 w
1.19500 26 aa
1.39500 26 z
1.45000 26 r
1.55500 26 iy
1.74500 26 p
1.91000 26 ih
2.07500 26 ng
2.12500 26 pau

The correct labeling can be checked with the WaveSurfer app.

WafeSurfer . checking labels

WafeSurfer . checking labels

The uniph_xxxx.cep files have the following type of content :

EST_File Track
DataType binary
ByteOrder 01
NumFrames 385
NumChannels 24
EqualSpace 0
BreaksPresent true
CommentChar ;

Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
Channel_5 melcep_6
...
Channel_19 melcep_d_8
Channel_20 melcep_d_9
Channel_21 melcep_d_10
Channel_22 melcep_d_11
Channel_23 melcep_d_N
EST_Header_End
....

Creating Utterances

After labeling the utterance structure is created with the command

festival -b festvox/build_clunits.scm '(build_utts
"etc/uniphone.data")'
Festival build_utts

Festival : build_utts

The 3 files uniph_0001.utt, uniph_0002.utt and uniph_0003.utt are saved in the folder festival/utts. They have the following type of content :

EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 94 ; type Text ; 
iform "\"a whole joy was reaping.\"" ; 
filename prompt-utt/uniph_0001.utt ; 
fileid uniph_0001 ;
Stream_Items
1 id _1 ; name a ; whitespace "" ; prepunctuation "" ;
2 id _2 ; name whole ; whitespace " " ; prepunctuation "" ;
3 id _3 ; name joy ; whitespace " " ; prepunctuation "" ;
...
1 76 0 0 0 0
End_of_Relation
Relation Phrase ; ()
2 11 1 0 3 0
3 10 0 0 4 2
4 9 0 0 5 3
5 8 0 0 6 4
6 6 0 0 0 5
1 77 0 2 0 0
End_of_Relation
End_of_Relations
End_of_Utterance

txt.done.data

The next scripts are looking for the txt.done.data file instead of the uniphone.data file. Copying the uniphone.data file and renaming it to txt.done.data solves this problem.

Extracting pitchmarks

The simplest way to extract the pitchmarks from the records is to use the command

./bin/make_pm_wave wav/*.wav

without tuning any parameters.

Festival make_pm

Festival : make_pm

The 3 files uniph_0001.pm, uniph_0002.pm and uniph_0003.pm are saved in the folder pm. They have the following type of content :

EST_File Track
DataType ascii
NumFrames 271
NumChannels 0
NumAuxChannels 0
EqualSpace 0
BreaksPresent true
EST_Header_End
0.016750 1
0.023312 1
0.030125 1
0.037250 1
0.044625 1
.....
2.070750 1
2.081512 1
2.092275 1
2.103038 1
2.113800 1
2.124563 1

Find Mel Frequency Cepstral Coefficients

In the next stage the Mel Frequency Cepstral Coefficients are defined synchronously with the pitch periods

./bin/make_mcep wav/*.wav
Festival make_mcep

Festival : make_mcep

The 3 files uniph_0001.mcep, uniph_0002.mcep and uniph_0003.mcep are saved in the folder mcep. They have the following type of content :

EST_File Track
DataType binary
ByteOrder 01
NumFrames 271
NumChannels 12
EqualSpace 0
BreaksPresent true
CommentChar ;

Channel_0 melcep_1
Channel_1 melcep_2
Channel_2 melcep_3
Channel_3 melcep_4
Channel_4 melcep_5
Channel_5 melcep_6
Channel_6 melcep_7
Channel_7 melcep_8
Channel_8 melcep_9
Channel_9 melcep_10
Channel_10 melcep_11
Channel_11 melcep_N
EST_Header_End
........

Building Synthesizer

Building the cluster unit selection synthesizer is the main part of the voice creation. It’s done with the command

festival -b festvox/build_clunits.scm '(build_clunits 
"etc/uniphone.data")'
Festival : build synthesizer (click to enlarge)

Festival : build synthesizer (click to enlarge)

The following files are created :

  • folder festival/clunits : file mbarnig_en_marco.catalogue
  • folder festival/feats : 41 files  [phoneme-name].feats
  • folder festival/trees : 41 files [phoneme-name].tree
  • folder festival/trees : file mbarnig_en_marco.tree

The file mbarnig_en_marco.catalogue has the following type of content :

EST_File index
DataType ascii
NumEntries 46
IndexName mbarnig_en_marco
EST_Header_End
pau_5 uniph_0001 0.000000 0.007500 0.015000
ax_0 uniph_0001 0.015000 0.070000 0.125000
hh_0 uniph_0001 0.125000 0.180000 0.235000
ow_0 uniph_0001 0.235000 0.302500 0.370000
l_0 uniph_0001 0.370000 0.450000 0.530000
jh_0 uniph_0001 0.530000 0.590000 0.650000
oy_0 uniph_0001 0.650000 0.760000 0.870000
.....
er_0 uniph_0003 1.150000 1.205000 1.260000
m_0 uniph_0003 1.260000 1.320000 1.380000
ay_0 uniph_0003 1.380000 1.480000 1.580000
k_0 uniph_0003 1.580000 1.615000 1.650000
pau_0 uniph_0003 1.650000 1.650000 1.650000

A xx.feats file has the following type of content :

0 w - r 0 0 0 0 l + z - f 0 0 0 0 a + 0.11000001 128.271 130.7258 
126.721 1 coda coda onset 1 1 0 0 1 1 single oy + 0 2 d 3 + 0 0 0 0 
content content content

A xx.tree file has the following type of content :

((((0 0)) 0))
;; Right cluster 0 (0%) mean ranking 2 mean distance 0

The mbarnig_en_marco.tree file has the following type of content :

;; Autogenerated list of selection trees
;; db_dir "./"
;; db_dir "."
;; name mbarnig_en_marco
;; index_name mbarnig_en_marco
;; f0_join_weight 0
;; join_weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)
;; trees_dir "festival/trees/"
;; catalogue_dir "festival/clunits/"
;; coeffs_dir "mcep/"
;; coeffs_ext ".mcep"
;; clunit_name_feat lisp_mbarnig_en_marco::clunit_name
;; join_method windowed
;; continuity_weight 5
;; optimal_coupling 1
;; extend_selections 2
;; pm_coeffs_dir "mcep/"
;; pm_coeffs_ext ".mcep"
;; sig_dir "wav/"
;; sig_ext ".wav"
;; disttabs_dir "festival/disttabs/"
;; utts_dir "festival/utts/"
;; utts_ext ".utt"
;; dur_pen_weight 0
;; f0_pen_weight 0
;; get_stds_per_unit t
;; ac_left_context 0.8
;; ac_weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)
;; feats_dir "festival/feats/"
;; feats (occurid p.name p.ph_vc p.ph_ctype p.ph_vheight p.ph_vlng 
p.ph_vfront p.ph_vrnd p.ph_cplace p.ph_cvox n.name n.ph_vc 
n.ph_ctype n.ph_vheight n.ph_vlng n.ph_vfront n.ph_vrnd n.ph_cplace 
n.ph_cvox segment_duration seg_pitch p.seg_pitch n.seg_pitch 
R:SylStructure.parent.stress seg_onsetcoda n.seg_onsetcoda 
p.seg_onsetcoda R:SylStructure.parent.accented pos_in_syl 
syl_initial syl_final R:SylStructure.parent.lisp_cg_break 
R:SylStructure.parent.R:Syllable.p.lisp_cg_break 
R:SylStructure.parent.position_type pp.name pp.ph_vc pp.ph_ctype 
pp.ph_vheight pp.ph_vlng pp.ph_vfront pp.ph_vrnd pp.ph_cplace 
pp.ph_cvox n.lisp_is_pau p.lisp_is_pau 
R:SylStructure.parent.parent.gpos 
R:SylStructure.parent.parent.R:Word.p.gpos 
R:SylStructure.parent.parent.R:Word.n.gpos)
;; wagon_field_desc "festival/clunits/all.desc"
;; wagon_progname "$ESTDIR/bin/wagon"
;; wagon_cluster_size 20
;; prune_reduce 0
;; cluster_prune_limit 40
;; files (uniph_0001 uniph_0002 uniph_0003)
(set! clunits_selection_trees '(
("k" ((((0 0)) 0)))
("ay" ((((0 0)) 0)))
("m" ((((0 0)) 0)))
("er" ((((0 0)) 0)))
("zh" ((((0 0)) 0)))
("ae" ((((0 0)) 0)))
("ch" ((((0 0)) 0)))
("eh" ((((0 0)) 0)))
....
("jh" ((((0 0)) 0)))
("l" ((((0 0)) 0)))
("ow" ((((0 0)) 0)))
("hh" ((((0 0)) 0)))
("ax" ((((0 0)) 0)))
("pau"
((((0 67.455) (1 100) (2 66.73) (3 100) (4 67.43) (5 100)) 83.6025)))
))

Install new voice

In the last step I created the folder

Festival-TTS/festival/lib/voices/english/mbarnig_en_marco_clunits/

and copied the following folders from the voice directory

Festival_TTS/festvox/mbarnig_en_marco/

into this folder :

  • festival
  • festvox
  • mcep
  • wav

We can now check if the voice is recognized

festival> (voice.list)
Festival : voice.list

Festival : voice.list

load the new voice

festival> (voice_mbarnig_en_marco_clunits)

and test the synthesis

festival> (SayText "Hello Marco, how are you?")
Festival : SayText

Festival : SayText

It works as expected.

The following folders in the voice folder Festival-TTS/festvox/mbarnig_en_marco remained empty :

  • emu, including 2 empty subfolders lab_hlb and pm_hlb
  • f0
  • group
  • lar
  • lpc
  • phr
  • pm_lab
  • recording
  • scratch, including 2 empty subfolders lab and wav
  • syl
  • versions
  • wrd

Festival Text-to-Speech Package

Last update : April 22, 2015

Festival

The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Alan W. Black is now professor in the Language Technology Institute at Carnegie Mellon University where substantial contributions have been provided to Festival. The program is written in C++.

To set-up a complete Festival Environment on OS X (Yosemite 10.10.2), four packages are required :

  1. Festival-2.4 (file festival-2.4-release.tar)
  2. Edinburgh Speech-Tools (file speech_tools-2.4-release.tar)
  3. Festvox (file festvox-2.7.0-release.tar.gz)
  4. Languages (example file : english festvox_kallpc16k.tar.gz)

To compile and install the packages, I got some guidance from a Linguistic Mystic (alias Will Styler). After unzipping, the files have been moved into a common folder Festival-TTS on the desktop with the following names :

  • festival
  • speech-tools
  • festvox

The language files are installed in the festival folder in the sub-folders lib/voices/english.

The packages have been compiled in the following sequence :

mbarnig$ cd Desktop/Festival-TTS/speech_tools
mbarnig$ ./configure
mbarnig$ make
mbarnig$ make test
mbarnig$ make install
mbarnig$ cd Desktop/Festival-TTS/festival
mbarnig$ ./configure
mbarnig$ make
mbarnig$ make install
mbarnig$ cd Desktop/Festival-TTS/festvox
mbarnig$ ./configure
mbarnig$ make

At the end the voice folder with the language files was moved to the festival/lib directory.

After updating Xcode to version 6.1.1 and installing the audiotools for Xcode 6.1, I checked that afplay is working :

afplay check

afplay check

I checked also that the festival/lib/siteinit.scm file contains the  following statements :

  • (Parameter.set ‘Audio_Required_Format ‘riff)
  • (Parameter.set ‘Audio_Method ‘Audio_Command)
  • (Parameter.set ‘Audio_Command “afplay $FILE”)

The following files have been downloaded from the festvox website, unzipped and moved to the festival/lib/dicts folder :

  • festlex_CMU.tar.gz
  • festlex_OALD.tar.gz
  • festlex_POSLEX.tar.gz

I added finally the following statements to the .bash_profile file located in the homefolder (/Users/mbarnig) :

  • export FESTIVALDIR=”/Users/mbarnig/Desktop/Festival-TTS/festival”
  • export PATH=”$FESTIVALDIR/bin:$PATH”
  • export ESTDIR=”/Users/mbarnig/Desktop/Festival-TTS/speech_tools”
  • export PATH=”$ESTDIR/bin:$PATH”
  • export FESTVOXDIR=”/Users/mbarnig/Desktop/Festival-TTS/festvox”

The festival tool can now be started in the terminal window with the command

mbarnig$ $FESTIVALDIR/bin/festival
Festival

Festival version 2.4

All seems to be working great!

Festival embeds a basic small Scheme (Lisp) interpreter (SIOD : Scheme In One Defun 3.0) written by George Carrett.

Festival works in two fundamental modes, command mode and text-to-speech (tts) mode. If Festival is started without arguments (or with the option  –command), it enters the default command mode (prompt = festival>). Information included in paranthesis is treated as commands and is interpreted by the Scheme interpreter. The following commands are accepted:

festival> 
> (intro)   :  short spoken introduction
> (voice.list)   : list of available voices
> (set! utt1 (Utterance Text "Hello world"))   : 
           create an utterance and save it in a variable
> (utt.synth utt1)    : synthesize utterance to get a waveform
> (utt.play utt1) : send the synthesized waveform to the audio device
> (SayText "Good morning, welcome to Festival")   : 
           speak text (combination of the 3 preceding commands)
> (tts "myfile" nil)    : speak file instead of text
> (manual nil)  : show the content of the manual
> (manual "Accessing an utterance")  : show the section "utterance"
> (PhoneSet.list)   : show the currently defined phonesets
> (tts "doremi.xml" 'singing)  : an XML based mode for specifying 
           songs, both notes and duration
> (quit)   : exit

If Festival is started with the –tts option, it enters tts-mode. Information (in files or through standard input) is treated as text to be rendered as speech.

Other options available at the start of Festival are :

--language LANG   : set the default language to LANG.
--server   : enter server mode where Festival waits for clients on a 
    known port (default port : 1314); connected clients may send 
    commands (or text) to the server and expect waveforms back.
--script scriptfile  : run scriptfile as a Festival script file.
--heap NUMBER   : to increase the scheme heap.
--batch  : after processing file arguments do not become interactive.
--interactive  : after processing file arguments become interactive.

Script mode :

festival mbarnig$  examples/saytime
festival mbarnig$  text2wave myfile.txt -o myfile.wav

An updated Festival System Documentation with 34 chapters, edited in December 2014, is available at the festvox website.

The following Festival voices are available :

  • festvox_cmu_us_ahw_cg
  • festvox_cmu_us_aup_cg
  • festvox_cmu_us_awb_cg
  • festvox_cmu_us_axb_cg
  • festvox_cmu_us_bdl_cg
  • festvox_cmu_us_clb_cg
  • festvox_cmu_us_fem_cg
  • festvox_cmu_us_gka_cg
  • festvox_cmu_us_jmk_cg
  • festvox_cmu_us_ksp_cg
  • festvox_cmu_us_rms_cg
  • festvox_cmu_us_rxr_cg
  • festvox_cmu_us_slt_cg
  • festvox_kallpc16k
  • festvox_rablpc16k
  • Leopold : AustrianGerman
  • IMS German Festival
  • OGIgerman by CSLU
  • Swedish by SOL
  • Hindi

Hindi and German are examples of Festival languages/voices with different phone-features in the phone-set as in the standard us and english phone-sets.

Edinburgh Speech Tools

The Edinburgh Speech Tools Library is a collection of C++ class, functions and related programs for manipulating objects used in speech processing. It includes support for reading and writing waveforms, parameter files (LPC, Ceptra, F0) in various formats and converting between them. It also includes support for linguistic type objects and support for various label files and ngrams (with smoothing). In addition to the library a number of programs are included. An intonation library which includes a pitch tracker, smoother and labelling system (using the Tilt Labelling system), a classification and regression tree (CART) building program called wagon. Also there is growing support for various speech recognition classes such as decoders and HMMs.

An introduction to the Edinburgh Speech Tools is provided by Festvox.

Festvox

The Festvox project aims to make the building of new synthetic voices for Festival more systemic and better documented, by offering the following resources :

Festival Variables

Festival provides a list of variables available for general use. This list is automatically generated from the documentation strings of the variables defined in the source code. A variable can be displayed with the print command at the festival prompt. Some examples are shown hereafter :

festival>
> (print festival_version) ; current version of the system
> (print *ostype*) ; operation system that Festival is running on
> (print lexdir) ; default directory of the lexicons
> (print SynthTypes) ; list of synthesis types and functions
> (print token.letter_pos) ; POS tag for individual letters
> (print token.punctuation) ; characters treated as punctuation
> (print voice-path) ; list of folders to look for voices
> (print voice_default) ; function to load the default voice
Festival Variables

Festival Variables

Festival Functions

Festival provides a list of functions available for general use. This list is automatically generated from the documentation strings of the functions defined in the source code. A function is called at the Festival prompt. Some examples are shown hereafter :

festival>
> (pwd) ; return current directory
> (lex.list) ; list names of all currently defined lexicons
> (voice.list) ; list all potential voices in the system
> (lex.lookup WORD FEATURES) ; lookup word in current lexicon
> (lex.compile ENTRYFILE COMPILEFILE) ; compile lexical entries
> (PhoneSet.list) ; list all currently defined PhoneSets
> (quit) ; exit from Festival
Festival Functions

Festival Functions

Utterance Access Methods

Festival provides a number of standard functions that allow to access parts of an utterance, to traverse through it and to extract features.

Three utterances access methods are of particular interest :

  1. (utt.feat UTT FEATNAME)
    returns the value of feature FEATNAME in UTT
  2. (item.feat ITEM FEATNAME)
    returns the value of feature FEATNAME in ITEM
  3. (utt.features UTT RELATIONNAME FUNCLIST)
    returns vectors of feature values for each item, listed in FUNCLIST and related
    in RELATIONNAME in UTT

FEATNAME may be a

  • feature name ; example : (item.feat sylb ‘stress)
  • feature function name ; example : (item.feat sylb ‘pos_in_word)
  • pathname ; examples : (item.feat sylb ‘nn.stress)
    (item.feat sylb ‘R:SylStructure.parent.word)

Notes :
sylb is a syllable item
R: is a relation operator

RELATIONNAME may be ‘Token, ‘Word, ‘Phrase, ‘Segment, ‘Syllable, etc

FUNCLIST is a list of items ; example : ‘(name pos)

Some examples are shown hereafter :

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! tok (utt.relation.first utter 'Token))
(utt.feat utter 'type)
(item.feat tok 'nn.name)
(item.feat tok 'R:Token.daughter1.name)
(utt.features utter 'Word '(name pos p.pos n.pos))
feats

Utterance access methods

More informations about feature functions as FEATNAME are provided in the next chapter.

Festival Feature Functions

Festival provides a list of basic feature functions available as FEATNAME in utterances. Most are only available for specific items. Some examples are shown hereafter, related to the corresponding items :

Token item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! tok (utt.relation.first utter 'Token))
(item.name tok) ; first token
(item.feat tok 'name) ; first token
(item.feat tok 'n.name) ; second token
(item.feat tok 'nn.name) ; third token
(item.feat tok 'whitespace)
(item.feat tok 'prepunctuation)
Utterance Token

Utterance Token

Word item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! wrd (item.next (utt.relation.first utter 'Word)))
(item.name wrd) ; second word
(item.feat wrd 'p.name) ; first word
(item.feat wrd 'cap)
(item.feat wrd 'word_duration)
Utterance Word

Utterance Word

Segment item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! seg (item.prev (item.prev (utt.relation.last utter 'Segment))))
(item.name seg) ; third last segment
(item.feat seg 'n.name) ; second last segment
(item.feat seg 'seg_pitch)
(item.feat seg 'segment_end)
(item.feat seg 'R:SylStructure.parent.parent.name)
Utterance Segment

Utterance Segment

Syllable item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! sylb (utt.relation.first utter 'Syllable))
(item.features sylb) ; first syllable
(item.feat sylb 'asyl_out)
(item.feat sylb 'syl_midpitch)
(utt.features utter 'Syllable '(stress))
(item.feat sylb 'nn.stress) ; stress of third syllable
(item.feat sylb 'R:SylStructure.parent.name)
(item.feat sylb 'R:SylStructure.daughter1.name)
(item.feat sylb 'R:SylStructure.daughter2.name)
Utterance Syllable

Utterance Syllable

SylStructure item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! sylst (item.prev (utt.relation.last utter 'SylStructure)))
(item.features sylst)
(item.feat sylst 'pos_index)
(item.feat sylst 'phrase_score)
Utterance SylStructure

Utterance SylStructure

Intonation item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! inton (utt.relation.first utter 'Intonation))
(item.features inton)
(item.feat inton 'id)
Utterance Intonation

Utterance Intonation

Dumping features

Extracting basic features from a set of utterances is useful for most of the training techniques for TTS voice building. Festival provides a script dumpfeats in the festival/examples folder which does this task. The results can be saved in a single feature file or in separate files for each utterance. An example is shown below, the dumpfeats script was copied in the festival folder of my test voice mbarnig_lb_voxcg :

mbarnig$ ./dumpfeats -feats "(name p.name n.name)"
-relation Segment -output myfeats.txt utts/*.utt
Festival dumpfeats

Festival dumpfeats

Links

A list of links to websites with additional informations about the Festival package is shown hereafter :

Mary TTS (Text To Speech)

Last update : January 5, 2017

MaryTTS is an open-source, multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of DFKI’s Language Technology Lab and the Institute of Phonetics at Saarland University. It is now maintained by the Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI (Deutsches Forschungszentrum für Künstliche Intelligenz GmbH).

Mary stands for Modular Architecture for Research in sYynthesis. The earliest version of MaryTTS was developed around 2000 by Marc Schröder. The current stable version is 5.2, released on September 15, 2016.

I installed Mary TTS on my Windows, Linux and Mac computers. On the Mac (OSX 10.10 Yosemite), version 5.1.2 of Mary TTS was placed on the desktop in the folders marytts-5.1.2 and marytts-builder-5.1.2. The Mary TTS Server is started first by opening a terminal window in the folder marytts-5.1 with the following command :

marytss-5.1.2 mbarnig$ bin/marytts-server.sh

To start the Mary TTS client with the related GUI, a second terminal window is opened in the same folder  with the command :

marytss-5.1.2 mbarnig$ bin/marytts-client.sh

On Windows , the related scripts are marytts-server.bat and marytts-client.bat.

As the development version 5.2 of Mary TTS supports more languages and comes with toolkits for quickly adding support for new languages and for building unit selection and HMM-based synthesis voices, I downloaded a snapshot-zip-file from Github with the most recent source code. After unzipping, the source code was placed in the folder marytts-master on the desktop.

To compile Mary TTS from source on the Mac, the latest JAVA development version (jdk-8u31-macosx-x64.dmg) and Apache Maven (apache-maven-3.2.5-bin.tar.gz), a software project management and comprehension tool, are required.

On Mac, Java is installed in

/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/

and Maven is installed in

/usr/local/apache-maven/apache-maven-3.2.5

It is important to set the environment variables $JAVA_HOME, $M2_HOME and the $PATH to the correct values (export in /Users/mbarnig/.bash-profile).

The succesful installation of Java and Maven can be verified with the commands :

mbarnig$ java -version
mbarnig$ mvn --version
marytts-maven-java

Mary TTS : Maven and Java versions

This looks good!

In the next step I compiled the Mary TTS source code by running the command

marytts-master mbarnig$ mvn install

in the top-level marytts-master folder. This build the system, run unit and integration tests, packaged the code and installed it in the following folders :

marytts-master/target/marytts-5.2.SNAPSHOT
marytss-master/target/marytss-builder-5.2-SNAPSHOT

The build took 2:55 minutes and was succesful, without errors or warnings.

mary

Results of building MARYTTS 5.2 SNAPSHOT

The following modules have been compiled :

  • MaryTTS
  • marytts-common
  • marytts-signalproc
  • marytts-runtime
  • marytts-lang-de, en, te, tr, ru, it, fr, sv, lx (lx is a pseudo locale for a test language)
  • marytts-languages
  • marytts-client
  • marytts-builder
  • marytts-redstart
  • marytts-transcription
  • marytts-assembly with the sub-modules assembly-builder and assembly-runtime
  • voice_cmu_slt_hsmm

After checking the whole file structure, I started the Mary TTS 5.2 server

marytts-snapshot-server

Mary TTS snapshot 5.2 Server

and the Mary TTS 5.2 client

marytts-snapshot-client

Mary TTS Snapshot 5.2 client

did some trials with text to audio conversion in the GUI window

marytts-gui-client

Mary TTS Client GUI

launched the Mary TTS 5.2 component installer

Mary TTS Component Installer

Mary TTS Component Installer

and finally installed some french, german and english available voices.

marytts-installer

Mary TTS Voice Installer GUI

In the next step I will try to create my own voices and develop a voice for the luxembourgish language.

In January 2017, I updated my systems with the stable MaryTTS version 5.2 which supports the luxembourgish language.

eSpeak Formant Synthesizer

Last update : November 2, 2014

eSpeak

eSpeak is a compact multi-platform multi-language open source speech synthesizer using a formant synthesis method.

eSpeak is derived from the “Speak” speech synthesizer for British English for Acorn Risc OS computers, developed by Jonathan Duddington in 1995. He is still the author of the current eSpeak version 1.48.12 released on November 1, 2014. The sources are available on Sourceforge.

eSpeak provides two methods of formant synthesis : the original eSpeak synthesizer and a Klatt synthesizer. It can also be used as a front end for MBROLA diphone voices. eSpeak can be used as a command-line program or as a shared library. On Windows, a SAPI5 version is also installed. eSpeak supports SSML (Speech Synthesis Marking Language) and uses an ASCII representation of phoneme names which is loosely based on the Kirshenbaum system.

In formant synthesis, voiced speech (vowels and sonorant consonants) is created by using formants. Unvoiced consonants are created by using pre-recorded sounds. Voiced consonants are created as a mixture of a formant-based voiced sound in combination with a pre-recorded unvoiced sound. The eSpeakEditor allows to generate formant files for individual vowels and voiced consonants, based on a sequence of keyframes which define how the formant peaks (peaks in the frequency spectrum) vary during the sound. A sequence of formant frames can be created with a modified version of Praat, a free scientific computer software package for the analysis of speech in phonetics. The Praat formant frames, saved in a spectrum.dat file, can be converted to formant keyframes with eSpeakEdit.

To use eSpeak on the command line, type

espeak "Hello world"

There are plenty of command line options available, for instance to load from file, to adjust the volume, the pitch, the speed or the gaps between words, to select a voice or a language, etc.

To use the MBROLA voices in the Windows SAPI5 GUI or at the command line, they have to be installed during the setup of the program. It’s possible to rerun the setup to add additional voices. To list the available voices type

espeak --voices

eSpeak uses a master phoneme file containing the utility phonemes, the consonants and a schwa. The file is named phonemes (without extension) and located in the espeak/phsource program folder. The vowels are defined in the language specific phoneme files in text format. These files can also redefine consonants if you wish. The language specific phoneme text-files are located in the same espeak/phsource folder and must be referenced in the phonemes master file (see example for luxembourgish).

....
phonemetable lb base
include ph_luxembourgish

In addition to the specific phoneme file ph_luxembourgish (without extension), the following files are requested to add a new language, e.g. luxembourgish :

lb file (without extension) in the folder espeak/espeak-data/voices : a text file which in its simplest form contains only 2 lines :

name luxembourgish
language lb

lb_rules file (without extension) in the folder espeak/dictsource : a text file which contains the spelling-to-phoneme translation rules.

lb_list file (without extension) in the folder espeak/dictsource : a text file which contains pronunciations for special words (numbers, symbols, names, …).

The eSpeakEditor (espeakedit.exe) allows to compile the lb_ files into an lb_dict file (without extension) in the folder espeak/espeak-data and to add the new phonemes into the files phontab, phonindex and phondata in the same folder. These compiled files are used by eSpeak for the speech synthesis. The file phondata-manifest lists the type of data that has been compiled into the phondata file. The files dict_log and dict_phonemes provide informations about the phonemes used in the lb_rules and lb_dict files.

eSpeak applies tunes to model intonations depending on punctuation (questions, statements, attitudes, interaction). The tunes (s.. = full-stop, c.. = comma, q.. = question, e.. = exclamation) used for a language can be specified by using a tunes statement in the voice file.

tunes s1  c1  q1a  e1

The named tunes are defined in the text file espeak/phsource/intonation (without extension) and must be compiled for use by eSpeak with the espeakedit.exe program (menu : Compile intonation data).

meSpeak.js

Three years ago, Matthew Temple ported the eSpeak program from C++ to JavaScript using Emscripten : speak.js. Based on this Javascript project, Norbert Landsteiner from Austria created the meSpeak.js text-to-speech web library. The latest version is 1.9.6 released in February 2014.

meSpeak.js is supported by most browsers. It introduces loadable voice modules. The typical usage of the meSpeak.js library is shown below :

<!DOCTYPE html>
<html lang="en">
<head>
 <title>Bonjour le monde</title>
 <script type="text/javascript" src="mespeak.js"></script>
 <script type="text/javascript">
 meSpeak.loadConfig("mespeak_config.json");
 meSpeak.loadVoice("voices/fr.json");
 function speakIt() {
 meSpeak.speak("Bonjour le monde");
 }
 </script>
</head>
<body>
<h1>Try meSpeak.js</h1>
<button onclick="speakIt();">Speak It</button>
</body>
</html>

Click here to test this example.

The mespeak_config.json file contains the data of the phontab, phonindex, phondata and intonations files and the default configuration values (amplitude, pitch, …). This data is encoded as base64 octed stream. The voice.json file includes the id of the voice, the dictionary used and the corresponding binary data (base64 encoded) of these two files. There are various desktop or online Base64 Decoders and Encoders available on the net to create the required .json files (base64decode.org, motobit.com, activexdev.com, …).

meSpeak cam mix multiple parts (diiferent languages or voices) in a single utterance.meSpeak supports the Web Audio API (AudioContext) with internal wav files, Flash is used as a fallback.

Links

A list with links to websites providing additional informations about eSpeak and meSpeak follows :

Google text to speech (TTS) with processing

Referring to the post about Google STT, this post is related to Google speech synthesis with processing. Amnon Owed presented in November 2011 processing code snippets to make use of Google’s text-to-speech webservice. The idea was born in the processing forum.

The sketch makes use of the Minim library that comes with Processing. Minim is an audio library that uses the JavaSound API, a bit of Tritonus, and Javazoom’s MP3SPI to provide an easy to use audio library for people developing in the Processing environment. The author of Minim is Damien Di Fede (ddf), a creative coder and composer interested in interactive audio art and music games. In November 2009, Damien was joined by Anderson Mills who proposed and co-developed the UGen Framework for the library.

I use the Minim 2.1.0 beta version with this new UGen Framework. I installed the Minim library in the libraries folder in my sketchbook and deleted the integrated 2.0.2 version in the processing (2.0b8) folder modes/java/libraries.

Today I run succesful trials with the english, french and german Google TTS engine. I am impressed by the results.

Google Text-to-Speech (TTS) support

Last update : 30 April 2011

On november 16th, 2009, Google announced on their official blog that english text-to-speech was added to the translation tools.  Google used eSpeak, which is an open source software speech synthesizer for this service.

In may 2010,  Google Translate added more audio translations languages, including Afrikaans, Albanian, Catalan, Chinese (Mandarin), Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Haitian Creole, Hindi, Hungarian, Icelandic, Indonesian, Italian, Latvian, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Spanish, Swahili, Swedish, Turkish, Vietnamese and Welsh.

The speech audio is in MP3 format and is queried via a simple HTTP GET (REST) request. For english, an example url is:

http://translate.google.com/translate_tts?tl=en&q=how are you?

The TTS web service is restricting the text to 100 characters and the service returns 404 (Not Found) if the request includes a Referer header.

December 3, 2010, Google acquired Phonetic Arts, a company specialised in speech synthesis. Phonetic Arts Limited delivers technology that generates natural expressive speech. The products include Phonetic Morpher,  Phonetic LipSync  and Phonetic Synthesizer. Phonetic Arts, formerly known as Tayvin 356 Limited, was founded in 2006 and is based in Cambridge, UK.  The Phonetic Arts technology generates natural computer speech from small samples of recorded voice and should improve the voice output quality of Googles text-to-speech applications.

Google does not only provide speech output tools, but also speech input tools (Voice Search, Voice Input, Voice Actions), mainly in relation with the mobile phone OS Android.

Version 11 of the Google Chrome browser includes the HTML5 Speech Input API.

An amusing application of the Google TTS system is the Google Translate Beatbox.