Character encoding for Festival TTS files

Today, the dominant character encoding for the World Wide Web and for text files is UTF-8 (Universal Character Set + Transformation Format 8 bits). UTF-8 uses one to 4 bytes to encode one symbol (1,112,064 valid code points in the Unicode code space). The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

The Festival TTS package doesn’t  support UTF-8. The development of Festival started about 20 years ago when UTF-8 was only known by a few people of the Open Group for Unix Systems. Festival only supports one byte character encoding. All files created with the Festvox tools to develop new languages or voices for Festival are in US-ASCII format.

Problems appear if we need to use non-ASCII characters in Festival, for example the characters é è ë à ä ö ü for the luxembourgish language. In UTF-8 these characters are encoded with two bytes (16 bits), which yields errors in Festival. There exist however a series of 8-bit character encoding standards defined by ECMA, IEC and ISO. These standards are known as ISO-8859-x, where x is one of 15 parts. The listed luxembourgish specific characters are included in the ISO-8859 parts 1, 2, 3, 4, 9, 10, 14, 15 and 16. The  preferred ISO-8859 standard for the luxembourgish language is part 15, which includes the euro sign and provides the coverage of all the french and german letters. ISO 8859-15 encodes what it refers to as Latin alphabet no. 9.

Most text editors and other tools used today to write scripts and program code use UTF-8 as default format. On Windows I use Notepad++ to edit my files. Changing the encoding is easy. On my Mac OSX 10.10.2 (Yosemite) I use TextEdit, Terminal and Xcode to edit my files for Festival TTS. I changed the following preferences to encode my scripts and programs in ISO-8859-15.

TextEdit

I first checked the list of the encoding formats to show in the selection menu of the preference window.

List of character encoding formats in TextEdit

List of character encoding formats in TextEdit

The Occidental (ISO Latin 9) format corresponds to the ISO-8859-15 standard.

TextEdit

Preferences selected in TextEdit

Terminal (bash)

Same procedure for the OSX terminal. I checked the list of the encoding formats to show in the selection menu of the preference window.

List of character encoding formats in the OSX Terminal

List of character encoding formats in the OSX Terminal

In the OSX terminal preference window I also selected the Occidental (ISO Latin 9) format corresponding to the ISO-8859-15 standard.

Terminal

Preferences selected for the OSX Terminal

XCode 6.2

An ISO-8859-15 encoded text file (for example created witt TextEdit) is not recognized as such by Xcode 6.2. Characters as “é à è ö ä ü” are displayed as “È ‡ Ë ˆ ‰ ¸ “. The indicated Xcode decoding is Western (Mac OS Roman), even if the default text encoding set in Preferences > Text Editing is set to Western (ISO Latin 9). The characters are displayed as expected if the text is reinterpreted with the File Inspector to ISO Latin 9. Files created with Xcode are encoded always in UTF-8, the default text encoding setting seems to take no effect.

The Xcode window to select potential encoding formats uses a different presentation, but displays the same formats as those in TextEdit or OSX Terminal.

Xcode

List of encoding formats in Xcode

Again I defined the default text encoding as Occidental (ISO Latin 9) alias ISO-8859-15.

Xcode

Preferences selected for Xcode

File conversion

To check the text encoding of a file we can use the command file

mbarnig$ file -I name.ext

Some examples are shown hereafter :

file

Show file encoding with the command $ file -I

To convert an UTF-8 file in the ISO-8859-15 format we can use the command iconv

mbarnig$ iconv -f utf8 -t ISO-8859-1 utf8.txt > iso.txt

An example is shown hereafter :

iconv

File format conversion with the command $ iconv

The encoding formats available in iconv are listed with the option -l or –list :

iconv

Encoding formats available in iconv

Environment Variables

To show the environment variables we can use the command locale :

mbarnig$ locale
locale

Environment variables in OSX locale

LANG=                         ; native language + local customization if no LC_ variables
LC_COLLATE=”C”       ; character collation
LC_CTYPE=”C”           ; character handling functions
LC_MESSAGES=”C”   ; affirmative and negative responses
LC_MONETARY=”C”   ; monetary related numeric formatting
LC_NUMERIC=”C”      ; numeric formatting
LC_TIME=”C”              ; date and timing formatting
LC_ALL=                     ; value to overwrite all LC_ variables

“C” stands for simplest locale (character set ASCII, single byte character encoding, language US-english, …)

To show all the available locales we use the option -a

mbarnig$ locale -a
locale -a

List of locales defined in OSX 10.10.2

To change the locales we use the export function :

mbarnig$ export LANG=fr_CH
mbarnig$ export LC_ALL=fr_CH.ISO8859-15
locate_iso

Modification of locales in OSX 10.10.2 with the export command

The export functions can be included in the /Users/mbarnig/.bash_profile file.

Links