Today, the dominant character encoding for the World Wide Web and for text files is UTF-8 (Universal Character Set + Transformation Format 8 bits). UTF-8 uses one to 4 bytes to encode one symbol (1,112,064 valid code points in the Unicode code space). The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.
The Festival TTS package doesn’t support UTF-8. The development of Festival started about 20 years ago when UTF-8 was only known by a few people of the Open Group for Unix Systems. Festival only supports one byte character encoding. All files created with the Festvox tools to develop new languages or voices for Festival are in US-ASCII format.
Problems appear if we need to use non-ASCII characters in Festival, for example the characters é è ë à ä ö ü for the luxembourgish language. In UTF-8 these characters are encoded with two bytes (16 bits), which yields errors in Festival. There exist however a series of 8-bit character encoding standards defined by ECMA, IEC and ISO. These standards are known as ISO-8859-x, where x is one of 15 parts. The listed luxembourgish specific characters are included in the ISO-8859 parts 1, 2, 3, 4, 9, 10, 14, 15 and 16. The preferred ISO-8859 standard for the luxembourgish language is part 15, which includes the euro sign and provides the coverage of all the french and german letters. ISO 8859-15 encodes what it refers to as Latin alphabet no. 9.
Most text editors and other tools used today to write scripts and program code use UTF-8 as default format. On Windows I use Notepad++ to edit my files. Changing the encoding is easy. On my Mac OSX 10.10.2 (Yosemite) I use TextEdit, Terminal and Xcode to edit my files for Festival TTS. I changed the following preferences to encode my scripts and programs in ISO-8859-15.
I first checked the list of the encoding formats to show in the selection menu of the preference window.
The Occidental (ISO Latin 9) format corresponds to the ISO-8859-15 standard.
Same procedure for the OSX terminal. I checked the list of the encoding formats to show in the selection menu of the preference window.
In the OSX terminal preference window I also selected the Occidental (ISO Latin 9) format corresponding to the ISO-8859-15 standard.
An ISO-8859-15 encoded text file (for example created witt TextEdit) is not recognized as such by Xcode 6.2. Characters as “é à è ö ä ü” are displayed as “È ‡ Ë ˆ ‰ ¸ “. The indicated Xcode decoding is Western (Mac OS Roman), even if the default text encoding set in Preferences > Text Editing is set to Western (ISO Latin 9). The characters are displayed as expected if the text is reinterpreted with the File Inspector to ISO Latin 9. Files created with Xcode are encoded always in UTF-8, the default text encoding setting seems to take no effect.
The Xcode window to select potential encoding formats uses a different presentation, but displays the same formats as those in TextEdit or OSX Terminal.
Again I defined the default text encoding as Occidental (ISO Latin 9) alias ISO-8859-15.
To check the text encoding of a file we can use the command file
mbarnig$ file -I name.ext
Some examples are shown hereafter :
To convert an UTF-8 file in the ISO-8859-15 format we can use the command iconv
mbarnig$ iconv -f utf8 -t ISO-8859-1 utf8.txt > iso.txt
An example is shown hereafter :
The encoding formats available in iconv are listed with the option -l or –list :
To show the environment variables we can use the command locale :
LANG= ; native language + local customization if no LC_ variables
LC_COLLATE=”C” ; character collation
LC_CTYPE=”C” ; character handling functions
LC_MESSAGES=”C” ; affirmative and negative responses
LC_MONETARY=”C” ; monetary related numeric formatting
LC_NUMERIC=”C” ; numeric formatting
LC_TIME=”C” ; date and timing formatting
LC_ALL= ; value to overwrite all LC_ variables
“C” stands for simplest locale (character set ASCII, single byte character encoding, language US-english, …)
To show all the available locales we use the option -a
mbarnig$ locale -a
To change the locales we use the export function :
mbarnig$ export LANG=fr_CH
mbarnig$ export LC_ALL=fr_CH.ISO8859-15
The export functions can be included in the /Users/mbarnig/.bash_profile file.
- All about Unicode, UTF8 & Character Sets, by Paul Tero