Festival Text-to-Speech Package

Last update : April 22, 2015

Festival

The Festival Speech Synthesis System is a general multi-lingual speech synthesis system originally developed by Alan W. Black at the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Alan W. Black is now professor in the Language Technology Institute at Carnegie Mellon University where substantial contributions have been provided to Festival. The program is written in C++.

To set-up a complete Festival Environment on OS X (Yosemite 10.10.2), four packages are required :

  1. Festival-2.4 (file festival-2.4-release.tar)
  2. Edinburgh Speech-Tools (file speech_tools-2.4-release.tar)
  3. Festvox (file festvox-2.7.0-release.tar.gz)
  4. Languages (example file : english festvox_kallpc16k.tar.gz)

To compile and install the packages, I got some guidance from a Linguistic Mystic (alias Will Styler). After unzipping, the files have been moved into a common folder Festival-TTS on the desktop with the following names :

  • festival
  • speech-tools
  • festvox

The language files are installed in the festival folder in the sub-folders lib/voices/english.

The packages have been compiled in the following sequence :

mbarnig$ cd Desktop/Festival-TTS/speech_tools
mbarnig$ ./configure
mbarnig$ make
mbarnig$ make test
mbarnig$ make install
mbarnig$ cd Desktop/Festival-TTS/festival
mbarnig$ ./configure
mbarnig$ make
mbarnig$ make install
mbarnig$ cd Desktop/Festival-TTS/festvox
mbarnig$ ./configure
mbarnig$ make

At the end the voice folder with the language files was moved to the festival/lib directory.

After updating Xcode to version 6.1.1 and installing the audiotools for Xcode 6.1, I checked that afplay is working :

afplay check

afplay check

I checked also that the festival/lib/siteinit.scm file contains the  following statements :

  • (Parameter.set ‘Audio_Required_Format ‘riff)
  • (Parameter.set ‘Audio_Method ‘Audio_Command)
  • (Parameter.set ‘Audio_Command “afplay $FILE”)

The following files have been downloaded from the festvox website, unzipped and moved to the festival/lib/dicts folder :

  • festlex_CMU.tar.gz
  • festlex_OALD.tar.gz
  • festlex_POSLEX.tar.gz

I added finally the following statements to the .bash_profile file located in the homefolder (/Users/mbarnig) :

  • export FESTIVALDIR=”/Users/mbarnig/Desktop/Festival-TTS/festival”
  • export PATH=”$FESTIVALDIR/bin:$PATH”
  • export ESTDIR=”/Users/mbarnig/Desktop/Festival-TTS/speech_tools”
  • export PATH=”$ESTDIR/bin:$PATH”
  • export FESTVOXDIR=”/Users/mbarnig/Desktop/Festival-TTS/festvox”

The festival tool can now be started in the terminal window with the command

mbarnig$ $FESTIVALDIR/bin/festival
Festival

Festival version 2.4

All seems to be working great!

Festival embeds a basic small Scheme (Lisp) interpreter (SIOD : Scheme In One Defun 3.0) written by George Carrett.

Festival works in two fundamental modes, command mode and text-to-speech (tts) mode. If Festival is started without arguments (or with the option  –command), it enters the default command mode (prompt = festival>). Information included in paranthesis is treated as commands and is interpreted by the Scheme interpreter. The following commands are accepted:

festival> 
> (intro)   :  short spoken introduction
> (voice.list)   : list of available voices
> (set! utt1 (Utterance Text "Hello world"))   : 
           create an utterance and save it in a variable
> (utt.synth utt1)    : synthesize utterance to get a waveform
> (utt.play utt1) : send the synthesized waveform to the audio device
> (SayText "Good morning, welcome to Festival")   : 
           speak text (combination of the 3 preceding commands)
> (tts "myfile" nil)    : speak file instead of text
> (manual nil)  : show the content of the manual
> (manual "Accessing an utterance")  : show the section "utterance"
> (PhoneSet.list)   : show the currently defined phonesets
> (tts "doremi.xml" 'singing)  : an XML based mode for specifying 
           songs, both notes and duration
> (quit)   : exit

If Festival is started with the –tts option, it enters tts-mode. Information (in files or through standard input) is treated as text to be rendered as speech.

Other options available at the start of Festival are :

--language LANG   : set the default language to LANG.
--server   : enter server mode where Festival waits for clients on a 
    known port (default port : 1314); connected clients may send 
    commands (or text) to the server and expect waveforms back.
--script scriptfile  : run scriptfile as a Festival script file.
--heap NUMBER   : to increase the scheme heap.
--batch  : after processing file arguments do not become interactive.
--interactive  : after processing file arguments become interactive.

Script mode :

festival mbarnig$  examples/saytime
festival mbarnig$  text2wave myfile.txt -o myfile.wav

An updated Festival System Documentation with 34 chapters, edited in December 2014, is available at the festvox website.

The following Festival voices are available :

  • festvox_cmu_us_ahw_cg
  • festvox_cmu_us_aup_cg
  • festvox_cmu_us_awb_cg
  • festvox_cmu_us_axb_cg
  • festvox_cmu_us_bdl_cg
  • festvox_cmu_us_clb_cg
  • festvox_cmu_us_fem_cg
  • festvox_cmu_us_gka_cg
  • festvox_cmu_us_jmk_cg
  • festvox_cmu_us_ksp_cg
  • festvox_cmu_us_rms_cg
  • festvox_cmu_us_rxr_cg
  • festvox_cmu_us_slt_cg
  • festvox_kallpc16k
  • festvox_rablpc16k
  • Leopold : AustrianGerman
  • IMS German Festival
  • OGIgerman by CSLU
  • Swedish by SOL
  • Hindi

Hindi and German are examples of Festival languages/voices with different phone-features in the phone-set as in the standard us and english phone-sets.

Edinburgh Speech Tools

The Edinburgh Speech Tools Library is a collection of C++ class, functions and related programs for manipulating objects used in speech processing. It includes support for reading and writing waveforms, parameter files (LPC, Ceptra, F0) in various formats and converting between them. It also includes support for linguistic type objects and support for various label files and ngrams (with smoothing). In addition to the library a number of programs are included. An intonation library which includes a pitch tracker, smoother and labelling system (using the Tilt Labelling system), a classification and regression tree (CART) building program called wagon. Also there is growing support for various speech recognition classes such as decoders and HMMs.

An introduction to the Edinburgh Speech Tools is provided by Festvox.

Festvox

The Festvox project aims to make the building of new synthetic voices for Festival more systemic and better documented, by offering the following resources :

Festival Variables

Festival provides a list of variables available for general use. This list is automatically generated from the documentation strings of the variables defined in the source code. A variable can be displayed with the print command at the festival prompt. Some examples are shown hereafter :

festival>
> (print festival_version) ; current version of the system
> (print *ostype*) ; operation system that Festival is running on
> (print lexdir) ; default directory of the lexicons
> (print SynthTypes) ; list of synthesis types and functions
> (print token.letter_pos) ; POS tag for individual letters
> (print token.punctuation) ; characters treated as punctuation
> (print voice-path) ; list of folders to look for voices
> (print voice_default) ; function to load the default voice
Festival Variables

Festival Variables

Festival Functions

Festival provides a list of functions available for general use. This list is automatically generated from the documentation strings of the functions defined in the source code. A function is called at the Festival prompt. Some examples are shown hereafter :

festival>
> (pwd) ; return current directory
> (lex.list) ; list names of all currently defined lexicons
> (voice.list) ; list all potential voices in the system
> (lex.lookup WORD FEATURES) ; lookup word in current lexicon
> (lex.compile ENTRYFILE COMPILEFILE) ; compile lexical entries
> (PhoneSet.list) ; list all currently defined PhoneSets
> (quit) ; exit from Festival
Festival Functions

Festival Functions

Utterance Access Methods

Festival provides a number of standard functions that allow to access parts of an utterance, to traverse through it and to extract features.

Three utterances access methods are of particular interest :

  1. (utt.feat UTT FEATNAME)
    returns the value of feature FEATNAME in UTT
  2. (item.feat ITEM FEATNAME)
    returns the value of feature FEATNAME in ITEM
  3. (utt.features UTT RELATIONNAME FUNCLIST)
    returns vectors of feature values for each item, listed in FUNCLIST and related
    in RELATIONNAME in UTT

FEATNAME may be a

  • feature name ; example : (item.feat sylb ‘stress)
  • feature function name ; example : (item.feat sylb ‘pos_in_word)
  • pathname ; examples : (item.feat sylb ‘nn.stress)
    (item.feat sylb ‘R:SylStructure.parent.word)

Notes :
sylb is a syllable item
R: is a relation operator

RELATIONNAME may be ‘Token, ‘Word, ‘Phrase, ‘Segment, ‘Syllable, etc

FUNCLIST is a list of items ; example : ‘(name pos)

Some examples are shown hereafter :

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! tok (utt.relation.first utter 'Token))
(utt.feat utter 'type)
(item.feat tok 'nn.name)
(item.feat tok 'R:Token.daughter1.name)
(utt.features utter 'Word '(name pos p.pos n.pos))
feats

Utterance access methods

More informations about feature functions as FEATNAME are provided in the next chapter.

Festival Feature Functions

Festival provides a list of basic feature functions available as FEATNAME in utterances. Most are only available for specific items. Some examples are shown hereafter, related to the corresponding items :

Token item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! tok (utt.relation.first utter 'Token))
(item.name tok) ; first token
(item.feat tok 'name) ; first token
(item.feat tok 'n.name) ; second token
(item.feat tok 'nn.name) ; third token
(item.feat tok 'whitespace)
(item.feat tok 'prepunctuation)
Utterance Token

Utterance Token

Word item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! wrd (item.next (utt.relation.first utter 'Word)))
(item.name wrd) ; second word
(item.feat wrd 'p.name) ; first word
(item.feat wrd 'cap)
(item.feat wrd 'word_duration)
Utterance Word

Utterance Word

Segment item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! seg (item.prev (item.prev (utt.relation.last utter 'Segment))))
(item.name seg) ; third last segment
(item.feat seg 'n.name) ; second last segment
(item.feat seg 'seg_pitch)
(item.feat seg 'segment_end)
(item.feat seg 'R:SylStructure.parent.parent.name)
Utterance Segment

Utterance Segment

Syllable item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! sylb (utt.relation.first utter 'Syllable))
(item.features sylb) ; first syllable
(item.feat sylb 'asyl_out)
(item.feat sylb 'syl_midpitch)
(utt.features utter 'Syllable '(stress))
(item.feat sylb 'nn.stress) ; stress of third syllable
(item.feat sylb 'R:SylStructure.parent.name)
(item.feat sylb 'R:SylStructure.daughter1.name)
(item.feat sylb 'R:SylStructure.daughter2.name)
Utterance Syllable

Utterance Syllable

SylStructure item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! sylst (item.prev (utt.relation.last utter 'SylStructure)))
(item.features sylst)
(item.feat sylst 'pos_index)
(item.feat sylst 'phrase_score)
Utterance SylStructure

Utterance SylStructure

Intonation item

festival>
(set! utter (SayText "Hello Marco, how are you?"))
(set! inton (utt.relation.first utter 'Intonation))
(item.features inton)
(item.feat inton 'id)
Utterance Intonation

Utterance Intonation

Dumping features

Extracting basic features from a set of utterances is useful for most of the training techniques for TTS voice building. Festival provides a script dumpfeats in the festival/examples folder which does this task. The results can be saved in a single feature file or in separate files for each utterance. An example is shown below, the dumpfeats script was copied in the festival folder of my test voice mbarnig_lb_voxcg :

mbarnig$ ./dumpfeats -feats "(name p.name n.name)"
-relation Segment -output myfeats.txt utts/*.utt
Festival dumpfeats

Festival dumpfeats

Links

A list of links to websites with additional informations about the Festival package is shown hereafter :

Synology Photostation PostgreSQL Database

Last update : November 17, 2015

The Synology DSM 5.0 operating system uses the database PostgreSQL version 9.3 for Photostation 6.0 which can be administered with phpPgAdmin.

The Synology phpPgAdmin package created by Nigel Barnes (alias Pernod 70) has been updated on March 31, 2014 to work with the new Synology DSM version 5.1. The new package version is 5.1.0-002, the sources are available at Github.

The following configuration files are used :

by phpPgAdmin

  • /usr/syno/synoman/phpsrc/phpPgAdmin/conf/config.inc.php
  • /usr/syno/synoman/phpsrc/phpPgAdmin/conf/config.inc.php-dist

by PosrtgreSQL

  • /etc/postgresql/pg_hba.conf
  • /etc/postgresql/pg-ident.conf
  • /etc/postgresql/postgresql.conf
  • /etc.defaults/postgresql/pg_hba.conf
  • /etc.defaults/postgresql/pg-ident.conf
  • /etc.defaults/postgresql/postgresql.conf

config.inc.php

I named this configuration file phpPgAdmin in the Synology Config File Editor. The original content is show below :


<?php

 /**
 * Central phpPgAdmin configuration. As a user you may modify the
 * settings here for your particular configuration.
 *
 * $Id: config.inc.php-dist,v 1.55 2008/02/18 21:10:31 xzilla Exp $
 */

 // An example server. Create as many of these as you wish,
 // indexed from zero upwards.

 // Display name for the server on the login screen
 $conf['servers'][0]['desc'] = 'PostgreSQL by Synology';

 // Hostname or IP address for server. Use '' for UNIX domain socket.
 // use 'localhost' for TCP/IP connection on this computer
 $conf['servers'][0]['host'] = '';

 // Database port on server (5432 is the PostgreSQL default)
 $conf['servers'][0]['port'] = 5432;

 // Database SSL mode
 // Possible options: disable, allow, prefer, require
 // To require SSL on older servers use option: legacy
 // To ignore the SSL mode, use option: unspecified
 $conf['servers'][0]['sslmode'] = 'allow';

 // Change the default database only if you cannot connect to template1.
 // For a PostgreSQL 8.1+ server, you can set this to 'postgres'.
 $conf['servers'][0]['defaultdb'] = 'template1';

 // Specify the path to the database dump utilities for this server.
 // You can set these to '' if no dumper is available.
 $conf['servers'][0]['pg_dump_path'] = '/usr/bin/pg_dump';
 $conf['servers'][0]['pg_dumpall_path'] = '/usr/bin/pg_dumpall';

 // Example for a second server (PostgreSQL for Windows)
 //$conf['servers'][1]['desc'] = 'Test Server';
 //$conf['servers'][1]['host'] = '127.0.0.1';
 //$conf['servers'][1]['port'] = 5432;
 //$conf['servers'][1]['sslmode'] = 'allow';
 //$conf['servers'][1]['defaultdb'] = 'template1';
 //$conf['servers'][1]['pg_dump_path'] = 
'C:\\Program Files\\PostgreSQL\\8.0\\bin\\pg_dump.exe';
 //$conf['servers'][1]['pg_dumpall_path'] = 
'C:\\Program Files\\PostgreSQL\\8.0\\bin\\pg_dumpall.exe';
 
 
 /* Groups definition */
 /* Groups allow administrators to logicaly group servers together under
 * group nodes in the left browser tree
 *
 * The group '0' description
 */
 //$conf['srv_groups'][0]['desc'] = 'group one';

 /* Add here servers indexes belonging to the group '0' seperated by comma */
 //$conf['srv_groups'][0]['servers'] = '0,1,2'; 

 /* A server can belong to multi groups. Here server 1 is referenced in both
 * 'group one' and 'group two'*/
 //$conf['srv_groups'][1]['desc'] = 'group two';
 //$conf['srv_groups'][1]['servers'] = '3,1';

 /* A group can be nested in one or more existing groups using the 'parents'
 * parameter. Here the group 'group three' contains only one server and will
 * appear as a subgroup in both 'group one' and 'group two':
 */
 //$conf['srv_groups'][2]['desc'] = 'group three';
 //$conf['srv_groups'][2]['servers'] = '4';
 //$conf['srv_groups'][2]['parents'] = '0,1';

 /* Warning: Only groups with no parents appears at the root of the tree. */
 

 // Default language. E.g.: 'english', 'polish', etc. See lang/ directory
 // for all possibilities. If you specify 'auto' (the default) it will use 
 // your browser preference.
 $conf['default_lang'] = 'auto';

 // AutoComplete uses AJAX interaction to list foreign key values 
 // on insert fields. It currently only works on single column 
 // foreign keys. You can choose one of the following values:
 // 'default on' enables AutoComplete and turns it on by default.
 // 'default off' enables AutoComplete but turns it off by default.
 // 'disable' disables AutoComplete.
 $conf['autocomplete'] = 'default on';
 
 // If extra login security is true, then logins via phpPgAdmin with no
 // password or certain usernames (pgsql, postgres, root, administrator)
 // will be denied. Only set this false once you have read the FAQ and
 // understand how to change PostgreSQL's pg_hba.conf to enable
 // passworded local connections.
 $conf['extra_login_security'] = false;

 // Only show owned databases?
 // Note: This will simply hide other databases in the list - this does
 // not in any way prevent your users from seeing other database by
 // other means. (e.g. Run 'SELECT * FROM pg_database' in the SQL area.)
 $conf['owned_only'] = false;

 // Display comments on objects? Comments are a good way of documenting
 // a database, but they do take up space in the interface.
 $conf['show_comments'] = true;

 // Display "advanced" objects? Setting this to true will show 
 // aggregates, types, operators, operator classes, conversions, 
 // languages and casts in phpPgAdmin. These objects are rarely 
 // administered and can clutter the interface.
 $conf['show_advanced'] = false;

 // Display "system" objects?
 $conf['show_system'] = false;

 // Minimum length users can set their password to.
 $conf['min_password_length'] = 1;

 // Width of the left frame in pixels (object browser)
 $conf['left_width'] = 200;
 
 // Which look & feel theme to use
 $conf['theme'] = 'default';
 
 // Show OIDs when browsing tables?
 $conf['show_oids'] = false;
 
 // Max rows to show on a page when browsing record sets
 $conf['max_rows'] = 30;

 // Max chars of each field to display by default in browse mode
 $conf['max_chars'] = 50;

 // Send XHTML strict headers?
 $conf['use_xhtml_strict'] = false;

 // Base URL for PostgreSQL documentation.
 // '%s', if present, will be replaced with the PostgreSQL version
 // (e.g. 8.4 )
 $conf['help_base'] = 'http://www.postgresql.org/docs/%s/interactive/';
 
 // Configuration for ajax scripts
 // Time in seconds. If set to 0, refreshing data using ajax 
will be disabled (locks and activity pages)
 $conf['ajax_refresh'] = 3;

 /** Plugins management
 * Add plugin names to the following array to activate them
 * Example:
 * $conf['plugins'] = array(
 * 'Example',
 * 'Slony'
 * );
 */
 $conf['plugins'] = array();

 /*****************************************
 * Don't modify anything below this line *
 *****************************************/

 $conf['version'] = 19;

?>

config.inc.php-dist

This is a backup copy of the main configuration file config.inc.php.

pg_hba.conf

Client authentication in PostgreSQL is controlled by a configuration file, which traditionally is named pg_hba.conf (HBA stands for host-based authentication). I named this configuration file PostgreSQL in the Synology Config File Editor. The original content is show below :


# TYPE DATABASE USER ADDRESS        METHOD
local  all      all                 trust
host   all      all  127.0.0.1/32   trust
host   all      all  ::1/128        trust

A backup copy with these default values is stored in the /etc.defaults/ folder.

pg-ident.conf

The configuration file pg-ident.conf is used to map the operating system user name to a database user name if an external authentication system is involved. In the Synology setup this file and the backup copy stored in the /etc.defaults/ folder are empty.

postgresql.conf

The original content is show below :


hba_file = '/etc/postgresql/pg_hba.conf'
ident_file = '/etc/postgresql/pg_ident.conf'

external_pid_file = '/run/postgresql/postmaster.pid'

listen_addresses = '127.0.0.1'
max_connections =64

shared_buffers = 24MB

log_destination = 'syslog'
syslog_ident = 'postgres'
client_min_messages = notice
log_min_messages = warning
log_min_error_statement = error
log_min_duration_statement = -1

track_activities = off
track_counts = off

autovacuum = off

datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

escape_string_warning = off
synchronize_seqscans = off

standard_conforming_strings = off

A backup copy with these default values is stored in the /etc.defaults/ folder.

Configuration

Out of the box with the default configuration parameters, the login to the PostgreSQL database with the phpPgAdmin app works with the username postgres and an empty password.

After an update or upgrade of the DSM operating systme, the phpPgAdmin webpage (http://yourdomain/phpPgAdmin/) is usually no longer accessible. You must reinstall the 3rd party phpPgAdmin installation package with the following steps :

  1. deinstall phpPgAdmin
  2. set the confidence level in parameters to the required 3rd party installation
  3. install manually the latest phpPgAdmin package
  4. check the configuration files with the configuration editor (mainly the extra_login_security parameter which I set to false in the config-file named phpPgAdmin)
  5. Start the phpPgAdmin package in the package center
  6. Go to the phpPgAdmin webpage, enter the default login credentials and verify your databases
phpPgAdmin Interface on Synology

PostgreSQL database management with phpPgAdmin on Synology

The last update of the Synology System was done on November 17, 2015, followed by a new installation of phpPgAdmin.

Export

To export the Photo PostgreSQL database for backup purposes, I select the photo database, click the tab “Export” in the menu bar, select “Structure and Data” with the Format “SQL”, chose the option “download” and finally click the button “Export”. The file is saved with the name “dump.sql” in the standard local download folder.

Export

Export PostgreSQL Synology photostation database with phpPgAdmin

Links

A list with links to sources providing additional informations about the Synology PostgreSQL database is shown hereafter :

WordNet and ImageNet

WordNet

WordNet is a large lexical database for the English language, a combination of dictionary and thesaurus. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonym rings (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. It is accessible to human users via a web browser, but its primary use is in automatic natural language processing and artificial intelligence applications.

The database (lexicographer files) and software tools (compiler called grind and reverse morphology program called morphy) have been released under a BSD style license and are freely available for download from the WordNet website. The database contains about 160.000 words, organized in about 120.000 synsets, for a total of about 200.000 word-sense pairs (see detailed statistics). The current version 3.1 has a size of about 12 MB in compressed form.

WordNet was created in the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George Armitage Miller, starting in 1985, and has been directed in recent years by Christiane Fellbaum.

Christiane Fellbaum, together with Piek Vossen, founded in 2000 the Global WordNet Association.

Global WordNet Association

GWA (Global WordNet Association) is a free, public and non-commercial organization that provides a platform for discussing, sharing and connecting wordnets for all languages in the world. A list of wordnets in other languages are published on the GWA website. Wordnets of the neighbouring countries of Luxembourg are listed hereafter :

The first GWA conference (GWC2002) was organized in January 2002 in Mysore, India. The most recent conference (GWC2014) was organized in Tartu, Estonia.

A major project of the GWA is the creation of a completely free worldwide wordnet grid, build around a shared set of concepts, such as the Common Base Concepts, and the Suggested Upper Merged Ontology (SUMO) owned by the IEEE.

SUMO

The Suggested Upper Merged Ontology (SUMO) and its domain ontologies form the largest formal public ontology in existence today. They are being used for research and applications in search, linguistics and reasoning. SUMO is the only formal ontology that has been mapped to all of the WordNet lexicons. The Technical editor of SUMO is Adam Pease.

WordNet Relations

Verena Heinrich from the University of Tübingen created a few images for GermaNet which visualize examples of WordNet relations. These copyrighted pictures are used here with permission.

Antonymy

WordNet Antonymy

WordNet Antonymy

Synonymy

WordNet Synonymy

WordNet Synonymy

Pertainymy

WordNet Pertainymy

WordNet Pertainymy

Hypernymy

WordNet Hypernymy

WordNet Hypernymy

Meronymy

WordNet Meronymy

WordNet Meronymy

Holonymy

WordNet Holonymy

WordNet Holonymy

Association

WordNet Association

WordNet Association

Multiple Relations

WordNet Multiple Relations

WordNet Multiple Relations

WordNet Search Results

The following figures show the results of WordNet searches for the term

pedestrian = piéton = Fussgänger
Online WordNet Search at the Princeton University

Online WordNet Search at the Princeton University

WordNet

Online Search at WoNeF – WordNet du Français

ImageNet

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images (an average of over five hundred images per node).

ImageNet does not own the copyright of the images. ImageNet only provides thumbnails and URLs of images, in a way similar to what image search engines do, by compiling an accurate list of web images for each synset of WordNet. The list is freely available.

ImageNet provides the download of SIFT (Scale-Invariant Feature Transform) features, of object bounding boxes for about 1 million pictures and of object attributes, both annotated and verified through Amazon Mechanical Turk.

ImageNet is managed by a research team from the universities of Stanford, Princeton, Michigan and North-Carolina. The project is sponsored by the Stanford Vision Lab, Stanford University, Princeton University, Google Research and A9, a subsidiary of Amazon.com based in Palo Alto, California, that develops search and advertising technology.

The following figure shows the results of the search for pedestrian in the ImageNet database.

ImageNet Search Result

ImageNet  Result Page for a “pedestrian” query

ImageNet

ImageNet Pictures (1.518) for the synset “pedestrian crossing, zebra crossing”

For comparison, the results of a Google Image Search for the same term pedestrian is shown below :

Goggle Image Search for pedestrian

Goggle Image Search for pedestrian

Started in 2010 (ILSVRC2010), the ImageNet Team organizes an annual challenge to measure improvements in the state of machine vision technology.

Large Scale Visual Recognition Challenge

The Large Scale Visual Recognition Challenge is based on pattern recognition software that can be trained to recognize objects in digital images and is made possible by the ImageNet database.

In 2012 (ILSVR2012) the contest was won by Geoffrey E. Hinton, a cognitive scientist at the University of Toronto, and his students Alex Krizhevsky and Ilya Sutskever. All three joined Google in 2013.

In 2014 (ILSVR2014), the challenge drew 38 entrants from 13 countries. The groups used advanced software, in most cases modeled loosely on the biological vision systems, to detect, locate and classify a huge set of images taken from Internet sources. Contestants run their recognition programs on high-performance computers based in many cases on specialized processors called GPUs, for graphic processing units. All of the entrants used a variant of an approach known as a convolutional neural network, an approach first refined in 1998 by Yann LeCun, a French computer scientist who recently became director of artificial intelligence research at Facebook.

The results of the 2014 challenge have been published at the ImageNet website.

Raspberry Pi

Dernière mise à jour : 23 janvier 2016

Le Raspberry Pi est un nano-ordinateur monocarte à processeur ARM développé par la fondation Raspberry Pi. Cet ordinateur, qui a la taille d’une carte de crédit, est destiné à encourager l’apprentissage de la programmation informatique; il permet l’exécution de plusieurs variantes du système d’exploitation libre GNU/Linux et des logiciels compatibles. Seule la carte mère nue est fournie, sans boîtier, alimentation, mémoire, clavier, souris ni écran, afin de diminuer les coûts et de permettre l’utilisation de matériel de récupération.

Ordinateur Raspberry Pi modèle B

Ordinateur Raspberry Pi modèle B

Raspberry Pi

Les ordinateurs Raspberry Pi et les accessoires afférents sont disponibles auprès de différents distributeurs et revendeurs, parmi eux Amazon. Fin 2015, il y  a quatre modèles disponibles : PI 1 A+, PI 1 B+, PI 2 B et PI Zero. Dans le passé il y avait encore les modèles PI 1 A et PI 1 B.

La documentation officielle est disponible sur le site web de la fondation. Je dispose d’un ensemble d’ordinateurs Raspberry Pi modèle B rev. 2 (2011.12), avec des modules de caméra Raspicam et des kits Bright Pi 1.0. Les caractéristiques principales sont présentées ci-après :
Ordinateur :

  • System on a chip (SoC) processeur : Broadcom BCM2835, 700 MHz (ARM, distribution armhf)
  • RAM : 512 MByte
  • Carte mémoire : Full SD
  • Ethernet : 10/100 Mbits
  • HDMI port : 1
  • USB 2.0 ports : 2
  • Composite video output : Cinch
  • Audio output : 3,5 mm audiojack
  • Camera interface CSI-2 : 1
  • LCD display interface DSI : 1
  • Extensions : 26
  • Nombre de GPIO (general purpose input/output) : 17
  • Alimentation : microUSB, 5 Volt, 700 mA

Caméra :

  • Sensor : OmniVision OV5647
  • Résolution : 2.592 x 1.944 pixels
  • Focus : fixe >= 1m
  • Vidéo : 1080p30, 720p60 and 640x480p90

Bright Pi v1.0 :

  •  interface I2C
  • 4 LED’s bright white
  • 8 LED’s Infrarouge

Raspbian

Comme j’utilise déjà Debian sur un autre ordinateur, je me suis décidé d’installer la version Debian adaptée au Raspberry Pi, appelée Raspbian, sur mes nano-ordinateurs.

Carte mémoire 8GB reformatée

Carte mémoire 8GB reformatée

Le système d’exploitation Raspbian, les autres logiciels et les données sont enregistrés sur une carte mémoire SD 8GB classe 10. Comme ma carte SD n’était pas vierge, mais formatée sur un autre système, la procédure de reformatage classique ne fonctionnait pas.

J’ai procédé comme suit dans le terminal de commande Windows :

diskpart
list disk
select disk 1
clean
format fs=FAT32 quick
assign

Pour installer le système d’exploitation sur la carte mémoire, j’ai utilisé l’outil Win32DiskImager. La version la plus récente est 0.9.5. Les instructions comment procéder” sont disponibles sur le site web de l’organisation Raspberry Pi.

Outil Win32DiskImager

Outil Win32DiskImager

Raspicam

J’ai connecté le module camera (Raspicam) à la platine Raspberry PI sur base des instructions de set-up données sur le site web de Raspberry. Les caractéristiques sont indiquées dans les détails techniques. La caméra a été fixée sur un support spécifique.

Raspicam avec support

Raspicam avec support

Des bibliothèques (logiciels) pour piloter la caméra sont disponibles pour bash et pour Python. Les trois commandes de base en ligne pour gérer la caméra sont :

Les détails des commandes sont décrits dans l’API du module Raspicam.

Bright Pi

Bright-Pi module

Bright-Pi module

Bright Pi est un kit d’éclairage pour ajouter à la caméra Raspicam, développé par Pi Supply. Le module comprend 4 LED’s Cree blanches puissantes et 8 LED’s Liteon infrarouge.

J’étais un supporteur du projet Bright Pi sur Kickstarter.

Les éléments du module sont fournis séparément, il faut soi-même les assembler et souder. Les instructions d’assemblage et de programmation sont disponibles sur le site web de Pi-Supply.

Bright Pi utilise le bus I2C pour échanger des données avec le Raspberry Pi moyennant le chip Semtech SC620 (voir datasheet). Pour activer le bus I2C dans Raspian, il faut ajouter les deux lignes

i2c-bcm2708
i2c-dev

à la fin du fichier /etc/modules. Pour ce faire, on peut utiliser l’éditeur nano sur Raspberry Pi. Pour sauver le fichier modifié, on pousse <Ctrl>+o, ensuite <Ctrl>+x pour quitter l’éditeur. La prise en compte des nouveaux modules se fait lors d’un reboot. Pour installer les outils i2c, il faut entrer les commandes

pi@raspberrypi ~ $ sudo apt-get install python-smbus
pi@raspberrypi ~ $ sudo apt-get install i2c-tools

Pour voir les modules connectés, on peut entrer la commande

pi@raspberrypi ~ $ sudo i2cdetect -y 1
Détection des modules connectés au bus I2C

Détection des modules connectés au bus I2C

On voit qu’un seul module avec l’adresse 0x70 est connecté, le Bright Pi. Pour piloter le module Bright Pi, on utilise la commande

i2cset [-y] i2cbus chip-address data-address value

Les paramètres sont :

  • -y : option pour désactiver le mode interactif (pas de confirmation nécessaire)
  • i2cbus : 1
  • chip-address : 0x70
  • data-address : 0x00 LED’s on/off; 0x01, 0x03, 0x06, 0x08 dimming IR LED’s couples; 0x02, 0x04, 0x05, 0x07 dimming white LED’s; 0x09 gain register
  • value : dimming values : 6 bit multipliers; gain values :0000 = 31,25 microampère; 1111 = 500 microampère; max = 25 millampère par LED

Quelques exemples sont montrés ci-après :

white LED1 12 mA : sudo i2cset -y 1 0x70 0x02 ...
white LED2 1 mA :
white LED3 
...

Clavier

Comme alimentation, j’utilise le chargeur d’une tablette qui fournit 2 ampères à 5 Volt. Lors de la première mise sous tension, le Raspberry se configure automatiquement. Il ne reconnaît toutefois pas le layout de mon clavier luxembourgeois QWERTZ (respectivement français-suisse) et il faut le modifier manuellement comme suit :

pi@raspberrypi ~ $ sudo raspi-config
raspi-config

raspi-config

Le menu 4 (International Options) donne accès sélections “Change Locale”, “Change Timezone” et “Change Keyboard Layout”. Mon clavier Microsoft Wired Keyboard 600 ne fait pas partie des claviers figurant dans la liste déroulante des modèles de clavier. J’ai choisi le clavier Generic 105-key (Intl) PC. Un layout luxembourgeois n’est pas relevé, le layout du clavier français-suisse figure parmi les layouts allemands. Faut le savoir !

Via le menu 4 on peut également changer le fuseau horaire (le Luxembourg figure sur la liste des pays européens) et la langue d’affichage, par exemple pour passer en français.

Remote Desktop

Avant de pouvoir entrer des commandes il faut faire un login au système. Les paramètres par défaut pour le login sont : user name = pi ; password = raspberry. L’adresse IP attribuée par DHCP est 192.178.1.60. Pour pouvoir piloter dans l’avenir ce Raspberry, et dans la suite ses confrères, à partir de mes ordinateurs connectés en réseau local, j’ai installé le service RDP (Remote Desktop Protocol) de Microsoft sur le Raspberry Pi :

pi@raspberrypi ~ $ sudo apt-get install xrdp

Après un redémarrage du Raspberry (commande sudo reboot ou sudo poweroff), le serveur xrdp sesman est configuré pour démarrer automatiquement lors de chaque mise sous tension.

Sur les ordinateurs Windows, le service RDP est disponible d’office et peut être démarré avec le menu Accessories/Remote Desktop Connection.

Microsoft Remote Desktop Connection

Microsoft Remote Desktop Connection

Après l’établissement de la connexion, le Raspberry retourne un avertissement

Avertissement

Avertissement RDP

et ensuite la fenêtre de login suivante (module sesman-Xvnc) :

XRDP Login Window

XRDP Login Window

Ici on découvre le prochain bug. Le clavier français-suisse de mon PC Windows n’est pas supporté par le service xrdp, mais il est interprété comme un clavier anglais. Il faut donc saisir le mot de passe raspberrz au lieu de raspberry pour réussir le login. Faut le savoir !

Support du clavier français-suisse par xrdp

Linux xrdp utilise par défaut le fichier keymap /etc/xrdp/km-0409.ini (us-english) si un fichier keymap non-existant est demandé. Il semble que le client Windows xrdp demande le keymap 20409 non disponible, indépendamment du clavier connecté. Le seul remède consiste à remplacer dans Raspian le fichier km-0409.ini par le contenu du fichier keymap du clavier utilisé. Dans mon cas il s’agit du fichier km-100c.ini (voir liste des keymaps).

On peut générer ce fichier avec la commande

pi@raspberrypi ~ $ sudo xrdp-genkeymap /etc/xrdp/km-100c.ini

Attention : il faut toutefois passer en mode graphique avec la commande startx et ouvrir le LXTerminal pour passer la commande, si non on obtient le message d’erreur << unable to open display ” >>.

Cliquez sur le fichier km-100c.ini pour visualiser son contenu. Il faut encore modifier le code de quelques touches qui ne sont pas reconnus correctement. Il s’agit notamment des combinaisons <alt-gr> et des touches curseur :

  • @
  • #
  • ~
  • ¢
  • ¬
  • up

Les touches mortes (dead keys) fonctionnent correctement.

Il suffit ensuite de renommer le fichier km-0409.ini en km-0409-old.ini et de copier le fichier km-100c.ini en km-0409.ini. Faut le savoir !

Applications

Après vérification de l’authenticité de l’usager, le desktop du Raspberry se présente sur l’écran et réagit aux commandes du clavier et de la souris.

Raspberry Pi's Desktop

Raspberry Pi’s Desktop

Les programmes et applications installés d’office sur le Raspberry Pi sont :

Je me propose d’ajouter les programmes suivants :

Pour activer la caméra, il faut passer dans le menu 5 (Enable Camera) de raspi-config. Pour tester la caméra et faire une première photo, on entre la commande

pi@raspberrypi ~ $ raspistill -o camtest1.jpg

Si la caméra est positionnée à l’envers, il faut tourner l’image de 180 degrés. Cela se fait aisément avec les options “vertical flip” et “horizontal flip”.

pi@raspberrypi ~ $ raspistill -vf -hf -o camtest2.jpg

Liens

Les liens suivants fournissent des informations additionelles concernant des projets Raspberry et le clavier XRDP :

Spectrograms and speech processing

Last update : July 24, 2022

Spectrograms are visual representations of the spectrum of frequencies in a sound or other signal as they vary with time (or with some other variable). Spectrograms can be used to identify spoken words phonetically. The instrument that generates a spectrogram is called a spectrograph.

Spectrograms are approximated as a filterbank that results from a series of bandpass filters or calculated from the time signal using the Fast Fourier Transform (FFT).

FFT is an algorithm to compute the Discrete Fourier Transform (DFT) and its inverse. A significative parameter of the DFT is the choice of the Window Function. In signal processing, a window function is a mathematical function that is zero-valued outside of some chosen interval. The following window functions are common for spectrograms :

I recorded a sound example.wav file with my name spoken three times, to use as test file for different spectrogram software programs.

Real-Time Spectrogram Software

There are some great software programs to perform a spectrogram for speech analysis in realtime or with recorded sound files :

  • Javascript Spectrogram
  • Wavesurfer
  • Spectrogram16
  • SFS / RTGRAM
  • Audacity
  • RTS
  • STRAIGHT
  • iSound

Javascript Spectrogram

Jan Schnupp, sensory neuroscientist, former Professor at the Department of Physiology, Anatomy and Genetics within the Division of Medical Sciences at the University of Oxford, developed an outstanding javascript program to calculate and display a real-time spectrogram in a webpage, from the input to the computer’s microphone. It requires a browser which supports HTML5 and web audio and it requires also WebRTC, which is supported in recent versions of Chrome, Firefox and Opera browsers. WebRTC is a free, open project that enables web browsers with Real-Time Communications (RTC) capabilities via simple JavaScript APIs.

Javascript spectrogram with 3x voice sound "Marco Barnig"

Javascript realtime spectrogram with 3x voice input “Marco Barnig” by microphone

Jan Schnupp  is currently Professor of Neuroscience at the City University of Hong Kong. He is also the author of the website howyourbrainworks.net offering free, accessible introductory online lecture courses to neuroscience.
[HTML1]

Wafesurfer

WaveSurfer is an open source multiplatform tool for sound visualization and manipulation. Typical applications are speech/sound analysis and sound annotation/transcription. WaveSurfer may be extended by plug-ins as well as embedded in other applications. A comprehensive user manual and numerous tutorials for Wavesurfer are available on the net.

WaveSurfer was developed at the Centre for Speech Technology (CCT) at the KTH Royal Institute of Technology in Sweden. The latest stable Windows release (1.8.8p6, May 7, 2020) and the source code of WaveSurfer can be downloaded from Sourceforge. The authors of Wavesurfer are Jonas Beskow and Kåre Sjölander.

wavesurfer auto

Wavesurfer auto-calculated, auto-sized spectrogram

By right-clicking in the Wafesurfer pane, a pop-up window opens with menus to add more panes, to customize the configuration and to change the parameters for analysis. In the following rendering, the panes Waveform, Pitch Contour, Formant Plot and Transcription have been added to the spectrogram pane and to the Time Axis pane. The spectrogram frequency range was cut at 5 KHz.
[HTML1]

Wafesurfer customized

Wafesurfer customized

Two other panes can be selected: Power Plot and Data Plot. Additional specific panes can be created with plugins.

Wavesurfer uses the Snack Sound Toolkit created by Kåre Sjölander. There exist other software programs with the name Wavesurfer, for example wavesurfer.js, a customizable waveform audio visualization tool, built on top of Web Audio API and HTML5 Canvas by katspaugh.

Spectrogram16

Spectrogram16 is a calibrated, dual channel audio spectrum analyzer for Windows that can provide either a scrolling time-frequency display or a spectrum analyzer scope display in real time for any sound source connected to the sound card. A detailed user guide (51 pages) is joined to the program.

Spectrogram16 customized

Spectrogram16 customized

The tool was created by Richard Horne, the founder of Visualization Software LLC. The company closed  a few years ago. The WayBackMachine shows that Richard Horne announced in 2008 that version 16 of Spectrogram is now freeware (see also local copy). The software is still available from most  free software download websites. Richard Horne, MS, who retired as a Civilian Electrical Engineer for the Navy, was member of the Management Team of Vocal Innovations.

The Spectrogram program was (and is still) appreciated by amateur radio operators for aligning ham receivers.

SFS / RTGRAM

RTGRAM is a free Windows program for displaying a real-time scrolling spectrographic display of an audio signal. With RTGRAM you can monitor the spectro-temporal characteristics of sounds being played into the computer’s microphone or line input ports. RTGRAM is optimised for speech signals and has options for different sampling rates, analysis bandwidths (wideband = 300 Hz, narrowband = 45 Hz), temporal resolution (time per pixel = 1 – 10 ms), dynamic range (30 – 70 dB) and colour maps.

RTGRAM

RTGRAM realtime spectrogram with 3x voice input “Marco Barnig” by microphone

The current version of RTGRAM is 1.3, released in April 2010. It is part of the Speech Filing System (SFS) tools for speech research.

RTGRAM is free, but not public domain software, its intellectual property is owned by Mark Huckvale, University College London.

Audacity

Audacity is a free, open source, cross-platform software for recording and editing sounds. Audacity was started in May 2000 by Dominic Mazzoni and Roger Dannenberg at Carnegie Mellon University. The current version is 3.0.3, released on July 26, 2021.
[HTML1]

Audacity

Audacity auto-calculated, auto-sized spectrogram

A huge documentation about Audacity with manuals, tutorials, tips, wikis, FAQ’s is available in several languages.

RTS tm

RTS (Real-Time Spectrogram) is a product of Engineering Design, founded in 1980 to address problems in instrumentation and measurement, physical acoustics, and digital signal analysis. Since 1984, Engineering Design has been the developer of the SIGNAL family of sound analysis software. RTS is highly integrated with SIGNAL.

STRAIGHT

STRAIGHT (Speech Transformation and Representation by Adaptive Interpolation of weiGHTed spectrogram) was originally designed to investigate human speech perception in terms of auditorily meaningful parametric domains. STRAIGHT is a tool for manipulating voice quality, timbre, pitch, speed and other attributes flexibly. The tool was invented by Hideki Kawahara when he was in the Advanced Telecommunications Research Institute International (ATR) in Japan. Hideki Kawahara is now Emeritus Professor from the Wakayama University, Japan.

iSound

Irman Abdić created an audio tool (iSound) for displaying spectrograms in real time using Sphinx-4 as part of his thesis at the Faculty of Mathematics, Natural Sciences and Information Technologies (FAMNIT) from Koper, Slovenia.

No Real-Time Spectrogram Software

Other great software programs to create no-realtime spectrograms of recorded voice samples are :

  • Praat
  • SoX
  • SFS / WASP
  • Sonogram Visible Speech

Praat

Praat (= talk in dutch) is a free scientific computer software package for the analysis of speech in phonetics. It was designed, and continues to be developed, by Paul Boersma and David Weenink of the Institute of Phonetics Sciences at the University of Amsterdam. Praat runs on a wide range of operating systems. The program also supports speech synthesis, including articulatory synthesis.
[HTML1]
Praat displays two windows : Praat Objects and Praat Picture.

Praat Objects Window

Praat Objects Window

Praat Picture Window

Praat Picture Window

The spectrogram can also be rendered in a customized window.

Praat

Praat customized window

The current version 6.1.51 of Praat was released on August 25, 2021. The source code for this release is available at Github. A huge documentation with FAQ’s, tutorials, publications, user guides is available for Praat. The plugins are located in the directory C:/Users/name/Praat/.

An outstanding plugin for Praat is EasyAlign. It is a user-friendly automatic phonetic alignment tool for continuous speech. It is possible to align speech from an orthographic or phonetic transcription. It requires a few minor manual steps and the result is a multi-level annotation within a TextGrid composed of phonetic, syllabic, lexical and utterance tiers. EasyAlign was developed by Jean-Philippe Goldman at the Department of Linguistics, University of Geneva.

SoX

SoX (Sound EXchange) is a free cross-platform command line utility that can convert various formats of computer audio files in to other formats. It can also apply various effects to these sound files and play and record audio files on most platforms. SoX is called the Swiss Army knife of sound processing programs.

SoX is written in standard C and was created in July 1991 by Lance Norskog. In May 1996, Chris Bagwell started to maintain and release updated versions of SoX. Throughout its history, SoX has had many contributing authors. Today Chris Bagwell is still the main developer.

The current Windows distribution is 14.4.2 released  in February 22, 2015. The source code is available at Sourceforge.

SoX provides a very powerful spectrogram effect. The spectrogram is rendered in a png image-file and shows time in the x-axis, frequency in the y-axis and audio signal amplitude in the z-axis. The z-axis values are represented by the colour of the pixels in the x-y plane. The command

sox example.wav -n spectrogram

creates the following auto-calculated, auto-sized spectrogram :

SoX auto

SoX auto-calculated, auto-sized spectrogram

The main options to customize a spectrogram created with SoX are :


-x num : change the width of the spectrogram from its default value of 800px
-Y num : sets the total height of the spectrogram; the default value is 550px
-z num : sets the dynamic range from 20 to 180 dB; the default value is 120 dB
-q num : sets the z-axis quantisation (number of different colours)
-w name : select the window function; the default function is Hann
-l : creates a printer-friendly spectrogram with a light background
-a : suppress the display of the axis lines
-t text : set an image title
-c text : set an image comment (below and to the left of the image)
-o text : set the name of the output file; the default name is spectrogram.png
rate num k : analyse a small portion of the frequency domain (up to 1/2 num kHz)

[HTML1]
A customized rendering follows :

SoX

Customized SoX spectrogram

The customized SoX spectrogram was created with the following command :

sox example.wav -n rate 10k spectrogram -x 480 -y 240 -q 4 -c "www.web3.lu" 
-t "SoX Spectrogram of the triple speech sound Marco Barnig"

WASP

WASP is a free Windows program for the recording, display and analysis of speech. With WASP you can record and replay speech signals, save them and reload them from disk, edit annotations, and display spectrograms and a fundamental frequency track. WASP is a simple application that is complete in itself, but which is also designed to be compatible with the Speech Filing System (SFS) tools for speech research. The current version 1.80 was released in June 2020.
[HTML1]
The following figure shows a customized WASP window with a  speech waveform pane, a wideband spectrogram, a pitch track and annotations.

WASP customized spectrogram

WASP customized spectrogram with pitch and annotation tracks

WASP is free, but not public domain software, its intellectual property is owned by Mark Huckvale, University College London.
[HTML1]

Sonogram Visible Speech

Sonogram Visible Speech is a very advanced program for sound, music and speech analysis. It provides multiple tools to perform various transformations and spectral studies on audio signals and to display the results in numerous panels : perspectogram, pitch, wavelet, cepstrum, 3D plots, auto-correlation charts etc.

In short terms, Sonogram is a  powerful and complex audio spectrum analyzer with a comprehensive GUI layout.

Sonogram Visible Speech Main Window

Sonogram is programmed in Java and needs Java Runtime in version 16 at least. It runs in Windows, MacOS and Unix/Linux. The current version 5 has been released in August 18, 2021. The source code is available at Github. The next figure shows the start of the program in a Linux terminal.

Program Start with Sonogram.sh

The following files show the help-, settings- and info-panels:

Sonogram Online Help

 

Sonogram Settings

 

Sonogram Detailed Info

Sonogram includes a 3D-chart to present processed sound signals in three dimensions and a convenient audio recorder.

3D Perspectogram

 

Sonogram Audio Recorder

Sonogram Visible Speech was programed from 2000 to 2021 by Christoph Lauer. When he started the project he worked at the DFKI (Deutsches Forschungsinstitut für künstliche Intelligenz) in Saarbrücken. In December 2007 he joined the Saarland University as a scientific assistant, two years later the IDTM (Fraunhofer Institute for Digital Media Technology) as an Audio DSP Researcher. Since 2014 Christoph Lauer  works as a Machine Learning Researcher for the BMW Group.

Specific Spectrogram Software

Spectrograms can also be used for teaching, artistic or other curious purposes :

  • FaroSon
  • SpectroTyper
  • ImageSpectrogram

FaroSon

FaroSon (The Auditory Lighthouse), is a Windows program for the real-time conversion of sound into a coloured pattern representing loudness, pitch and timbre. The loudness of the sound is reflected in the brightness and saturation of the colours. The timbre of the sound is reflected in the colours themselves: sounds with predominantly bass character have a red colour, while sounds with a predominantly treble character have a blue colour. The pitch of the sound is reflected in the horizontal banding patterns: when the pitch of the sound is low, then the bands are large and far apart, and when it is high, the bands are narrow and close together. If the pitch of the sound is falling you see the bands diverge; when it is rising, you see the bands converge.

Faroson

Faroson

FaroSon is free, but not public domain software, its intellectual property is owned by Mark Huckvale, University College London.

SpectroTyper

AudioCheck offers the Internet’s largest collection of online sound tests, test tones, and tone generators. Audiocheck provides a unique online tool called SpectroTyper to insert plain text into a .wav sound file. The downloaded file plays as cool-sounding computer-like tones and is secretly readable from a spectrogram view (linear frequency scale best). It can be used for fun, to hide easter eggs in a music production or to tag copyrighted audio material with own identifiers or source informations.

Here is the barnig_txt.wav sound file with my integrated name as an example, the result is shown below in the SoX spectrogram, created with the command :

sox barnig_txt.wav -n rate 10k spectrogram -x 480 -y 120
Spectro

SoX Spectrogram of a sound with inserted text, synthesized with SpectroTyper

SpectroTyper and other audio tools and tone generators have been created by Stéphane Pigeon, a research engineer & sound designer from Belgium. He received the degree of electrical engineering from the Université Catholique de Louvain (UCL) in June 1994, with a specialization in signal processing. He finalized a PhD thesis in applied science in 1999. Then, Stéphane Pigeon joined the Royal Military Academy as a part-time researcher. In parallel, he worked as a consultant, exclusively for Roland Corporation in the area of the musical instrument market. He designed various audio-related websites, like AudioCeck.net started in 2007. He also released some iOS apps. His most succesful project is myNoise.net, started in 2013, which offers a unique collection of online noise generators.

ImageSpectrogram

Richard David James, best known by his stage name Aphex Twin, is an British electronic musician and composer. In 1999, he released Windowlicker as a single on Warp Records. In this record he synthesized his face as a sound, only viewable in a spectrogram.

Gavin Black (alias plurSKI) created a perl script to do the same : take a digital picture and convert it into a wave file. Creating a spectrogram of that file then reproduces the original picture.

[HTML1]
Here is the barnig_portrait.wav sound file with my integrated portrait as an example, the result is shown below in the SoX spectrogram, created with the command :

sox barnig_portrait.wav -n spectrogram -x 480 -y 480
Spectro

SoX Spectrogram of a sound with inserted picture, synthesized with imageSpectrogram

On July 24, 2022, Scott Duplichan published an Audio SpectrumViewer for Windows on Sourceforge. During the development he used the wav-sample with my embedded portrait to test his realtime spectrum viewer. Scott found a converter to create a better image with a smaller wav-file.

The spectrum viewer app contains a demo folder with an audioFileImage subfolder where you can start batch-files to compare the original with the improved spectrum. The result with the new converter is shown in the following screen-shot:

Links

A list with links to websites providing additional informations about spectrograms is presented below :

Mary TTS (Text To Speech)

Last update : January 5, 2017

MaryTTS is an open-source, multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of DFKI’s Language Technology Lab and the Institute of Phonetics at Saarland University. It is now maintained by the Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI (Deutsches Forschungszentrum für Künstliche Intelligenz GmbH).

Mary stands for Modular Architecture for Research in sYynthesis. The earliest version of MaryTTS was developed around 2000 by Marc Schröder. The current stable version is 5.2, released on September 15, 2016.

I installed Mary TTS on my Windows, Linux and Mac computers. On the Mac (OSX 10.10 Yosemite), version 5.1.2 of Mary TTS was placed on the desktop in the folders marytts-5.1.2 and marytts-builder-5.1.2. The Mary TTS Server is started first by opening a terminal window in the folder marytts-5.1 with the following command :

marytss-5.1.2 mbarnig$ bin/marytts-server.sh

To start the Mary TTS client with the related GUI, a second terminal window is opened in the same folder  with the command :

marytss-5.1.2 mbarnig$ bin/marytts-client.sh

On Windows , the related scripts are marytts-server.bat and marytts-client.bat.

As the development version 5.2 of Mary TTS supports more languages and comes with toolkits for quickly adding support for new languages and for building unit selection and HMM-based synthesis voices, I downloaded a snapshot-zip-file from Github with the most recent source code. After unzipping, the source code was placed in the folder marytts-master on the desktop.

To compile Mary TTS from source on the Mac, the latest JAVA development version (jdk-8u31-macosx-x64.dmg) and Apache Maven (apache-maven-3.2.5-bin.tar.gz), a software project management and comprehension tool, are required.

On Mac, Java is installed in

/Library/Java/JavaVirtualMachines/jdk1.8.0_31.jdk/Contents/Home/

and Maven is installed in

/usr/local/apache-maven/apache-maven-3.2.5

It is important to set the environment variables $JAVA_HOME, $M2_HOME and the $PATH to the correct values (export in /Users/mbarnig/.bash-profile).

The succesful installation of Java and Maven can be verified with the commands :

mbarnig$ java -version
mbarnig$ mvn --version

marytts-maven-java

Mary TTS : Maven and Java versions

This looks good!

In the next step I compiled the Mary TTS source code by running the command

marytts-master mbarnig$ mvn install

in the top-level marytts-master folder. This build the system, run unit and integration tests, packaged the code and installed it in the following folders :

marytts-master/target/marytts-5.2.SNAPSHOT
marytss-master/target/marytss-builder-5.2-SNAPSHOT

The build took 2:55 minutes and was succesful, without errors or warnings.

mary

Results of building MARYTTS 5.2 SNAPSHOT

The following modules have been compiled :

  • MaryTTS
  • marytts-common
  • marytts-signalproc
  • marytts-runtime
  • marytts-lang-de, en, te, tr, ru, it, fr, sv, lx (lx is a pseudo locale for a test language)
  • marytts-languages
  • marytts-client
  • marytts-builder
  • marytts-redstart
  • marytts-transcription
  • marytts-assembly with the sub-modules assembly-builder and assembly-runtime
  • voice_cmu_slt_hsmm

After checking the whole file structure, I started the Mary TTS 5.2 server

marytts-snapshot-server

Mary TTS snapshot 5.2 Server

and the Mary TTS 5.2 client

marytts-snapshot-client

Mary TTS Snapshot 5.2 client

did some trials with text to audio conversion in the GUI window

marytts-gui-client

Mary TTS Client GUI

launched the Mary TTS 5.2 component installer

Mary TTS Component Installer

Mary TTS Component Installer

and finally installed some french, german and english available voices.

marytts-installer

Mary TTS Voice Installer GUI

In the next step I will try to create my own voices and develop a voice for the luxembourgish language.

In January 2017, I updated my systems with the stable MaryTTS version 5.2 which supports the luxembourgish language.

eSpeak Formant Synthesizer

Last update : November 2, 2014

eSpeak

eSpeak is a compact multi-platform multi-language open source speech synthesizer using a formant synthesis method.

eSpeak is derived from the “Speak” speech synthesizer for British English for Acorn Risc OS computers, developed by Jonathan Duddington in 1995. He is still the author of the current eSpeak version 1.48.12 released on November 1, 2014. The sources are available on Sourceforge.

eSpeak provides two methods of formant synthesis : the original eSpeak synthesizer and a Klatt synthesizer. It can also be used as a front end for MBROLA diphone voices. eSpeak can be used as a command-line program or as a shared library. On Windows, a SAPI5 version is also installed. eSpeak supports SSML (Speech Synthesis Marking Language) and uses an ASCII representation of phoneme names which is loosely based on the Kirshenbaum system.

In formant synthesis, voiced speech (vowels and sonorant consonants) is created by using formants. Unvoiced consonants are created by using pre-recorded sounds. Voiced consonants are created as a mixture of a formant-based voiced sound in combination with a pre-recorded unvoiced sound. The eSpeakEditor allows to generate formant files for individual vowels and voiced consonants, based on a sequence of keyframes which define how the formant peaks (peaks in the frequency spectrum) vary during the sound. A sequence of formant frames can be created with a modified version of Praat, a free scientific computer software package for the analysis of speech in phonetics. The Praat formant frames, saved in a spectrum.dat file, can be converted to formant keyframes with eSpeakEdit.

To use eSpeak on the command line, type

espeak "Hello world"

There are plenty of command line options available, for instance to load from file, to adjust the volume, the pitch, the speed or the gaps between words, to select a voice or a language, etc.

To use the MBROLA voices in the Windows SAPI5 GUI or at the command line, they have to be installed during the setup of the program. It’s possible to rerun the setup to add additional voices. To list the available voices type

espeak --voices

eSpeak uses a master phoneme file containing the utility phonemes, the consonants and a schwa. The file is named phonemes (without extension) and located in the espeak/phsource program folder. The vowels are defined in the language specific phoneme files in text format. These files can also redefine consonants if you wish. The language specific phoneme text-files are located in the same espeak/phsource folder and must be referenced in the phonemes master file (see example for luxembourgish).

....
phonemetable lb base
include ph_luxembourgish

In addition to the specific phoneme file ph_luxembourgish (without extension), the following files are requested to add a new language, e.g. luxembourgish :

lb file (without extension) in the folder espeak/espeak-data/voices : a text file which in its simplest form contains only 2 lines :

name luxembourgish
language lb

lb_rules file (without extension) in the folder espeak/dictsource : a text file which contains the spelling-to-phoneme translation rules.

lb_list file (without extension) in the folder espeak/dictsource : a text file which contains pronunciations for special words (numbers, symbols, names, …).

The eSpeakEditor (espeakedit.exe) allows to compile the lb_ files into an lb_dict file (without extension) in the folder espeak/espeak-data and to add the new phonemes into the files phontab, phonindex and phondata in the same folder. These compiled files are used by eSpeak for the speech synthesis. The file phondata-manifest lists the type of data that has been compiled into the phondata file. The files dict_log and dict_phonemes provide informations about the phonemes used in the lb_rules and lb_dict files.

eSpeak applies tunes to model intonations depending on punctuation (questions, statements, attitudes, interaction). The tunes (s.. = full-stop, c.. = comma, q.. = question, e.. = exclamation) used for a language can be specified by using a tunes statement in the voice file.

tunes s1  c1  q1a  e1

The named tunes are defined in the text file espeak/phsource/intonation (without extension) and must be compiled for use by eSpeak with the espeakedit.exe program (menu : Compile intonation data).

meSpeak.js

Three years ago, Matthew Temple ported the eSpeak program from C++ to JavaScript using Emscripten : speak.js. Based on this Javascript project, Norbert Landsteiner from Austria created the meSpeak.js text-to-speech web library. The latest version is 1.9.6 released in February 2014.

meSpeak.js is supported by most browsers. It introduces loadable voice modules. The typical usage of the meSpeak.js library is shown below :

<!DOCTYPE html>
<html lang="en">
<head>
 <title>Bonjour le monde</title>
 <script type="text/javascript" src="mespeak.js"></script>
 <script type="text/javascript">
 meSpeak.loadConfig("mespeak_config.json");
 meSpeak.loadVoice("voices/fr.json");
 function speakIt() {
 meSpeak.speak("Bonjour le monde");
 }
 </script>
</head>
<body>
<h1>Try meSpeak.js</h1>
<button onclick="speakIt();">Speak It</button>
</body>
</html>

Click here to test this example.

The mespeak_config.json file contains the data of the phontab, phonindex, phondata and intonations files and the default configuration values (amplitude, pitch, …). This data is encoded as base64 octed stream. The voice.json file includes the id of the voice, the dictionary used and the corresponding binary data (base64 encoded) of these two files. There are various desktop or online Base64 Decoders and Encoders available on the net to create the required .json files (base64decode.org, motobit.com, activexdev.com, …).

meSpeak cam mix multiple parts (diiferent languages or voices) in a single utterance.meSpeak supports the Web Audio API (AudioContext) with internal wav files, Flash is used as a fallback.

Links

A list with links to websites providing additional informations about eSpeak and meSpeak follows :

Language : fr, de, en, lb, eo

Last update : November 7, 2021

Language is the human capacity for acquiring and using complex systems of communication, and a language is any specific example of such a system. The scientific study of language is called linguistics.

In the context of a text-to-speech (TTS) and automatic-speech-recognition (ASR) project, I assembled the following informations about the french, german, english, luxembourgish and esperanto languages.

French

French is a romance language spoken worldwide by 340 million people. The written french uses the 26 letters of the latin script, four diacritics appearing on vowels (circumflex accent, acute accent, grave accent, diaeresis) and the cedilla appearing in ç. There are two ligatures, œ and æ. The french language is regulated by the Académie française. The language codes are fr (ISO 639-1), fre, fra (ISO 639-2) and fra (ISO 639-3).

The spoken french language distinguishes 26 vowels, plus 8 for Quebec french. There are 23 consonants. The Grand Robert lists about 100.000 french words.

German

German is a West Germanic language spoken by 120 million people. In addition to the 26 standard latin letters, German has three vowels with Umlauts and the letter ß called Eszett. German is the most widely spoken native language in the European Union. The german language is regulated by the Rat für deutsche Rechtschreibung. The language codes are de (ISO 639-1), ger, deu (ISO 639-2) and 22 variants in ISO 630-3.

The spoken german language uses 29 vowels and 27 consonants. The 2013 relase of the Duden lists about 140.000 german words.

English

English is a West Germanic language spoken by more than a billion people. It is an official language of almost 60 sovereign states and the third-most-common native language in the world. The written english uses the 26 letters of the latin script, with rare optional ligatures in words derived from Latin or Greek. There is no regulatory body for the english language. The language codes are en (ISO 639-1) and eng (ISO 630-2 and ISO 639-3).

The spoken english language distinguishes 25 vowels and 34 consonants, including the variants used in the United Kingdom and the United States. The Oxford English Dictionary lists more than 250,000 distinct words, not including many technical, scientific, and slang terms.

Luxembourgish

Luxembourgish (Lëtzebuergesch) is a Moselle Franconian variety of West Central German that is spoken mainly in Luxembourg by about 400.000 native people. The Luxembourgish alphabet consists of the 26 Latin letters plus three letters with diacritics: é, ä, and ë. In loanwords from French and German, the original diacritics are usually preserved. The luxembourgish language is regulated by the Conseil Permanent de la Langue Luxembourgeoise (CPLL). The language codes are lb (ISO 639-1) and ltz (ISO 630-2 and ISO 639-3).

The spoken luxembourgish language uses 22 vowels (14 monophthongs, 8 diphthongs) and 26 consonants. The luxembourgish-french dictionary dico.lu icludes about 50.000 words, the luxembourgish-german dictionary luxdico lists about 26.000 words. The full online Luxembourgish dictionary www.lod.lu is in construction, at present words beginning with A-S may be accessed via the search engine.

Esperanto

Esperanto is a constructed international auxiliary language. Between 100,000 and 2,000,000 people worldwide fluently or actively speak Esperanto. Esperanto was recognized by UNESCO in 1954 and Google Translate added it in 2012 as its 64th language. The 28 letter Esperanto alphabet is based on the Latin script, using a one-sound-one-letter principle. It includes six letters with diacritics: ĉ, ĝ, ĥ, ĵ, ŝ (with circumflex), and ŭ (with breve). The alphabet does not include the letters q, w, x, or y, which are only used when writing unassimilated foreign terms or proper names. The language is regulated by the Akademio de Esperanto. The language codes are eo (ISO 639-1) and epo (ISO 630-2 and ISO 639-3).

Esperanto has 5 vowels, 23 consonants and 2 semivowels that combine with the vowels to form 6 diphthongs. The core vocabulary of Esperanto contains 900 roots which can be expanded into tens of thousands of words using prefixes, suffixes, and compounding.

Links

A list with links to websites with additional informations about the five languages (mainly luxembourgish) is shown hereafter :

Phonemes, phones, graphemes and visemes

Phonemes

A phoneme is the smallest structural unit that distinguishes meaning in a language, studied in phonology (a branch of linguistics concerned with the systematic organization of sounds in languages). Linguistics is the scientific study of language. Phonemes are not the physical segments themselves, but are cognitive abstractions or categorizations of them. They are abstract, idealised sounds that are never pronounced and never heard. Phonemes are combined with other phonemes to form meaningful units such as words or morphemes.

A morpheme is the smallest meaningful (grammatical) unit in a language. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. The field of study dedicated to morphemes is called morphology.

Phones

Concrete speech sounds can be regarded as the realisation of phonemes by individual speakers, and are referred to as phones. A phone is a unit of speech sound in phonetics (another branch of linguistics that comprises the study of the sounds of human speech).  Phones are represented with phonetic symbols. The IPA (International Phonetic Alphabet) is an alphabetic system of phonetic notation based primarily on the Latin alphabet. It was created by the International Phonetic Association as a standardized representation of the sounds of oral language.

In IPA transcription phones are conventionally placed between square brackets and phonemes are placed between slashes.

English Word : make
Phonetics : [meik]
Phonology : /me:k/   /maik/   /mei?/

A set of multiple possible phones, used to pronounce a single phoneme, is called an allophone in phonology.

Graphemes

Analogous to the phonemes of spoken languages, the smallest semantically distinguishing unit in a written language is called a grapheme. Graphemes include alphabetic letters, typographic ligatures, chinese characters, numerical digits, punctuation marks, and other individual symbols of any of the world’s writing systems.

Grapheme examples

Grapheme examples

In transcription graphemes are usually notated within angle brackets.

<a>  <W>  <5>  <i>  <> <>  <ق>

A grapheme is an abstract concept, it is represented by a specific shape in a specific typeface called a glyph. Different glyphs representing the same grapheme are called allographs.

In an ideal phonemic orthography, there would be a complete one-to-one correspondence between the graphemes and the phonemes of the language. English is highly non-phonemic, whereas Finnish come much closer to being consistent phonemic.

Visemes

A viseme is a generic facial shape that can be used to describe a particular sound. Visemes are for lipreaders, what phonemes are for listeners: the smallest standardized building blocks of words. However visemes and phonemes do not share a one-to-one correspondence.

Visemes

Visemes

Links

A list with links to websites with additional informations about phonemes, phones, graphemes and visemes is shown hereafter :