Google text to speech (TTS) with processing

Referring to the post about Google STT, this post is related to Google speech synthesis with processing. Amnon Owed presented in November 2011 processing code snippets to make use of Google’s text-to-speech webservice. The idea was born in the processing forum.

The sketch makes use of the Minim library that comes with Processing. Minim is an audio library that uses the JavaSound API, a bit of Tritonus, and Javazoom’s MP3SPI to provide an easy to use audio library for people developing in the Processing environment. The author of Minim is Damien Di Fede (ddf), a creative coder and composer interested in interactive audio art and music games. In November 2009, Damien was joined by Anderson Mills who proposed and co-developed the UGen Framework for the library.

I use the Minim 2.1.0 beta version with this new UGen Framework. I installed the Minim library in the libraries folder in my sketchbook and deleted the integrated 2.0.2 version in the processing (2.0b8) folder modes/java/libraries.

Today I run succesful trials with the english, french and german Google TTS engine. I am impressed by the results.

Google speech to text (STT) with processing

Processing is an open source programming language and environment for people who want to create images, animations, and interactions.

Florian Schulz, Interaction Design Student at FH Potsdam, presented a year ago in the processing forum a speech to text (STT) library, based on the Google API. The source code is available at GitHub, a project page provides additional informations. The library is based on an article of Mike Pultz, named Accessing Google Speech API / Chrome 11, published in March 2011.

I installed the library in my processing environment (version 2.0b8) and run the test examples with success. I did some trials with the french and german Google speech recognition engines. I am impressed by the results.

Additional informations about this topic are provided in the following link list :

 

Voice driven web applications

Last update : July 17, 2013

The new JavaScript Web Speech API specified by W3C makes it easy to add speech recognition to a web page and to create voice driven web applications. It enables developers to use scripting to generate text-to-speech output and to use speech recognition as an input for forms, continuous dictation and control. The JavaScript API allows web pages to control activation and timing and to handle results and alternatives.

The Web Speech specification was published by the Speech API Community Group, chaired by Glen Shires, software engineer at Google. The specification is not a W3C Standard nor is it on the W3C Standards Track.

A demo working in the Chrome browser 25 and later is available at the HTML5 rocks website.

There are two processes : Text-to-Speech (speech synthesis : TTS) and Speech-to-Text (speech recognition : ASR). There are at least three different approaches to synthesize text :

  • integrated :  a TTS module is built into the OS, or a separately installed TTS engine can plug-in to the OS’s TTS module.
  • packaged : instead of requiring a separate install, a synthesizer and voices can be packaged and shipped with the application.
  • in the cloud : a web-service is used to synthesize text. The advantage of this is a more predictable and consistent voice quality, independent from the hardware and operation system used on the mobile client.

Concerning ASR, Wolf Paulus, an internationally experienced technologist and innovator, compared the performance (speed and accuracy) of the speech recognition systems developed by Google, Nuance, iSpeech and AT&T.

A HTML Speech XG Speech API Proposal, introduced by Microsoft to the  HTML Speech Incubator Group, is available as unofficial draft at the W3C website.

A list of speech recognition software is available at Wikipedia. The main hosted speech applications are presented below :

iSpeech

iSpeech provides speech solutions for individuals and business, in different fields as mobiles, connected homes, automotive, publishing (audio books), e-learning and more. The solutions include Text-to-speech (TTS) and speech recognition (ASR).

iSpeech offers API’s and SDK for developers for different devices and programming languages (iPhone, Android, Blackberry, PHP, JAVA, Python, .NET, Flash, Ruby, Perl) and comprehensive documentations, integration guides, web samples and FAQ’s. iSpeech povides development keys to use the three servers :

  • Mobile Development
  • Mobile Production
  • Web/General/Desktop/Other Production

The applications must be configured to use the correct servers.To make the web/general key work, you need to buy credits. The low usage price is $0.02 per word (TTS) or per transaction (ASR).

An free iSpeech app for iOS devices (version 1.3.5 updated May 13, 2013) to convert text to speech with the best sounding voices is available at the iTune store. This app is powered by the iSpeech.org Text to Speech (TTS) software as a service (SaaS) API. Other apps for iOS and Android devices are listed at the iSpeech website. A Text-to-Speech demo is also available.

Nuance

Nuance Communications is a multinational computer software technology corporation, headquartered in Burlington, Massachusetts, that provides speech and imaging applications.

In August 2012, Nuance announced Nina, a collection of personal assistant technologies that will bring Siri-like functionality to customer service mobile apps.

Nuance provides the Dragon Mobile SDK to developers that joined the NDEV Dragon Mobile developer program. This creates a unique opportunity in the mobile developer ecosystem to power any application with Nuance’s proven, best-in-class Dragon Naturally Speaking voice recognition technology.

In joining NDEV Mobile, developers have free access to wrappers and widgets for simple application customization, all through a self-service website. Developers also have access to an on-line community forum for support, a variety of code samples and full documentation. Once an NDEV Mobile developer has integrated the SDK into their application, Nuance provides 90 days of free access to the cloud-based speech services to validate the power of speech recognition on their application. To put an application in production, a licence fee of 3.000 $ has to be prepaid.The low usage price is 0,009 $ per transaction.

The following platforms are supported :

  • Apple  iOS
  • Android
  • Windows Phone
  • HTTP web services interface

A mobile assistant & voice app for iOS and Android is available in the iTunes at GooglePlay stores.

AT&T Watson Speech engine

AT&T offers a free speech development program to access the tools needed to build, test, onboard and certify applications across a range of devices, OSes and platforms.

There are three classes of functionality in the AT&T speech API family :

  • Speech to Text : 9 contexts are optimized to return the text of what the end users say. The text can be returned in multiple formats, including, JSON and XML.
  • Text to Speech : Male and female ‘characters’ are available for both English and Spanish.
  • Speech to Text Custom :  the speech service is customized by sending a list of words or phrases commonly spoken by the end users to improve recognition of those unique words. The Grammar List supports 19 languages, the Generic with Hints supports English and Spanish.

The Call Management (Beta) API that is powered by Tropo™ exposes SMS and Voice Calling RESTful APIs, which enable app developers to create voice-enabled apps that send or receive calls, provide Interactive Voice Response (IVR) logic, Automatic Speech Recognition (ASR), Voice to Text (VTT), Text (SMS) integration, and more. SDK’s are available for HTML5 (Sencha Touch), Android, iOS and Microsoft. Tools are provided for key platforms, including Android, Brew MP, HTML5, RIM BlackBerry and Windows Phone.

The Speech API provides two methods for transcribing audio into text and one method for rendering text into audio. An AT&T Natural Voices Text-to-Speech Demo is availbale at the AT&T research website.

API access to the AT&T sandbox and production environments costs 99$ a year. The sandbox and production environments allow you to develop, test, and deploy applications using AT&T APIs, including 1 million points (one transaction = one point) each month to spend on any APIs they like. A US based credit card is required to charge 20$ for each additional group of 2,000 points exceeding one million. See the AT&T pricelist.

AT&T Application Resource Optimizer (ARO) is a free diagnostic tool for analyzing the performance of your mobile applications. It can help your app run faster and smarter by providing recommendations to help optimize your mobile application’s performance, speed, network impact and battery utilization.

Speech API FAQ’s as well as code samples, documents, tutorials, guides, SDK’s, tools, blogs, forums and more are available at the AT&T speech development website.

Google Speech API

The Google Speech API can be accessed safely through a Chrome browser using x-webkit-speech. Some people have reverse engineered the Google speech API for other uses on the web. The interface is free, but it is not an official public API.

On February 23, 2013, Google announced at the Chrome Blog that the new stable Chrome release includes support for the Web Speech API, which developers can use to integrate speech recognition capabilities into their web apps in more than 30 languages. A web speech API demo is available at the Google website. In the Peanut Gallery, you can add intertitles to old black-and-white movies simply by talking to Chrome.

The following list provides links to more informations about the Google speech API’s :

More speech applications from other suppliers are listed hereafter :

The Eclipse Voice Tools Project (VTP) allows you to build and run speech recognition application using industry standards such as VoiceXML and Speech Recognition Grammar Specification (SRGS).

Sphinx-4 : a Java speech recognizer

Sphinx-4 is a state-of-the-art speech recognition system written in Java. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).

Sphinx-4 contains the following demo programs :

  • Hello World Demo: a command line application that recognizes simple phrases
  • Hello Digits Demo: a command line application that recognizes connected digits
  • Hello N-Gram Demo: a command line application using an N-gram language model for speech recognition
  • ZipCity Demo: a Java Web Start technology application that recognizes spoken zip codes and locates the associated city and state
  • WavFile Demo: a simple demo program to show how to decode audio files (e.g., .wav, .au files)
  • Transcriber Demo: a simple demo program showing how to transcribe a continuous audio file that has multiple utterances separated by silences
  • JSGF Demo: a simple demo program showing how a program can swap between multiple JSGF grammars
  • Dialog Demo: a demo program showing how a program can swap between multiple JSGF and dictation grammars
  • Action Tags Demo: a demo program showing how to use action tags for post-processing of RuleParse objects obtained from JSGF grammars
  • Confidence Demo: a simple demo program showing how to obtain confidence scores for result
  • Lattice Demo: a simple demo program showing how to extract lattices from recognition results

A number of tests and demos rely on having JSAPI installed. Sphinx-4 can be combined wit FreeTTS to set up a complete voice interface or a VoiceXML server.

FreeTTS : a Java speech synthesizer

FreeTTS is a speech synthesis system written entirely in Java. It is based upon Flite, a small run-time speech synthesis engine developed at Carnegie Mellon University. Flite is derived from the Festival Speech Synthesis System from the University of Edinburgh and the FestVox project from Carnegie Mellon University.
Free TTS was built by the Speech Integration Group of Sun Microsystems Laboratories.

Possible uses of FreeTTS are:

  • JSAPI (Java Speech API) speech synthesizer
  • Remote TTS Server, to act as a back-end text-to-speech engine that works with a speech/telephony system, or does the “heavy lifting” for a wireless PDA
  • Workstation/Desktop TTS engine
  • Downloadable Web Application (FreeTTS can not be used in an applet)

FreeTTS includes the following demos :

  •  JSAPI/HelloWorld: uses the JSAPI 1.0 Synthesis interface to speak “Hello, World”
  • JSAPI/MixedVoices: demonstrates using multiple voices and speech synthesizers in a coordinated fashion using JSAPI 1.0
  • JSAPI/Player: Swing-based GUI (graphical user interface) that allows the user to monitor and manipulate a JSAPI 1.0 Speech Synthesizer
  • JSAPI/JTime: JSAPI program that uses a limited-domain, high quality voice to tell the time
  • JSAPI/Emacspeak: uses JSAPI 1.0 to provide a text-to-speech server for Emacspeak
  • JSAPI/WebStartClock: JSAPI talking clock that can be downloaded from the web using Java Web Start
  • freetts/HelloWorld: low-level (non-JSAPI) program that speaks a greeting to the world
  • freetts/ClientServer: low-level (non-JSAPI) socket-based TTS server with sample clients written in the C programming language and the Java programming language.

To write software with FreeTTS, it is recommended to use the Java Speech API (JSAPI) 1.0 to interface with FreeTTS. The JSAPI interface provides the best method of controlling and using FreeTTS.

Currently, the FreeTTS distribution comes with these 3 voices:

  • a low quality, unlimited domain, 8kHz diphone voice, called kevin
  • a medium quality, unlimited domain, 16kHz diphone voice, called kevin16
  • a high quality, limited domain, 16kHz cluster unit voice, called alan

FreeTTS interfaces with the MBROLA synthesizer and can use MBROLA voices. It’s also possible to import voice data from Festival and FestVox or CMU ARCTIC voices.

A full implementation of Sun’s Java Speech API for Windows platforms, allowing a large range of SAPI4 and SAPI5 compliant Text-To-Speech and Speech-Recognition engines (in many different languages) to be programmed using the standard Java Speech API has been developped by CloudGarden. Packages and additional classes augment the capabilities of the JSAPI by, for example integrating with Sun’s JMF, allowing, amongst other things, MPEG audio files to be created and read, and compressed audio data to be transmitted across a network