Sphinx-4 : a Java speech recognizer

Sphinx-4 is a state-of-the-art speech recognition system written in Java. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).

Sphinx-4 contains the following demo programs :

  • Hello World Demo: a command line application that recognizes simple phrases
  • Hello Digits Demo: a command line application that recognizes connected digits
  • Hello N-Gram Demo: a command line application using an N-gram language model for speech recognition
  • ZipCity Demo: a Java Web Start technology application that recognizes spoken zip codes and locates the associated city and state
  • WavFile Demo: a simple demo program to show how to decode audio files (e.g., .wav, .au files)
  • Transcriber Demo: a simple demo program showing how to transcribe a continuous audio file that has multiple utterances separated by silences
  • JSGF Demo: a simple demo program showing how a program can swap between multiple JSGF grammars
  • Dialog Demo: a demo program showing how a program can swap between multiple JSGF and dictation grammars
  • Action Tags Demo: a demo program showing how to use action tags for post-processing of RuleParse objects obtained from JSGF grammars
  • Confidence Demo: a simple demo program showing how to obtain confidence scores for result
  • Lattice Demo: a simple demo program showing how to extract lattices from recognition results

A number of tests and demos rely on having JSAPI installed. Sphinx-4 can be combined wit FreeTTS to set up a complete voice interface or a VoiceXML server.

FreeTTS : a Java speech synthesizer

FreeTTS is a speech synthesis system written entirely in Java. It is based upon Flite, a small run-time speech synthesis engine developed at Carnegie Mellon University. Flite is derived from the Festival Speech Synthesis System from the University of Edinburgh and the FestVox project from Carnegie Mellon University.
Free TTS was built by the Speech Integration Group of Sun Microsystems Laboratories.

Possible uses of FreeTTS are:

  • JSAPI (Java Speech API) speech synthesizer
  • Remote TTS Server, to act as a back-end text-to-speech engine that works with a speech/telephony system, or does the “heavy lifting” for a wireless PDA
  • Workstation/Desktop TTS engine
  • Downloadable Web Application (FreeTTS can not be used in an applet)

FreeTTS includes the following demos :

  •  JSAPI/HelloWorld: uses the JSAPI 1.0 Synthesis interface to speak “Hello, World”
  • JSAPI/MixedVoices: demonstrates using multiple voices and speech synthesizers in a coordinated fashion using JSAPI 1.0
  • JSAPI/Player: Swing-based GUI (graphical user interface) that allows the user to monitor and manipulate a JSAPI 1.0 Speech Synthesizer
  • JSAPI/JTime: JSAPI program that uses a limited-domain, high quality voice to tell the time
  • JSAPI/Emacspeak: uses JSAPI 1.0 to provide a text-to-speech server for Emacspeak
  • JSAPI/WebStartClock: JSAPI talking clock that can be downloaded from the web using Java Web Start
  • freetts/HelloWorld: low-level (non-JSAPI) program that speaks a greeting to the world
  • freetts/ClientServer: low-level (non-JSAPI) socket-based TTS server with sample clients written in the C programming language and the Java programming language.

To write software with FreeTTS, it is recommended to use the Java Speech API (JSAPI) 1.0 to interface with FreeTTS. The JSAPI interface provides the best method of controlling and using FreeTTS.

Currently, the FreeTTS distribution comes with these 3 voices:

  • a low quality, unlimited domain, 8kHz diphone voice, called kevin
  • a medium quality, unlimited domain, 16kHz diphone voice, called kevin16
  • a high quality, limited domain, 16kHz cluster unit voice, called alan

FreeTTS interfaces with the MBROLA synthesizer and can use MBROLA voices. It’s also possible to import voice data from Festival and FestVox or CMU ARCTIC voices.

A full implementation of Sun’s Java Speech API for Windows platforms, allowing a large range of SAPI4 and SAPI5 compliant Text-To-Speech and Speech-Recognition engines (in many different languages) to be programmed using the standard Java Speech API has been developped by CloudGarden. Packages and additional classes augment the capabilities of the JSAPI by, for example integrating with Sun’s JMF, allowing, amongst other things, MPEG audio files to be created and read, and compressed audio data to be transmitted across a network