Google speech to text (STT) with processing

Processing is an open source programming language and environment for people who want to create images, animations, and interactions.

Florian Schulz, Interaction Design Student at FH Potsdam, presented a year ago in the processing forum a speech to text (STT) library, based on the Google API. The source code is available at GitHub, a project page provides additional informations. The library is based on an article of Mike Pultz, named Accessing Google Speech API / Chrome 11, published in March 2011.

I installed the library in my processing environment (version 2.0b8) and run the test examples with success. I did some trials with the french and german Google speech recognition engines. I am impressed by the results.

Additional informations about this topic are provided in the following link list :

 

The Telecoms’ Future presented by Ernst & Young

Ernst & Young developed in the past six months a scenario study entitled How will consumers communicate in 2020? The results have been presented today by the Telecommunications, Media& Entertainment and Technology (TMT) experts of Ernst & Young at its premises in Luxembourg.

Their thorough analysis resulted in two core uncertainties :

  • security and privacy (two extremes : in control or full chaos)
  • degree of Internet integration into our daily lives (two extremes : fragmeneted or fully integrated)

Placing these two core uncertainties on the axes of a coordinate system results in the following four divergent and challenging scenarios :

  • Full Speed Ahead scenario (self-regulation and uniform standards)
  • Roller Coaster scenario (high speed innovation, no rules)
  • Speed Limit Control scenario (stringent rules and regulations, more expensive, less user-friendly)
  • Gear Down scenario (lost of trust in the Internet)

Ernst & Young has sketched these four illustrative scenarios in an interactive video.

The Internet Society engaged in a similar scenario study about the future of the Internet a few years ago.

Blackberry Protect

Last update : September 28, 2013

BlackBerry Protect is designed to help find your lost BlackBerry smartphone and keep the information on it secure. In BlackBerry 10, BlackBerry Protect is a feature built into the OS. For smartphones running on BlackBerry 7 OS and earlier versions of BlackBerry Device Software, BlackBerry Protect is a free application you can download on your smartphone.

The current location of your device can be mapped, you can make it ring loudly to help you find it even if sound is currently turned off, a customized message can be sent to your home screen even if the device is locked, it can be locked and optionally a new password can be set and you can permanently delete all data from the device. All these actions are done remotely from the Blackberry Protect Website.

Blackberry Protect

View Location of a Blackberry device in Blackberry Protect

My Blackberry Z10 was located correctly in Cloche d’Or, Luxembourg.

Voice driven web applications

Last update : July 17, 2013

The new JavaScript Web Speech API specified by W3C makes it easy to add speech recognition to a web page and to create voice driven web applications. It enables developers to use scripting to generate text-to-speech output and to use speech recognition as an input for forms, continuous dictation and control. The JavaScript API allows web pages to control activation and timing and to handle results and alternatives.

The Web Speech specification was published by the Speech API Community Group, chaired by Glen Shires, software engineer at Google. The specification is not a W3C Standard nor is it on the W3C Standards Track.

A demo working in the Chrome browser 25 and later is available at the HTML5 rocks website.

There are two processes : Text-to-Speech (speech synthesis : TTS) and Speech-to-Text (speech recognition : ASR). There are at least three different approaches to synthesize text :

  • integrated :  a TTS module is built into the OS, or a separately installed TTS engine can plug-in to the OS’s TTS module.
  • packaged : instead of requiring a separate install, a synthesizer and voices can be packaged and shipped with the application.
  • in the cloud : a web-service is used to synthesize text. The advantage of this is a more predictable and consistent voice quality, independent from the hardware and operation system used on the mobile client.

Concerning ASR, Wolf Paulus, an internationally experienced technologist and innovator, compared the performance (speed and accuracy) of the speech recognition systems developed by Google, Nuance, iSpeech and AT&T.

A HTML Speech XG Speech API Proposal, introduced by Microsoft to the  HTML Speech Incubator Group, is available as unofficial draft at the W3C website.

A list of speech recognition software is available at Wikipedia. The main hosted speech applications are presented below :

iSpeech

iSpeech provides speech solutions for individuals and business, in different fields as mobiles, connected homes, automotive, publishing (audio books), e-learning and more. The solutions include Text-to-speech (TTS) and speech recognition (ASR).

iSpeech offers API’s and SDK for developers for different devices and programming languages (iPhone, Android, Blackberry, PHP, JAVA, Python, .NET, Flash, Ruby, Perl) and comprehensive documentations, integration guides, web samples and FAQ’s. iSpeech povides development keys to use the three servers :

  • Mobile Development
  • Mobile Production
  • Web/General/Desktop/Other Production

The applications must be configured to use the correct servers.To make the web/general key work, you need to buy credits. The low usage price is $0.02 per word (TTS) or per transaction (ASR).

An free iSpeech app for iOS devices (version 1.3.5 updated May 13, 2013) to convert text to speech with the best sounding voices is available at the iTune store. This app is powered by the iSpeech.org Text to Speech (TTS) software as a service (SaaS) API. Other apps for iOS and Android devices are listed at the iSpeech website. A Text-to-Speech demo is also available.

Nuance

Nuance Communications is a multinational computer software technology corporation, headquartered in Burlington, Massachusetts, that provides speech and imaging applications.

In August 2012, Nuance announced Nina, a collection of personal assistant technologies that will bring Siri-like functionality to customer service mobile apps.

Nuance provides the Dragon Mobile SDK to developers that joined the NDEV Dragon Mobile developer program. This creates a unique opportunity in the mobile developer ecosystem to power any application with Nuance’s proven, best-in-class Dragon Naturally Speaking voice recognition technology.

In joining NDEV Mobile, developers have free access to wrappers and widgets for simple application customization, all through a self-service website. Developers also have access to an on-line community forum for support, a variety of code samples and full documentation. Once an NDEV Mobile developer has integrated the SDK into their application, Nuance provides 90 days of free access to the cloud-based speech services to validate the power of speech recognition on their application. To put an application in production, a licence fee of 3.000 $ has to be prepaid.The low usage price is 0,009 $ per transaction.

The following platforms are supported :

  • Apple  iOS
  • Android
  • Windows Phone
  • HTTP web services interface

A mobile assistant & voice app for iOS and Android is available in the iTunes at GooglePlay stores.

AT&T Watson Speech engine

AT&T offers a free speech development program to access the tools needed to build, test, onboard and certify applications across a range of devices, OSes and platforms.

There are three classes of functionality in the AT&T speech API family :

  • Speech to Text : 9 contexts are optimized to return the text of what the end users say. The text can be returned in multiple formats, including, JSON and XML.
  • Text to Speech : Male and female ‘characters’ are available for both English and Spanish.
  • Speech to Text Custom :  the speech service is customized by sending a list of words or phrases commonly spoken by the end users to improve recognition of those unique words. The Grammar List supports 19 languages, the Generic with Hints supports English and Spanish.

The Call Management (Beta) API that is powered by Tropo™ exposes SMS and Voice Calling RESTful APIs, which enable app developers to create voice-enabled apps that send or receive calls, provide Interactive Voice Response (IVR) logic, Automatic Speech Recognition (ASR), Voice to Text (VTT), Text (SMS) integration, and more. SDK’s are available for HTML5 (Sencha Touch), Android, iOS and Microsoft. Tools are provided for key platforms, including Android, Brew MP, HTML5, RIM BlackBerry and Windows Phone.

The Speech API provides two methods for transcribing audio into text and one method for rendering text into audio. An AT&T Natural Voices Text-to-Speech Demo is availbale at the AT&T research website.

API access to the AT&T sandbox and production environments costs 99$ a year. The sandbox and production environments allow you to develop, test, and deploy applications using AT&T APIs, including 1 million points (one transaction = one point) each month to spend on any APIs they like. A US based credit card is required to charge 20$ for each additional group of 2,000 points exceeding one million. See the AT&T pricelist.

AT&T Application Resource Optimizer (ARO) is a free diagnostic tool for analyzing the performance of your mobile applications. It can help your app run faster and smarter by providing recommendations to help optimize your mobile application’s performance, speed, network impact and battery utilization.

Speech API FAQ’s as well as code samples, documents, tutorials, guides, SDK’s, tools, blogs, forums and more are available at the AT&T speech development website.

Google Speech API

The Google Speech API can be accessed safely through a Chrome browser using x-webkit-speech. Some people have reverse engineered the Google speech API for other uses on the web. The interface is free, but it is not an official public API.

On February 23, 2013, Google announced at the Chrome Blog that the new stable Chrome release includes support for the Web Speech API, which developers can use to integrate speech recognition capabilities into their web apps in more than 30 languages. A web speech API demo is available at the Google website. In the Peanut Gallery, you can add intertitles to old black-and-white movies simply by talking to Chrome.

The following list provides links to more informations about the Google speech API’s :

More speech applications from other suppliers are listed hereafter :

The Eclipse Voice Tools Project (VTP) allows you to build and run speech recognition application using industry standards such as VoiceXML and Speech Recognition Grammar Specification (SRGS).

Talking Tom Cat and friends

 

Talking Tom Cat

Talking Tom Cat app

Talking Tom Cat is a free app for phones and tablets that shows a cartoon cat onscreen that repeats whatever you say into the device in a funny voice. Talking Tom Cat also responds to pokes and strokes of the screen. You can pet the cat and hear him purr. You can pour him a glass of milk and watch him drink it. Talking Tom Cat has incredible attraction for kids.

Talking Tom Cat  was created by Outfit7, a startup founded in October 2009 in Slovenia by Samo Login. The company moved its headquarters to Limassol, Cyprus in 2011. Outfit7 is known for its creation of the popular Talking Friends collection of about 20 mobile apps which have been downloaded more than 400 million times. Tom Cat is the most popular Talking Friend which has been downloaded more than 40 million times. Other characters are Gina the Giraffe, Pierre Parrot, Ben the Dog, Rex the Dinosaur, Larry the Bird, Baby Hippo, the cats Ginger and Angela and others.

In 2012 Tom Cat and his friends appeared in TV shows and music videos.

The company generates revenue through a combination of paid app downloads, in-app payments for virtual items and advertising. Tom Cat and the Talking Friends are also available as plush toys. The Talking Friend Superstars bring the apps to life and can even talk with the apps.

Second Life and OpenSim mobile clients

Two mobile grid clients are available for Second Life or OpenSim virtual worlds :

Pocket Metaverse (for iOS ; version 1.8.0 ; $4.99)
The following features are provided :

  • Instant Messaging & Chat
  • Friends List with Who’s Online
  • World Map and Teleporting
  • MiniMap with Who’s Nearby
  • Profiles, Groups, Search
  • Giving, Receiving, and Managing Inventory
  • Read Notecards and view Snapshots and Textures
  • Upload and Download from the Photo Album and Camera (iPhone only)
  • Payments
  • and more

The app does not show a 3D world view.

Mobile Grid Client (for Android ; version 1.19 ; monthly fee L$250)

The following features are provided :

  • messaging client/viewer with local chat
  • IM
  • group chat
  • people search
  • mini map
  • the ability to teleport
  • inventory support
  • and more

The app does not show a 3D world view.

Linden Lab, the company that created Second Life and grew that online community into one of the most colorful, varied online social networks in the world, recently launched some new products. One is Creatorverse, an app for Apple and Android phones and tablets that is designed around a very different type of collaborative, creative play from Second Life.

Aurora-Sim Metaverse

Aurora-Sim is the next generation of the OpenSimulator Project.

Aurora-Sim is a community based project consisting of individuals with a variety of technical and non-technical talents and has been developed with emphasis on security, speed and performance.

The architecture is designed to be as flexible as possible, so that third parties can create modules by pluging into Aurora-Sim’s modular design. By using the .NET framework and Mono, Aurora-Sim can be run on many platforms.

The Aurora-Sim software is still in alpha. There are 3 components :

  • Aurora Simulator (virtual world server ; version 5.0.1)
  • Aurora Web UI
  • Aurora Joomla interface

The source code is available at Github.

Hypergrid Business and Grid Press

Hypergrid Business is the magazine for enterprise users of virtual worlds.

Hypergrid Business is published by Trombly International, a Massachusetts-based communications firm with key staff in Boston, Belgium, Shanghai and Mumbai. Subsidiaries include the China Speakers Bureau. Hypergrid Business offers in-depth and up-to-date coverage of the OpenSim technology and community, which is currently the leading contender to be the hyperlinked 3D Web, with news, case studies, opinion, and feature stories. They cover also alternative open source platforms like Open Wonderland, and proprietary enterprise platforms like ProtoSphere.

The Editor in Chief and publisher of the Hypergrid Business Magazine is Maria Korolov. She has been a journalist for more than twenty years and has worked for the Chicago Tribune, Reuters, and Computerworld.

Another source for news and views from around the OpenSim is Grid Press, covering also the new Aurora-Sim metaverse. Grid Press says to be dedicated to bringing you factual news, reliable information and honest opinions. It promotes the free and open exchange of ideas and opinions in a fashion that is respectful of all persons. Grid Press operates on a not-for-profit basis, though it is not a registered non-profit. Grid Press is supported and hosted by Zetamax, an OpenSim hosting company, claiming to offer a truely managed OpenSim solution.

Mono

Mono is a software platform designed to allow developers to easily create cross platform applications. Sponsored by Xamarin, Mono is an open source implementation of Microsoft’s .NET Framework based on the ECMA standards for C# and the Common Language Runtime. A growing family of solutions and an active and enthusiastic contributing community is helping position Mono to become the leading choice for development of Linux applications.

The latest stable version of Mono is 2.10.x.