ROS by Example: Speech Recognition and TTS

ROS by Example: Speech Recognition and Text-to-Speech (TTS)

NOTE: This tutorial is several years out of date and the specific commands for installing software and using ROS have changed since then.  An up-to-date version of this tutorial can be found in the book ROS By Example: A Do-It-Yourself Guide to the Robot Operating System, available as a downloadable PDF and in paperback on

Speech recognition and Linux have come a long way in the past few years, thanks mostly to the CMU Sphinx and Festival projects.  There are also ready-made ROS packages for both speech recognition and text-to-speech.  Consequently, it is quite easy to add speech control and voice feedback to your robot as we will now show.

In this tutorial we will:
  • Install and test the pocketsphinx package for speech recognition
  • Learn how to create a custom vocabulary for speech recognition
  • Teleoperate a TurtleBot using voice commands
  • Install and test the Festival text-to-speech system and ROS sound_play package

System Requirements

This tutorial has been tested using ROS Electric and Ubuntu 10.04.  There is also a report that it works using ROS Diamondback and Ubuntu 10.10.

Installing PocketSphinx for Speech Recognition

Thanks to Michael Ferguson from the University of Albany (and now at Willow Garage), we can use the ROS pocketsphinx package for speech recognition.  The pocketsphinx package requires the installation of the Ubuntu package gstreamer0.10-pocketsphinx and we will also need the ROS sound drivers stack (in case you don't already have it) so so let's take care of both first.  You will be prompted to install the Festival packages if you don't already have them--answer "Y" of course:

$ sudo apt-get install gstreamer0.10-pocketsphinx
$ sudo apt-get install ros-electric-sound-drivers

The pocketsphinx package is part of the University of Albany's rharmony stack so let's install the stack using the following commands. First move into your personal ROS path (e.g. ~/ros), then run:

$ svn checkout
$ rosmake --rosdep-install pocketsphinx

The key file in the pocketsphinx package is the Python script found in the nodes subdirectory.  This script does all the hard work of connecting to the audio input stream of your computer and matching voice commands to the words or phrases in the current vocabulary.  When the recognizer node matches a word or phrase, it publishes it on the /recognizer/output topic.  Other nodes can subscribe to this topic to find out what the user has just said.

Downloading the Tutorial Files

All the files needed for the tutorial can be downloaded via SVN.  Move into your personal ROS path (e.g. ~/ros) and run:

$ svn checkout
$ rosmake --rosdep-install pi_speech_tutorial

Testing the PocketSphinx Recognizer

You will get the best speech recognition results using a headset microphone, either USB, standard audio or Bluetooth.  Once you have your microphone connected to your computer, make sure it is selected as the input audio device by going to the Ubuntu System menu, then selecting Preferences->Sound.  Once the Sound Preferences window opens, click on the Input tab and select your microphone device from the list (if more than one).  Speak a few words into your microphone and you should see the volume meter respond.  Then click on the Output tab and select your desired output device as well as adjust the volume slider.  Now close the Sound Preferences window.

NOTE:  If you disconnect a USB or Bluetooth microphone and then reconnect it later, you will likely have to select it as the input again using the procedure described above.

Michael Ferguson includes a vocabulary file suitable for RoboCup@Home competitions that you can use to the test the recognizer.  Fire it up now by running:

$ roslaunch pocketsphinx robocup.launch

You should see a list of INFO messages indicating that the various parts of the recognition model are being loaded.  The last few messages will look something like this:

INFO: ngram_search_fwdtree.c(195): Creating search tree
INFO: ngram_search_fwdtree.c(203): 0 root, 0 non-root channels, 26 single-phone words
INFO: ngram_search_fwdtree.c(325): max nonroot chan increased to 328
INFO: ngram_search_fwdtree.c(334): 77 root, 200 non-root channels, 6 single-phone words

Now say a few of the RoboCup phrases such as "bring me the glass", "go to the kitchen", or "come with me".  The output should look something like this:

Partial: BRING
Partial: BRING ME
Partial: BRING IS
[INFO] [WallTime: 1318719668.724552] bring me the glass
Partial: THE
Partial: GO
Partial: GOOD
Partial: GO TO
Partial: GOOD IS
Partial: GO TO THE
Partial: GO TO THE TO
Partial: GO TO THE GET
[INFO] [WallTime: 1318719670.184438] go to the kitchen
Partial: GO
Partial: COME
Partial: COME WITH
[INFO] [WallTime: 1318719671.835016] come with me

Congratulations—you can now talk to your robot!   Here we see how the PocketSphinx recognizer builds the recognized phrase over the course of your utterance.  To see just the final result, open another terminal, and echo the /recognizer/output topic:

$ rostopic echo /recognizer/output

Now try the same three phrases as above and you should see:

data: bring me the glass
data: go to the kitchen
data: come with me

For my voice, and using a Bluetooth over-the-ear microphone, the recognizer was surprisingly fast and accurate.

To see all the phrases you can use with the demo RoboCup vocabulary, run the following commands:

$ roscd pocketsphinx/demo
$ more robocup.corpus

Now try saying a phrase that is not in the vocabulary, such as "the sky is blue".  In my case, the result on the /recognizer/output topic was "this go is room".  As you can see, the recognizer will respond with something no matter what you say.  This means that care must be taken to "mute" the speech recognizer if we don't want random conversation to be interpreted as speech commands.  We will see how to do this below when we learn how to map speech recognition into actions.

Creating A Vocabulary

It is easy to create a new vocabulary or corpus as it is referred to in PocketSphinx.   First, create a simple text file with one word or phrase per line.  Here is a corpus that could be used to drive your robot around using voice commands.  We will store it in a file called nav_commands.txt in the config subdirectory of the pi_speech_tutorial package:

$ roscd pi_speech_tutorial/config
$ more nav_commands.txt

You should see the following list of phrases:

pause speech
continue speech
move forward
move backward
move back
move left
move right
go forward
go backward
go back
go left
go right
go straight
come forward
come backward
come left
come right
turn left
turn right
rotate left
rotate right
speed up
slow down
quarter speed
half speed
full speed
stop now
help me
turn off
shut down

Feel free to add, delete or change some of these words or phrases before proceeding to the next step.

When you enter your phrases, try not to mix upper and lower case and do not use punctuation marks.  Also, if want to include a number such as 54, spell it out as "fifty four".

Before we can use this corpus with PocketSphinx, we need to compile it into special dictionary and pronunciation files.  This can be done using the online CMU language model (lm) tool located at:

Follow the directions to upload your nav_commands.txt file, click the "Compile Knowledge Base" button, then download the resulting compressed tarball that contains all the language model files.  Extract these files into the config subdirectory of the pi_speech_tutorial package.  The files will all begin with the same number, such as 3026.dic3026.lm, etc.  These files define your vocabulary as a language model that PocketSphinx can understand.  You can rename all these files to something more memorable using a command like the following (the 4-digit number will likely be different in your case):

$ roscd pi_speech_tutorial/config
$ rename -f 's/3026/nav_commands/' *

Next, take a look at the voice_nav_commands.launch file found in the launch subdirectory of the pi_speech_tutorial package.  It looks like this:

  <node name="recognizer" pkg="pocketsphinx" type="" output="screen">
    <param name="lm" value="$(find pi_speech_tutorial)/config/nav_commands.lm"/>
    <param name="dict" value="$(find pi_speech_tutorial)/config/nav_commands.dic"/>

As you can see, we launch the node in the pocketsphinx package and we point the lm and dict parameters to the files nav_commands.lm and nav_commands.dic created in the steps above.  Note also that the parameter output="screen" is what allows us to see the real-time recognition results in the launch window.

Launch this file and test speech recognition by monitoring the /recognizer/output topic:

$ roslaunch pi_speech_tutorial voice_nav_commands.launch

And in a separate terminal:

$ rostopic echo /recognizer/output

Try saying a few navigation phrases such as "move forward", "slow down" and "stop".  You should see your commands echoed on the /recognizer/output topic.

Voice Controlling Your Robot

The node in the pocketsphinx package publishes recognized speech commands to the /recognizer/output topic.  To map these commands to robot actions, we need a second node that subscribes to this topic, looks for appropriate messages, then causes the robot to execute different behaviors depending on the message received.  To get us started, Michael Ferguson includes a Python script called in the pocketsphinx package that maps voice commands into Twist messages that can be used to control a mobile robot.  We will use a slightly modified version of this script called found in the nodes subdirectory of the pi_speech_tutorial package.

The only key difference between the two scripts is the following block in

# A mapping from keywords to commands.
self.keywords_to_command = {'stop': ['stop', 'halt', 'abort', 'kill', 'panic', 'off', 'freeze', 'shut down', 'help'],
                            'slower': ['slow down', 'slower'],
                            'faster': ['speed up', 'faster'],
                            'forward': ['forward', 'ahead', 'straight'],
                            'backward': ['back', 'backward', 'back up'],
                            'rotate left': ['rotate left'],
                            'rotate right': ['rotate right'],
                            'turn left': ['turn left'],
                            'turn right': ['turn right'],
                            'quarter': ['quarter speed'],
                            'half': ['half speed'],
                            'full': ['full speed'],
                            'pause': ['pause speech'],
                            'continue': ['continue speech']}

The keywords_to_command dictionary allows us to map different verbal commands into the same action.  For example, it is really important to be able to stop the robot once it is moving.  However, the word "stop" is not always recognized by the PocketSphinx recognizer.  So we provide a number of alternative ways of telling the robot to stop like "halt", "abort", "help", etc.  Of course, these alternatives must be included in our original PocketSphinx vocabulary (corpus).

The node subscribes to the /recognizer/output topic and looks for recognized keywords as specified in the nav_commands.txt corpus.  If a match is found, the keywords_to_commands dictionary maps the matched phrase to an appropriate command word.  Our callback function then maps the command word to the appropriate Twist action sent to the robot.  You can look at the script for details.

Another feature of the script is that it will respond to the two special commands "pause speech" and "continue speech".  If you are voice controlling your robot, but you would like to say something to another person without the robot interpreting your words as movement commands, just say "pause speech".  When you want to continue controlling the robot, say "continue speech".

To voice control a TurtleBot, move the robot into an open space free of obstacles, then bring up at least the minimal.launch file on the TurtleBot.  On your workstation computer, run the voice_nav_commands.launch and turtlebot_voice_nav.launch files:

$ roslaunch pi_speech_tutorial voice_nav_commands.launch

and in another terminal:

$ roslaunch pi_speech_tutorial turtlebot_voice_nav.launch

Try a relatively safe voice command first such as "rotate right".   Refer to the list of commands above for different ways you can move the robot.  The turtlebot_voice_nav.launch file ncludes parameters you can set that determine the maximum speed of the TurtleBot as well as the increments used when you say "go faster" or "slow down".

Installing and Testing Festival Text-to-Speech

Now that we can talk to our robot, it would be nice if it could talk back to us.  Text-to-speech (TTS) is accomplished using the CMU Festival system together with the ROS sound_play package.  If you have followed this tutorial from the beginning, you have already done the following step.  Otherwise, run it now. You will be prompted to install the Festival packages if you don't already have them--answer "Y" of course:

$ sudo apt-get install ros-electric-sound-drivers

The sound_play package uses the CMU Festival TTS library to generate synthetic speech.  Let's test it out with the default voice as follows.  First fire up the primary sound_play node:

$ rosrun sound_play

In another terminal, enter some text to be converted to voice:

$ rosrun sound_play "Greetings Humans. Take me to your leader."

The default voice is called kal_diphone.   To see all the English voices currently installed on your system:

$ ls /usr/share/festival/voices/english

To get a list of all basic Festival voices available, run the following command:

$ sudo apt-cache search --names-only festvox-*

To install the festvox-don voice (for example), run the command:

$ sudo apt-get install festvox-don

And to test out your new voice, add the voice name to the end of the command line like this:

$ rosrun sound_play "Welcome to the future" voice_don_diphone

There aren't a huge number of voices to choose from, but a few additional voices can be installed as described here and demonstrated here.   Here are the steps to get and use two of those voices, one male and one female:

$ sudo apt-get install festlex-cmu
$ cd /usr/share/festival/voices/english/
$ sudo wget -c
$ sudo wget -c
$ sudo tar jxf cmu_us_clb_arctic-0.95-release.tar.bz2
$ sudo tar jxf cmu_us_bdl_arctic-0.95-release.tar.bz2
$ sudo rm cmu_us_clb_arctic-0.95-release.tar.bz2
$ sudo rm cmu_us_bdl_arctic-0.95-release.tar.bz2
$ sudo ln -s cmu_us_clb_arctic cmu_us_clb_arctic_clunits
$ sudo ln -s cmu_us_bdl_arctic cmu_us_bdl_arctic_clunits

You can test these two voices like this:

$ rosrun sound_play "I am speaking with a female C M U voice" voice_cmu_us_clb_arctic_clunits
$ rosrun sound_play "I am speaking with a male C M U voice" voice_cmu_us_bdl_arctic_clunits

NOTE: If you don't hear the phrase on the first try, try repeating the command.  Also, remember that a sound_play node must already be running in another terminal.

You can also use sound_play to play wave files or a number of built-in sounds.  To play the R2D2 wave file in the pi_speech_tutorial sounds directory, use the command:

$ rosrun sound_play `rospack find pi_speech_tutorial`/sounds/R2D2a.wav

Note that the script requires the absolute path to the wave file which is why we used 'rospack find'.  You could also just type out the full path name.

To hear one of the built-in sounds, use the script together with a number from 1 to 5; for example:

$ rosrun sound_play 4

Using Text-to-Speech within a ROS Node

So far we have only used the Festival voices from the command line.  To see how to use text-to-speech from within a ROS node, the following script can be found in the nodes directory in pi_speech_tutorial.  Note that to use such a script, the primary sound_play node must already be running:

#!/usr/bin/env python

import roslib; roslib.load_manifest('pi_speech_tutorial')
import rospy
from std_msgs.msg import String

from sound_play.libsoundplay import SoundClient

class TalkBack:
    def __init__(self):
        self.voice = rospy.get_param("~voice", "voice_don_diphone")
        self.wavepath = rospy.get_param("~wavepath", "")
        # Create the sound client object
        self.soundhandle = SoundClient()
        # Announce that we are ready for input
        self.soundhandle.playWave(self.wavepath + "/R2D2a.wav")
        self.soundhandle.say("Ready", self.voice)
        rospy.loginfo("Say one of the navigation commands...")

        # Subscribe to the recognizer output
        rospy.Subscriber('/recognizer/output', String, self.talkback)
    def talkback(self, msg):
        # Print the recognized words on the screen
        # Speak the recognized words in the selected voice
        self.soundhandle.say(, self.voice)
        # Uncomment to play one of the built-in sounds
        # Uncomment to play a wave file
        #self.soundhandle.playWave(self.wavepath + "/R2D2a.wav")

    def cleanup(self):
        rospy.loginfo("Shutting down talkback node...")

if __name__=="__main__":

The key lines are highlighted in yellow.  First we import the SoundClient class from the sound_play library.  Then we assign a SoundClient object to self.soundhandle that we can use throughout the script.  The three sound_play functions we use are playWave() to play a wave file, say() to voice some text and play() to play one of the builtin sounds.  For the complete API, take a look at this ROS wiki page.

You can test the script using the talkback.launch file.  Note how the launch file first brings up a sound_play node before launching the script:

$ roslaunch pi_speech_tutorial talkback.launch

You should now be able to write your own script that combines speech recognition and text-to-speech.  For example, see if you can figure out how to ask your robot the date and time and get back the answer from the system clock. :-)