ROS by Example: Speech Recognition and Text-to-Speech (TTS)
NOTE: This tutorial is several
years out of date and the specific commands for installing software and
using ROS have changed since then. An up-to-date version of this
tutorial can be found in the book ROS By Example: A Do-It-Yourself Guide to the Robot Operating System, available as a downloadable PDF and in paperback on Lulu.com.
Speech recognition and Linux have come a long way in the past few years, thanks mostly to the CMU Sphinx and Festival
projects. There are also ready-made ROS packages for both speech
recognition and text-to-speech. Consequently, it is quite easy to
add speech control and voice feedback to your robot as we will now show.
In this tutorial we will:
Install and test the pocketsphinx package for speech recognition
Learn how to create a custom vocabulary for speech recognition
Teleoperate a TurtleBot using voice commands
Install and test the Festival text-to-speech system and ROS sound_play package
This tutorial has been tested using ROS Electric and Ubuntu 10.04. There is also a report that it works using ROS Diamondback and Ubuntu 10.10.
Installing PocketSphinx for Speech Recognition
Thanks to Michael Ferguson from the University of Albany (and now at Willow Garage), we can use the ROS pocketsphinx package for speech recognition. The pocketsphinx package requires the installation of the Ubuntu package gstreamer0.10-pocketsphinx
and we will also need the ROS sound drivers stack (in case you don't
already have it) so so let's take care of both first. You will be
prompted to install the Festival packages if
you don't already have them--answer "Y" of course:
The key file in the pocketsphinx package is the Python script recognizer.py found in the nodes
subdirectory. This script does all the hard work of connecting to
the audio input stream of your computer and matching voice commands to
the words or phrases in the current vocabulary. When the
recognizer node matches a word or phrase, it publishes it on the /recognizer/output topic. Other nodes can subscribe to this topic to find out what the user has just said.
Downloading the Tutorial Files
All the files needed for the tutorial can be downloaded via SVN.
Move into your personal ROS path (e.g. ~/ros) and run:
You will get the best speech recognition results using a headset
microphone, either USB, standard audio or Bluetooth. Once you
have your microphone connected to your computer, make sure it is
selected as the input audio device by going to the Ubuntu System menu, then selecting Preferences->Sound. Once the Sound Preferences window opens, click on the Input
tab and select your microphone device from the list (if more than
one). Speak a few words into your microphone and you should see
the volume meter respond. Then click on the Output tab and select your desired output device as well as adjust the volume slider. Now close the Sound Preferences window.
NOTE: If you disconnect a
USB or Bluetooth microphone and then reconnect it later, you will
likely have to select it as the input again using the procedure
Michael Ferguson includes a vocabulary file suitable for RoboCup@Home
competitions that you can use to the test the recognizer. Fire it
up now by running:
$ roslaunch pocketsphinx robocup.launch
You should see a list of INFO messages indicating that the various
parts of the recognition model are being loaded. The last few
messages will look something like this:
INFO: ngram_search_fwdtree.c(195): Creating search tree
INFO: ngram_search_fwdtree.c(203): 0 root, 0 non-root channels, 26 single-phone words
INFO: ngram_search_fwdtree.c(325): max nonroot chan increased to 328
INFO: ngram_search_fwdtree.c(334): 77 root, 200 non-root channels, 6 single-phone words
Now say a few of the RoboCup phrases such as "bring me the glass", "go
to the kitchen", or "come with me". The output should look
something like this:
Partial: BRING ME
Partial: BRING IS
Partial: BRING ME THE
Partial: BRING ME THE GO
Partial: BRING ME THE THE
Partial: BRING ME THE GLASS
[INFO] [WallTime: 1318719668.724552] bring me the glass
Partial: GO TO
Partial: GOOD IS
Partial: GO TO THE
Partial: GO TO THE TO
Partial: GO TO THE GET
Partial: GO TO THE KITCHEN
[INFO] [WallTime: 1318719670.184438] go to the kitchen
Partial: COME WITH
Partial: COME WITH THE
Partial: COME WITH ME
[INFO] [WallTime: 1318719671.835016] come with me
Congratulations—you can now talk to your robot! Here we see
how the PocketSphinx recognizer builds the recognized phrase over the
course of your utterance. To see just the final result, open another terminal, and echo the /recognizer/output topic:
$ rostopic echo /recognizer/output
Now try the same three phrases as above and you should see:
data: bring me the glass
data: go to the kitchen
data: come with me
For my voice, and using a Bluetooth over-the-ear microphone, the recognizer was surprisingly fast and accurate.
To see all the phrases you can use with the demo RoboCup vocabulary, run the following commands:
$ roscd pocketsphinx/demo
$ more robocup.corpus
Now try saying a phrase that is not in the vocabulary, such as "the sky is blue". In my case, the result on the /recognizer/output
topic was "this go is room". As you can see, the recognizer will
respond with something no matter what you say. This means that
care must be taken to "mute" the speech recognizer if we don't want
random conversation to be interpreted as speech commands. We will
see how to do this below when we learn how to map speech recognition
Creating A Vocabulary
It is easy to create a new vocabulary or corpus
as it is referred to in PocketSphinx. First, create a
simple text file with one word or phrase per line. Here is a
corpus that could be used to drive your robot around using voice
commands. We will store it in a file called nav_commands.txt in the config subdirectory of the pi_speech_tutorial package:
$ roscd pi_speech_tutorial/config
$ more nav_commands.txt
You should see the following list of phrases:
Feel free to add, delete or change some of these words or phrases before proceeding to the next step.
When you enter your phrases, try not to mix upper and lower case and do
not use punctuation marks. Also, if want to include a number such
as 54, spell it out as "fifty four".
Before we can use this corpus with PocketSphinx, we need to compile
it into special dictionary and pronunciation files. This
can be done using the online CMU language model (lm) tool located at:
Follow the directions to upload your nav_commands.txt
file, click the "Compile Knowledge Base" button, then download the
resulting compressed tarball that contains all the language model
files. Extract these
files into the config subdirectory of the pi_speech_tutorial package. The files will all begin with the same number, such as 3026.dic, 3026.lm,
etc. These files define your vocabulary as a language model that
PocketSphinx can understand. You can rename all these files to
something more memorable using a command like the following (the
4-digit number will likely be different in your case):
As you can see, we launch the recognizer.py node in the pocketsphinx package and we point the lm and dict parameters to the files nav_commands.lm and nav_commands.dic created in the steps above. Note also that the parameter output="screen" is what allows us to see the real-time recognition results in the launch window.
Launch this file and test speech recognition by monitoring the /recognizer/output topic:
Try saying a few navigation phrases such as "move forward", "slow down"
and "stop". You should see your commands echoed on the /recognizer/output topic.
Voice Controlling Your Robot
The recognizer.py node in the pocketsphinx package publishes recognized speech commands to the /recognizer/output
topic. To map these commands to robot actions, we need a second
node that subscribes to this topic, looks for appropriate messages,
then causes the robot to execute different behaviors depending on the
message received. To get us started, Michael Ferguson includes a
Python script called voice_cmd_vel.py in the pocketsphinx package that maps voice commands
into Twist messages that can be used to control a mobile robot. We will use a
slightly modified version of this script called voice_nav.py found in the nodes subdirectory of the pi_speech_tutorial package.
The only key difference between the two scripts is the following block in voice_nav.py:
The keywords_to_command dictionary
allows us to map different verbal commands into the same action.
For example, it is really important to be able to stop the robot once
it is moving. However, the word "stop" is not always recognized
by the PocketSphinx recognizer. So we provide a number of
alternative ways of telling the robot to stop like "halt", "abort",
"help", etc. Of course, these alternatives must be included in our
original PocketSphinx vocabulary (corpus).
The voice_nav.py node subscribes to the /recognizer/output topic and
looks for recognized keywords as specified in the nav_commands.txt corpus.
If a match is found, the keywords_to_commands
dictionary maps the matched phrase to an appropriate command
word. Our callback function then maps the command word to the
appropriate Twist action sent to the robot. You can look at the voice_nav.py script for details.
Another feature of the voice_nav.py script is that it will respond to the two special commands "pause speech" and "continue speech".
If you are voice controlling your robot, but you would like to say
something to another person without the robot interpreting your words
as movement commands, just say "pause speech". When you want to
continue controlling the robot, say "continue speech".
To voice control a TurtleBot, move the robot into an open space free of obstacles, then bring up at least the minimal.launch file on the TurtleBot. On your workstation computer, run the voice_nav_commands.launch and turtlebot_voice_nav.launch files:
Try a relatively safe voice command first such as "rotate
right". Refer to the list of commands above for different
ways you can move the robot. The turtlebot_voice_nav.launch
file ncludes parameters you can set that determine the maximum speed of
the TurtleBot as well as the increments used when you say "go faster"
or "slow down".
Installing and Testing Festival Text-to-Speech
Now that we can talk to our robot, it would be nice if it could talk
back to us.
Text-to-speech (TTS) is accomplished using the CMU Festival system
together with the ROS sound_play
package. If you have followed this tutorial from the beginning,
you have already done the following step. Otherwise, run it now.
You will be prompted to install the Festival packages if
you don't already have them--answer "Y" of course:
$ sudo apt-get install ros-electric-sound-drivers
The sound_play package uses the CMU Festival TTS library to generate
synthetic speech. Let's test it out with the default voice as follows. First fire up the primary sound_play node:
$ rosrun sound_play soundplay_node.py
In another terminal, enter some text to be converted to voice:
$ rosrun sound_play say.py "Greetings Humans. Take me to your leader."
The default voice is called kal_diphone. To see all the English voices currently installed on your system:
$ ls /usr/share/festival/voices/english
To get a list of all basic Festival voices available, run
the following command:
$ sudo apt-cache search --names-only festvox-*
To install the festvox-don voice (for example), run the command:
$ sudo apt-get install festvox-don
And to test out your new voice, add the voice name to the end of the command line like this:
$ rosrun sound_play say.py "Welcome to the future" voice_don_diphone
There aren't a huge number of voices to choose from, but a few additional voices can be installed as described here and demonstrated here. Here are the steps to get and use two of those voices, one male and one female:
Note that the play.py script requires
the absolute path to the wave file which is why we used 'rospack
find'. You could also just type out the full path name.
To hear one of the built-in sounds, use the playbuiltin.py script together with a number from 1 to 5; for example:
$ rosrun sound_play playbuiltin.py 4
Using Text-to-Speech within a ROS Node
So far we have only used the Festival voices from the command
line. To see how to use text-to-speech from within a ROS node,
the following talkback.py script can be found in the nodes directory in pi_speech_tutorial. Note that to use such a script, the primary sound_play node must already be running:
import roslib; roslib.load_manifest('pi_speech_tutorial')
from std_msgs.msg import String
from sound_play.libsoundplay import SoundClient
# Create the sound client object self.soundhandle = SoundClient()
# Announce that we are ready for input self.soundhandle.playWave(self.wavepath + "/R2D2a.wav")
rospy.sleep(1) self.soundhandle.say("Ready", self.voice)
rospy.loginfo("Say one of the navigation commands...")
# Subscribe to the recognizer output
rospy.Subscriber('/recognizer/output', String, self.talkback)
def talkback(self, msg):
# Print the recognized words on the screen
# Speak the recognized words in the selected voice self.soundhandle.say(msg.data, self.voice)
# Uncomment to play one of the built-in sounds
# Uncomment to play a wave file
#rospy.sleep(2) #self.soundhandle.playWave(self.wavepath + "/R2D2a.wav")
rospy.loginfo("Shutting down talkback node...")
The key lines are highlighted in yellow. First we import the SoundClient class from the sound_play library. Then we assign a SoundClient object to self.soundhandle that we can use throughout the script. The three sound_play functions we use are playWave() to play a wave file, say() to voice some text and play() to play one of the builtin sounds. For the complete API, take a look at this ROS wiki page.
You can test the script using the talkback.launch file. Note how the launch file first brings up a sound_play node before launching the talkback.py script:
$ roslaunch pi_speech_tutorial talkback.launch
You should now be able to write your own script that combines speech
recognition and text-to-speech. For example, see if you can
figure out how to ask your robot the date and time and get back the
answer from the system clock. :-)