Where Am I?
Place Recognition Using Omni-directional Images and Color Histograms
 Photo
1: Our omni-directional vision test setup. The black object
pointing up at the spherical mirror is a wireless webcam which
beams back an omnidirectional image to a desktop PC for
processing. The mirror itself is a Christmas ornament. The
three images on the right represent from top to bottom: a raw
omnidirectional image of the hallway; the unwrapped version of
the image using RoboRealm; the color histogram of the image
using EmguCV.
Have you ever wondered how you know where you are? For example, I
am sitting at my computer in our dining room inside our apartment
which is located in Stanford, California, USA, Earth, Milky Way, The
Universe. But how do I know this?
It is easy to see that location is a hierarchical concept
beginning at a small scale and working up to larger sizes and
distances. At the smaller end (e.g. dining room) we probably rely
mostly on visual cues and short term memory, while at the bigger end
(e.g. Earth), we depend on conceptual or semantic knowledge. The
small end is actually the hardest. Am I in the dining room or the
living room? Is this my apartment or your house? How do I tell these
locations apart? And how do I get from one to the other?
What Would a Robot Do?
In robotics, this general problem is called localization and
navigation and it involves both knowing where you are and how to
get somewhere else. Studies with people and animals have revealed at
least two clues on how we visually distinguish one location from
another. The first involves landmarks—distinct features
of the surroundings that serve as sign posts. For example, I may have
a picture on my living room wall that distinguishes it from the
dining room. And if you do not have the same picture anywhere in your
house, it can help tell my place apart from yours. Of course, we use
landmarks all the time when giving someone directions: "Turn
right at the fire station, then left at the bike shop, etc."
However, unless you already have a fairly sophisticated object
recognition system in place, localization and navigation by landmarks
is probably not the best strategy to start with.
The second approach involves image statistics. The most
popular image statistic is the histogram. A histogram counts
up how many times a given feature takes on different values. For
example, the color histogram of an image counts the number of pixels
in each color category: if we divide up the color space into
different hues or colors, then each time a pixel in an image matches
one of the colors, we increment that bin's counter by one. When we do
this across the entire image, we get a kind of frequency chart
representing the number of pixels in each of the color categories.
This frequency chart is what we mean by a histogram and it is often
characteristic of a given scene or room. For example, if your kitchen
has lots of white in it while your living room has lots of brown and
green, then the color histograms of pictures taken of these two
locations will allow us to distinguish one from the other. As it
turns out, understanding the nature of image statistics is also key
to developing higher level object recognition algorithms, which in
turn can help with landmark identification. So starting with image
statistics rather than object recognition is a good idea for this
reason too.
Below is a simple color image and its corresponding histogram.
The histogram's horizontal axis runs from red at the left to blue
on the right. The vertical axis represents the number of pixels
having that particular hue or color. Note the three peaks
corresponding to the orange, green and blue patches. Note also that
the orange peak is the highest because there are more orange pixels
in the image than green or blue. (If you're wondering why I chose
orange instead of red, it was to better show off the orange peak in
the histogram which would otherwise would have been obscured by the
axis on the left for a red peak...)
Getting Around
So let's start with a very simple task. How do we give a robot the
ability to know which room it is in while roaming around the house?
The easiest approach would be to create a visually unique landmark
for each room. For example, we could place brightly colored pieces of
paper at an appropriate height on the walls of each room so that all
the robot would need to do is look around until he sees one of the
landmarks, then read off the color to know what room he is in. Note
that we wouldn't be able to use different shapes for this purpose.
The reason is that a shape will look very different depending on the
angle from which it is being viewed. So color is a better choice. The
downside of this approach is that you have to mark up your house for
it to work. And any time the robot wanted to confirm which room he is
in, he'd have to look around to find the closest colored piece of
paper. Furthermore, this technique wouldn't work very well outdoors,
and someday the robot might want to get out and see the world for
himself. So how could we accomplish the same thing without using
artificial landmarks?
For this we turn to image statistics. The simplest statistic is
the color histogram as described above. We are going to take
advantage of our robot's omni-directional vision system to capture
the images. Such images work very well with histograms since any
given image covers an entire panorama of the room. This means that it
doesn't matter what orientation the robot might have when a picture
is taken since the number of different colored pixels in the image is
independent of the rotation of the image. Of course, the images and
their histograms will vary depending on the position in the
room from which they are taken. And this will mean that the
histograms for a given room will not be exactly the same from one
location to an other. However, we are hoping that these differences
will be smaller than the differences between histograms computed in
different rooms.
To test this approach, we take the following steps:
Take a number of omni-directional
pictures of each room, say 5-10 images per room. We call this the
training run or learning phase.
Compute the color histograms of
each picture and store them in a database.
Compute or train a classifier that
can assign histograms to room names.
Take the robot back to the various rooms, take some new
pictures, and ask the robot to tell us what room he thinks he is in
based on the current picture. We call this the testing phase.
Let's now take each of these steps in turn.
Building the Image Database
Below are a few omni-directional images taken from six
different rooms. Alongside each image is one of its corresponding
color histograms. I say "one" of its color histograms
because a given image can be analyzed along a number of different
color channels, with one histogram per channel. In the images below,
the histogram corresponding to hue is shown. Hue is roughly
associated with what we think of an object's basic color, with red
colors toward the left in the histogram and blue colors toward the
right. As you can see in the first histogram below, the balcony image
contains a preponderance of red pixels.
Balcony
Dining Room
Living Room
Kitchen
Hallway
Foyer
As you can see, the histograms for different rooms have different
shapes reflecting the different distributions of color in each room.
However, we can also see that some rooms will be hard to tell apart
using this method alone. For example, the histograms for the dining
room and living room appear very similar. Fortunately, the histograms
computed on some of the other channels are better at telling these
two rooms apart. For example, here are the two saturation
histograms for the same images above for the dining room and
living room.
|

|

|
The saturation of a color is a measure of how dark or light the
color appears in the image. As we can see from the two saturation
histograms above, the dining room on the left has a noticeably
different saturation profile than the living room on the right. So
while the robot might confuse these two room using hue histograms
alone, it can make a better distinction if it also uses the
saturation histograms. This result highlights a general principle
when working with both robots and brains: never put all your eggs in
one basket. The more kinds of information you can use to perceptually
distinguish one object or scene from another, the less likely you are
to make mistakes.
Putting It All Together
As the robot moves from one room to another during the
learning phase, it stores six different histograms for each
omni-directional image. The six histograms correspond to the color
channels Red, Green, Blue, Hue, Saturation and Lightness. This might
sound like a lot of data to have to store in memory for each image
but in fact it is tiny compared to storing the raw images themselves.
Each histogram can be represented by as few as 64 bins each with one
number per bin (the count for that bin). So each image requires only
6x64 = 384 numbers per image. By comparison, to store all the pixel
values for a 320x240 color image would require 320x240x3 = 153,600
numbers or 400 times what it takes to store all six histograms. A
640x480 image would require 921,600 numbers or 2,400 times more data
than storing just the histograms.
It is fun to speculate how this strategy for storing image
statistics rather than the images themselves might relate to
human and animal perception and memory. For example, imagine
in your mind's eye what it's like to be in a forest. Do you see a
photographically detailed image of each tree? Or is the image
something more fuzzy like a certain mixture of green and brown, a
pattern of vertical and horizontal shapes (trunks and branches), a
patchwork of hard-to-describe texture (leaves, needles, bushes) and
so on. We'll have more to say along these lines in a later article
that includes shape and texture statistics in addition to color.
Getting back to our color histograms and the robot's meanderings
around the house. Let's have the robot take 5-10 pictures per room
(depending on the size of the room) and store the histograms of each
image in a database along with the name of the room in which the
picture was taken. We'll start with six rooms: dining room, living
room, kitchen, hallway, foyer and balcony. The balcony is not really
a room but it allows us to see how the robot performs in an outdoor
setting as well as indoors.
After storing the histograms of all the reference images, we'll
take the robot for a second tour of the various rooms (the testing
phase), place him in random positions in those rooms, and ask him to
tell us the name of the room he's in. How does he decide?
Classified Information
All creatures great and small must make continual discriminations
between objects, scenes and situations. How else could we know that
we are in the kitchen and not in the living room? Or that this slice
of green bread might make me sick. In addition to receiving raw
sensory data, the brain must decide which patterns of data belong in
the same category. In the field of robotics, this process is called
pattern classification. Furthermore, sensory information is
rarely classified in its raw form: instead, the original data is
preprocessed to extract certain features that characterize the
information more succinctly. Good features also tend to be more
stable than the raw data under a variety of conditions, such as
changes in lighting or viewing angle. Typical features used in visual
image processing include edges, line segments, colored blobs, basic
shapes, texture patches and so on.
Color histograms can also be thought of as a kind of feature of a
visual image. Not only do the histograms condense the information so
that it is easier to store in memory, but histograms vary less
dramatically under different conditions, such as different viewing
positions, than do the raw pixels values. For example, if we rotate
an omni-directional image, the raw pixels move in the opposite
direction, but the count of those pixels remains the same, so the
histogram is left unchanged. In other words, color histograms are
insensitive to the spatial arrangement of the image pixels. The
downside of this invariance is that many different images can result
in the same or similar histogram. We saw this earlier where the hue
histograms of images of the dining room and living room were very
similar. The hue histograms of these two rooms are similar because
the histograms do not encode the spatial distribution of the colors,
just their overall count. But this also matches our perceptual
experience that the two rooms really do look similar in terms
of overall color.
Once we have a set of features that can summarize a given image,
we need to build a classifier that can categorize these
features into different groups or classes. In the case of our color
histograms, a classifier will take a given picture, compute the six
histograms we have been using, then use those histograms to tell us
what room the picture was taken in. You can imagine doing this
yourself, though it might not be easy. First you'd have to look at
the six different histograms from a number of pictures taken in each
of the rooms. Then you'd have to find similarities between the
histograms from the same room as well as key differences between
histograms computed for different rooms. With enough practice, you
might be able to tell which room you were considering just by looking
at six of its histograms.
So how do we build a classifier that can match histograms with
room names? Fortunately, many people have worked decades on this
problem and there are many solutions depending on the type of data
you are dealing with and your goals. The general problem can be
framed as mapping a set of input values into a set of output values.
In our case, the set of input values are the bin counts in our six
histograms, while the desired output values can be thought of as 1's
and 0's where we assign a 1 to the correct room and a 0 to all the
incorrect rooms. Mathematically, these input values and output values
can be represented as vectors and the mappings between them
can be represented as matrices or other operators. In our
case, the input vectors have 256 elements (histogram bin counts)
while the output vectors have 6 elements, one for each room. The
challenge is to find a mapping between histogram values as inputs
into the correct room values as outputs.
In the next few sections, we will illustrate three different types
of classifiers based on three different mapping strategies from
inputs to outputs: prototypes, nearest neighbors and neural networks.
Classification by Prototypes
Perhaps the easiest classifier to build and understand is the
prototype classifier. The idea is quite simple: during the training
phase, take all the histograms for a given room and average
them. In other words, if we have 10 pictures of the living room, take
the 10 different hue histograms and average them together to form one
representative hue histogram for the whole collection. Do the same
thing for the other 5 histogram channels (saturation, lightness, red,
green blue) resulting in six average histograms Averaging input
vectors is often referred to as forming prototypes so
we will call this method the prototype classifier.
The idea behind averaging is that the
common features across the histograms for the same room will be
enhanced while the parts that do not overlap will be diluted. In
theory, this should result in a histogram that better matches other
images taken from the same room. So how well does it work?
Before describing the results, we'll
give a brief description of the methodology. The steps in the
experiment go like this:
Take an initial set of 5-10
pictures in each of the six rooms. (In the results below, there were
38 pictures taken during this phase.)
Compute the six different
histograms for each of these pictures..
Compute the average or prototype
histograms for each room, one for each color channel.
Take 3-5 new test pictures in
each of the rooms. (In the results below, there were 22 test
pictures.)
Compute the six different
histograms for each of these new pictures.
Compare the histograms to the
stored prototype histograms for each room. Each histogram channel
(hue, saturation, etc.) gets a vote as to which prototype (and thus
which room) best matches the test histogram. Whichever prototype
gets the most votes gets to label the image by its room name. A
confidence level is also computed and is defined as the number of
votes for the winning room divided by the total number of votes
possible (six in this case). For those of you wondering how we
compare two histograms, the method used here is called the Jeffrey
Divergence which essentially treats the two histograms as
probability distributions. But one could also simply use Euclidean
distance.
So how well does the prototype
classifier do? The table below shows the classification results for
the 22 test pictures. The first room name in each row is the correct
classification of the test picture while the second room name after
the arrow is what our classifier thinks it is. The number at the end
of each row is the confidence of the decision: a value of 1.0 would
be 100% confident (6 out of 6 votes) while 0.5 would 50% confident (3
out of 6 votes), and so on.
balcony ==> balcony: 0.97
balcony ==> balcony: 0.98
balcony ==> balcony: 0.59
balcony ==> balcony: 0.64
dining room ==> dining room: 0.8
dining room ==> dining room: 0.97
dining room ==> dining room: 0.95
dining room ==> dining room: 0.32
foyer ==> foyer: 0.97
foyer ==> foyer: 0.8
hallway ==> hallway: 0.63
hallway ==> hallway: 0.96
hallway ==> hallway: 0.64
hallway ==> hallway: 0.65
kitchen ==> kitchen: 0.79
kitchen ==> kitchen: 0.97
kitchen ==> kitchen: 0.8
living room ==> living room: 0.94
living room ==> living room: 0.97
living room ==> living room: 0.49
living room ==> living room: 0.81
living room ==> dining room: 0.48
TOTALS: 21 Correct; 1 Error = 95% correct
As you can see, this method made only
one error (the last row highlighted in red) where it confused the
living room with the dining room for that picture. There were also
two other questionable classifications (highlighted in yellow) where
the room name is correct but the confidence is low, meaning that
other rooms also got a significant portion of the votes.
Classification by Nearest Neighbor
The next classifier uses a
technique known as the nearest neighbor algorithm.
This time, instead of averaging the histograms for each image, we
store all the histograms individually in our database. When we are
presented with a test image, we compute its six histograms and
compare them to all the stored histograms to find out which earlier
picture best matches the current one. In other words, rather than
comparing the test histograms to prototype histograms, we compare
histograms to normal histograms. This requires a much larger number
of comparisons but it has the advantage that once we choose the best
match, we also know which
picture it best matches. For
example, if we take 5 pictures in different locations in the dining
room during the training phase, then take a test picture also in the
dining room, the nearest neighbor algorithm can tell us which of the
5 stored pictures is the best match. In theory this means that might
also know roughly where
in the dining room we are. (We'll
explore such a possibility in a later article.)
Here now are the results of using the
nearest neighbor classifier on our 22 test pictures.
balcony ==> balcony: 0.98
balcony ==> balcony: 0.99
balcony ==> balcony: 0.64
balcony ==> balcony: 0.98
dining room ==> living room: 0.48
dining room ==> dining room: 0.98
dining room ==> dining room: 0.81
dining room ==> dining room: 0.8
foyer ==> living room: 0.64
foyer ==> foyer: 0.63
hallway ==> hallway: 0.96
hallway ==> hallway: 0.81
hallway ==> hallway: 0.64
hallway ==> hallway: 0.49
kitchen ==> kitchen: 0.64
kitchen ==> kitchen: 0.65
kitchen ==> kitchen: 0.64
living room ==> living room: 0.64
living room ==> living room: 0.97
living room ==> living room: 0.65
living room ==> living room: 0.96
living room ==> living room: 0.97
TOTALS: 20 Correct; 2 Errors = 91% correct
The nearest neighbor classifier made one more error than the
prototype classifier but still correctly classified 20 out of 22
pictures. There was also a third classification (highlighted in
yellow above) that was a little low in confidence.
Classification using an Artificial Neural Network
The third type of classifier to be tested is called an artificial
neural network or ANN.
Neurons in the cortex tend to be arranged in layers, with one layer
connected to another layer by many thousands of synapses. Artificial
neural networks have been developed into a powerful method for
mapping a set of input values into a set of output values. The input
values are likened to the activity levels of neurons in one layer and
the output values are taken to be the activity levels of another
layer. By modifying the connections between the two layers, we can
build a network that takes a set of values over the input neurons and
produces a desired set of values over the output neurons. Since most
of the magic in such networks takes place in the connections between
layers, this kind of classifier is also known as a connectionist
network.
The kind of artificial neural network we will work with here uses
a process called supervised learning. The process works in a
series of "training" sessions. First we randomly set the
connections between the input neurons and the output neurons. Then,
for each histogram corresponding to the set of training pictures, we
set the activation levels of the input neurons to the bin values of the
histogram. Passing these values through the network's connections
results in a pattern of activity across the six output neurons (one
for each room). Since we know what room the histogram corresponds to,
we know which output neuron should have a value of 1 while the others
should have a value of 0. Early on in the training, this won't be the
case—the output neurons will have values in between 0 and 1. To
teach the network to do better in the future, we alter the connections
in such a way that the output activities move closer and closer to
the patterns of 1's and 0's that we know is correct. The particular
method used to tweak the connections is called the learning
algorithm for the network and a number of such algorithms have
been developed for different types of networks and problems to be solved.
The simplest kind of connectionist network is called a
perceptron. A perceptron has a single layer of connections
mapping the input nodes to the output nodes. The learning algorithm
most often used with perceptrons is called the delta rule,
because it tweaks the connections during training in a way that is
proportional to the difference between the desired output values and
the actually output values. Since these changes in the connections
must be made a little bit at a time, the whole training set of input
vectors must be run through the learning process many times—usually
hundreds or even thousands of cycles before the output values
adequately match the correct values. Nonetheless, with today's
computers, even thousands of training cycles can be done in seconds
if not less.
The figure below illustrates the kind of perceptron used in the current setup.
So now let's try our room recognition test using a perceptron
classifier. First we train the network using the first 38 pictures.
Then we test the network by feeding it each of the 22 test pictures
and looking at its output neurons to determine the room name. The
results are as follows:
balcony ==> balcony: 0.83
balcony ==> balcony: 1
balcony ==> balcony: 0.5
balcony ==> balcony: 0.83
dining room ==> dining room: 0.5
dining room ==> dining room: 0.5
dining room ==> dining room: 0.5
dining room ==> living room: 0.83
foyer ==> foyer: 0.83
foyer ==> foyer: 0.67
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.67
kitchen ==> kitchen: 0.83
kitchen ==> kitchen: 0.67
kitchen ==> kitchen: 0.83
living room ==> living room: 1
living room ==> living room: 1
living room ==> living room: 0.83
living room ==> living room: 0.83
living room ==> living room: 0.83
TOTALS: 21 Correct; 1 Error = 95% correct
The perceptron classifier performs as well as the prototype
classifier making only one mistake. There are however, a few more
cases where the confidence level is borderline (a number of them at
0.5). Let's see if we can do better by training the network for a
little longer before testing. For the results above, the network was
trained for 2500 cycles. Let's bump that up to 5000 cycles and try
again. Here are the new results:
balcony ==> balcony: 0.83
balcony ==> balcony: 1
balcony ==> balcony: 0.5
balcony ==> balcony: 0.83
dining room ==> dining room: 0.67
dining room ==> dining room: 0.5
dining room ==> dining room: 0.5
dining room ==> dining room: 0.67
foyer ==> foyer: 0.83
foyer ==> foyer: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.67
kitchen ==> kitchen: 0.67
kitchen ==> kitchen: 0.67
kitchen ==> kitchen: 0.83
living room ==> living room: 1
living room ==> living room: 1
living room ==> living room: 0.83
living room ==> living room: 0.83
living room ==> living room: 0.83
TOTALS: 22 Correct; 0 Errors = 100% correct
This time the classifier gets a perfect score whereas the confidence
levels remain roughly the same.
Summing Up
In this article we have seen how a fairly simple robot equipped
with a homemade omni-directional vision system can tell what room it
is in. If we imagine how we might go about such a task
ourselves, we might think in terms of high level object recognition
such as, “Oh, there is the stove so I must be in the
kitchen.” Since object recognition is actually one of the harder
things to get right when developing computer vision systems, we
adopted a simpler strategy. In this approach, we treated the image as
a whole and extracted the color histograms of the image along a number
of different color dimensions such as hue, saturation and
lightness. Since the images are omni-directional, such histograms are
fairly resilient to rotations and small translations of the robot and
therefore make good candidates for characterizing a particular room.
When trying to figure out what room it is in, our robot might say
something like, “There is a lot of white in this image so I must
be in the kitchen.” We then tested three different
classification algorithms for learning and discriminating between
histograms: prototypes, nearest neighbor comparisons, and an
artificial neural network called a perceptron. Although the neural
network classifier was able to score 100% on the room recognition
test, the prototype classifier came close at 95% correct and is much
simpler to implement.
In a future article, we will take a look at additional feature
histograms to further refine place recognition. For example, one can
count up all the edges in the image that are oriented at a certain
angle. Or one can count up the number of times color blobs change
from one color to another. It will probably turn out that we can go a
long way toward recognizing our current environment before we have to
actually start recognizing and naming particular objects.
References
This experiment was mainly inspired by the following article:
Ulrich, I., and Nourbakhsh, I., “Appearance-Based Place
Recognition for Topological Localization”, IEEE International
Conference on Robotics and Automation, San Francisco, CA, April 2000,
pp. 1023-1029. Best Vision Paper Award.
|