Where Am I?

Where Am I?

Place Recognition Using Omni-directional Images and Color Histograms

Have you ever wondered how you know where you are? For example, I am sitting at my computer in our dining room inside our apartment which is located in Stanford, California, USA, Earth, Milky Way, The Universe. But how do I know this?

It is easy to see that location is a hierarchical concept beginning at a small scale and working up to larger sizes and distances. At the smaller end (e.g. dining room) we probably rely mostly on visual cues and short term memory, while at the bigger end (e.g. Earth), we depend on conceptual or semantic knowledge. The small end is actually the hardest. Am I in the dining room or the living room? Is this my apartment or your house? How do I tell these locations apart? And how do I get from one to the other?

What Would a Robot Do?

Photo 1: Our omni-directional vision test setup. The black object pointing up at the spherical mirror is a wireless webcam which beams back an omnidirectional image to a desktop PC for processing. The mirror itself is a Christmas ornament. The three images on the right represent from top to bottom: a raw omnidirectional image of the hallway; the unwrapped version of the image using RoboRealm; the color histogram of the image using EmguCV.

In robotics, this general problem is called localization and navigation and it involves both knowing where you are and how to get somewhere else. Studies with people and animals have revealed at least two clues on how we visually distinguish one location from another. The first involves landmarks—distinct features of the surroundings that serve as sign posts. For example, I may have a picture on my living room wall that distinguishes it from the dining room. And if you do not have the same picture anywhere in your house, it can help tell my place apart from yours. Of course, we use landmarks all the time when giving someone directions: "Turn right at the fire station, then left at the bike shop, etc." However, unless you already have a fairly sophisticated object recognition system in place, localization and navigation by landmarks is probably not the best strategy to start with.

The second approach involves image statistics. The most popular image statistic is the histogram. A histogram counts up how many times a given feature takes on different values. For example, the color histogram of an image counts the number of pixels in each color category: if we divide up the color space into different hues or colors, then each time a pixel in an image matches one of the colors, we increment that bin's counter by one. When we do this across the entire image, we get a kind of frequency chart representing the number of pixels in each of the color categories. This frequency chart is what we mean by a histogram and it is often characteristic of a given scene or room. For example, if your kitchen has lots of white in it while your living room has lots of brown and green, then the color histograms of pictures taken of these two locations will allow us to distinguish one from the other. As it turns out, understanding the nature of image statistics is also key to developing higher level object recognition algorithms, which in turn can help with landmark identification. So starting with image statistics rather than object recognition is a good idea for this reason too.

Below is a simple color image and its corresponding histogram. The histogram's horizontal axis runs from red at the left to blue on the right. The vertical axis represents the number of pixels having that particular hue or color. Note the three peaks corresponding to the orange, green and blue patches. Note also that the orange peak is the highest because there are more orange pixels in the image than green or blue. (If you're wondering why I chose orange instead of red, it was to better show off the orange peak in the histogram which would otherwise would have been obscured by the axis on the left for a red peak...)



Getting Around

So let's start with a very simple task. How do we give a robot the ability to know which room it is in while roaming around the house? The easiest approach would be to create a visually unique landmark for each room. For example, we could place brightly colored pieces of paper at an appropriate height on the walls of each room so that all the robot would need to do is look around until he sees one of the landmarks, then read off the color to know what room he is in. Note that we wouldn't be able to use different shapes for this purpose. The reason is that a shape will look very different depending on the angle from which it is being viewed. So color is a better choice. The downside of this approach is that you have to mark up your house for it to work. And any time the robot wanted to confirm which room he is in, he'd have to look around to find the closest colored piece of paper. Furthermore, this technique wouldn't work very well outdoors, and someday the robot might want to get out and see the world for himself. So how could we accomplish the same thing without using artificial landmarks?

For this we turn to image statistics. The simplest statistic is the color histogram as described above. We are going to take advantage of our robot's omni-directional vision system to capture the images. Such images work very well with histograms since any given image covers an entire panorama of the room. This means that it doesn't matter what orientation the robot might have when a picture is taken since the number of different colored pixels in the image is independent of the rotation of the image. Of course, the images and their histograms will vary depending on the position in the room from which they are taken. And this will mean that the histograms for a given room will not be exactly the same from one location to an other. However, we are hoping that these differences will be smaller than the differences between histograms computed in different rooms.

To test this approach, we take the following steps:

  • Take a number of omni-directional pictures of each room, say 5-10 images per room. We call this the training run or learning phase.

  • Compute the color histograms of each picture and store them in a database.

  • Compute or train a classifier that can assign histograms to room names.

  • Take the robot back to the various rooms, take some new pictures, and ask the robot to tell us what room he thinks he is in based on the current picture. We call this the testing phase.

Let's now take each of these steps in turn.

Building the Image Database

Below are a few omni-directional images taken from six different rooms. Alongside each image is one of its corresponding color histograms. I say "one" of its color histograms because a given image can be analyzed along a number of different color channels, with one histogram per channel. In the images below, the histogram corresponding to hue is shown. Hue is roughly associated with what we think of an object's basic color, with red colors toward the left in the histogram and blue colors toward the right. As you can see in the first histogram below, the balcony image contains a preponderance of red pixels.

Balcony

Dining Room

Living Room

Kitchen

Hallway

Foyer

As you can see, the histograms for different rooms have different shapes reflecting the different distributions of color in each room. However, we can also see that some rooms will be hard to tell apart using this method alone. For example, the histograms for the dining room and living room appear very similar. Fortunately, the histograms computed on some of the other channels are better at telling these two rooms apart. For example, here are the two saturation histograms for the same images above for the dining room and living room.

The saturation of a color is a measure of how dark or light the color appears in the image. As we can see from the two saturation histograms above, the dining room on the left has a noticeably different saturation profile than the living room on the right. So while the robot might confuse these two room using hue histograms alone, it can make a better distinction if it also uses the saturation histograms. This result highlights a general principle when working with both robots and brains: never put all your eggs in one basket. The more kinds of information you can use to perceptually distinguish one object or scene from another, the less likely you are to make mistakes.

Putting It All Together

As the robot moves from one room to another during the learning phase, it stores six different histograms for each omni-directional image. The six histograms correspond to the color channels Red, Green, Blue, Hue, Saturation and Lightness. This might sound like a lot of data to have to store in memory for each image but in fact it is tiny compared to storing the raw images themselves. Each histogram can be represented by as few as 64 bins each with one number per bin (the count for that bin). So each image requires only 6x64 = 384 numbers per image. By comparison, to store all the pixel values for a 320x240 color image would require 320x240x3 = 153,600 numbers or 400 times what it takes to store all six histograms. A 640x480 image would require 921,600 numbers or 2,400 times more data than storing just the histograms.

It is fun to speculate how this strategy for storing image statistics rather than the images themselves might relate to human and animal perception and memory. For example, imagine in your mind's eye what it's like to be in a forest. Do you see a photographically detailed image of each tree? Or is the image something more fuzzy like a certain mixture of green and brown, a pattern of vertical and horizontal shapes (trunks and branches), a patchwork of hard-to-describe texture (leaves, needles, bushes) and so on. We'll have more to say along these lines in a later article that includes shape and texture statistics in addition to color.

Getting back to our color histograms and the robot's meanderings around the house. Let's have the robot take 5-10 pictures per room (depending on the size of the room) and store the histograms of each image in a database along with the name of the room in which the picture was taken. We'll start with six rooms: dining room, living room, kitchen, hallway, foyer and balcony. The balcony is not really a room but it allows us to see how the robot performs in an outdoor setting as well as indoors.

After storing the histograms of all the reference images, we'll take the robot for a second tour of the various rooms (the testing phase), place him in random positions in those rooms, and ask him to tell us the name of the room he's in. How does he decide?

Classified Information

All creatures great and small must make continual discriminations between objects, scenes and situations. How else could we know that we are in the kitchen and not in the living room? Or that this slice of green bread might make me sick. In addition to receiving raw sensory data, the brain must decide which patterns of data belong in the same category. In the field of robotics, this process is called pattern classification. Furthermore, sensory information is rarely classified in its raw form: instead, the original data is preprocessed to extract certain features that characterize the information more succinctly. Good features also tend to be more stable than the raw data under a variety of conditions, such as changes in lighting or viewing angle. Typical features used in visual image processing include edges, line segments, colored blobs, basic shapes, texture patches and so on.

Color histograms can also be thought of as a kind of feature of a visual image. Not only do the histograms condense the information so that it is easier to store in memory, but histograms vary less dramatically under different conditions, such as different viewing positions, than do the raw pixels values. For example, if we rotate an omni-directional image, the raw pixels move in the opposite direction, but the count of those pixels remains the same, so the histogram is left unchanged. In other words, color histograms are insensitive to the spatial arrangement of the image pixels. The downside of this invariance is that many different images can result in the same or similar histogram. We saw this earlier where the hue histograms of images of the dining room and living room were very similar. The hue histograms of these two rooms are similar because the histograms do not encode the spatial distribution of the colors, just their overall count. But this also matches our perceptual experience that the two rooms really do look similar in terms of overall color.

Once we have a set of features that can summarize a given image, we need to build a classifier that can categorize these features into different groups or classes. In the case of our color histograms, a classifier will take a given picture, compute the six histograms we have been using, then use those histograms to tell us what room the picture was taken in. You can imagine doing this yourself, though it might not be easy. First you'd have to look at the six different histograms from a number of pictures taken in each of the rooms. Then you'd have to find similarities between the histograms from the same room as well as key differences between histograms computed for different rooms. With enough practice, you might be able to tell which room you were considering just by looking at six of its histograms.

So how do we build a classifier that can match histograms with room names? Fortunately, many people have worked decades on this problem and there are many solutions depending on the type of data you are dealing with and your goals. The general problem can be framed as mapping a set of input values into a set of output values. In our case, the set of input values are the bin counts in our six histograms, while the desired output values can be thought of as 1's and 0's where we assign a 1 to the correct room and a 0 to all the incorrect rooms. Mathematically, these input values and output values can be represented as vectors and the mappings between them can be represented as matrices or other operators. In our case, the input vectors have 256 elements (histogram bin counts) while the output vectors have 6 elements, one for each room. The challenge is to find a mapping between histogram values as inputs into the correct room values as outputs.

In the next few sections, we will illustrate three different types of classifiers based on three different mapping strategies from inputs to outputs: prototypes, nearest neighbors and neural networks.

Classification by Prototypes

Perhaps the easiest classifier to build and understand is the prototype classifier. The idea is quite simple: during the training phase, take all the histograms for a given room and average them. In other words, if we have 10 pictures of the living room, take the 10 different hue histograms and average them together to form one representative hue histogram for the whole collection. Do the same thing for the other 5 histogram channels (saturation, lightness, red, green blue) resulting in six average histograms Averaging input vectors is often referred to as forming prototypes so we will call this method the prototype classifier.

The idea behind averaging is that the common features across the histograms for the same room will be enhanced while the parts that do not overlap will be diluted. In theory, this should result in a histogram that better matches other images taken from the same room. So how well does it work?

Before describing the results, we'll give a brief description of the methodology. The steps in the experiment go like this:

  • Take an initial set of 5-10 pictures in each of the six rooms. (In the results below, there were 38 pictures taken during this phase.)

  • Compute the six different histograms for each of these pictures..

  • Compute the average or prototype histograms for each room, one for each color channel.

  • Take 3-5 new test pictures in each of the rooms. (In the results below, there were 22 test pictures.)

  • Compute the six different histograms for each of these new pictures.

  • Compare the histograms to the stored prototype histograms for each room. Each histogram channel (hue, saturation, etc.) gets a vote as to which prototype (and thus which room) best matches the test histogram. Whichever prototype gets the most votes gets to label the image by its room name. A confidence level is also computed and is defined as the number of votes for the winning room divided by the total number of votes possible (six in this case). For those of you wondering how we compare two histograms, the method used here is called the Jeffrey Divergence which essentially treats the two histograms as probability distributions. But one could also simply use Euclidean distance.

So how well does the prototype classifier do? The table below shows the classification results for the 22 test pictures. The first room name in each row is the correct classification of the test picture while the second room name after the arrow is what our classifier thinks it is. The number at the end of each row is the confidence of the decision: a value of 1.0 would be 100% confident (6 out of 6 votes) while 0.5 would 50% confident (3 out of 6 votes), and so on.

balcony ==> balcony: 0.97
balcony ==> balcony: 0.98
balcony ==> balcony: 0.59
balcony ==> balcony: 0.64
dining room ==> dining room: 0.8
dining room ==> dining room: 0.97
dining room ==> dining room: 0.95
dining room ==> dining room: 0.32
foyer ==> foyer: 0.97
foyer ==> foyer: 0.8
hallway ==> hallway: 0.63
hallway ==> hallway: 0.96
hallway ==> hallway: 0.64
hallway ==> hallway: 0.65
kitchen ==> kitchen: 0.79
kitchen ==> kitchen: 0.97
kitchen ==> kitchen: 0.8
living room ==> living room: 0.94
living room ==> living room: 0.97
living room ==> living room: 0.49
living room ==> living room: 0.81
living room ==> dining room: 0.48

TOTALS: 21 Correct; 1 Error = 95% correct


As you can see, this method made only one error (the last row highlighted in red) where it confused the living room with the dining room for that picture. There were also two other questionable classifications (highlighted in yellow) where the room name is correct but the confidence is low, meaning that other rooms also got a significant portion of the votes.

Classification by Nearest Neighbor

The next classifier uses a technique known as the nearest neighbor algorithm. This time, instead of averaging the histograms for each image, we store all the histograms individually in our database. When we are presented with a test image, we compute its six histograms and compare them to all the stored histograms to find out which earlier picture best matches the current one. In other words, rather than comparing the test histograms to prototype histograms, we compare histograms to normal histograms. This requires a much larger number of comparisons but it has the advantage that once we choose the best match, we also know which picture it best matches. For example, if we take 5 pictures in different locations in the dining room during the training phase, then take a test picture also in the dining room, the nearest neighbor algorithm can tell us which of the 5 stored pictures is the best match. In theory this means that might also know roughly where in the dining room we are. (We'll explore such a possibility in a later article.)

Here now are the results of using the nearest neighbor classifier on our 22 test pictures.

balcony ==> balcony: 0.98
balcony ==> balcony: 0.99
balcony ==> balcony: 0.64
balcony ==> balcony: 0.98
dining room ==> living room: 0.48
dining room ==> dining room: 0.98
dining room ==> dining room: 0.81
dining room ==> dining room: 0.8
foyer ==> living room: 0.64
foyer ==> foyer: 0.63
hallway ==> hallway: 0.96
hallway ==> hallway: 0.81
hallway ==> hallway: 0.64
hallway ==> hallway: 0.49
kitchen ==> kitchen: 0.64
kitchen ==> kitchen: 0.65
kitchen ==> kitchen: 0.64
living room ==> living room: 0.64
living room ==> living room: 0.97
living room ==> living room: 0.65
living room ==> living room: 0.96
living room ==> living room: 0.97

TOTALS: 20 Correct; 2 Errors = 91% correct


The nearest neighbor classifier made one more error than the prototype classifier but still correctly classified 20 out of 22 pictures. There was also a third classification (highlighted in yellow above) that was a little low in confidence.

Classification using an Artificial Neural Network

The third type of classifier to be tested is called an artificial neural network or ANN. Neurons in the cortex tend to be arranged in layers, with one layer connected to another layer by many thousands of synapses. Artificial neural networks have been developed into a powerful method for mapping a set of input values into a set of output values. The input values are likened to the activity levels of neurons in one layer and the output values are taken to be the activity levels of another layer. By modifying the connections between the two layers, we can build a network that takes a set of values over the input neurons and produces a desired set of values over the output neurons. Since most of the magic in such networks takes place in the connections between layers, this kind of classifier is also known as a connectionist network.

The kind of artificial neural network we will work with here uses a process called supervised learning. The process works in a series of "training" sessions. First we randomly set the connections between the input neurons and the output neurons. Then, for each histogram corresponding to the set of training pictures, we set the activation levels of the input neurons to the bin values of the histogram. Passing these values through the network's connections results in a pattern of activity across the six output neurons (one for each room). Since we know what room the histogram corresponds to, we know which output neuron should have a value of 1 while the others should have a value of 0. Early on in the training, this won't be the case—the output neurons will have values in between 0 and 1. To teach the network to do better in the future, we alter the connections in such a way that the output activities move closer and closer to the patterns of 1's and 0's that we know is correct. The particular method used to tweak the connections is called the learning algorithm for the network and a number of such algorithms have been developed for different types of networks and problems to be solved.

The simplest kind of connectionist network is called a perceptron. A perceptron has a single layer of connections mapping the input nodes to the output nodes. The learning algorithm most often used with perceptrons is called the delta rule, because it tweaks the connections during training in a way that is proportional to the difference between the desired output values and the actually output values. Since these changes in the connections must be made a little bit at a time, the whole training set of input vectors must be run through the learning process many times—usually hundreds or even thousands of cycles before the output values adequately match the correct values. Nonetheless, with today's computers, even thousands of training cycles can be done in seconds if not less.

The figure below illustrates the kind of perceptron used in the current setup.

So now let's try our room recognition test using a perceptron classifier. First we train the network using the first 38 pictures. Then we test the network by feeding it each of the 22 test pictures and looking at its output neurons to determine the room name. The results are as follows:

balcony ==> balcony: 0.83
balcony ==> balcony: 1
balcony ==> balcony: 0.5
balcony ==> balcony: 0.83
dining room ==> dining room: 0.5
dining room ==> dining room: 0.5
dining room ==> dining room: 0.5
dining room ==> living room: 0.83
foyer ==> foyer: 0.83
foyer ==> foyer: 0.67
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.67
kitchen ==> kitchen: 0.83
kitchen ==> kitchen: 0.67
kitchen ==> kitchen: 0.83
living room ==> living room: 1
living room ==> living room: 1
living room ==> living room: 0.83
living room ==> living room: 0.83
living room ==> living room: 0.83

TOTALS: 21 Correct; 1 Error = 95% correct

The perceptron classifier performs as well as the prototype classifier making only one mistake. There are however, a few more cases where the confidence level is borderline (a number of them at 0.5). Let's see if we can do better by training the network for a little longer before testing. For the results above, the network was trained for 2500 cycles. Let's bump that up to 5000 cycles and try again. Here are the new results:

balcony ==> balcony: 0.83
balcony ==> balcony: 1
balcony ==> balcony: 0.5
balcony ==> balcony: 0.83
dining room ==> dining room: 0.67
dining room ==> dining room: 0.5
dining room ==> dining room: 0.5
dining room ==> dining room: 0.67
foyer ==> foyer: 0.83
foyer ==> foyer: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.83
hallway ==> hallway: 0.67
kitchen ==> kitchen: 0.67
kitchen ==> kitchen: 0.67
kitchen ==> kitchen: 0.83
living room ==> living room: 1
living room ==> living room: 1
living room ==> living room: 0.83
living room ==> living room: 0.83
living room ==> living room: 0.83

TOTALS: 22 Correct; 0 Errors = 100% correct

This time the classifier gets a perfect score whereas the confidence levels remain roughly the same.

Summing Up

In this article we have seen how a fairly simple robot equipped with a homemade omni-directional vision system can tell what room it is in. If we imagine how we might go about such a task ourselves, we might think in terms of high level object recognition such as, “Oh, there is the stove so I must be in the kitchen.” Since object recognition is actually one of the harder things to get right when developing computer vision systems, we adopted a simpler strategy. In this approach, we treated the image as a whole and extracted the color histograms of the image along a number of different color dimensions such as hue, saturation and lightness. Since the images are omni-directional, such histograms are fairly resilient to rotations and small translations of the robot and therefore make good candidates for characterizing a particular room. When trying to figure out what room it is in, our robot might say something like, “There is a lot of white in this image so I must be in the kitchen.” We then tested three different classification algorithms for learning and discriminating between histograms: prototypes, nearest neighbor comparisons, and an artificial neural network called a perceptron. Although the neural network classifier was able to score 100% on the room recognition test, the prototype classifier came close at 95% correct and is much simpler to implement.

In a future article, we will take a look at additional feature histograms to further refine place recognition. For example, one can count up all the edges in the image that are oriented at a certain angle. Or one can count up the number of times color blobs change from one color to another. It will probably turn out that we can go a long way toward recognizing our current environment before we have to actually start recognizing and naming particular objects.

References

This experiment was mainly inspired by the following article:

Ulrich, I., and Nourbakhsh, I., "Appearance-Based Place Recognition for Topological Localization", IEEE International Conference on Robotics and Automation, San Francisco, CA, April 2000, pp. 1023-1029. Best Vision Paper Award.

Software References

Aforge.Net - Image Processing and Machine Learning

EmguCV - A .NET version of OpenCV that also include algorithms for machine learning.

RoboRealm - Vision for Machines

Appendix

  • Delta Rule for artificial neural networks: See the good explanation at http://en.wikipedia.org/wiki/Delta_rule

  • Jeffreys Divergence for comparing histograms: More often than not, when we want to compute the distance between two n-dimensional vectors x and y, we use the Euclidean metric:



    In words this says that the distance is the square root of the sum of the squares of the pair-wise differences between the individual vector components. As mentioned in the main article, this distance measure is not necessarily the best when comparing two color histograms. The reason is that adjacent bin values in a histogram tend to be correlated since they represent similar shades of the same color. For this reason, a different kind of distance metric is used called the Jeffreys Divergence or JD.

    First we normalize our histograms to have unit length by dividing each bin count by the Euclidean norm of the whole vector. This allows us to treat the new bin values, which are now all between 0 and 1, as a kind of probability distribution for finding various colors in the image. As it turns out, a good way to measure the distance between two probability distributions p1 and p2 is to use the Jeffreys Divergence formula as follows:



    where ln is the natural logarithm. This is the formula used in the main article for computing the distance between two histograms. For a complete explanation, please see http://en.wikipedia.org/wiki/Kullbac...ler_divergence.