Object Recognition Group 2010s1
From Experimental Robotics
|
Weekly meetings and working sessions were held at the UNSW Robolab on Tuesdays and Fridays. The group members were:
- Bhuman Soni (3318300)
- Marius (3304298)
- Martin Åberg (3322180)
- Michael Gratton (3248510)
Abstract
In this experiment we implement the Speed-UP Robust Features (OpenCV) and the SiftGPU algorithm for the purpose of real-time object detection by a robot in a RoboCup@Home environment. The system is called ORA (Object Recognition Algorithm), which is a catch-all umbrella term for the implemented system. The robot is trained with a set of reference images of objects such as cups, coke cans/bottles and a magazine, after which it is deployed in an environment where it can detect the presence of these objects. Feedback of positive identification is presented in the form of a drawn bounding box around the object as seen through the robot's camera and a marker is put at the position of the object on a map of the environment. The results of the experiment show that the object detection rate is accurate provided the robot is moving at a relatively slow pace. After the initial tests the algorithm was further improved to incorporate real-time learning of new objects for which the early results were significantly positive. We also present a comparison between the CPU and GPU versions of the object recognition algorithm.
Background
As a part of the RoboCup initiative, RoboCup@home aims to develop assistive robots, i.e. domestic robots, that provide assistance in daily activities at home. As with other RoboCup leagues Robotcup@Home sees the competing robots carry out a set of activities in an established environment, i.e. a set of tests upon which the performance of the competing robots is evaluated. The environment in this case being a model of a home with a set of objects relative to a home environment.
The aforementioned activities often involve the robot interacting with the objects in its environment and therefore the robot's ability to identify these objects plays an important part in its overall success in the tournament. The problem with object recognition in robotics can be traced back to the origins of robot vision, however despite recent advances, object recognition in robotics is still very slow and unsuitable for real-time performance (Russakovsky, 2010).
In this study we aim to devise robot vision software that identifies and memorizes the location of cups in a given Robocup@Home environment. Object detection in robots is achieved with a wide variety of algorithms, depending on the researcher's preference, however for the purpose of this study we will employ the Scale Invariant feature Transform (Lowe, 1999).
The next section contains a background on object recognition in computer vision, followed by implementation details, testing strategies, a discussion of the results and a conclusion.
The group is aiming to have the pioneer robot autonomously navigate through a maze representing a standard Robocup@Home environment and identifying known objects using its on-board camera.
Previous work
In order to understand the object recognition algorithm implemented it is imperative to gain a thorough understanding of the following papers:
- Scale-invariant feature transform, SIFT, is an algorithm developed by David Lowe to extract distinctive invariant features from images for later recognition of objects in different views or scenes. For more detailed information about the algorithm, see Distinctive Image Features from Scale-Invariant Keypoints by David G. Lowe, January 5, 2004.
- The Open Computer Vision library OpenCV implements a similar scale- and rotation-invariant feature point extraction algorithm, called SURF (Speeded-Up Robust Features). For more detailed information about the algorithm, see Speeded-Up Robust Features (SURF) by Herbert Bay, Andreas Ess, Tinne Tuytelaars and Luc Van Gool, 15 December 2007.
Other notable experiments that tackle object recognition are:
- Russakovsky et al (2010) proposed an algorithm that identifies multiple objects within a scene by segmenting the image and carefully choosing a set of segments that contain objects of interest. Once the objects of interest are found they apply the set of object classifiers to determine the object in the segmented area. The novelty of the project lies in the strategy of choosing segments of interest for which they adopt a Steiner tree approach. They use Steiner trees to determine the best parameter settings for object detection, i.e. for a set of possible segmentation windows they determine which one at best contains the object of interest. Their tests conclude faster object detection by the algorithm, however their proposed scheme is only tested on static images.
- Loncomilla & Ruiz-del-Solar (2007) implement an improvised version of the SIFT algorithm for the purpose of object detection in a robot. Their proposed approach combines Lowe's approach for object recognition with a series of hypothesis rejection tests. Once the object is identified using SIFT, it is passed through a set of stages whereby each stage determines the validity of the identified image, i.e. weather it has been correctly identified. A similarity distribution of the reference and the target image is calculated using the Hough transform. They test their approach in both Robocup soccer and RoboCup@Home scenarios with varying degrees of success.
Scale Invariant Feature Transform (SIFT)
David Lowe (1999) introduced an object recognition method that worked by extracting a range of object features from an image while being invariant to the image scale, rotation, orientation, affine distortion and partially invariant to illumination changes. While a more technical definition for the SIFT algorithm description is available in the original paper written by Lowe, in this section we will provide a high level description of the SIFT algorithm.
SIFT employs a supervised learning approach whereby the algorithm must be trained with a set of reference images for the object that it needs to identify. During training SIFT extracts a set of features from the reference images and stores it in a feature database. Each set of features uniquely identifies an object in the image. Hence in order to train SIFT to recognize a particular object, a set of images of the objects taken from a range of angels should be used as a reference set to train the algorithm. The reference images are segmented so they only contain a picture of the object to be recognized. Once trained the algorithm can identify an object by comparing the features of the target image with the features of the reference image in the database. The term training is used very loosely in the description as unlike a Neural Network, the algorithm is not trained but it rather contains a unique id of a particular object in an image which it can use to compare it to other images to check if the object exists in the other image.
Overall a set of features for a particular object can be thought of as the object's signature, hence SIFT recognizes the presence of that object in an image when it successfully matches the target image signature of that object with a reference signature of that object in the SIFT database.
Implementation
Several different software components were used during the project to implement the recognition task. An offline training component allowed objects of interest to be learnt prior to attempting recognition, without requiring use of the Pioneer robotics platform. When performing object recognition, a background task was executed by RobotServ (part of the CASRobot platform) which acquired frames from the robot's on-board camera and analysed them, looking for objects present in the images. This recognition task then transmitted any objects found to RobotGui for presentation to a human operator. An analysis component was also used for evaluating and comparing algorithm performance. These components are illustrated in the diagram below.
The high-level components used a common set of classes written in C++ for feature extraction, feature matching, input/output and to represent recognised objects. Key classes included:
-
Feature,Keypoint - Represents an extracted local feature. Features contain both a key point - a vector located in the image and having both a magnitude and direction - and a 128 element array of descriptors used for feature matching.
-
Sift - Abstract base class for implementations of local feature extraction algorithms. Notable derived classes include
OpenCvSurfandSiftOnGPU. -
FeatureDatabase - Abstract base class for implementations of feature matching algorithms, notable derived classes include
KdTreeDatabase -
ObjectMatch - Represents matched objects. Includes the name of the object, it's bounding box and other useful information.
-
RecognitionTask - A CASRobot background task that performed online recognition.
-
ObjectRender - A render for the RobotPanel view in RobotGui, displayed objects matched by the recognition task.
The ExtractFeatures component used implementations of Sift to extract local features (instances of Feature) from training images. These were saved into the common on-disk format used by the feature database. The RecognitionTask component also used Sift objects to extract features from the scene being observed by the robot. Features from the scene were then matched to features extracted from the training images by FeatureDatabase. This produced ObjectMatch instances, which were then published for use by other CASRobot components. ObjectRender was one such component, however other, future services could also use the matches - a planning component, for example.
This object model meant it was possible to use different implementations of key algorithms for comparison and testing, this was configurable using RobotServ and RobotGui configuration files, or using command line options in the offline tools.
Feature Extraction
Two local feature extraction algorithms were used in this experiment. OpenCV provides an implementation of SURF that was used as the baseline algorithm since it was straight-forward to implement and deploy.
Changchang Wu at the University of North Carolina at Chapel Hill has produced a GPU implementation of SIFT (Wu, 2010). This implementation requires a graphics card with a high amount of on-board memory (approximately 512MB or higher) and supports dynamic branching which would most likely be supported by any nVidia graphics card. In order to integrate Changchang's implementation as a part of ORA the required libraries were installed and the images that were passed to the extraction method were converted to the appropriate format as required by the GPU.
Both the amount of and the robustness of the feature points were supposed to be greater when generated with SIFT rather then with SURF. Running SIFT was also supposed to be a lot slower then running SURF which was why it was desirable to run it on the GPU in order to increase the performance.
Training
The recognition task requires a database of features that can be compared to those extracted from a scene by the robot's on-board camera. To ensure this database contained only features from the objects to be recognised, a set of ideal training images was produced and post processed so that only the image region containing the object itself was considered when extracting features.
Training images were obtained from the robot itself, to ensure the training features would likely be similar to those obtained during actual recognition. The objects were placed against a reasonably uniform green background and images of the object from at least four different perspectives were captured. The green background allowed for automated segmentation to produce the background mask (using the same simple process outlined in Online Learning, below), although manual touch-up of areas missed was also performed. The mask was stored as an 8-bit alpha channel in the training images. This allowed the training image and mask to be loaded from the same file, as well as facilitating manual touch-up using standard image processing tools such as the GNU Image Manipulation Program (GIMP). Below is a comparison of a captured images and the post-processed training image.
Several objects were used for training, including cups, mugs, bottles, cans and posters. The captured images can be found in SVN in images/raw, while the post-processed images used for training are in images/training. Examples of the raw images are given below.
Cups and Mugs
Several different mugs and cups where used at different stages of the project. We soon realized that simple mugs (like the one in the left-most image) had too few features for the algorithm to recognize it. We retrained the program with mugs and cups which had pictures, textures and logos, i.e. it had more features which would be easier for sift to recognize. We also found that the mugs had to be large, so they could be recognized from a greater distance.
Coke Can and bottles
We trained the algorithm to recognize a Coke bottle, a Coke can and a Fanta bottle. We ran into trouble with these objects, as the reflective surface of the bottle combined with changing light conditions created inconsistent features.
Poster
One of the better identified objects by ORA was a magazine cover. The characteristics of the magazine cover were it's a large size and several large features within it which results in ORA being able to identify it from a greater distance than other objects.
K-Dimensional Database
The database we used for storing and matching images was implemented using k-dimensional tree search and a transformation matching algorithm. The database is made up of several objects, each represented by several images of that object. For example, a Coke bottle is a single object in the database, and the Coke bottle will be be photographed from several angles. Each image is then run through the SIFT algorithm to extract the feature points. A k-dimensional tree is built from these feature points, and is added to the database. The construction of a kd-tree is described in An introductory tutorial on k-d Trees.
The database is searched using the features of a test image. The best match is found for each object by comparing the matched features in each image of that object. To match a test image to a database image, each test feature (TF) is matched with a feature, the best feature (BF), in the kd-tree using the nearest neighbour kd tree search. The TF and the BF are stored in a FeatureMatch along with the distance between them. We only concider FeatureMatches with a low distance, and discard the ones where the distance is too large. The FeatureMatches are now sorted on distance, so the matches with the lowest distance appears first. The FeatureMatches are now paired with other FeatureMatches with a similar distance. By sorting the FeatureMatches, we ensure that a good match (low distance) is paired with another good match. For each Pair we calculate the translation, rotation and scaling that must be done to the test image to map it onto the database image. The calculation of the transformation is described in this link.
The Pairs are now grouped with other Pairs with similar transformation values. Each PairGroup will have a list of Pairs and, implicitly, a list of TFs. The TFs represent the points in the test image, and the Pairs represent lines between each point. We can use these numbers to calculate the confidence of that PairGroup representing the object at that position in the image by calculating the graph completeness (number of lines divided by number of possible lines). To improve the confidence we discard any TF, and the Pair, that is present in only one Pair. This gets rid of spurious matches that would otherwise decrease the confidence of the match. In addition we discard any PairGroup (after the previous step) with too few Pairs. This is because a graph with very few points will easily get a high confidence (e.g., a graph with only 3 points and 3 lines has 100%). Finally we find the PairGroup with the highest confidence and calculate the bounding box representing that group. The returned ObjectMatch contains the confidence, a list of TestFeatures that matched the object (to be displayed on screen), the bounding box and the transformation values.
Bounding-Box
An early approach to generating a bounding-box was to find the maximum and minimum x and y coordinates of all the match feature points, and use those to create the box. This was found to be ineffective owing to the fact that if only the identified section of the object will be marked i.e. the marker could encompass the object partially instead of fully.
An improved method was used instead. The bounding-box was calculated for the training image using the first approach. The training image has only features representing the object, so the bounding-box for this image will be accurate. This bounding-box is stored along with the object in the database. When a match is found, the bounding-box is scaled and translated based on the scaling and translation of the match. ORA is able to match multiple objects in the same image, and can draw a bounding-box around each object.
Map Markers
Once the object is identified the robot would place a marker in the map identifying the location of the object in the context of the map. Given the current state of the CASRobot code the map markers are placed on a map by building an object of the class Snap and passing it to the Robot::addSnap method. However the current mechanism relies on obtaining data from the Swiss Ranger camera (the camera that returns a point cloud of the target object) which is not present on the Pioneer robot. Given the scenario the position of the target object is manually calculated by a series of calculations using the laser readings.
The distance to the object was calculated by measuring the length of the ray that is closest to the location of the object in the world. Once the distance was obtained the position of the object was determined from the sensor frame of reference. However there exists a bug in the code which results in a marker being placed at the wrong location, when the pan of the camera is not in the direction where the robot is moving to. The probable causes can be:
- The value of the camera's pan is not being added to the position of the camera.
- The camera kinematics have not been correctly defined in the server configuration file.
Early implementations of ORA resulted in a lot of false positives which affected the marking mechanism as a marker was being placed for each false identification. The following algorithm was implemented to solve the above problem:
- Declare 3 threshold values, distanceThreshold, noOfPotentialMarkers,noOfMarkersMatched
- Upon identifying an object, store the marker for that object in a queue of potential markers
- Have a threshold for the number of potential markers in the queue e.g. 10 in our case
- Whenever a potential marker is added to the queue, calculate the distance between the new marker and all the other markers in the queue. If the distance is less than the distanceThreshold and it matches the threshold noOfMarkersMatched (e.g. 4 in our case) the potential marker is added to the map.
- Make sure a list of markers added to the map is maintained in a queue of "addedMakers" for the duration of the robot's run.
- Whenever a new potential marker is identified compare the distance of the new maker with the list of markers in the queue of "addedMarkers" and if the distance between those markers is more than the distanceThreshold. Subject it to the above steps before adding it to the list of potential markers.
The class SnapCollector is responsible for managing map markers.
Online Learning
The project primarily focused on offline training, however since many tasks in RoboCup@Home require the robot to be able to learn objects for future recognition, online learning of objects was also investigated. Many of the components required to implement online learning already existed, but it is a more difficult task – while the training images could be manually post-processed after capture ensuring only features from the object itself were obtained, this was not possible in a real-time environment.
The recognition task, running as part of RobotServ, can be signalled to switch from the default recognition mode to a learning mode. This was in practice triggered by the operator, however could also be triggered by voice recognition or some other task. Once in learning mode, the task captures ten successive frames, allowing time for different perspectives of the object to be presented. The frames were pre-processed by performing basic segmentation, then features were extracted using the configured feature extraction algorithm. The features from each frame were added as different aspects of the same group, and given a timestamp as an unique object name. After capturing ten frames the recognition task automatically returned to recognition mode. It was assumed that the object would be presented so that it was fully visible in the frame and was rotated fast enough to allow different aspects be learnt.
To avoid learning features from the scene's background, a simple, two-phase segmentation process was used. The first phase used the OpenCV cvPyrMeanShiftFiltering function that implements a mean-shift colour clustering filter (Comaniciu 1999). This algorithm is effective in removing texture from an image while preserving a consistent colour in regions in the image. It is also computationally expensive, increasing the time required to process a frame. The second phase performed a simple flood fill, starting in diagonally opposite corners of the frame. Repeating the fill allowed for some variance in the background. The fill process produced an 8 bit mask describing the region in the image that should be considered for feature extraction. The mask was then passed to the Sift object, along with the original frame for feature extraction. This process assumed both the object would be reasonably centered in the frame and that the background was relatively uniform, both of which would be violated in most real-world applications.
Test strategy
The testing procedure was split into two separate rounds of testing
Static image testing
The tests were conducted on a set of static images whereby an early version of ORA trained with a set of reference images was run using a simple main method to observe the results of the object recognition. Initially the training images used were captured using a high resolution camera (Nokia N95, 5 mega-pixel camera). However the algorithm trained with these images failed to identify a set of lower resolution images captured using the camera on the Pioneer robot. Hence the algorithm was retrained using a set of reference images captured using the robot. After spending days on the static image testing box, we decided to move on to real-time testing by using the robot.
Pioneer (Robot) testing
Upon initial satisfactory detection results obtained from the static image testing, we realized we had a beta version of ORA which even though incomplete was ready for its first run. This presented a few challenges as not all members of the group were proficient with C++ or debugging in C++ (using gdb). Hence one member of the group was assigned to full-time testing as well as issue tracking and resolving, while the others focused on finishing and implementing improvements namely, GPU SIFT implementation, Map Marking, exploring alternative detection algorithms and incorporating real-time object learning. The initial results were very poor as the algorithm identified a near unmanageable number of false positives, this was however resolved by using a thresholding technique. The Pioneer testing could be divided into two phases Non Maze testing and Maze testing
Non-Maze testing
In the non-maze testing the object was placed on a "pedestal" so it is directly in front of and at the level of the robot camera. This is phase was owing to the fact that we wanted to ensure the accuracy of the object detection before we deployed it for the test run. The maze environment consists of a number of variables in it that can be responsible for the failure of the object detection, map marker or the online learning. Hence if the algorithm is tested in the above environment it ensures that the test is independent of the extra variables and failure can would enable us to isolate the problem to a particular component.
Maze testing
This was the shortest test phase, as the aforementioned test phases ensured all the components of the algorithm worked as intended (almost) and when deployed in the maze it was only a matter of testing how well they worked with the robot autonomy. However we did encounter some issues in this stage owing to some faults inherent to the CASRobot API that prevented the robot from executing appropriate right-wall following. This was in part resolved by merging our algorithm with the autonomy module of the MAP planning group (the one that made use of the A* algorithm for wall following).
Results
Two videos illustrating the test runs performed at the UNSW Robolab were produced. Placing map markers in the robot's internally perceived environment and object detection as seen from the robots camera in the GUI.
In order to measure the performance differences between the OpenCV SURF and the SIFT GPU implementations a simple program to generate statistics where developed (See Appendix A; Generating Statistics). In addition, each algorithm was tested using both the original images and the images scaled to twice the width and height. The generated statistics were imported into OpenOffice Calc for analysis.
Examining the accuracy of the different algorithms and image sizes, we can see below that there was not much difference. The unscaled images appear to produce slight better accuracy, while SURF was more accurate than SIFT.
The True Positive columns indicates the average rate at which an object was detected when present in the scene. False Positive indicates the average rate at which an object was reported as being present when one was not and similarly, False Negative indicates when an object was detected when none was present. Finally, the Correct Matches columns indicate the correct classification of an image when one was reported to be present.
Performance was also examined, with the average results reported below. Here, the number of features extracted did not vary between scaled and unscaled images for either algorithm, but the SIFT implementation produced significantly more features compared to SURF. SURF took somewhat more time to extract features than SIFT, while SIFT appeared to be significantly faster when working on the scaled images. Database matching in general took longer when using scaled images, and significantly longer for SIFT.
| Type | Average Feature Count | Average Feature Extraction Time (ms) | Average Database Lookup Time (ms) |
|---|---|---|---|
| SIFT Scaled | 504 | 87 | 17213 |
| SIFT Unscaled | 504 | 103 | 12532 |
| SURF Scaled | 184 | 130 | 2566 |
| SURF Unscaled | 184 | 128 | 1779 |
Discussion
From the qualitative testing performed using the robot, it appears the recognition task managed to recognize certain objects reasonably well under good conditions and this is backed up somewhat by the quantitative results above. The high rate of correct matches and low rate of false positives is encouraging, meaning that when an object was recognized, it was typically correct. This is important since mis-identifying objects is obviously undesirable.
On the other hand, the true positive and false negative rates point to a need for improvement. Here, objects were not being recognised when present, which is just as clearly problematic. While observing the recognition in action, it was clear that objects present in the scene were detected in some frames, they were sometimes not detected in others. It was also observed that the objects were not detected from some perspectives, when further away from the robot, or when the robot was in motion. Some of these problems arose from infrastructure and hardware limitations – the on-board camera could only be used in low resolution due to Player bugs, Pioneer CPU and network bandwidth limitations. Others were perhaps due to the training images or simply limitations of the feature database and SIFT and similar algorithms.
The comparison between SIFT and SURF yielded some interesting results. While it was true that SIFT, executed on a graphics processor, was faster than the less computationally expensive SURF when executed by a general purpose processor, the larger average number of features extracted meant that the corresponding feature database was substantially larger, requiring significantly more time to perform matching. In addition, SIFT did not appear to be significantly more accurate than SURF. This seems to indicate that SURF may well be the better choice, despite producing less robust features. It would be interesting to see whether SIFT showed increased accuracy if some of the problems above were addressed.
While it has been noted that scaling images can lead to improved features being produced, our results seem to contradict this. When working with images scaled to twice the original size, accuracy decreased slightly while feature matching time increased. This might indicate more discriminatory features being produced, increasing the search time and resulting in fewer matches. Further comparison of accuracy using scaled, low-resolution with high-resolution images (both having the same final dimensions) might shed further light on this.
The online leaning process worked well under ideal conditions - when the object was presented up close, with a plain background, fully within the frame and when several reasonable perspectives were captured. The first two conditions could be eased by improving the segmentation method – it typically failed to correctly segment against complex backgrounds and required in the order of seconds to process a frame. Other segmentation cues such as texture and motion could be used, motion in particular would seem suited for @Home since people commonly waive objects of interest, such as an empty beer, in a home environment. The performance could perhaps further be addressed by performing processing semi-offline, in a low-priority background thread. The last two issues could be handled through interaction with the operator or person presenting the object, by providing feedback about the quality of the learnt aspects and prompting for “a another look” if confidence in the learnt data set was too low.
Lessons learned
The initial plan of this project was to have a robot traverse through a corridor of a commercial workspace (a technical term for a bunch of cubicle's) and count the number of cups on each desk. However, we encountered a number of issues with that aim and hence evolved our project achieve the outcome that we did. The issues we encountered and the lessons we learned were as follows:
- Thou shall only map walls : According to our original plan, we were trying to create a map of the corridor of the robotics lab on level 3, so the robot can traverse the location defined on a pre-loaded map. However the robot was unable to create a map of the corridor owing to the fact that the corridors are not walled all the way to the floor.
- Thou shall always show me features: The cups used initially for the tests were essentially plain in color and had no distinctive designs on them i.e. they lacked features. Upon realizing this we changed the test objects to objects which were feature rich, i.e. the Eeyore mug.
- Do not have any other algorithms before me (SIFT): While exploring alternative object recognition techniques towards the last few weeks of the semester, we realized that there was simply not enough time to play around with new techniques. For example, the haar-classifier for object recognition required days worth of training with a huge number of samples before it can be configured for any form of object recognition.
- Thou shall not be too far away from me: One of the other problems we faced with ORA was that detection distance was relative to the object size. In detail the smaller the object the closer the robot needs to be to it, e.g. for identifying a small object such as a cup the robot needs to be at a distance of 50 to 60 cm for an image with a 320 x 240 resolution, however the distance doubles to approximately 1 meter for an image of 640 x 480 resolution.
Further Work
A few pointers on the future directions for this project:
- Exhaustively test the online learning mechanism and improvise if necessary as it incorporates real-time learning.
- Integrate it with the existing projects (Map planning and Sound localization): In the suggested project the object recognition algorithm would be trained to identify an object which produces a unique sound. Once trained the robot would navigate through the maze using the Map planning group's A* algorithm and respond to sounds produced by the object. Upon reaching the object the robot would try and identify the object and place a bounding box upon successful identification. E.g. The Object recognition algorithm could be trained with a set of reference images of a toy dog that produces barking sounds and attempt to locate the barking dog in the maze during testing. Despite the promising concept a potential problem in this project could be to locate and successfully track the appropriate sound in a noisy environment.
- The implemented algorithm can potentially be modified and trained to accurately identify a human and thus aid in the person following and tracking procedure.
- Combine multiple techniques for identifying the household objects.
- Application of Histogram matching to verify object match after identification
- Use of the Canny edge detection in combination with a shape matching technique to detect objects using shape matching. Canny edge detection is an algorithm developed by John Canny (1986). Canny can be used in OpenCV with a simple function call without knowing intricate implementation details.
KD-Tree Improvements
There are several ways to improve the database algorithms. First, the BF is founding using a basic kd-tree search. This can be improved with the Nearest Neighbour First algorithm (Beis, 1997). Second, the FeatureMatches are discarded if the distance is greater than a constant. Other papers, such as Lowe (2004) describe a method where the ratio of best match distance and next-best match distance is calculated and used to discard FeatureMatches. The confidence calculated for each PairGroup can be improved by finding the realistic maximum number of Pairs the PairGroup can contain, which is slightly lower than the current method finds. In addition, the transformation values are simply the average of all the values in the PairGroup. A better method would be to multiply each value by the probability of that value being correct. There are also a few implementation improvements that would speed up the matching, for example using concurrency by having each object or image match running in a separate thread. An interesting improvement would also be to take previous matches into consideration when finding a match, e.g., if there is a possible match in one area of the image, multiply that
Finally, the current bounding-box does not rotate according to the objects, something that could relatively easily be added in a future version.
Conclusion
In this experiment we successfully implemented an object recognition application (ORA) that is capable of identifying common household objects for the RoboCup@Home environment. ORA is a complete system that integrates a number of components such as CPU SURF, GPU SIFT and an online learning component and successfully identifies cups in a household environment. The system however is not without its flaws and currently it is restricted to objects with well defined features and identification is done at a distance relative to the size of the object as well as the resolution of the image. Hence the system can be improvised to carry out object detection at a uniform distance irrespective of the object size and the image resolution. The addition of a static shape matching technique such as active shape modelling could enhance ORA's recognition ability and enable it to identify plain feature less objects. In conclusion we have a robust object recognition system complete with a range of features and the ability to statistically evaluate CPU and GPU SIFT.
Appendix A - Building and Running
The content mentioned here is based on the assumption that the development machine is ready and prepared with all the perquisites necessary to
- In order to get started on the project, it is important to have a basic understanding of some of the following C++ , for those unfamiliar with the C++ syntax this link would prove a good starting point http://www.cplusplus.com/doc/tutorial/
- Basic knowledge of SVN and the commands such as svn up, svn commit and so on. For an extensive tutorial check out this tutorial.
- basic understanding of the “gdb” for debugging purposes. A tutorial can be found at http://www.cs.cmu.edu/~gilpin/tutorial/
Project Source Code
Check out the project's source from Subversion and build it:
svn co svn+ssh://robolab.cse.unsw.edu.au/data/svn/comp4411/2010-s1/object-recognition/trunk object-recognition cd object-recognition make
To generate API documentation, install Doxygen (sudo apt-get install doxygen) and run:
make doc gnome-open doc/html/index.html
Preparing the source code
- Navigate to your workspace directory in the terminal window and create a new directory and call it object-recognition
- Now check out the latest source code from the source code repository. Use the following command
svn co svn+ssh://robolab.cse.unsw.edu.au/data/svn/comp4411/2010-s1/object-recognition/trunk object-recognition
- In order to successfully compile this library, it is important to set up the libsift GPU environment variable , else the code wont compile. navigate to the object-recognition directory and set up the environment variable using
export LD_LIBRARY_PATH=SiftGPU/linux/bin:
- In the object recognition directory call the make command
Once the source code is ready you can start modifying the code and calling compile every now and than.
Generating Feature Files
To extract OpenCV SURF or SIFT feature points from a training image a command line tool use extractKeypoints.
./extractKeypoints [-scale] surf|sift IMAGE FEATURES [FIGURE] [EDGE-FIGURE]
The number of feature points usually increases with the size of the image. The option -scale doubles the size of the input-image and the next option is to toggle between extracting feature points using OpenCV SURF or using SIFT GPU. The IMAGE is the training image, and the features will be stored in the FEATURES file. To save the original image with the feature-points drawn in the image specify a FIGURE. To save the resulting image from the canny edge detection specify a EDGE-FIGURE.
Generating Statistics
To generate statistics execute:
libsift/generateStats surf|sift scale|unscaled canny|nocanny featureGroupFolder outputFile testFolder
Where featureGroupFolder is the folder containing the feature points for the specified usage (e.g. generateStats surf scaled features/surf/scaled/). The outputFile is the file in which the statistics will be saved and the testFolder is the set of images over which the program should run. In order to know which images in the testFolder that actually contains an already known object, the images are divided into subcategories in the same way they are organized in the featureGroupFolder. That means for example, that images from the test-run that actually contains a coke bottle are stored in a folder called testFolder/cokeBottle. The images that did not contain any known objects are just saved in the root of testFolder which allows for detecting false-positives (See generateStats.cpp and the folder structure in the folder statistics for further details).
API Documentation
To generate API documentation, install Doxygen (sudo apt-get install doxygen) and run:
make doc gnome-open doc/html/index.html
Points to note
- The folder etc in the object-recognition directory contains server and gui files that correspond to the robots e.g. server-bass.xml and gui-bass.xml ,these files are config robotserv and robotgui files for the robot bass.
- Before commencing full scale development it would be helpful to run a small program that calls the OpenCV method to display an image in a new window, just to ensure everything works as expected. To do so write a main class and include the name of the class in the make file for the project.
- In order to debug start the server in the debug mode e.g. gdb robotserv and once in gdb call run server-bass.xml
- The command svn up and svn commit followed by the file names would update or check in the specified files respectively
References
Bay, H., Tuytelaars, T., and Gool, L. J. V. (2006). Surf: Speeded up robust features. In ECCV 2006, 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part I, pages 404–417.
Beis, J. S. and Lowe, D. G. (1997). Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97), June 17-19, 1997, San Juan, Puerto Rico, pages 1000–1006.
Canny, J. (1986). A Computational approach to edge-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679–698.
Comaniciu, D. and Meer, P. (1999). Mean shift analysis and applications. In IEEE International Conference on Computer Vision, vol. 2, pages. 1197).
Loncomilla, P. and del Solar, J. R. (2007). Robust object recognition using wide baseline matching for robocup applications. In RoboCup 2007: Robot Soccer World Cup XI, July 9-10, 2007, Atlanta, GA, USA, pages 441–448.
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision, pages 1150–1157.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110.
The Robocup Federation (2010). Robocup@home. http://www.ai.rug.nl/robocupathome/.
Russakovsky, O., Le, Q., and Ng, A. (2010). A Steiner tree approach to efficient object detection. In 23rd Conference on Computer Vision and Pattern Recognition (CPVR ’10), to appear.
Wu, C. (2010). http://www.cs.unc.edu/~ccwu/siftgpu/.
