Searching with Xapian

The Xapian search engine, and associated topics

Archive for the ‘Image search’ Category

Xappy now supports image similarity searching!

with 4 comments

Thanks to some work sponsored by MyDeco, and based on the ImgSeek image similarity matching code, it is now possible to perform searches for similar images using Xappy.

How do I use it?

Firstly, you’ll need the latest Xappy SVN HEAD, and the corresponding Xapian packages (obtainable using the xappy/libs/get_xapian.py script).

This code adds a new field action (“IMGSEEK”) to Xappy, which is used to specify that a field will contain filenames of images to index. These images will be loaded at index time, and certain “features” will be extracted which allow efficient similarity comparisons.

The action takes a number of parameters; the most crucial of these is the “terms” parameter which selects between two different methods of indexing the content. The details are in the documentation in Xappy, but the general effect is that if you index with “terms” set, you’ll get a considerably larger database, but much faster searches. The results with terms=True are a slightly approximated version of those with terms=False, but in our tests the approximation didn’t seem to harm the results noticeably.

The really nice thing about the integration is that the image similarity search boils down to a standard Xapian search. This means that it’s possible to combine an image similarity search with any other type of Xapian search. For example, you could use this to build a search for images which are similar to a photo you’ve just taken, which are also related to the search text “Church”, and are within 5 miles of you.

Performance

So far, we’ve tested the similarity matching with a collection of 500,000 images. A pure similarity search on this collection returns results in around 1.5 seconds using the “terms” method (or around 20 seconds using the “values” method). This is fast enough for some applications, but we hope (indeed, expect!) to improve performance further in future.

However, even with the current performance, if you combine the similarity search with a second Xapian search, you will often be able to get much better performance – because the similarity search will only need to be carried out for those documents returned by the second search.

Limitations

The image similarity algorithm used is certainly worthwhile for some applications; however, the results aren’t quite state-of-the-art. In particular, the algorithm is not pose-invariant: it’s not good at spotting rotated images.

Also, currently only images in JPEG format are supported. It would be fairly easy to add support for more formats, if needed.

There are also quite a few parameters which could be tuned to improve the quality of results for a particular collection. The best way of doing this would be to gather some sets of human judgments of image similarity, and use this for tuning. A nice little project in future would be to build a little web application to allow such judgments to be made…

How does it work?

The core of the algorithm is some (GPL) code taken from ImgSeek, which computes features based on an algorithm described in the paper Fast Multiresolution Image Querying. These features consist of 41 numbers for each colour channel (the code currently works in the YIQ colourspace).

40 of these numbers will correspond to the 40 most significant wavelets found in a wavelet decomposition of the image. The final number is based on the average luminosity of the image, and is basically a compensation factor.  The image similarity is given by the sum of the weights for the most significant wavelet features, minus a component based on the average luminosity.

Future developments

There are quite a few tunable parameters in the algorithm used, governing how much importance is ascribed to different components of the image, and different features – currently, the only one of these which is exposed in the interface is the number of “buckets” used for the average luminosity part of the search (which controls how precise the calculation is: more buckets gives a more precise calculation but is slower). I plan to expose more of these parameters, so that a system can be built to tune them optimally.

The image retrieval works best if background noise can be removed from images before they’re indexed.  I hope to be adding some code for doing this to Xappy shortly (also thanks to MyDeco!).

Finally, I hope to work on getting improved performance from the system. There are some general speed improvements in Xapian which will help with this, but I’ve also got a few ideas for ways to tweak the image retrieval algorithm too.

Written by richardboulton

March 11, 2009 at 4:49 pm