Xappy now supports image similarity searching!
How do I use it?
Firstly, you’ll need the latest Xappy SVN HEAD, and the corresponding Xapian packages (obtainable using the
This code adds a new field action (“IMGSEEK”) to Xappy, which is used to specify that a field will contain filenames of images to index. These images will be loaded at index time, and certain “features” will be extracted which allow efficient similarity comparisons.
The action takes a number of parameters; the most crucial of these is the “terms” parameter which selects between two different methods of indexing the content. The details are in the documentation in Xappy, but the general effect is that if you index with “terms” set, you’ll get a considerably larger database, but much faster searches. The results with
terms=True are a slightly approximated version of those with
terms=False, but in our tests the approximation didn’t seem to harm the results noticeably.
The really nice thing about the integration is that the image similarity search boils down to a standard Xapian search. This means that it’s possible to combine an image similarity search with any other type of Xapian search. For example, you could use this to build a search for images which are similar to a photo you’ve just taken, which are also related to the search text “Church”, and are within 5 miles of you.
So far, we’ve tested the similarity matching with a collection of 500,000 images. A pure similarity search on this collection returns results in around 1.5 seconds using the “terms” method (or around 20 seconds using the “values” method). This is fast enough for some applications, but we hope (indeed, expect!) to improve performance further in future.
However, even with the current performance, if you combine the similarity search with a second Xapian search, you will often be able to get much better performance – because the similarity search will only need to be carried out for those documents returned by the second search.
The image similarity algorithm used is certainly worthwhile for some applications; however, the results aren’t quite state-of-the-art. In particular, the algorithm is not pose-invariant: it’s not good at spotting rotated images.
Also, currently only images in JPEG format are supported. It would be fairly easy to add support for more formats, if needed.
There are also quite a few parameters which could be tuned to improve the quality of results for a particular collection. The best way of doing this would be to gather some sets of human judgments of image similarity, and use this for tuning. A nice little project in future would be to build a little web application to allow such judgments to be made…
How does it work?
The core of the algorithm is some (GPL) code taken from ImgSeek, which computes features based on an algorithm described in the paper YIQ colourspace).
40 of these numbers will correspond to the 40 most significant wavelets found in a wavelet decomposition of the image. The final number is based on the average luminosity of the image, and is basically a compensation factor. The image similarity is given by the sum of the weights for the most significant wavelet features, minus a component based on the average luminosity.
There are quite a few tunable parameters in the algorithm used, governing how much importance is ascribed to different components of the image, and different features – currently, the only one of these which is exposed in the interface is the number of “buckets” used for the average luminosity part of the search (which controls how precise the calculation is: more buckets gives a more precise calculation but is slower). I plan to expose more of these parameters, so that a system can be built to tune them optimally.
The image retrieval works best if background noise can be removed from images before they’re indexed. I hope to be adding some code for doing this to Xappy shortly (also thanks to MyDeco!).
Finally, I hope to work on getting improved performance from the system. There are some general speed improvements in Xapian which will help with this, but I’ve also got a few ideas for ways to tweak the image retrieval algorithm too.