Searching with Xapian

The Xapian search engine, and associated topics

Archive for the ‘Performance’ Category

Xapian terms, values and data explained

leave a comment »

My colleague Tom Mortimer has made a useful post on the Flax blog to clear up a common set of confusions regarding Xapian; covering the difference between, and the proper usage of, terms, values and document data. It even has a pretty diagram to help explain it!

Written by richardboulton

April 2, 2009 at 4:45 pm

Posted in Performance

Xappy now supports image similarity searching!

with 4 comments

Thanks to some work sponsored by MyDeco, and based on the ImgSeek image similarity matching code, it is now possible to perform searches for similar images using Xappy.

How do I use it?

Firstly, you’ll need the latest Xappy SVN HEAD, and the corresponding Xapian packages (obtainable using the xappy/libs/get_xapian.py script).

This code adds a new field action (“IMGSEEK”) to Xappy, which is used to specify that a field will contain filenames of images to index. These images will be loaded at index time, and certain “features” will be extracted which allow efficient similarity comparisons.

The action takes a number of parameters; the most crucial of these is the “terms” parameter which selects between two different methods of indexing the content. The details are in the documentation in Xappy, but the general effect is that if you index with “terms” set, you’ll get a considerably larger database, but much faster searches. The results with terms=True are a slightly approximated version of those with terms=False, but in our tests the approximation didn’t seem to harm the results noticeably.

The really nice thing about the integration is that the image similarity search boils down to a standard Xapian search. This means that it’s possible to combine an image similarity search with any other type of Xapian search. For example, you could use this to build a search for images which are similar to a photo you’ve just taken, which are also related to the search text “Church”, and are within 5 miles of you.

Performance

So far, we’ve tested the similarity matching with a collection of 500,000 images. A pure similarity search on this collection returns results in around 1.5 seconds using the “terms” method (or around 20 seconds using the “values” method). This is fast enough for some applications, but we hope (indeed, expect!) to improve performance further in future.

However, even with the current performance, if you combine the similarity search with a second Xapian search, you will often be able to get much better performance – because the similarity search will only need to be carried out for those documents returned by the second search.

Limitations

The image similarity algorithm used is certainly worthwhile for some applications; however, the results aren’t quite state-of-the-art. In particular, the algorithm is not pose-invariant: it’s not good at spotting rotated images.

Also, currently only images in JPEG format are supported. It would be fairly easy to add support for more formats, if needed.

There are also quite a few parameters which could be tuned to improve the quality of results for a particular collection. The best way of doing this would be to gather some sets of human judgments of image similarity, and use this for tuning. A nice little project in future would be to build a little web application to allow such judgments to be made…

How does it work?

The core of the algorithm is some (GPL) code taken from ImgSeek, which computes features based on an algorithm described in the paper Fast Multiresolution Image Querying. These features consist of 41 numbers for each colour channel (the code currently works in the YIQ colourspace).

40 of these numbers will correspond to the 40 most significant wavelets found in a wavelet decomposition of the image. The final number is based on the average luminosity of the image, and is basically a compensation factor.  The image similarity is given by the sum of the weights for the most significant wavelet features, minus a component based on the average luminosity.

Future developments

There are quite a few tunable parameters in the algorithm used, governing how much importance is ascribed to different components of the image, and different features – currently, the only one of these which is exposed in the interface is the number of “buckets” used for the average luminosity part of the search (which controls how precise the calculation is: more buckets gives a more precise calculation but is slower). I plan to expose more of these parameters, so that a system can be built to tune them optimally.

The image retrieval works best if background noise can be removed from images before they’re indexed.  I hope to be adding some code for doing this to Xappy shortly (also thanks to MyDeco!).

Finally, I hope to work on getting improved performance from the system. There are some general speed improvements in Xapian which will help with this, but I’ve also got a few ideas for ways to tweak the image retrieval algorithm too.

Written by richardboulton

March 11, 2009 at 4:49 pm

Performance profiling

with 10 comments

I’ve spent a fair while over the last few days, when not laid low with a cold, hacking on various bits of Xapian to measure and improve the performance.

My colleague Tom Mortimer put together a nice little performance test which involves first indexing about 100,000 articles from wikipedia, and then performing a set of searches over them. We’re working on packaging these up so that more people can try them on different hardware. The results were quite interesting, but the bit I’ve focussed on is that, for this particular setup, the search speed of the unreleased in-development “chert” backend is considerably worse than that of the stable “flint” backend.

I should note that this is quite a small database by Xapian standards, which means that it’s fully cached in memory for my tests. The chert database is considerably smaller than the flint backend (about 70% of the size, in fact), which means that it would be able to remain fully cached in memory for considerably longer. However, the flint backend is able to perform 10,000 searches (returning over 8 million results) in about 1.8 seconds, and the chert backend takes 12.6 seconds. We’re clearly going to have to fix this before chert is ready for release.

For the gritty details, see http://trac.xapian.org/ticket/326, but the executive summary is that the slowdown is due to the way chert accesses document length information during the search (which is needed for the default weighting model). Indeed, a quick hack I performed to turn off the document length lookup resulted in chert performing 10,000 searches in about 0.75 seconds, so chert is clearly not ready to be binned yet!

In flint, document length information is stored in every posting list, adjacent to the wdf information. This means that a separate lookup for the document length isn’t needed, but also means that the document length is duplicated in every posting list, increasing the size of the database. In chert, the document length information is pulled out into a separate posting list, so a separate lookup is required, but the overall database size is considerably smaller. In addition, the document length list is used for nearly every search, so it ought to be easy to get it cached well, so the side lookups should take little time.

Currently, the document length posting list uses the same format as normal term posting lists: posting lists are composed of a set of chunks, each of around 2000 bytes, and representing a set of items, one for each document, in ascending docid order. The chunks are indexed by the document ID of the first item in the chunk, and stored in a btree – this means that the appropriate chunk for a particular document can be found in O(ln(number_of_documents)) time, which in practice turns out to be nice and fast.

The problem with chert is the time taken to find the right place inside a chunk. Inside a chunk, the format is essentially a list of pairs of integers (coded in a variable length representation): the first integer represents the document ID increase since the previous entry, and the second integer is the wdf for a term posting list (or the document length for the document length posting list). This format is good for efficient appends (ie, fast batch updating), and for walking through the list, but to find the appropriate place in the chunk requires O(length_of_chunk) reads – each of which involves at least one unpacking of a variable-length integer. My profiling indicates that on average, finding the document length with chert requires scanning through about 30 items in each chunk which is used.

I’ve performed various optimisations on the code which unpacks the integers, and managed to increase the search speed with chert to about 7 seconds, but I don’t think that approach can go much further. A better datastructure is needed for holding the document length postlists. This can be a different structure from that used for normal postlists, of course, though there might be some benefit in improving the seek speed in term postlist chunks, too.

So, what do we need for our datastructure?

  1. Store a list of document ID -> doclen mappings, for a set of document IDs which are close together
  2. Reasonably efficient update desirable
  3. Efficient seek to specified document ID
  4. Efficient walk through list
  5. Reasonably compact representation on disk

The current encoding satisfies items 1,2,4,5, but fails badly on 3. It usually uses 1 byte for each document ID, and (in my test) an average of 2 bytes for each document length. In fact, the byte for the document ID is usually 0, because it’s “difference between document IDs – 1”.

One possibility is just to store the list of document IDs and lengths in fixed-size slots, so that a binary search can be used to find them. This obviously isn’t very compact, but we could restrict a chunk to holding only document IDs with a range of 65535 (and storing the offset from the start, rather than the absolute ID), so we’d need 2 bytes per document ID. We’d probably need to allow at least 3 bytes per document length – documents could well be over 65535 words long, but are unlikely to be over 16.7 million words.

Xapian already has code to encode a list of integers efficiently; this is used for position lists. It requires the list to be unpacked to read it, but has the nice property that if the list is fully compact (ie, all the integers in a range are present with no gaps) it has a very small encoding. In fact, this particular case could be special-cased so that the encoding doesn’t need to be unpacked explicitly. So, we could first store the document IDs in the chunk, as offsets from the initial document ID in the chunk encoded tightly, followed by an array of document lengths, 3 bytes per document length. For fully compact document IDs, the document ID list would encode to an empty list, so we’d use roughly the same amount of space as the current scheme, but we’d be able to binary search through the list of document lengths (and, if the document IDs were compact, this wouldn’t even need the document ID list to be unpacked first). Appending would also be efficient, and so would walking through the list.

I’m sure we can do better than storing the document lengths in fixed size representation, though. Perhaps we could store a list of bit positions which the document lengths change at, or something. Further thought is required…

Written by richardboulton

February 6, 2009 at 2:56 pm

Posted in Performance