Searching with Xapian

The Xapian search engine, and associated topics

Performance profiling

with 10 comments

I’ve spent a fair while over the last few days, when not laid low with a cold, hacking on various bits of Xapian to measure and improve the performance.

My colleague Tom Mortimer put together a nice little performance test which involves first indexing about 100,000 articles from wikipedia, and then performing a set of searches over them. We’re working on packaging these up so that more people can try them on different hardware. The results were quite interesting, but the bit I’ve focussed on is that, for this particular setup, the search speed of the unreleased in-development “chert” backend is considerably worse than that of the stable “flint” backend.

I should note that this is quite a small database by Xapian standards, which means that it’s fully cached in memory for my tests. The chert database is considerably smaller than the flint backend (about 70% of the size, in fact), which means that it would be able to remain fully cached in memory for considerably longer. However, the flint backend is able to perform 10,000 searches (returning over 8 million results) in about 1.8 seconds, and the chert backend takes 12.6 seconds. We’re clearly going to have to fix this before chert is ready for release.

For the gritty details, see http://trac.xapian.org/ticket/326, but the executive summary is that the slowdown is due to the way chert accesses document length information during the search (which is needed for the default weighting model). Indeed, a quick hack I performed to turn off the document length lookup resulted in chert performing 10,000 searches in about 0.75 seconds, so chert is clearly not ready to be binned yet!

In flint, document length information is stored in every posting list, adjacent to the wdf information. This means that a separate lookup for the document length isn’t needed, but also means that the document length is duplicated in every posting list, increasing the size of the database. In chert, the document length information is pulled out into a separate posting list, so a separate lookup is required, but the overall database size is considerably smaller. In addition, the document length list is used for nearly every search, so it ought to be easy to get it cached well, so the side lookups should take little time.

Currently, the document length posting list uses the same format as normal term posting lists: posting lists are composed of a set of chunks, each of around 2000 bytes, and representing a set of items, one for each document, in ascending docid order. The chunks are indexed by the document ID of the first item in the chunk, and stored in a btree – this means that the appropriate chunk for a particular document can be found in O(ln(number_of_documents)) time, which in practice turns out to be nice and fast.

The problem with chert is the time taken to find the right place inside a chunk. Inside a chunk, the format is essentially a list of pairs of integers (coded in a variable length representation): the first integer represents the document ID increase since the previous entry, and the second integer is the wdf for a term posting list (or the document length for the document length posting list). This format is good for efficient appends (ie, fast batch updating), and for walking through the list, but to find the appropriate place in the chunk requires O(length_of_chunk) reads – each of which involves at least one unpacking of a variable-length integer. My profiling indicates that on average, finding the document length with chert requires scanning through about 30 items in each chunk which is used.

I’ve performed various optimisations on the code which unpacks the integers, and managed to increase the search speed with chert to about 7 seconds, but I don’t think that approach can go much further. A better datastructure is needed for holding the document length postlists. This can be a different structure from that used for normal postlists, of course, though there might be some benefit in improving the seek speed in term postlist chunks, too.

So, what do we need for our datastructure?

  1. Store a list of document ID -> doclen mappings, for a set of document IDs which are close together
  2. Reasonably efficient update desirable
  3. Efficient seek to specified document ID
  4. Efficient walk through list
  5. Reasonably compact representation on disk

The current encoding satisfies items 1,2,4,5, but fails badly on 3. It usually uses 1 byte for each document ID, and (in my test) an average of 2 bytes for each document length. In fact, the byte for the document ID is usually 0, because it’s “difference between document IDs – 1”.

One possibility is just to store the list of document IDs and lengths in fixed-size slots, so that a binary search can be used to find them. This obviously isn’t very compact, but we could restrict a chunk to holding only document IDs with a range of 65535 (and storing the offset from the start, rather than the absolute ID), so we’d need 2 bytes per document ID. We’d probably need to allow at least 3 bytes per document length – documents could well be over 65535 words long, but are unlikely to be over 16.7 million words.

Xapian already has code to encode a list of integers efficiently; this is used for position lists. It requires the list to be unpacked to read it, but has the nice property that if the list is fully compact (ie, all the integers in a range are present with no gaps) it has a very small encoding. In fact, this particular case could be special-cased so that the encoding doesn’t need to be unpacked explicitly. So, we could first store the document IDs in the chunk, as offsets from the initial document ID in the chunk encoded tightly, followed by an array of document lengths, 3 bytes per document length. For fully compact document IDs, the document ID list would encode to an empty list, so we’d use roughly the same amount of space as the current scheme, but we’d be able to binary search through the list of document lengths (and, if the document IDs were compact, this wouldn’t even need the document ID list to be unpacked first). Appending would also be efficient, and so would walking through the list.

I’m sure we can do better than storing the document lengths in fixed size representation, though. Perhaps we could store a list of bit positions which the document lengths change at, or something. Further thought is required…

Written by richardboulton

February 6, 2009 at 2:56 pm

Posted in Performance

10 Responses

Subscribe to comments with RSS.

  1. One question – how badly does ranking suffer if doc length is ignored (particularly if the corpus has a fairly limited range of doc lengths)?

    banoffi

    February 6, 2009 at 3:27 pm

  2. Good question. I’m not entirely sure. If the document lengths are all identical, it will have no effect, since it’s just used as a normalisation factor. However, document length information is the main statistic used by BM25 but not by the simpler “TradWeight” score, based on the earlier probabilistic weighting formulae, and we know that BM25 scores much better on effectiveness tests than those.

    I’m sure that there are some collections where we could get away without using the document length quite happily, though.

    richardboulton

    February 6, 2009 at 4:16 pm

  3. I’m just wondering whether it could be switchable, in that case.

    banoffi

    February 6, 2009 at 4:47 pm

  4. You know, as a nasty quick kludge!

    banoffi

    February 6, 2009 at 4:47 pm

  5. It certainly could. The neatest way would be to have a weight class which doesn’t use document lengths. Weight classes already have a get_sumpart_needs_doclength() method, which is called at the start of the match, and the doclengths are only calculated if it returns true. You could possibly even sumclass BM25Weight in python with that method overridden to return false – that will result in a doclen of 0 being supplied to the weight function. If you combine that with a min_normlen parameter of 1 being supplied to the weighting function constructor, all the weights will be calculated using a normalised document length of 1.

    richardboulton

    February 6, 2009 at 5:18 pm

  6. 12s ouch!

    That’s a staggering difference in performance.

    I’ve been using chert during our initial indexing runs (to build our ranking DB for later sorting), not realizing the painful diff in search performance.

    Now I’m wondering: stick with chert (and hope the file format doesn’t change – which would require a reindex of a few TBs of data), or hang in there praying that the search times will approach that of flint… or just revert to flint now.

    For our application, search performance is paramount; indexing and update performance are a distant 2nd.

    Thoughts?

    Thanks
    H

    Henry

    February 10, 2009 at 6:22 am

  7. My current feeling is that it’s very likely that the chert format will change before the 1.1 release in order to fix this problem. Of course, if you want to stick with the current format, you don’t need to update your code when we change it… Perhaps you could wait for a time when you need to do a reindex run for other reasons, that way. If you’re doing that, you could always apply avoid_string_operations.patch and optimise_unpack_uint.patch (from http://trac.xapian.org/ticket/326) to your codebase, which give a big speed up for me. These patches don’t change the database format – they’re just code optimisations. Be aware that these patches aren’t well tested, of course; though they pass all the tests in the testsuite for me, I’ve not written any new tests to ensure they’re working right.

    I’d be very interested in the difference in search times that those patches make on such a large database, actually, where IO and small databases is a lot more important. If you are able to measure such a thing, attaching your results to ticket #326 would be great info.

    Chert does make considerably smaller databases than Flint, and this test was on a very small dataset (compared to TBs of data, anyway), where the IO performance was negligible. In your situation, the search speed difference between chert and flint may be much smaller – it’s even possible that chert will still be faster for you.

    richardboulton

    February 10, 2009 at 8:05 am

  8. Oh – I should add that at least one the patches might well not work correctly on a non-32-bit architecture – it might not even compile. If it does, be sure to run “make check” and check that none of the tests fail before trying to use the patch.

    richardboulton

    February 10, 2009 at 8:07 am

  9. Thanks for the comments, Richard.

    I think at some point I’ll need to measure the search performance of a subset of the current index (chert) [plus, apply patches mentioned, and measure perf again], then re-index same with flint and compare.

    It will take a while (currently have dozens of nodes grinding their molars on raw text:), but I’ll post feedback when I have it.

    Henry

    February 10, 2009 at 8:26 am

  10. Nice post.

    It would be interesting to see your approach on performance testing, so I’m wondering if it’s possible to have a look at those test files?

    johs

    November 28, 2009 at 11:52 am


Leave a reply to johs Cancel reply