I’ve spent a fair while over the last few days, when not laid low with a cold, hacking on various bits of Xapian to measure and improve the performance.
My colleague Tom Mortimer put together a nice little performance test which involves first indexing about 100,000 articles from wikipedia, and then performing a set of searches over them. We’re working on packaging these up so that more people can try them on different hardware. The results were quite interesting, but the bit I’ve focussed on is that, for this particular setup, the search speed of the unreleased in-development “chert” backend is considerably worse than that of the stable “flint” backend.
I should note that this is quite a small database by Xapian standards, which means that it’s fully cached in memory for my tests. The chert database is considerably smaller than the flint backend (about 70% of the size, in fact), which means that it would be able to remain fully cached in memory for considerably longer. However, the flint backend is able to perform 10,000 searches (returning over 8 million results) in about 1.8 seconds, and the chert backend takes 12.6 seconds. We’re clearly going to have to fix this before chert is ready for release.
For the gritty details, see http://trac.xapian.org/ticket/326, but the executive summary is that the slowdown is due to the way chert accesses document length information during the search (which is needed for the default weighting model). Indeed, a quick hack I performed to turn off the document length lookup resulted in chert performing 10,000 searches in about 0.75 seconds, so chert is clearly not ready to be binned yet!
In flint, document length information is stored in every posting list, adjacent to the wdf information. This means that a separate lookup for the document length isn’t needed, but also means that the document length is duplicated in every posting list, increasing the size of the database. In chert, the document length information is pulled out into a separate posting list, so a separate lookup is required, but the overall database size is considerably smaller. In addition, the document length list is used for nearly every search, so it ought to be easy to get it cached well, so the side lookups should take little time.
Currently, the document length posting list uses the same format as normal term posting lists: posting lists are composed of a set of chunks, each of around 2000 bytes, and representing a set of items, one for each document, in ascending docid order. The chunks are indexed by the document ID of the first item in the chunk, and stored in a btree – this means that the appropriate chunk for a particular document can be found in O(ln(number_of_documents)) time, which in practice turns out to be nice and fast.
The problem with chert is the time taken to find the right place inside a chunk. Inside a chunk, the format is essentially a list of pairs of integers (coded in a variable length representation): the first integer represents the document ID increase since the previous entry, and the second integer is the wdf for a term posting list (or the document length for the document length posting list). This format is good for efficient appends (ie, fast batch updating), and for walking through the list, but to find the appropriate place in the chunk requires O(length_of_chunk) reads – each of which involves at least one unpacking of a variable-length integer. My profiling indicates that on average, finding the document length with chert requires scanning through about 30 items in each chunk which is used.
I’ve performed various optimisations on the code which unpacks the integers, and managed to increase the search speed with chert to about 7 seconds, but I don’t think that approach can go much further. A better datastructure is needed for holding the document length postlists. This can be a different structure from that used for normal postlists, of course, though there might be some benefit in improving the seek speed in term postlist chunks, too.
So, what do we need for our datastructure?
- Store a list of document ID -> doclen mappings, for a set of document IDs which are close together
- Reasonably efficient update desirable
- Efficient seek to specified document ID
- Efficient walk through list
- Reasonably compact representation on disk
The current encoding satisfies items 1,2,4,5, but fails badly on 3. It usually uses 1 byte for each document ID, and (in my test) an average of 2 bytes for each document length. In fact, the byte for the document ID is usually 0, because it’s “difference between document IDs – 1”.
One possibility is just to store the list of document IDs and lengths in fixed-size slots, so that a binary search can be used to find them. This obviously isn’t very compact, but we could restrict a chunk to holding only document IDs with a range of 65535 (and storing the offset from the start, rather than the absolute ID), so we’d need 2 bytes per document ID. We’d probably need to allow at least 3 bytes per document length – documents could well be over 65535 words long, but are unlikely to be over 16.7 million words.
Xapian already has code to encode a list of integers efficiently; this is used for position lists. It requires the list to be unpacked to read it, but has the nice property that if the list is fully compact (ie, all the integers in a range are present with no gaps) it has a very small encoding. In fact, this particular case could be special-cased so that the encoding doesn’t need to be unpacked explicitly. So, we could first store the document IDs in the chunk, as offsets from the initial document ID in the chunk encoded tightly, followed by an array of document lengths, 3 bytes per document length. For fully compact document IDs, the document ID list would encode to an empty list, so we’d use roughly the same amount of space as the current scheme, but we’d be able to binary search through the list of document lengths (and, if the document IDs were compact, this wouldn’t even need the document ID list to be unpacked first). Appending would also be efficient, and so would walking through the list.
I’m sure we can do better than storing the document lengths in fixed size representation, though. Perhaps we could store a list of bit positions which the document lengths change at, or something. Further thought is required…