I spent the weekend away from my computer, visiting a sunny and pleasant Sheffield. This left the back of my mind free to churn, and now I’m back at a computer I need to write down some of the conclusions I came to, so that I can focus on what I’m meant to be working on (… about which, more soon). I should warn you that I’ve not thought through all of the following as carefully as I usually might – I’d welcome discussion based on it, though!
New backend performance measurements
Performance tests I’ve been running lately with Xapian have been a mix of encouraging and dispiriting. I’ve got performance figures which vary, but go up to around 8000 searches per second for a 100,000 document collection, which is pretty reasonable (and comparable with / slightly better than Lucene, depending on the kind of searches being done). We’ve also got database sizes down to about the same size as Lucene. So some things are “okay”. With this size of database, we’re largely CPU bound, and I’ve tried various code optimisations which have quite an effect (none of these are applied yet, since I was just feeling my way around to see what made a difference, but there are various patches in ticket #326 in Xapian’s tracker).
However, the new “chert” backend isn’t performing as well as earlier hoped; it’s size is considerably smaller than the old “flint” backend’s size, but search speeds are much worse. This is because it stores the document length information separately from the postings; reducing duplication, but adding an overhead for seeking to the right place in the document-length list, and looking up each document. The current code in SVN is rather pathological; one of the patches improves the seek time to be half the current time (on average); but even if we managed to remove the seek time entirely, the additional call overhead for unpacking the document length list means that chert would still be slower than flint for searches with a single term (it should start beating flint for searches with 2 or more terms, according to my calculations). Of course, this is with small databases (around 600Mb for my main current test), which rapidly get cached fully in memory, so CPU overhead dominates. Chert should do much better than flint when IO dominates, due to its smaller database size.
Indexing figures are less encouraging; we’re really quite slow compared to (Java) Lucene for indexing. My current understanding (based on watching block accesses, and knowledge of the code) is that, for small collections, each time we flush we’re updating pretty much every block in the posting table, and therefore we’re incurring a cost proportional to the current size of the database: we currently flush every D documents (where D defaults to 10,000); so this means indexing time is given by the number of flushes multiplied by the size of the database at each flush; which works out as O(N*N). This is bad – we should be able to achieve O(N*log(N)) at least. The problem is that our strategy is optimised for very large databases; once the databases start being nice and large we do much better, because we start skipping large parts of the database as we pass through it during each update; the cost of a flush becomes proportional to the size of the changes, rather than the size of the database, which in the long run means that we should get the total time being something like O(N). What we really want is either a separate strategy for small databases (and by small I mean up to a few Gb of data) , or a new strategy which works well for both small and large databases. A hierarchical merge process for the posting lists might fit the bill (similar to what Lucene does). My plan is to prototype such a technique in Xappy by building multiple sub-databases, and calling xapian-compact to merge them; this should give some idea if this is a worthwhile approach.
Indexing for small databases also appears to be quite CPU heavy, so the same kind of code optimisations which work well for searching may well help here.
So, the picture is of a new backend which improves some things considerably (50% smaller databases, less IO, etc), but has some major drawbacks which need to be addressed before it’s production ready. When these are addressed it will hopefully be a large improvement over flint, perhaps boosting speeds by up to 50%, but there’s still a fair way to go to get there.
Radical new approaches
A long time ago, search engines used to have a list of documents for each term in a query, and they’d perform a search by reading each of these lists in turn, making a big array in memory holding the score-so-far of each document, and adding values to this array from each list. This approach doesn’t work well for “web-scale” databases – you need to be able to keep an entry for all the potential matches in memory at once, which can be impractical, and you also can’t avoid doing any work – you need to read the whole of each list.
You normally only want to get the top few results for each search, so “modern” search engines (like Xapian and, as far as I know, Lucene) work the other way around; they open each list in parallel, and move through them in lock-step, keeping track of the best results seen so far, and immediately discarding any results which aren’t good enough to get into the top N. This requires much less memory, but also allows various optimisations to be performed; for example, for an AND operator for which one term is frequent and the other is rare, it is only necessary to read a few entries from the frequent term’s list to check if they’re there, rather than having to read the whole list. Xapian implements an efficient skipping mechanism to take advantage of this kind of optimisation. Also, when a list doesn’t have any entries left, the maximum possible weight of any remaining documents goes down, and it can be possible to shortcut the search in this way; and even if it isn’t possible to shortcut the search, it’s often possible to convert an OR operator into an AND operator, if it’s necessary for both terms to be present in order for a document to get enough weight to get into the top 10. Xapian’s matcher has been pretty heavily optimised over the years – there are still plenty of cases where it could probably do a better job, but it’s pretty decent at taking advantage of whatever information is available.
However, the Xapian matcher is restricted by it’s need to read all the lists in step. We’ve discussed changing this in certain cases in the past – for example, a phrase search can be split into two parts; a (fairly cheap) AND search to require that all the terms are present, followed by a check in the position lists to see if the terms occur in close proximity, and in the right order. It would be nice to skip the position list lookup for documents which wouldn’t have enough weight to get into the result set anyway, but Xapian’s matcher can’t currently accommodate this.
What would be nice would be to relax the processing of the term lists, so that they’re only “roughly” processed in lockstep. We’d keep a list of “candidate” documents – ones which contain all the terms, but still need the phrase check to be performed. When the candidate list gets full, we’d check the positions for the highest weighted item in this list, and either discard it or add it to the result set. This way, we’d avoid checking the positions for documents which are currently seen early enough in the matching process that they have a chance of getting into the “top-documents-so-far” list, which could be a big win.
To get radically better search performance, though, I think we’d need to start storing some specialised information in the database. Various things spring to mind:
- Storing the entries in each list with the best combination of wdf and doclength separately, so they can be accessed without iterating through the whole list.
- Storing information about the term frequency ranges in each chunk of the posting lists, which might allow chunks to be skipped more often.
- Storing extra “index” terms in the database – for example, for phrases, storing terms which represent each pair of words.
The first of these seems the most likely to be useful, to me. It would allow single-term searches to look only at the top documents, which would make these return more frequently. It could also allow the matcher to skip more entries in multi-term searches, since the list of “non-top” documents for the term would have a lower “maximum possible weight”, and could perhaps be terminated early more often than the full list of documents for the term.
To take best advantage of the opportunity presented by having a list of the best entries for each term, though, I think we’d want a redesigned matcher (which is obviously not a quick-hack project)!
My basic idea is that at indexing time we’d split the posting lists for each term into several pieces; the highest weighted documents in one, lower weighted documents in the next, and possibly have several levels in which each level contains entries with a higher weight than all lower levels. (Note – this would only be possible for fixed values of the BM25 weight parameters, so we’d need to fix these at index time. We could even store the resulting weight in each list, rather than the wdf/doclength combination – though it might be more space efficient to store them separately!)
The matcher would have a working set of documents (perhaps around 10000 – experimentation would be required to find the optimal size), and would start by reading the top level lists for each term, to populate this set. The key thing is that the matcher wouldn’t always know the exact weight for the documents in the working set, because only some of the terms would have been checked; instead, it would know an upper and lower bound for the weight, and have a list of terms which haven’t been checked for that document yet.
After processing the top-level lists for each term, it’s possible that the working set would contain enough documents for which all terms have been seen that the search could finish. Alternatively, it might be possible to derive a high “minimum weight” based on the lower bounds for the documents in the working set, which would allow some of the other posting lists to be discarded. It might be tricky to combine this with the boolean operators which Xapian supports, but it doesn’t seem implausible to me that this approach could work well.
This might be an area in which Whoosh could be useful for prototyping – being written in Python, it might be easier to experiment with these ideas there.
Another thing I’ve been pondering is the best way to calculate the list of tags attached to the results of a search. For this problem, a tag could be a user-entered tag, used to generate a tag cloud, or a category or facet used for faceted browsing. The fundamental search problem is the same.
Currently Xapian has some support for this (on a branch, and in the Xappy tarballs), which works simply by looking up the tags in each potential match, and combining them together. It’s fairly efficient, but unfortunately to get an accurate result it requires that most of the optimisations discussed so far are disabled.
What would be nicer would be if we could store the list of tags relevant to the documents containing a particular term, and simply combine these sets appropriately. Nice in theory, but I don’t immediately see a way to make this work with anything other than a straightforward OR query; if the terms are ANDed together, it’s not possible to know whether the tags occur in the same document or not. It gets even worse if you have a match-decider, range search, or weight-cutoff in play, too.
I’ve not really come up with any bright ideas for this one yet – give me a suggestion and I’ll explain why it doesn’t work!