Over the last year, the Redis database, or “data structure server”, has become wildly popular: it’s very lightweight and easy to use, but still very powerful. For introductory reading, I recommend Simon Willison’s tutorial.
I’ve been asked several times whether it would be possible to make a Xapian backend which used Redis for storage. If this could be done nicely, it would be a big boon for people who are already storing other parts of their data in Redis, as it would allow them to maintain their data in a single system, taking advantage of Redis replication (and, when it comes along, Redis clustering). Depending how it was done, it might also be possible to combine Xapian searches with the results of other filtering operations in Redis, and it could be easier to keep the search index up-to-date with the other data stored in Redis. Redis usually keeps all its data in memory (though this is changing with the Redis VM work), so it can access data very quickly, and perform certain operations (like set intersection) quickly.
I think it would be quite possible to do this integration, but it’s not going to be easy to get good performance. The on-disk Xapian backends can be looked at as having three levels: firstly, a “block level”, which consists typically of 8kb blocks holding nodes of the B+tree structure. Secondly, there is a “key-value” level, which is consists of an ordered set of key-value pairs, with efficient ways to append entries, and to iterate through parts of the contents. Finally, there’s a “logical” level, which contains various constructs like “posting lists” – lists of the documents in the database containing each word, and provides methods for accessing these efficiently in the ways required by Xapian’s matcher.
There are, correspondingly, several ways in which you could integrate Redis. Integration at the block level should be very simple to implement, but wouldn’t be able to take advantage of any of the best features of Redis – it would just be a “dumb” key-value server. Xapian doesn’t perform any caching, and relies on the operating system cache to avoid hitting the disk for most of the commonly used blocks, so unless a local cache of values returned from Redis was added, there would be a large amount of data transfer and lots of network round-trips for each search. It also wouldn’t be easy to integrate searches with other (non-Xapian) data stored in Redis (though this could possibly be done using a Xapian::PostingSource subclass).
Integration at the “key-value” level might be a better approach, but shares many of the problems of integration at the “block” level – less data would need to be passed, but the keys and values stored by Xapian are quite highly encoded, so Redis would still be being treated as a “dumb” store.
In contrast, integrating a backend at the “logical” level would allow Xapian to retrieve entire posting lists in a single network round-trip, which could allow reasonably fast searches. For a large database, these lists could get large enough that the time to transfer them over the network is too long, but if an appropriate format was used, you could use some of Redis’s features to sort and weight the stored lists, and to retrieve only certain ranges of the lists. This could also allow modifications to be done efficiently, using Redis operations to append values to lists stored in the database.
Really, to get the best performance, I think you’d want to perform at least some of the matching operations in Redis; which would mean integrating the backend at a high level so that the data stored could be manipulated by Redis, and also performing some integration at the level of Xapian’s matcher. In fact, you’d want to be able to write a simple matcher in Redis, and then perhaps use Xapian to fine-tune the results from this matcher, and support more advanced features. I’ve been taking a look at Redis to see how this could best be done, and will write a follow up post shortly to summarise my findings. First, I have a few patches to submit to Redis which should help this idea, though!
It’s been a long time since I posted here, and there are a lot of recent developments in Xapian to talk about – not least, the release of Xapian 1.2.0, the first in the new stable release series, which brings a lot of new features over the last stable release series (the 1.0 series). Of particular interest in the 1.2 series are things like the ability to get faceted classifications for search results, and to use custom “PostingSource” subclasses to perform custom weighting operations, or to bring in external data to modify the set of matching documents. It also includes a replication framework to allow scaling a search installation horizontally, and a new default backend database format which is considerably smaller on disk, while still storing some additional statistics which are used to optimise the searches further. We’re still working on writing up the full set of changes since the last release in the 1.0 series, which will ultimately appear at http://trac.xapian.org/wiki/ReleaseOverview/1.2.0.
Each of these features deserves a blog post to themselves, and I’ll try and write some of these over the coming weeks, but in the meantime I encourage you to download 1.2.0 and play with the new features.
Last night, I gave a brief 13 minute talk to the “Cambridge Geek Night” – a group of around 30 very switched on and interesting technology people, meeting above a pub in Cambridge. The atmosphere was wonderful, and I had a great time. There were also talks from Taylor Vinters solicitors about intellectual property, and Michael Brunton-Spall about many of the Guardian’s electronics services and experiments.
I’ve put the slides up on slideshare here – I’ll try and find some time to record an audio track to go with them soon.
My colleague Tom Mortimer has made a useful post on the Flax blog to clear up a common set of confusions regarding Xapian; covering the difference between, and the proper usage of, terms, values and document data. It even has a pretty diagram to help explain it!
I spent the weekend away from my computer, visiting a sunny and pleasant Sheffield. This left the back of my mind free to churn, and now I’m back at a computer I need to write down some of the conclusions I came to, so that I can focus on what I’m meant to be working on (… about which, more soon). I should warn you that I’ve not thought through all of the following as carefully as I usually might – I’d welcome discussion based on it, though!
New backend performance measurements
Performance tests I’ve been running lately with Xapian have been a mix of encouraging and dispiriting. I’ve got performance figures which vary, but go up to around 8000 searches per second for a 100,000 document collection, which is pretty reasonable (and comparable with / slightly better than Lucene, depending on the kind of searches being done). We’ve also got database sizes down to about the same size as Lucene. So some things are “okay”. With this size of database, we’re largely CPU bound, and I’ve tried various code optimisations which have quite an effect (none of these are applied yet, since I was just feeling my way around to see what made a difference, but there are various patches in ticket #326 in Xapian’s tracker).
However, the new “chert” backend isn’t performing as well as earlier hoped; it’s size is considerably smaller than the old “flint” backend’s size, but search speeds are much worse. This is because it stores the document length information separately from the postings; reducing duplication, but adding an overhead for seeking to the right place in the document-length list, and looking up each document. The current code in SVN is rather pathological; one of the patches improves the seek time to be half the current time (on average); but even if we managed to remove the seek time entirely, the additional call overhead for unpacking the document length list means that chert would still be slower than flint for searches with a single term (it should start beating flint for searches with 2 or more terms, according to my calculations). Of course, this is with small databases (around 600Mb for my main current test), which rapidly get cached fully in memory, so CPU overhead dominates. Chert should do much better than flint when IO dominates, due to its smaller database size.
Indexing figures are less encouraging; we’re really quite slow compared to (Java) Lucene for indexing. My current understanding (based on watching block accesses, and knowledge of the code) is that, for small collections, each time we flush we’re updating pretty much every block in the posting table, and therefore we’re incurring a cost proportional to the current size of the database: we currently flush every D documents (where D defaults to 10,000); so this means indexing time is given by the number of flushes multiplied by the size of the database at each flush; which works out as O(N*N). This is bad – we should be able to achieve O(N*log(N)) at least. The problem is that our strategy is optimised for very large databases; once the databases start being nice and large we do much better, because we start skipping large parts of the database as we pass through it during each update; the cost of a flush becomes proportional to the size of the changes, rather than the size of the database, which in the long run means that we should get the total time being something like O(N). What we really want is either a separate strategy for small databases (and by small I mean up to a few Gb of data) , or a new strategy which works well for both small and large databases. A hierarchical merge process for the posting lists might fit the bill (similar to what Lucene does). My plan is to prototype such a technique in Xappy by building multiple sub-databases, and calling xapian-compact to merge them; this should give some idea if this is a worthwhile approach.
Indexing for small databases also appears to be quite CPU heavy, so the same kind of code optimisations which work well for searching may well help here.
So, the picture is of a new backend which improves some things considerably (50% smaller databases, less IO, etc), but has some major drawbacks which need to be addressed before it’s production ready. When these are addressed it will hopefully be a large improvement over flint, perhaps boosting speeds by up to 50%, but there’s still a fair way to go to get there.
Radical new approaches
A long time ago, search engines used to have a list of documents for each term in a query, and they’d perform a search by reading each of these lists in turn, making a big array in memory holding the score-so-far of each document, and adding values to this array from each list. This approach doesn’t work well for “web-scale” databases – you need to be able to keep an entry for all the potential matches in memory at once, which can be impractical, and you also can’t avoid doing any work – you need to read the whole of each list.
You normally only want to get the top few results for each search, so “modern” search engines (like Xapian and, as far as I know, Lucene) work the other way around; they open each list in parallel, and move through them in lock-step, keeping track of the best results seen so far, and immediately discarding any results which aren’t good enough to get into the top N. This requires much less memory, but also allows various optimisations to be performed; for example, for an AND operator for which one term is frequent and the other is rare, it is only necessary to read a few entries from the frequent term’s list to check if they’re there, rather than having to read the whole list. Xapian implements an efficient skipping mechanism to take advantage of this kind of optimisation. Also, when a list doesn’t have any entries left, the maximum possible weight of any remaining documents goes down, and it can be possible to shortcut the search in this way; and even if it isn’t possible to shortcut the search, it’s often possible to convert an OR operator into an AND operator, if it’s necessary for both terms to be present in order for a document to get enough weight to get into the top 10. Xapian’s matcher has been pretty heavily optimised over the years – there are still plenty of cases where it could probably do a better job, but it’s pretty decent at taking advantage of whatever information is available.
However, the Xapian matcher is restricted by it’s need to read all the lists in step. We’ve discussed changing this in certain cases in the past – for example, a phrase search can be split into two parts; a (fairly cheap) AND search to require that all the terms are present, followed by a check in the position lists to see if the terms occur in close proximity, and in the right order. It would be nice to skip the position list lookup for documents which wouldn’t have enough weight to get into the result set anyway, but Xapian’s matcher can’t currently accommodate this.
What would be nice would be to relax the processing of the term lists, so that they’re only “roughly” processed in lockstep. We’d keep a list of “candidate” documents – ones which contain all the terms, but still need the phrase check to be performed. When the candidate list gets full, we’d check the positions for the highest weighted item in this list, and either discard it or add it to the result set. This way, we’d avoid checking the positions for documents which are currently seen early enough in the matching process that they have a chance of getting into the “top-documents-so-far” list, which could be a big win.
To get radically better search performance, though, I think we’d need to start storing some specialised information in the database. Various things spring to mind:
- Storing the entries in each list with the best combination of wdf and doclength separately, so they can be accessed without iterating through the whole list.
- Storing information about the term frequency ranges in each chunk of the posting lists, which might allow chunks to be skipped more often.
- Storing extra “index” terms in the database – for example, for phrases, storing terms which represent each pair of words.
The first of these seems the most likely to be useful, to me. It would allow single-term searches to look only at the top documents, which would make these return more frequently. It could also allow the matcher to skip more entries in multi-term searches, since the list of “non-top” documents for the term would have a lower “maximum possible weight”, and could perhaps be terminated early more often than the full list of documents for the term.
To take best advantage of the opportunity presented by having a list of the best entries for each term, though, I think we’d want a redesigned matcher (which is obviously not a quick-hack project)!
My basic idea is that at indexing time we’d split the posting lists for each term into several pieces; the highest weighted documents in one, lower weighted documents in the next, and possibly have several levels in which each level contains entries with a higher weight than all lower levels. (Note – this would only be possible for fixed values of the BM25 weight parameters, so we’d need to fix these at index time. We could even store the resulting weight in each list, rather than the wdf/doclength combination – though it might be more space efficient to store them separately!)
The matcher would have a working set of documents (perhaps around 10000 – experimentation would be required to find the optimal size), and would start by reading the top level lists for each term, to populate this set. The key thing is that the matcher wouldn’t always know the exact weight for the documents in the working set, because only some of the terms would have been checked; instead, it would know an upper and lower bound for the weight, and have a list of terms which haven’t been checked for that document yet.
After processing the top-level lists for each term, it’s possible that the working set would contain enough documents for which all terms have been seen that the search could finish. Alternatively, it might be possible to derive a high “minimum weight” based on the lower bounds for the documents in the working set, which would allow some of the other posting lists to be discarded. It might be tricky to combine this with the boolean operators which Xapian supports, but it doesn’t seem implausible to me that this approach could work well.
This might be an area in which Whoosh could be useful for prototyping – being written in Python, it might be easier to experiment with these ideas there.
Another thing I’ve been pondering is the best way to calculate the list of tags attached to the results of a search. For this problem, a tag could be a user-entered tag, used to generate a tag cloud, or a category or facet used for faceted browsing. The fundamental search problem is the same.
Currently Xapian has some support for this (on a branch, and in the Xappy tarballs), which works simply by looking up the tags in each potential match, and combining them together. It’s fairly efficient, but unfortunately to get an accurate result it requires that most of the optimisations discussed so far are disabled.
What would be nicer would be if we could store the list of tags relevant to the documents containing a particular term, and simply combine these sets appropriately. Nice in theory, but I don’t immediately see a way to make this work with anything other than a straightforward OR query; if the terms are ANDed together, it’s not possible to know whether the tags occur in the same document or not. It gets even worse if you have a match-decider, range search, or weight-cutoff in play, too.
I’ve not really come up with any bright ideas for this one yet – give me a suggestion and I’ll explain why it doesn’t work!
How do I use it?
Firstly, you’ll need the latest Xappy SVN HEAD, and the corresponding Xapian packages (obtainable using the
This code adds a new field action ("IMGSEEK") to Xappy, which is used to specify that a field will contain filenames of images to index. These images will be loaded at index time, and certain "features" will be extracted which allow efficient similarity comparisons.
The action takes a number of parameters; the most crucial of these is the "terms" parameter which selects between two different methods of indexing the content. The details are in the documentation in Xappy, but the general effect is that if you index with "terms" set, you'll get a considerably larger database, but much faster searches. The results with
terms=True are a slightly approximated version of those with
terms=False, but in our tests the approximation didn't seem to harm the results noticeably.
The really nice thing about the integration is that the image similarity search boils down to a standard Xapian search. This means that it's possible to combine an image similarity search with any other type of Xapian search. For example, you could use this to build a search for images which are similar to a photo you've just taken, which are also related to the search text "Church", and are within 5 miles of you.
So far, we've tested the similarity matching with a collection of 500,000 images. A pure similarity search on this collection returns results in around 1.5 seconds using the "terms" method (or around 20 seconds using the "values" method). This is fast enough for some applications, but we hope (indeed, expect!) to improve performance further in future.
However, even with the current performance, if you combine the similarity search with a second Xapian search, you will often be able to get much better performance - because the similarity search will only need to be carried out for those documents returned by the second search.
The image similarity algorithm used is certainly worthwhile for some applications; however, the results aren't quite state-of-the-art. In particular, the algorithm is not pose-invariant: it's not good at spotting rotated images.
Also, currently only images in JPEG format are supported. It would be fairly easy to add support for more formats, if needed.
There are also quite a few parameters which could be tuned to improve the quality of results for a particular collection. The best way of doing this would be to gather some sets of human judgments of image similarity, and use this for tuning. A nice little project in future would be to build a little web application to allow such judgments to be made...
How does it work?
The core of the algorithm is some (GPL) code taken from ImgSeek, which computes features based on an algorithm described in the paper YIQ colourspace).
40 of these numbers will correspond to the 40 most significant wavelets found in a wavelet decomposition of the image. The final number is based on the average luminosity of the image, and is basically a compensation factor. The image similarity is given by the sum of the weights for the most significant wavelet features, minus a component based on the average luminosity.
There are quite a few tunable parameters in the algorithm used, governing how much importance is ascribed to different components of the image, and different features - currently, the only one of these which is exposed in the interface is the number of "buckets" used for the average luminosity part of the search (which controls how precise the calculation is: more buckets gives a more precise calculation but is slower). I plan to expose more of these parameters, so that a system can be built to tune them optimally.
The image retrieval works best if background noise can be removed from images before they're indexed. I hope to be adding some code for doing this to Xappy shortly (also thanks to MyDeco!).
Finally, I hope to work on getting improved performance from the system. There are some general speed improvements in Xapian which will help with this, but I've also got a few ideas for ways to tweak the image retrieval algorithm too.
There’s been a bit of buzz today about “Whoosh” – a search engine written in Python. I did some performance measurements, and posted them to the Whoosh mailing list, but it looks like there’s wider interest, so I thought I’d give a brief summary here.
First, I took a corpus of (slightly over) 100,000 documents containing text from the english wikipedia, and indexed them with whoosh, and with xapian.
– Whoosh took 81 minutes, and produced a database of 977Mb.
– Xapian took 20 minutes, and produced a database of 1.2Gb. (Note: the unreleased “chert” backend of xapian produced a 932Mb database, and compacting that with “xapian-compact” produced a 541Mb database, but it’s only fair to compare against released versions.)
Next, I performed a search speed test – running 10000 1 word searches (randomly picked from /usr/share/dict/british-english) against the database, and measuring the number of searches performed per second. I did tests with the cache cleared (with “echo 3 > /proc/sys/vm/drop_caches”), and then with the cache full (by running the same test again):
– With an empty cache, whoosh achieved 19.8 searches per second.
– With a full cache, whoosh achieved 26.3 searches per second.
– With an empty cache, xapian achieved 83 searches per second.
– With a full cache, xapian achieved 5408 searches per second.
In summary, whoosh is damn good for a pure-python search engine, but Xapian is capable of much better performance.