Searching with Xapian

The Xapian search engine, and associated topics

Archive for the ‘Redis’ Category

Using Redis as a backend

with 2 comments

Over the last year, the Redis database, or “data structure server”, has become wildly popular: it’s very lightweight and easy to use, but still very powerful. For introductory reading, I recommend Simon Willison’s tutorial.

I’ve been asked several times whether it would be possible to make a Xapian backend which used Redis for storage. If this could be done nicely, it would be a big boon for people who are already storing other parts of their data in Redis, as it would allow them to maintain their data in a single system, taking advantage of Redis replication (and, when it comes along, Redis clustering). Depending how it was done, it might also be possible to combine Xapian searches with the results of other filtering operations in Redis, and it could be easier to keep the search index up-to-date with the other data stored in Redis. Redis usually keeps all its data in memory (though this is changing with the Redis VM work), so it can access data very quickly, and perform certain operations (like set intersection) quickly.

I think it would be quite possible to do this integration, but it’s not going to be easy to get good performance. The on-disk Xapian backends can be looked at as having three levels: firstly, a “block level”, which consists typically of 8kb blocks holding nodes of the B+tree structure. Secondly, there is a “key-value” level, which is consists of an ordered set of key-value pairs, with efficient ways to append entries, and to iterate through parts of the contents. Finally, there’s a “logical” level, which contains various constructs like “posting lists” – lists of the documents in the database containing each word, and provides methods for accessing these efficiently in the ways required by Xapian’s matcher.

There are, correspondingly, several ways in which you could integrate Redis. Integration at the block level should be very simple to implement, but wouldn’t be able to take advantage of any of the best features of Redis – it would just be a “dumb” key-value server. Xapian doesn’t perform any caching, and relies on the operating system cache to avoid hitting the disk for most of the commonly used blocks, so unless a local cache of values returned from Redis was added, there would be a large amount of data transfer and lots of network round-trips for each search. It also wouldn’t be easy to integrate searches with other (non-Xapian) data stored in Redis (though this could possibly be done using a Xapian::PostingSource subclass).

Integration at the “key-value” level might be a better approach, but shares many of the problems of integration at the “block” level – less data would need to be passed, but the keys and values stored by Xapian are quite highly encoded, so Redis would still be being treated as a “dumb” store.

In contrast, integrating a backend at the “logical” level would allow Xapian to retrieve entire posting lists in a single network round-trip, which could allow reasonably fast searches. For a large database, these lists could get large enough that the time to transfer them over the network is too long, but if an appropriate format was used, you could use some of Redis’s features to sort and weight the stored lists, and to retrieve only certain ranges of the lists. This could also allow modifications to be done efficiently, using Redis operations to append values to lists stored in the database.

Really, to get the best performance, I think you’d want to perform at least some of the matching operations in Redis; which would mean integrating the backend at a high level so that the data stored could be manipulated by Redis, and also performing some integration at the level of Xapian’s matcher. In fact, you’d want to be able to write a simple matcher in Redis, and then perhaps use Xapian to fine-tune the results from this matcher, and support more advanced features. I’ve been taking a look at Redis to see how this could best be done, and will write a follow up post shortly to summarise my findings. First, I have a few patches to submit to Redis which should help this idea, though!

Written by richardboulton

May 24, 2010 at 7:22 pm

Posted in Redis