Searching with Xapian

The Xapian search engine, and associated topics

Xapian performance comparision with Whoosh

with 4 comments

There’s been a bit of buzz today about “Whoosh” – a search engine written in Python. I did some performance measurements, and posted them to the Whoosh mailing list, but it looks like there’s wider interest, so I thought I’d give a brief summary here.

First, I took a corpus of (slightly over) 100,000 documents containing text from the english wikipedia, and indexed them with whoosh, and with xapian.

– Whoosh took 81 minutes, and produced a database of 977Mb.
– Xapian took 20 minutes, and produced a database of 1.2Gb. (Note: the unreleased “chert” backend of xapian produced a 932Mb database, and compacting that with “xapian-compact” produced a 541Mb database, but it’s only fair to compare against released versions.)

Next, I performed a search speed test – running 10000 1 word searches (randomly picked from /usr/share/dict/british-english) against the database, and measuring the number of searches performed per second. I did tests with the cache cleared (with “echo 3 > /proc/sys/vm/drop_caches”), and then with the cache full (by running the same test again):

– With an empty cache, whoosh achieved 19.8 searches per second.
– With a full cache, whoosh achieved 26.3 searches per second.
– With an empty cache, xapian achieved 83 searches per second.
– With a full cache, xapian achieved 5408 searches per second.

In summary, whoosh is damn good for a pure-python search engine, but Xapian is capable of much better performance.

Written by richardboulton

February 12, 2009 at 5:49 pm

Posted in Uncategorized

4 Responses

Subscribe to comments with RSS.

  1. I’ve always thought of Whoosh in comparison to (in terms of speed) the few pure-Python search libraries that came before (like Lupy, the old itools, Pyndexter’s native implementation, etc.) and (in terms of convenience) to PyLucene and the various wrappers for Xapian and HyperEstraier.

    The goal was definitely to be useful to Python programmers and whereever compiled libraries are not an option, not to be competitive with Xapian, Lucene, or HE. That would be absurd. If I got carried away on the Whoosh home page and implied otherwise, I apologize. :)

    Richard, thanks again for doing the performance test and sharing the results.

    Matt Chaput

    February 12, 2009 at 11:05 pm

  2. You missed a few critical metrics! See comments here.

    David.

    David W

    February 14, 2009 at 8:29 pm

  3. Richard, thanks for the benchmarks, very useful.

    Matt, as a longtime Python programmer and a Xapian fanboi, I think what you’ve done is fantastic.

    Van Gale

    February 14, 2009 at 10:36 pm

  4. David – interesting comments, if slightly aggressive in your final paragraph!

    I quite agree that a pure python library wins for convenience, and have never argued against that. It’s rather hard to put figures on convenience, though, and what I was interested in doing here was presenting some quantitative findings (which people had asked for in various blog entries and comments, and of which a rough approximation had been hinted at on the whoosh front page).

    There are plenty of other metrics I missed too, like quality of results, and feature comparisons. Such comparisions could be done quantitatively, and I would expect Whoosh to come out well from what I’ve seen of its code and design. I’m not likely to have time to put together tests of such things in the near future, but if I do, I’d probably put them up here too, for anyone interested in them.

    My post was merely made because I thought it interesting to share the results of my tests: if I’d been making it to promote Xapian, I’d have pushed links to it everywhere! I mean, Xapian’s (one of) my pet projects, and I’m happy to promote it, but my main interest is in making good search technologies, and helping others to do likewise.

    richardboulton

    February 14, 2009 at 10:41 pm


Leave a Reply