Searching with Xapian

The Xapian search engine, and associated topics

Xapian performance comparision with Whoosh

with 5 comments

There’s been a bit of buzz today about “Whoosh” – a search engine written in Python. I did some performance measurements, and posted them to the Whoosh mailing list, but it looks like there’s wider interest, so I thought I’d give a brief summary here.

First, I took a corpus of (slightly over) 100,000 documents containing text from the english wikipedia, and indexed them with whoosh, and with xapian.

– Whoosh took 81 minutes, and produced a database of 977Mb.
– Xapian took 20 minutes, and produced a database of 1.2Gb. (Note: the unreleased “chert” backend of xapian produced a 932Mb database, and compacting that with “xapian-compact” produced a 541Mb database, but it’s only fair to compare against released versions.)

Next, I performed a search speed test – running 10000 1 word searches (randomly picked from /usr/share/dict/british-english) against the database, and measuring the number of searches performed per second. I did tests with the cache cleared (with “echo 3 > /proc/sys/vm/drop_caches”), and then with the cache full (by running the same test again):

– With an empty cache, whoosh achieved 19.8 searches per second.
– With a full cache, whoosh achieved 26.3 searches per second.
– With an empty cache, xapian achieved 83 searches per second.
– With a full cache, xapian achieved 5408 searches per second.

In summary, whoosh is damn good for a pure-python search engine, but Xapian is capable of much better performance.

Advertisements

Written by richardboulton

February 12, 2009 at 5:49 pm

Posted in Uncategorized

5 Responses

Subscribe to comments with RSS.

  1. I’ve always thought of Whoosh in comparison to (in terms of speed) the few pure-Python search libraries that came before (like Lupy, the old itools, Pyndexter’s native implementation, etc.) and (in terms of convenience) to PyLucene and the various wrappers for Xapian and HyperEstraier.

    The goal was definitely to be useful to Python programmers and whereever compiled libraries are not an option, not to be competitive with Xapian, Lucene, or HE. That would be absurd. If I got carried away on the Whoosh home page and implied otherwise, I apologize. 🙂

    Richard, thanks again for doing the performance test and sharing the results.

    Matt Chaput

    February 12, 2009 at 11:05 pm

  2. You missed a few critical metrics! See comments here.

    David.

    David W

    February 14, 2009 at 8:29 pm

  3. Richard, thanks for the benchmarks, very useful.

    Matt, as a longtime Python programmer and a Xapian fanboi, I think what you’ve done is fantastic.

    Van Gale

    February 14, 2009 at 10:36 pm

  4. David – interesting comments, if slightly aggressive in your final paragraph!

    I quite agree that a pure python library wins for convenience, and have never argued against that. It’s rather hard to put figures on convenience, though, and what I was interested in doing here was presenting some quantitative findings (which people had asked for in various blog entries and comments, and of which a rough approximation had been hinted at on the whoosh front page).

    There are plenty of other metrics I missed too, like quality of results, and feature comparisons. Such comparisions could be done quantitatively, and I would expect Whoosh to come out well from what I’ve seen of its code and design. I’m not likely to have time to put together tests of such things in the near future, but if I do, I’d probably put them up here too, for anyone interested in them.

    My post was merely made because I thought it interesting to share the results of my tests: if I’d been making it to promote Xapian, I’d have pushed links to it everywhere! I mean, Xapian’s (one of) my pet projects, and I’m happy to promote it, but my main interest is in making good search technologies, and helping others to do likewise.

    richardboulton

    February 14, 2009 at 10:41 pm

  5. Just found this again recently, hope you don’t mind me adding a comment, but it’s funny to think how bad Whoosh was back then! (Well, not funny ha ha ;).

    After two years the performance numbers are closer. My recent benchmarks show 54 minutes for Whoosh to index the Enron corpus, and 1700+ searches/s on a warm index, with index size 1.12 GB, compared to 25 minutes, 2700+ searches/s, and 2.71 GB for the Xapian python bindings (which are not fielded, of course). Xappy is actually worse, at 80 minutes, 1700+ searches/s, and 5.82 GB.

    Whoosh compresses its posting lists by default so it’s possible I could eke out a little more performance in exchange for index size.

    (I’m getting about 80 searches/s on a Solr server, so who knows what’s going on there.)

    Matt Chaput

    December 17, 2010 at 11:06 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: