Searching with Xapian

The Xapian search engine, and associated topics

Linux disk cache not being ideal

with 2 comments

It’s unfair to expect Linux’s disk caching or IO to be perfect, but these traces look like it failing in a fairly easy situation to me (though there could be all sorts of other things going on – these traces were taken on a moderately busy machine):

Firstly, here’s a trace of a (5 term) xapian search, being repeated 100 times. All accesses in this trace are reads – horizontal axis is time, vertical is offset in file, green is a fast read, red is a read taking over 0.001 seconds. Mostly the 20-odd disk reads performed by the search are very fast, but there’s a single read which takes about 0.01 seconds – and looking at the trace by hand, I can see that the exact same 8k stretch of disk was read almost immediately before this read (there was 1 read intervening). So, either the cache dropped that block just before it was needed, or something else locked up the IO system for a moment:

100_searches_top_10

This isn’t an isolated event, either. The next trace I ran was the same search, with checkatleast set to dbsize. For those unfamiliar with xapian, this basically means that a bit more IO is done during the search. A similar, though less severe, event happened here: this time, there were 8 intervening reads between a fast read of the block, and the slow read (visible again as a red line).

100_searches_check_all

It may well be worth implementing a small disk cache inside xapian, if these events are widespread…

Written by richardboulton

February 10, 2009 at 4:22 pm

Posted in Uncategorized

Pretty performance pictures

leave a comment »

I’ve been analysing block read patterns.

Firstly, here’s the block reads for reading about 6700 documents in a fairly random (actually sorted by a key) order. Pretty random, not so good:

53954-seek-read-calls-from-sortdatabase

But, here’s the read pattern for reading about 10000 documents in order of docid.

3455-seek-read-calls-from-strace

Finally, here’s the block reads for running xapian-compact:

106400-seek-read-calls-from-xapian-compact

Written by richardboulton

February 9, 2009 at 8:43 am

Posted in Uncategorized

Performance profiling

with 10 comments

I’ve spent a fair while over the last few days, when not laid low with a cold, hacking on various bits of Xapian to measure and improve the performance.

My colleague Tom Mortimer put together a nice little performance test which involves first indexing about 100,000 articles from wikipedia, and then performing a set of searches over them. We’re working on packaging these up so that more people can try them on different hardware. The results were quite interesting, but the bit I’ve focussed on is that, for this particular setup, the search speed of the unreleased in-development “chert” backend is considerably worse than that of the stable “flint” backend.

I should note that this is quite a small database by Xapian standards, which means that it’s fully cached in memory for my tests. The chert database is considerably smaller than the flint backend (about 70% of the size, in fact), which means that it would be able to remain fully cached in memory for considerably longer. However, the flint backend is able to perform 10,000 searches (returning over 8 million results) in about 1.8 seconds, and the chert backend takes 12.6 seconds. We’re clearly going to have to fix this before chert is ready for release.

For the gritty details, see http://trac.xapian.org/ticket/326, but the executive summary is that the slowdown is due to the way chert accesses document length information during the search (which is needed for the default weighting model). Indeed, a quick hack I performed to turn off the document length lookup resulted in chert performing 10,000 searches in about 0.75 seconds, so chert is clearly not ready to be binned yet!

In flint, document length information is stored in every posting list, adjacent to the wdf information. This means that a separate lookup for the document length isn’t needed, but also means that the document length is duplicated in every posting list, increasing the size of the database. In chert, the document length information is pulled out into a separate posting list, so a separate lookup is required, but the overall database size is considerably smaller. In addition, the document length list is used for nearly every search, so it ought to be easy to get it cached well, so the side lookups should take little time.

Currently, the document length posting list uses the same format as normal term posting lists: posting lists are composed of a set of chunks, each of around 2000 bytes, and representing a set of items, one for each document, in ascending docid order. The chunks are indexed by the document ID of the first item in the chunk, and stored in a btree – this means that the appropriate chunk for a particular document can be found in O(ln(number_of_documents)) time, which in practice turns out to be nice and fast.

The problem with chert is the time taken to find the right place inside a chunk. Inside a chunk, the format is essentially a list of pairs of integers (coded in a variable length representation): the first integer represents the document ID increase since the previous entry, and the second integer is the wdf for a term posting list (or the document length for the document length posting list). This format is good for efficient appends (ie, fast batch updating), and for walking through the list, but to find the appropriate place in the chunk requires O(length_of_chunk) reads – each of which involves at least one unpacking of a variable-length integer. My profiling indicates that on average, finding the document length with chert requires scanning through about 30 items in each chunk which is used.

I’ve performed various optimisations on the code which unpacks the integers, and managed to increase the search speed with chert to about 7 seconds, but I don’t think that approach can go much further. A better datastructure is needed for holding the document length postlists. This can be a different structure from that used for normal postlists, of course, though there might be some benefit in improving the seek speed in term postlist chunks, too.

So, what do we need for our datastructure?

  1. Store a list of document ID -> doclen mappings, for a set of document IDs which are close together
  2. Reasonably efficient update desirable
  3. Efficient seek to specified document ID
  4. Efficient walk through list
  5. Reasonably compact representation on disk

The current encoding satisfies items 1,2,4,5, but fails badly on 3. It usually uses 1 byte for each document ID, and (in my test) an average of 2 bytes for each document length. In fact, the byte for the document ID is usually 0, because it’s “difference between document IDs – 1”.

One possibility is just to store the list of document IDs and lengths in fixed-size slots, so that a binary search can be used to find them. This obviously isn’t very compact, but we could restrict a chunk to holding only document IDs with a range of 65535 (and storing the offset from the start, rather than the absolute ID), so we’d need 2 bytes per document ID. We’d probably need to allow at least 3 bytes per document length – documents could well be over 65535 words long, but are unlikely to be over 16.7 million words.

Xapian already has code to encode a list of integers efficiently; this is used for position lists. It requires the list to be unpacked to read it, but has the nice property that if the list is fully compact (ie, all the integers in a range are present with no gaps) it has a very small encoding. In fact, this particular case could be special-cased so that the encoding doesn’t need to be unpacked explicitly. So, we could first store the document IDs in the chunk, as offsets from the initial document ID in the chunk encoded tightly, followed by an array of document lengths, 3 bytes per document length. For fully compact document IDs, the document ID list would encode to an empty list, so we’d use roughly the same amount of space as the current scheme, but we’d be able to binary search through the list of document lengths (and, if the document IDs were compact, this wouldn’t even need the document ID list to be unpacked first). Appending would also be efficient, and so would walking through the list.

I’m sure we can do better than storing the document lengths in fixed size representation, though. Perhaps we could store a list of bit positions which the document lengths change at, or something. Further thought is required…

Written by richardboulton

February 6, 2009 at 2:56 pm

Posted in Performance

WSGIWebAPI – web services in Python

leave a comment »

This entry isn’t going to be directly talking about Xapian or searching technologies.  Instead, I wanted to talk about a new Python library, or framework, depending on your point of view, for writing web APIs.

As part of the work I do with Xapian, I’ve been writing a lot of web services recently.  There weren’t any Python frameworks which fulfilled my requirements, so I’ve been implementing the bits I need as I went along.  Last month I spent 4 weeks commuting from Cambridge to London, so I had lots of 40 minute stretches on a train with my laptop, and took the opportunity to knock the bits of code I’d developed into shape, and make a coherent library using them.

The result is WSGIWebAPI.

There are lots of similar frameworks already in Python – two of the most similar to WSGIWebAPI are web.py and WAPI. There’s a good comparison of some of the candidates at Viktor Nagy’s blog.

None of the existing frameworks satisfied my requirements. Partly this is because my biggest requirement was to know that they existed, and when I started implementing my framework I was on a Napoleonic fort, surrounded by sea, two miles from the nearest internet connection, with only 11 other intrepid hackers for company.

More seriously, most of the frameworks I’ve found require or are part of Django, which is a bit heavyweight for the kind of services I write. web.py seems very nice, but isn’t targeted at building web APIs. My main requirement is to have lots of support code available for writing my services, so I don’t have to keep writing the same bits of glue code, but to have everything explicit, and have nothing get in my way. I’m quite happy with what I’ve got so far in WSGIWebAPI, though I’m sure there’s plenty left to do. Hopefully, people will try using it for their projects, and let me know which areas aren’t ideal for them.

One particularly nice feature is that APIs written in WSGIWebAPI can be automatically documented, using information from documentation comments to explain the purpose behind each method, but also hooking into the validation framework to display lists of possible parameters. The documentation output could do with being clearer, but the basic functionality has already been useful in one client job, where I was using a pre-release version of WSGIWebAPI to build APIs very quickly, and was able to share the up-to-date documentation with other team members as soon as the API was working.

My next plan is to use WSGIWebAPI to improve a Xapian search server (also written at the fort), so hopefully my next post will be to announce progress with that!

Written by richardboulton

February 2, 2009 at 8:11 pm

Posted in WSGIWAPI

Tagged with , , ,

Geocoding with freely available data

leave a comment »

I’ve been doing some work recently to finish off the “geospatial” branch of Xapian, which will add various features allowing Xapian to be used to do some geospatial searching: for example “find my nearest”, “only show me results within N miles” and “weight by a combination of closeness and relevance”.  This work is nearly ready to be merged into trunk, so it should make the 1.1.0 release, but it relies on users having latitude-longitude coordinates available for their documents and searches.  Thus, it isn’t a lot of use for users who’ve only got addresses or postcodes!

So, I’ve been researching the ways of converting addresses and postcodes to latitude-longitude coordinates, with a bias towards those systems which will work in the UK.  Most countries seem to have fairly freely available postcode databases (or Zip code, or whatever they want to call them), but sadly the UK doesn’t.  The Royal Mail sells the postcode database, and to get a license to use it for a website you tend to need to pay a few thousand pounds.  There are also several resellers who sell it more cheaply, or combine it with software to do various lookups in the database, but these aren’t much use to small or non-commercial websites.

Fortunately, there are two main options available: there are some freely available geocoding web services (notably from Yahoo and Google), and there are some efforts to produce free replacements of the postcode database.  The web services are rather useful, but have some limitations: notably, limits on the number of requests per IP address per day (5,000 requests for Google, 15,000 requests for Yahoo).  If you need to do a lot of lookups, these limits could be a big problem; also, if you design your website such that your users do the geocoding lookup directly against Google or Yahoo, users who are on shared IP addresses can run into these limits very quickly.  Google, in particular, appears to focus on use of their geocoder from browser JavaScript rather than from a server, since most of their documentation for the web service is documentation of the JavaScript library used to perform the requests from the browser.

So, it looks like freely available postcode data is the way to go for small or non-commercial websites, or websites on a budget! Besides, rolling your own is much more fun…

A year ago, there didn’t seem to be much useful data around (or at least, I couldn’t find it), but there are now several useful sources of data:

  • http://www.geonames.org/ – this site (as well as providing some web services) provides downloads of freely usable (Creative Commons Attribution 3.0 License) lists of location names, together with the all-important latitude-longitude data for each.  This is immensely useful data!  It also provides a list of 27,000 or so “outcodes” – ie, the first half of postcodes.  This allows the rough location of each postcode to be established, to “town” accuracy, which may be good enough for many applications.
  • http://www.npemap.org.uk/ – this site uses scanned copies of out-of-copyright OS maps (the “New Popular Edition”, to be precise), and the freely available outcodes, to allow users to enter their postcode and point to it on an (old) map.  This has been used to collect nearly 40,000 postcodes (though not all of them are full postcodes), with moderate accuracy.
  • http://www.dracos.co.uk/play/locating-postboxes/ – this site started with a Freedom of Information request to get the list of all postboxes in the UK, and associated postcodes.  Unfortunately, the list didn’t have latitude-longitude coordinates for each postbox, so Matthew Sommerville site allows users to mark on a map the location of postboxes that they know about, and associate them with entries on the list.  This results in a list of postcodes associated with coordinates, which can be added to the list from npemap to get even more freely available postcodes.

So far, I’ve built a quick toy index of the geonames name data, using Xappy, which performs remarkably well.  I plan to add a little further parsing, so that I can handle postcodes and partial postcodes as well as possible.

One problem is that, given a partially entered postcode, it’s not always possible to tell what the outcode is.  For example, if a user enters “CB22”, it’s not possible to tell if that’s the outcode “CB22”, or the outcode “CB2” followed by the first digit of the second half of the postcode.

Written by richardboulton

December 12, 2008 at 4:47 pm

Posted in Geospatial

Tagged with , ,