Searching with Xapian

The Xapian search engine, and associated topics

Posts Tagged ‘Xapian

Geocoding with freely available data

leave a comment »

I’ve been doing some work recently to finish off the “geospatial” branch of Xapian, which will add various features allowing Xapian to be used to do some geospatial searching: for example “find my nearest”, “only show me results within N miles” and “weight by a combination of closeness and relevance”.  This work is nearly ready to be merged into trunk, so it should make the 1.1.0 release, but it relies on users having latitude-longitude coordinates available for their documents and searches.  Thus, it isn’t a lot of use for users who’ve only got addresses or postcodes!

So, I’ve been researching the ways of converting addresses and postcodes to latitude-longitude coordinates, with a bias towards those systems which will work in the UK.  Most countries seem to have fairly freely available postcode databases (or Zip code, or whatever they want to call them), but sadly the UK doesn’t.  The Royal Mail sells the postcode database, and to get a license to use it for a website you tend to need to pay a few thousand pounds.  There are also several resellers who sell it more cheaply, or combine it with software to do various lookups in the database, but these aren’t much use to small or non-commercial websites.

Fortunately, there are two main options available: there are some freely available geocoding web services (notably from Yahoo and Google), and there are some efforts to produce free replacements of the postcode database.  The web services are rather useful, but have some limitations: notably, limits on the number of requests per IP address per day (5,000 requests for Google, 15,000 requests for Yahoo).  If you need to do a lot of lookups, these limits could be a big problem; also, if you design your website such that your users do the geocoding lookup directly against Google or Yahoo, users who are on shared IP addresses can run into these limits very quickly.  Google, in particular, appears to focus on use of their geocoder from browser JavaScript rather than from a server, since most of their documentation for the web service is documentation of the JavaScript library used to perform the requests from the browser.

So, it looks like freely available postcode data is the way to go for small or non-commercial websites, or websites on a budget! Besides, rolling your own is much more fun…

A year ago, there didn’t seem to be much useful data around (or at least, I couldn’t find it), but there are now several useful sources of data:

  • – this site (as well as providing some web services) provides downloads of freely usable (Creative Commons Attribution 3.0 License) lists of location names, together with the all-important latitude-longitude data for each.  This is immensely useful data!  It also provides a list of 27,000 or so “outcodes” – ie, the first half of postcodes.  This allows the rough location of each postcode to be established, to “town” accuracy, which may be good enough for many applications.
  • – this site uses scanned copies of out-of-copyright OS maps (the “New Popular Edition”, to be precise), and the freely available outcodes, to allow users to enter their postcode and point to it on an (old) map.  This has been used to collect nearly 40,000 postcodes (though not all of them are full postcodes), with moderate accuracy.
  • – this site started with a Freedom of Information request to get the list of all postboxes in the UK, and associated postcodes.  Unfortunately, the list didn’t have latitude-longitude coordinates for each postbox, so Matthew Sommerville site allows users to mark on a map the location of postboxes that they know about, and associate them with entries on the list.  This results in a list of postcodes associated with coordinates, which can be added to the list from npemap to get even more freely available postcodes.

So far, I’ve built a quick toy index of the geonames name data, using Xappy, which performs remarkably well.  I plan to add a little further parsing, so that I can handle postcodes and partial postcodes as well as possible.

One problem is that, given a partially entered postcode, it’s not always possible to tell what the outcode is.  For example, if a user enters “CB22”, it’s not possible to tell if that’s the outcode “CB22”, or the outcode “CB2” followed by the first digit of the second half of the postcode.

Written by richardboulton

December 12, 2008 at 4:47 pm

Posted in Geospatial

Tagged with , ,