A Tale of Two Cities

How else could one title this post? Here’s the story: Birmingham City Council in the UK recently sent out 720,000 leaflets advertising their services; the picture on the leaflet, however, depicts a different city with the same name: Birmingham, Alabama (US). Classic local authority snafu.

It’s clear that the individual charged with illustrating the pamphlet searched for ‘Birmingham’, found what looked like a nice skyline, and failed to fact-check. It is also possible that this individual never realized that there are multiple towns called ‘Birmingham’ (eighteen, actually). Whatever the cause, the story highlights some fundamental-but-oft-overlooked challenges in geoparsing that we embrace at Yahoo! Geo Technologies.

Geoparsing is of course the process of identifying places referenced in free- or unstructured text, and is the essential ingredient of any system where we want to geolocate content with machine analysis. The two steps of successful geoparsing are (1) token identification, and (2) geographic disambiguation. Let’s take a look at each briefly:

The first step in geoparsing is token identification: identifying place-names, such as ‘Wayne’ or ‘The Bay Area’, in unstructured content like newspaper articles or web pages, while ensuring at the same time that one does not falsely identify terms like ‘New England Clam Chowder’ as a place (a post on our fun with these potential false-positives will follow).

But token identification is the easy half of the battle; many entity-recognition applications, like the otherwise excellent OpenCalais are not capable of geotagging the above BBC article on ‘Birmingham’, for example, as it — correctly — identifies seven ‘Birminghams’, but does not tell us whether those referred to within are the UK city, one of its seventeen US namesakes, or a mix of both. (You can try this yourself with any text using the Calais Viewer.) Human cognition can certainly determine this with a quick read-through, but we’re looking at machine parsing specifically here.

To do this properly, we first require the means to refer to a place in a permanent, unambiguous, and machine-friendly manner: usually this is attempted by expanding the geographic context so that the token ‘Wayne’, when found in text, can be indexed as ‘Wayne, PA, USA’; this works sometimes but is hardly machine-friendly. (Furthermore, there are ten towns called ‘Wayne’ in Pennsylvania, so the above string gets us no closer to our goal.) In truth, string-based indexing will always have its exceptions, so we have opened GeoPlanet, our gazetteer of places and their unique Where-on-Earth Identifiers (WOEIDs), to provide the vocabulary to describe the world’s places without ambiguity.

So, now that we’ve found the correct tokens (’Wayne’) in our hypothetical text, and dismissed misleading, place-sounding terms (’Yorkshire Pudding’), we then determine which place, of all the places with that name, is specifically being referenced. This is geographic disambiguation (or geodisambiguation for the portmanteau-inclined). Let’s take for example ‘Rome’, of which there are over thirty: there is of course ‘the’ Rome, in Italy (WOEID: 721943), and for many of us, this is the only Rome we know. However, residents of Rome, Georgia (WOEID: 2484261) would argue otherwise. This highlights the problem: how can we be certain which place is being referenced when we have only ‘Rome’ in the text? Obviously the language helps in some instances, as does context (is ‘Georgia’ or ‘Italy’ mentioned elsewhere in the document?). But when geodisambiguating at Yahoo! (and this is the fun bit), we take into account the location of the user (or publisher) to capture the ‘locality’ of the term, and really put geography in the first-person. For example, although ‘Rome’ by itself will usually refer to ‘Rome, Italy’, the probability of its referring to ‘Rome, Georgia’ increases as we move geographically towards the latter. This approach ensures that Yahoo! returns the ‘correct’ city when a search for ‘Birmingham’ is performed in the UK, compared to the same search in the US. This approach ensures that content originating from Rome, Georgia will be geoparsed and disambiguated correctly to the correct and local ‘Rome’.

Acknowledging that geography is in the eye of the beholder is just one way that Yahoo! Geo Technologies provides our users with the most personally georelevant results. Shame Birmingham Council did not come to us first.

Tyler Bell, Advanced Products Manager, Yahoo! Geo Technologies

4 Responses to “A Tale of Two Cities”

  1. GEO Jr. Says:

    How do you make sure every new Rome gets a WOEID ?

  2. tyler Says:

    @Geo Jr — I think what you’re asking is: how do we keep WhereHaus current? We have a number of automated and editorial processes here at Yahoo! that ensure that our placenames are topical. A subject for another post.

  3. GEO Jr. Says:

    I will look forward to that post.. Thanks for the reply.

  4. tyler Says:

    These things are not uncommon:

    “Tourist books for Australia, ends up in Canada”:
    http://www.metro.co.uk/weird/article.html?Tourist_books_for_Australia,_ends_up_in_Canada&in_article_id=326259&in_page_id=2

Leave a Reply