Mapping Words: Lessons Learned From a Decade of Exploring the Geography of Text | The Signal

It is hard to imagine our world today without maps. Though not the first online mapping platform, the debut of Google Maps a decade ago profoundly reshaped the role of maps in everyday life, popularizing the concept of organizing information in space. When Flickr unveiled image geotagging in 2006, more than 1.2 million photos were geotagged in the first 24 hours. In August 2009, with the launch of geotagged tweets, Twitter announced that organizing posts according to location would usher in a new era of spatial serendipity, allowing users to “switch from reading the tweets of accounts you follow to reading tweets from anyone in your neighborhood or city–whether you follow them or not.”

As more and more of the world’s citizen generated information becomes natively geotagged, we increasingly think of information as being created in space and referring to space, using geography to map conversation, target information, and even understand global communicative patterns. Yet, despite the immense power of geotagging, the vast majority of the world’s information does not have native geographic metadata, especially the vast historical archives of text held by libraries. It is not that libraries do not contain spatial information, it is that their rich descriptions of location are expressed in words rather than precise mappable latitude/longitude coordinates. A geotagged tweet can be directly placed on a map, while a textual mention of “a park in Champaign, USA” in a digitized nineteenth century book requires highly specialized “fulltext geocoding” algorithms to identify, disambiguate (determine whether the mention is of Champaign, Illinois or Champaign, Ohio and which park is referred to) and convert textual descriptions of location into mappable geographic coordinates.

Building robust algorithms capable of recognizing mentions of an obscure hilltop or a small rural village anywhere on Earth requires a mixture of state-of-the-art software algorithms and artistic handling of the enormous complexities and nuances of how humans express space in writing. This is made even more difficult by assumptions of shared locality made by content like news media, the mixture of textual and visual locative cues in television, and the inherent transcription error of sources like OCR and closed captioning.
http://blogs.loc.gov/digitalpreservation/2015/04/mapping-words-lessons-l...