Chapter 2. Geospatial Data
The most challenging aspect of geospatial analysis is the data. Geospatial data includes dozens of file formats and database structures already and continues to evolve and grow to include new types of data and standards. Additionally almost any file format can technically contain geospatial information simply by adding a location. As a geospatial analyst you may frequently encounter the following general data types:
- Spreadsheets and comma or tab-delimited files (CSV files)
- Geo-tagged photos
- Lightweight binary points, lines, and polygons
- Multigigabyte satellite or aerial images
- Elevation data such as grids, point clouds, or integer-based images
- XML files
- JSON files
- Databases (both servers and file databases)
Each format contains its own challenges for access and processing. When you perform analysis on data, usually you have to do some form of preprocessing first. You might clip a satellite image of a large area down to just your area of interest. Or you might reduce the number of points in a collection to just the ones meeting certain criteria in your data model. A good example of this type of preprocessing is the SimpleGIS
example at the end of Chapter 1, Learning Geospatial Analysis with Python. The state data set included just the state of Colorado rather than all 50 states. And the city dataset included only three sample cities, demonstrating three levels of population along with different relative locations.
The common geospatial operations in Chapter 1, Learning Geospatial Analysis with Python, are the building blocks for this type of preprocessing. However, it is important to note that there has been a gradual shift in the field of geospatial analysis. Until around 2004, geospatial data was difficult to acquire and desktop computing power was much less than it is today. Preprocessing data was an absolute first step to any geospatial project. However, in 2004, Google released Google Maps, not long after Google Earth. Microsoft had also been developing a technology acquisition called TerraServer which they relaunched around that time. In 2004, the Open Geospatial Consortium updated the version of its Web Map Service (WMS) to 1.3.0. That same year Esri also released Version 9 of their ArcGIS server system. These innovations were driven by Google's web map tiling model. People used map servers on the Internet before Google Maps, most famously with the MapQuest driving directions website. But these map servers offered only small amounts of data at a time and usually over limited areas. The Google web tiling system converted global maps to tiered image tiles for both images and mapping data. These were served dynamically using JavaScript and the browser-based XMLHttpRequest
API. Google's system scaled to millions of users using ordinary web browsers. More importantly, it allowed programmers to modify the JavaScript to create mash-ups to use the Google Maps JavaScript API for adding additional data to the maps. The mash-up concept is actually a "distributed geospatial layers" system. Users can combine and recombine data from different locations onto a single map as long as the data is web accessible. Other commercial and open source systems quickly mimicked the idea of distributed layers. Notable examples are OpenLayers, which provide an open source Google-like API that has now gone beyond Google's API offering additional features. Complimentary to OpenLayers is OpenStreetMap, which is the open source answer to the tiled-map services consumed by systems like OpenLayers. OpenStreetMap has global, street-level vector data and other spatial features collected from available government data sources and the contributions of thousands of editors worldwide. OpenStreetMap's data maintenance model is similar to the way the Wikipedia online encyclopedia crowd sources information creation and update for articles.
The mash-up revolution had interesting and beneficial side effects on data. Geospatial data is traditionally difficult to obtain. The cost of collecting, processing, and distributing data kept geospatial analysis constrained to those who could afford this steep overhead cost by producing data or purchasing it. For decades, geospatial analysis was the tool of governments, very large organizations, and universities. Once the web mapping trend shifted to large-scale, globally-tiled maps, organizations began essentially providing base map layers for free in order to draw developers to their platform. The massively-scalable global map system required massively-scalable, high-resolution data to be useful. Geospatial software producers and data providers wanted to maintain their market share and kept up with the technology trend.
Geospatial analysts benefited greatly from this market shift in several ways. First of all, data providers began distributing data in a common projection called Mercator . The Mercator projection is a nautical navigation projection introduced over 400 years ago. As mentioned in Chapter 1, Learning Geospatial Analysis with Python, all projections have practical benefits as well as distortions. The distortion in the Mercator projection is size. In a global view, Greenland appears bigger than the continent of South America. But, like every projection, it also has a benefit. Mercator preserves angles. Predictable angles allowed medieval navigators to draw straight bearing lines when plotting a course across oceans. Google Maps didn't launch with Mercator. However, it quickly became clear that roads in high and low latitudes met at odd angles on the map instead of the 90 degrees in reality. Because the primary purpose of Google Maps was street-level driving directions, Google sacrificed the global view accuracy for far better relative accuracy among streets when viewing a single city. Competing mapping systems followed suit. Google also standardized on the WGS 84 datum. This datum defines a specific spherical model of the Earth called a geoid. This model defines the normal sea level. What is significant about this choice by Google is that the Global Positioning System (GPS) also uses this datum. Therefore, most GPS units default to this datum as well, making Google Maps easily compatible with raw GIS data. It should be noted that Google tweaked the standard Mercator projection slightly for its use; however, this variation is almost imperceptible.
The Google variation of the Mercator projection is often called Google Mercator. The European Petroleum Survey Group (EPSG) assigns short numeric codes to projections as an easy way to reference them. Rather than waiting for the EPSG to approve or assign a code that was first only relevant to Google, they began calling the projection EPSG:900913 which is "Google" spelled with numbers.
The following URL provides an image, taken from Wikipedia, https://en.wikipedia.org/wiki/File:Tissot_mercator.png. It shows the distortion caused by the Mercator projection using Tissot's Indicatrix, which projects small ellipses of equal size onto a map. The distortion of the ellipse clearly shows how the projection affects the size and distance:
Web mapping services have reduced the chore of hunting for data and much of the preprocessing for analysts to create base maps. But to create anything of value you must understand geospatial data and how to work with it. This chapter provides an overview of common data types and issues you will encounter in geospatial analysis. Throughout this chapter, two terms will be commonly used: vector data and raster data. These are the two primary categories under which most geospatial data sets can be grouped. Vector data includes any format that minimally represents geo-location data using points, lines, or polygons. Raster data includes any format that stores data in a grid of rows and columns. Raster data includes all image formats.