Visualizing Geospatial Data in Python

Open source tools and techniques for visualizing data on custom maps

Published in

Towards Data Science

8 min readJun 2, 2020

Throughout the global pandemic, many people have spent lots of time viewing maps that visualize data. Important data. People who work in data science are probably seeing increased needs to work with geospatial data, especially for visualizations. There are increased needs to understand metrics about geographic regions, to analyze supply chain, make plans that take into account local conditions and rules, etc.

This article shows how to use two popular geospatial libraries in Python:

geopandas: extends Pandas to allow spatial operations on geometric types
geoplot: a high-level geospatial plotting library

The second library is especially helpful since it builds on top of several other popular geospatial libraries, to simplify the coding that’s typically required. Those include: cartopy, which in turn leverages Cython, NumPy, GEOS, Shapely, pyshp, PROJ, Six, and perhaps a few others such as mapclassify, depending on which features you need to use.

Note: all of this code is available in a Jupyter notebook at github.com/DerwenAI/ibm_dsc_articles/blob/master/2020_05/tutorial.ipynb

Installation

Installation should be quick. Just use the following three command lines with conda:

Note: if you run into problems with these installations, there are alternative approaches available. Some additional notes discuss how to build these dependencies on Linux

github.com/DerwenAI/ibm_dsc_articles/blob/master/2020_05/INSTALL.md

Terminology

Part of the learning curve for working with geospatial data is that there’s lots of special terminology used. Here’s a handy cheat-sheet for terms that you’re likely to encounter:

shapefile: data file format used to represent items on a map
geometry: a vector (generally a column in a dataframe) used to represent points, polygons, and other geometric shapes or locations, usually represented as well-known text (WKT)
polygon: an area
point: a specific location
basemap: the background setting for a map, such as county borders in California
projection: since the Earth is a 3D spheroid, chose a method for how an area gets flattened into 2D map, using some coordinate reference system (CRS)
colormap: choice of a color palette for rendering data, selected with the cmap parameter
overplotting: stacking several different plots on top of one another
choropleth: using different hues to color polygons, as a way to represent data levels
kernel density estimation: a data smoothing technique (KDE) that creates contours of shading to represent data levels
cartogram: warping the relative area of polygons to represent data levels
quantiles: binning data values into a specified number of equal-sized groups
voronoi diagram: dividing an area into polygons such that each polygon contains exactly one generating point and every point in a given polygon is closer to its generating point than to any other; also called a Dirichlet tessellation

Okay, with those terms defined here for reference … let’s go!

Get Some Data

We need to get some data to use for these examples. While geoplot includes plenty of sample datasets in the geojson format, it helps to know how to load your own data.

First, let’s get a shapefile from the US Census Bureau TIGER database to visualize state boundaries, which we’ll place into a `maps` subdirectory:

Of course, there are so many open data sources for shapefiles like this. Here are a few:

Next let’s get some data to plot, in this case the 2018 population estimates for the US, which we’ll place into a `data` subdirectory:

Start Plotting

To import the required packages in Python:

If you’re working in a Jupyter notebook be sure to run the following “magic” command to render plots properly:

%matplotlib inline

Then load a shapefile and view parts of it:

Notice the `geometry` column, which specifies the polygon shapes.

Now we’ll load the US Census data as a pandas dataframe and view a portion of it:

Next we merge the shapefile with population data, joining on the state names:

Great, now that data is ready to plot a shape. We’ll specify California by name:

Alternatively, we can create a GeoDataFrame (a dataframe with geospatial data) by loading one of the sample datasets from geoplot, in this case the polygons for state boundaries:

Then plot a map of the US states:

Let’s load another sample dataset, in this case for US cities:

Then plot the locations of each city in the continental US as points:

Composing these two, we’ll use overplotting to show the cities and states in the continental US. Note how the `ax` variable for the state polygons provides an axis onto which we plot the cities:

That looks a bit stretched, so let’s adjust the projection to use an Albers equal-area conic projection:

Okay, that’s better! Again, since the Earth is a 3D globe, a projection is a method for how an area gets flattened into 2D map, using some coordinate reference system (CRS). The geoplot library makes this easy for us to use any number of projections — Albers equal-area projection is a choice in line with documentation from the libraries. You could also play with some you may remember from grade school like the `mercator gcrs.Mercator()` or modern variations of it like the `gcrs.WebMercator()` projection. Play around with them and let us know which ones you prefer and why!

Representing Data

Now let’s compare several different ways to visualize geospatial data. First, we’ll change the hue of a city’s plotted point based on that city’s elevation, and also add a legend for people to decode the meaning of the different hues. The parameter lists start to get long-ish, so we’ll specify parameters on different lines:

We can also use the scale of each plotted point to represent another dimension. In this case, the scale of the city points is based on their elevation:

With a choropleth we use different hues to shade polygons, to represent a dimension of data:

A data smoothing technique known as kernel density estimation (KDE) creates contours to represent a dimension of data. In this case, we’ll zoom in to view the traffic collisions in the NYC boroughs:

Let’s zoom out to try KDE on major population centers throughout the US:

This next section shows how to work with data associated with areas (polygons). We’ll load a dataset about obesity rates by US state:

Convert that into a GeoDataFrame using a join. Note how this adds the required `geometry` column:

Now we can use this data to plot a cartogram, which grows or shrinks polygons to represent a dimension of data — in this case, the obesity rates per state:

One good approach to simplifying data visualization is binning the data into quantiles. These are equal-sized groups, in this case 10 quantiles for elevation:

Here we’ve divided the elevations into 10 quantiles with approximately 375 values each. Now let’s assign a different hue to each quantile, plus a legend to explain them:

Note how the colormap was changed to the “inferno_r” setting.

Next, let’s add a filter for typical warnings that can be ignored:

The next example uses a voronoi diagram, to calculate polygon areas based on a dimension of the data. Each polygon is centered on a generating point, such that every location in the polygon is closer to its generating point than to any other. This is helpful when you want to examine a column of data, to see if it may have any geospatial correlations.

In the following example, we’ll plot the locations primary schools in Melbourne, Australia, and use a voronoi diagram to show where they are concentrated:

Let’s construct a voronoi diagram for the elevations of US cities. This is a data smoothing technique since the elevations are for points, but we’ll “smooth” those values across areas:

Visualizing COVID-19 Data

Next, let’s download some of the COVID-19 data from the University of Washington’s IHME center. Change the name of the UNZIP’ed directory (download date) as needed:

Then load the dataset:

We’ll filter rows, limiting this visualization to Earth Day (April 22) 2020:

Now merge on state name with the continental US dataset from earlier:

Add a calculated column for “deaths per million”:

Then to visualize this data, let’s plot a choropleth of “deaths per million” per state, and overplot with population per city:

Note how the choropleth is overplotted with a point plot by reusing the same `ax` variable. The point plot specifies a `zorder` parameter (which “layer”) and an `alpha` parameter (how “translucent”) to make the overplotting more readable. The `figsize` parameter in the first plot modifies the overall size of the plot.

This visualization shows where population centers in the US are more at risk: NYC area, Louisiana, Illinois, Michigan, Georgia, etc., which is where the news headlines in the US about COVID-19 on Earth Day 2020 date were focused.

Animated Maps

Next, let’s prepare to animate this visualization. We’ll define a function to visualize the data for a given day:

Then define a range of dates:

Iterate through those dates, creating a visualization for each date, which gets saved to a file:

Use `imageio` to stitch these individual plots together into an animated GIF, at a rate of 2 frames/second:

Here’s a video of the resulting animation:

Note: there is also the matplotlib.animation API, which we could have used instead of imageio. The former has some bugs in recent releases, while the latter has more general-purpose uses.

Again, all of this code is available in a Jupyter notebook at github.com/DerwenAI/ibm_dsc_articles/blob/master/2020_05/tutorial.ipynb

If you’d like to experiment with extending this visualization example, there are lots of related datasets are available at https://covidtracking.com/data

Co-author: William Roberts

Originally published on the IBM Data Science Community blog at https://ibm.biz/paco-geospatial