87.4 GeoPandas and Shapely: Geospatial Data
Right, let’s talk about making maps that don’t lie. You’ve probably got some data with a latitude and longitude column. You could scatter-plot it on a map and call it a day, but that’s like using a sledgehammer to crack a nut—messy, imprecise, and you’ll annoy the neighbors (in this case, cartographers). For real geospatial work, where where actually matters, you need to understand the geometry of your data. That’s where the dynamic duo, GeoPandas and Shapely, comes in.
Think of it this way: Shapely is the brilliant but slightly obsessive engineer who only cares about the pure, mathematical shapes—points, lines, polygons. GeoPandas is the pragmatic project manager who takes those shapes, slaps them into a Pandas DataFrame with a proper coordinate system, and makes sure everyone (i.e., your other Python code) can work with them. It’s a beautiful partnership.
Your First Foray: It’s Just a DataFrame, But Spicy
First, the basics. A GeoDataFrame is just a Pandas DataFrame with a special geometry column where all the magic lives. This column holds Shapely objects. Let’s create a couple of points. Imagine these are the most important coffee shops in your city (because, let’s be honest, they are).
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point, Polygon
# Create a list of Shapely Point objects.
# Remember: Point(longitude, latitude). It's (x, y), not the other way around.
# Getting this backwards is the most common rookie mistake, and it will silently ruin your day.
geometry = [Point(-73.9857, 40.7484), # Empire State Building
Point(-74.0445, 40.6892)] # Statue of Liberty
# Some associated data, because what's a point without data?
names = ['Empire State', 'Lady Liberty']
# Stick it all in a DataFrame. The key is the 'geometry' keyword.
gdf = gpd.GeoDataFrame({'name': names}, geometry=geometry)
print(gdf)
print(f"\nAnd the type of that geometry column? {type(gdf.geometry.iloc[0])}")
See? It looks like a normal DataFrame. But that geometry column is full of Shapely objects, which means we can do spatial operations on them. Which brings us to…
The Real Power: Spatial Relationships
Plotting is nice, but the real reason you’re here is to ask questions of your data. “Which coffee shops are within 500 meters of my office?” “Does this proposed pipeline cross through any protected wetlands?” Shapely’s syntax for this is brilliantly intuitive.
from shapely.geometry import Polygon
# Let's define a *really* important area: my apartment.
my_apartment_bounds = Polygon([(-74.005, 40.715),
(-74.005, 40.725),
(-73.995, 40.725),
(-73.995, 40.715)])
# Now, let's see which of our landmarks are within my apartment.
# (Spoiler: none, my apartment isn't that big.)
is_within = gdf.within(my_apartment_bounds)
print(f"Is it within my apartment?\n{is_within}")
# But wait, that's not terribly useful. Let's buffer the points to create a area of influence.
# Let's say each landmark has a 1km radius of tourism influence.
buffered_landmarks = gdf.geometry.buffer(0.01) # ~1km approx in degrees (this is a hack for example purposes!)
# Now, do these buffered areas intersect my apartment?
# This is a more interesting question: "Is my apartment within a tourist zone?"
intersects = buffered_landmarks.intersects(my_apartment_bounds)
print(f"Does its influence intersect my apartment?\n{intersects}")
This is the core of it. .within(), .contains(), .intersects(), .distance()—these methods become your new best friends. They let you move from “data on a map” to “answering spatial questions.”
The Crucial Detail Everyone Ignores (Until It Breaks)
You just used .buffer(0.01). I said “approx 1km.” That’s a filthy, disgusting lie I told for the sake of a simple example. And this is the single most important concept in geospatial analysis: Coordinate Reference Systems (CRS).
The Earth is round(ish), but your screen is flat. A CRS is a mathematical model that flattens it for you. The most common one you’ll see is WGS 84 (EPSG:4326), which uses latitude and longitude in degrees. The problem? Degrees are not meters! A degree of longitude at the equator is about 111km, but at the poles, it’s 0km. Buffering a point in degrees is nonsense for real-world distance calculations.
This is why you must reproject your data. You find a CRS that’s accurate for your part of the world and uses meters. For New York, that’s often EPSG:2263.
# First, set the original CRS. Our points were in lat/lon, so it's WGS84.
gdf.crs = 'EPSG:4326'
print(f"Original CRS: {gdf.crs}")
# Now, let's reproject to a CRS that uses feet (because 'Murica?).
gdf_ny = gdf.to_crs('EPSG:2263') # NAD83 / New York Long Island (ftUS)
print(f"Reprojected CRS: {gdf_ny.crs}")
# Now we can buffer with a real distance! 1000 feet.
gdf_ny['buffer_geom'] = gdf_ny.geometry.buffer(1000)
# And if we want to plot it on a leaflet map later, we can project it back.
gdf_back_to_wgs = gdf_ny.to_crs('EPSG:4326')
If you take nothing else from this, remember this: Always know what CRS your GeoDataFrame is in. An operation without a known CRS is a time bomb waiting to ruin your analysis with silent, plausible-looking, but completely wrong results.
Reading, Writing, and the Magic of Geocoding
You’re not always creating points by hand. Usually, you’re loading data. GeoPandas can read almost any geospatial format: Shapefiles, GeoJSON, you name it.
# Read a GeoJSON file of NYC neighborhoods
neighborhoods = gpd.read_file('https://raw.githubusercontent.com/nycebo/Open-Data-Repo/main/Neighborhoods/GeoJSON/Neighborhoods.geojson')
# Quick and dirty plot
neighborhoods.plot(figsize=(10, 10), edgecolor='black', cmap='viridis')
And if all you have is a list of addresses? You geocode. geopandas.tools.geocode uses Nominatim by default, which is free but… slow and has usage limits. For anything serious, use a paid service or set up your own.
# Let's find the sacred geometry of pizza places.
from geopandas.tools import geocode
addresses = ["Joe's Pizza, NYC", "Lombardi's Pizza, NYC"]
pizza_gdf = geocode(addresses, provider='nominatim', user_agent='my-geo-app') # You MUST set a user_agent
print(pizza_gdf)
The final word of advice? Spatial joins are your superpower. You can join your point data to your neighborhood polygon data to ask “which neighborhood is each point in?” in one line of code (gpd.sjoin). It feels like magic every time. Use it wisely. Now go make some maps that actually mean something.