Geospatial operations are relational operations¶
A working hypothesis, and a slightly radical one: the core operations of
geospatial and climate analysis — the ones we reach for an array library to
perform — are, underneath, relational operations. Climatologies, anomalies,
zonal means, spectral indices, forecast skill, even regridding: each maps onto
ordinary SQL — GROUP BY, JOIN, window functions, CASE, and the occasional
scalar UDF.
The array paradigm (NumPy, Xarray, Dask) is a wonderful interface for these
operations. But it is not the only one, and for a large and growing audience —
the people fluent in SQL rather than in apply_ufunc and rechunking — it is not
the most accessible one. xarray-sql lets you pose these
questions in SQL and answers them with a real query engine (DataFusion). The
datasets are opened lazily, so a query against the whole archive reads only the
variable and the slice it actually needs. And because a gridded result is still
gridded data, every query here round-trips its answer straight back to an
xarray.Dataset — SQL in, an array out, ready to plot or save.
This page makes the argument case by case. Every claim below is backed by a
runnable script in benchmarks/geospatial/ that
poses the operation in SQL and asserts the answer matches an xarray/array
reference to floating-point tolerance. The point is not that "SQL is faster";
the point is that the SQL reads like the definition of the operation and
computes the same numbers — at ERA5's real 0.25° global resolution.
Where this list comes from¶
The operations here aren't a set we hand-picked to suit SQL. They're taken from Large Scale Geospatial Benchmarks (coiled/benchmarks #1545), a discussion James Bourbeau opened in 2024 asking the geospatial and climate community a pointed question: what are the end-to-end workflows the Xarray/Dask ecosystem needs to handle smoothly at the 100-terabyte scale? The replies are a representative survey of what geoscience actually runs — and this suite works through nearly all of it:
| #1545 workflow | Covered by |
|---|---|
| Remote-sensing indices (NDVI/NDWI/NDSI over Sentinel-2 or Landsat) | case 01 |
Vectorized functions (apply_ufunc-style per-cell math) |
case 01 |
| Climatology (average weather for a time of year/day, per location) | case 02 |
| Transformed Eulerian Mean (circulation diagnostics — zonal means and anomalies) | cases 03, 04 |
| Forecast evaluation (scoring forecasts against ground truth) | case 05 |
| Regridding and reprojection (resolution and CRS changes) | cases 07, 08, 09 |
| Spatial joins (large polygon-to-polygon joins) | not covered — a vector-data problem; the closest analogue here is the raster × vector join in case 06 |
So the claim isn't that a few cherry-picked operations happen to be relational. It's that an independent survey of the operations geoscience runs at scale, run through SQL one by one, turns out to be — almost entirely — queries.
The mapping¶
| Operation | The "array" framing | The relational reality | Script |
|---|---|---|---|
| Spectral index (NDVI) | apply_ufunc over a raster |
column arithmetic | 01_ndvi.py |
| Climatology | rechunk → grouped reduction | GROUP BY lat, lon, hour-of-day |
02_climatology.py |
| Zonal mean | reduce over lon/time axes | GROUP BY lat |
03_zonal_mean.py |
| Anomaly | grouped broadcast-subtract | climatology CTE self-JOIN |
04_anomaly.py |
| Forecast skill (RMSE) | align valid/init/lead, reduce | forecast↔truth JOIN on valid_time |
05_forecast_skill.py |
| Zonal stats over regions | rasterize polygons + mask | raster × vector range JOIN |
06_zonal_vector.py |
| Reprojection | per-pixel CRS transform | scalar UDF (ST_Transform-style) |
07_reproject_udf.py |
| Regridding | interpolation to a new grid | sparse-weight table JOIN |
08_regrid_weights.py |
| Warp (reproject + resample) | CRS transform and interpolation | reproject UDF → weight-table JOIN |
09_warp.py |
1. A pixel-wise formula is a column expression¶
NDVI is (NIR − Red) / (NIR + Red), per pixel. The array idiom broadcasts a
ufunc over the raster. But "one output per pixel, computed from that pixel's
bands" is the definition of a SQL projection:
Invalid pixels need no special handling: xarray decodes the band's _FillValue
to NaN on open, and NaN propagates through the arithmetic on both sides, so
the masking is free.
01_ndvi.py runs this against a real
Sentinel-2 L2A scene in Zarr — discovered with pystac-client and opened the
canonical way with xr.open_datatree (ESA's EOPF sample service) — and matches
xarray's apply_ufunc-style result over a million pixels.
2. A climatology is a GROUP BY over the cycle¶
A climatology is the average value for each time-of-cycle at each location. In the array world this is the canonical painful workload — load native chunks, rechunk so all of time lands in one chunk, reduce, rechunk back. The rechunking serves the array layout, not the question. The question is:
SELECT latitude, longitude, date_part('hour', time) AS hour,
AVG("2m_temperature")
FROM era5 GROUP BY latitude, longitude, date_part('hour', time)
The grouping keys are the dimensions you keep; everything else is reduced. No
layout to reason about. 02_climatology.py
computes the diurnal cycle of ERA5 2m-temperature over a region — averaging
each cell by hour of day — and matches da.groupby("time.hour").mean() across
~500k cells.
A zonal mean (03_zonal_mean.py)
is the same idea with fewer keys: the axes you "reduce over" are simply the
columns you don't GROUP BY.
3. Broadcasting a normal back onto observations is a JOIN¶
An anomaly subtracts each cell's climatological normal from every matching
observation. Xarray expresses the realignment with grouped broadcasting
(ds.groupby("time.hour") - climatology). That realignment — attach each
cell's normal to every timestep that shares its key — is a JOIN on the
grouping key:
WITH clim AS (
SELECT latitude, longitude, date_part('hour', time) AS hour,
AVG("2m_temperature") AS clim_t
FROM era5 GROUP BY latitude, longitude, date_part('hour', time)
)
SELECT a.time, a.latitude, a.longitude,
a."2m_temperature" - c.clim_t AS anomaly
FROM era5 a JOIN clim c
ON a.latitude = c.latitude AND a.longitude = c.longitude
AND date_part('hour', a.time) = c.hour
04_anomaly.py computes the
climatology once (the CTE) and joins it back to every observation.
4. Forecast evaluation is a JOIN on valid time + aggregate¶
This is the real workload of WeatherBench 2:
scoring machine-learning weather models — Pangu-Weather and GraphCast —
against ERA5 ground truth. A forecast is indexed by initialization time and
lead time (prediction_timedelta); the truth is indexed by valid time.
Evaluation aligns them by valid_time = init + lead and reduces the error to
RMSE as a function of lead.
That alignment is a relational JOIN, and valid_time = init + lead is just
timestamp + duration arithmetic the engine does natively:
SELECT f.model, f.prediction_timedelta AS lead,
SQRT(AVG(POWER(f."2m_temperature" - e."2m_temperature", 2))) AS rmse
FROM forecasts f
JOIN era5 e
ON e.time = f.time + f.prediction_timedelta -- valid_time = init + lead
AND e.latitude = f.latitude
AND e.longitude = f.longitude
GROUP BY f.model, f.prediction_timedelta
Both models are stacked along a model dimension into one forecast table, so a
single query scores them together, grouped by the model column. The entire
evaluation — temporal alignment across three time axes, spatial matching, and the
score — is one JOIN and one aggregate.
05_forecast_skill.py runs it
for both models, matches an xarray reference, and reproduces the published result
that GraphCast edges out Pangu at every lead — the classic "error grows with
horizon" curve (≈0.3 K at 6 h rising to ≈2.5 K at 9 days):
The result round-trips to a pandas table directly (got.to_pandas()), RMSE in
kelvin by lead time:
model graphcast pangu
lead (days)
0.25 0.296 0.336
1.25 0.464 0.554
2.25 0.608 0.734
3.25 0.780 0.936
4.25 0.988 1.191
5.25 1.228 1.469
6.25 1.470 1.747
7.25 1.763 2.096
8.25 2.092 2.489
9.25 2.380 2.814
5. Raster × vector zonal statistics is a range JOIN¶
"Average the raster inside each region" is the canonical raster-meets-vector task. The array idiom rasterizes each polygon to a mask and reduces under it. But a region is a row in a table of bounds, and "pixel inside region" is a range predicate — so zonal statistics is a JOIN:
SELECT r.region, AVG(a."2m_temperature") - 273.15 AS avg_c
FROM era5.surface a JOIN regions r
ON a.latitude BETWEEN r.lat_min AND r.lat_max
AND a.longitude BETWEEN r.lon_min AND r.lon_max
WHERE a.time BETWEEN TIMESTAMP '2020-06-01' AND TIMESTAMP '2020-06-01 23:00:00'
GROUP BY r.region
This is the README's promise — joining tabular data with raster data — made
literal: the raster is the full ERA5 archive (the WHERE prunes it to a day),
the regions are a second SQL table, and the spatial relationship is an ordinary
BETWEEN. See 06_zonal_vector.py
— it reports e.g. Sahara 33 °C vs Greenland −8 °C for a June day. (Rectangular
regions keep this simple; arbitrary polygons would follow the same shape, with a
point-in-polygon test in the join.)
6. The hard cases: where a UDF fits, and where it doesn't¶
Reprojection and regridding are the operations most wedded to the array paradigm. They split cleanly along one line: is the operation row-independent?
Reprojection is. Moving a coordinate from one CRS to another depends only on
that coordinate, so it is a scalar function — exactly what PostGIS and
DuckDB-spatial already ship as ST_Transform. We register a PROJ-backed scalar
UDF (mirroring the cftime() UDF already in xarray_sql/cftime.py) and
reproject in SQL:
07_reproject_udf.py validates
this against Earth Engine itself: it opens a UTM grid through
Xee carrying ee.Image.pixelLonLat(), so EE's
own geodesy engine reports the true lon/lat of every pixel — an independent
reprojection reference, not PROJ-vs-PROJ. The SQL UDF and EE agree to sub-metre
precision. The script flags one practical gotcha (PROJ is not thread-safe, so the
UDF runs serially), but the caveat that matters here is conceptual: reprojection
moves the coordinates without resampling the data onto a new grid — and that is
the next operation.
Regridding is not row-independent: each output cell is a weighted blend of
several input cells. That is a many-to-many relationship — and a many-to-many
weighted blend is a sparse matrix–vector product, which is a JOIN against a
weight table plus a weighted GROUP BY:
SELECT w.dst_id, SUM(s.value * w.weight) AS regridded
FROM weights w JOIN src s ON s.cell_id = w.src_id
GROUP BY w.dst_id
08_regrid_weights.py regrids
real SRTM elevation (Sierra Nevada terrain, opened from the Earth Engine
catalog through Xee) coarse → fine and matches
xarray's bilinear .interp() exactly. So regridding does not weaken the thesis —
it is the most relational operation of all.
A warp is just the two composed. The full operation a GIS calls warp (GDAL
and rasterio's reproject) does both at once: change the CRS and resample onto
the new grid. 09_warp.py writes it as the
two cases above run back to back — the 07 reproject UDF carries the target
lon/lat grid back into the source UTM space, arrays turn those reprojected points
into bilinear weights, and the 08 JOIN applies them:
-- 1. reproject the target grid into source coordinates (the 07 UDF)
SELECT dst_lat, dst_lon, reproject(dst_lon, dst_lat)['x'] AS sx,
reproject(dst_lon, dst_lat)['y'] AS sy
FROM target
-- 2. apply the bilinear weights built from those points (the 08 JOIN)
SELECT w.dst_lat AS lat, w.dst_lon AS lon, SUM(s.value * w.weight) AS warped
FROM weights w JOIN src s ON s.x = w.src_x AND s.y = w.src_y
GROUP BY w.dst_lat, w.dst_lon
It warps SRTM from a UTM grid onto a lon/lat grid and matches xarray's .interp()
at the reprojected points exactly, with Earth Engine's own lon/lat SRTM as a
second, cross-CRS sanity check (a loose match — EE resamples its native 30 m data,
we resample the 2 km source — so it is a corroboration, not the assertion). The
warp lands exactly where the split predicts: the row-independent half is a UDF,
the many-to-many half is a JOIN, and the only genuinely geometric step — turning
the reprojected points into weights — is the array work the next section is about.
Where the array paradigm still earns its keep¶
The boundary is weight generation. Applying a regridding is a join; computing the weights — cell overlaps for conservative remapping, stencils and spherical geometry for bilinear, the whole machinery of xESMF/ESMF — is genuinely geometric work that arrays (and specialized libraries) do well. The relational view does not replace that; it consumes its output. The division of labor is clean and, we think, the right one:
Arrays compute the geometry (the weights). SQL applies it (the join).
Likewise, the array libraries remain the right tool for building the inputs in
the first place — opening Zarr, decoding CF metadata, the numerics of generating
a weight matrix. xarray-sql sits downstream of all that as a query front-end:
once the data is openable as an xarray.Dataset, these everyday operations are
expressible — and accessible — as SQL.
That is the qualitative boundary; the rest of this page puts numbers to it. The Results below report what each operation costs in SQL versus the array reference, Analysis explains why the relational form is slower and where the time goes, and the Conclusion turns the whole thing into a when-to-use-which.
Running the suite¶
python benchmarks/geospatial/02_climatology.py # inside the repo
uv run benchmarks/geospatial/02_climatology.py # standalone (PEP 723 deps)
Each script prints its SQL, runs the array reference, and asserts the two agree.
See benchmarks/geospatial/README.md for
the full list and dataset notes.
Results¶
Correctness is the headline, but every case is also profiled. The numbers below
come from run_perf.sh on a single Google
Compute Engine e2-standard-8 (8 vCPU, 32 GB) in us-central1 — in-region with the
ARCO-ERA5 and WeatherBench 2 buckets, so the cloud read is fast — with Earth Engine
reached from the same VM, so all nine cases share one machine and one release build.
Each case runs once per fresh process, with no warmup, repeated five times: the
SQL operation and the xarray reference each pay a cold read on every
measurement.
Fairness here took some care, because the obvious trap is caching. A reference
that calls .load() caches its data in place on the very object the SQL table
also reads from, so a later read — even just running the reference after the SQL
query in the same process — could be served warm. We close that two ways. The one
case that loads shared objects (05, forecast skill) uses .compute() instead,
which returns a fresh array and leaves the inputs lazy, caching nothing; the other
references either reopen their data or recompute their reduction eagerly on every
read (chunks=None is NumPy, not Dask, so there is no graph to keep warm). And
run_perf.sh runs each case in a fresh process per repetition, ruling out any
carryover between reps. We verified the result directly: reading a window
repeatedly in one process stays flat, and running either side after the other
speeds up neither — the SQL query and the reference do not warm each other.
| Case | Step | median (s) | stdev (s) | min (s) | max (s) | peak (MB) |
|---|---|---|---|---|---|---|
| 01 · NDVI (per-pixel arithmetic) | SQL | 3.528 | 0.803 | 2.861 | 5.024 | 114.0 |
| xarray reference | 0.304 | 0.104 | 0.282 | 0.496 | 42.0 | |
02 · Climatology (GROUP BY lat, lon, hour) |
SQL | 4.443 | 0.383 | 4.216 | 5.198 | 490.2 |
| xarray reference | 1.867 | 0.106 | 1.844 | 2.053 | 43.7 | |
03 · Zonal mean (GROUP BY latitude) |
SQL | 2.406 | 0.122 | 2.333 | 2.631 | 236.9 |
| xarray reference | 0.385 | 0.006 | 0.381 | 0.395 | 249.5 | |
04 · Anomaly (climatology self-JOIN) |
SQL | 7.027 | 0.123 | 6.950 | 7.239 | 511.5 |
| xarray reference | 2.549 | 0.219 | 2.126 | 2.657 | 72.1 | |
05 · Forecast skill (forecast↔truth JOIN) |
SQL | 10.714 | 0.093 | 10.663 | 10.891 | 6.6 |
| xarray reference | 0.248 | 0.013 | 0.220 | 0.254 | 2.2 | |
06 · Zonal stats (raster × vector JOIN) |
SQL | 4.308 | 0.053 | 4.299 | 4.401 | 509.1 |
| xarray reference | 1.557 | 0.029 | 1.499 | 1.567 | 1262.1 | |
| 07 · Reprojection (PROJ scalar UDF) | SQL | 0.029 | 0.003 | 0.024 | 0.031 | 0.3 |
08 · Regridding (weight-table JOIN) |
SQL | 0.875 | 0.037 | 0.845 | 0.933 | 11.9 |
| xarray reference | 0.850 | 0.658 | 0.809 | 2.310 | 13.3 | |
09 · Warp (reproject UDF → regrid JOIN) |
SQL | 0.281 | 0.038 | 0.250 | 0.353 | 0.8 |
| xarray reference | 0.817 | 0.030 | 0.764 | 0.828 | 11.2 |
Two patterns are visible before any analysis. SQL is slower on wall-clock wherever
a cloud read or a large relational expansion dominates — by ~2.5–6× on the
GROUP BY and JOIN cases against ARCO-ERA5, and ~43× on case 05, the smallest
grid but the biggest blow-up into rows — and its peak memory is highest on those
join/group-by cases (≈0.5 GB on 02, 04, 06). But the pattern is not universal.
On cases 08 and 09, where the interpolation weights are precomputed and SQL just
applies them, SQL is at parity with the array reference (08: 0.875 vs 0.850 s) or
faster (09: 0.281 vs 0.817 s — the reference pays for pyproj + .interp,
while SQL streams the prebuilt weight JOIN). The slow and the fast cases follow
from the same cause, which the next section pins down. (Case 01 reads Sentinel-2
from Europe, the only non-US source, so its SQL time includes a cross-region read.
Cases 07–09 run against Earth Engine from the same VM: 07 times only the SQL
reproject transform, checked against Earth Engine's own pixelLonLat; 08 and 09
read SRTM lazily on both the SQL and reference sides, so that comparison is
symmetric.)
Case 05 is the suite's most hardware-sensitive number: its SQL time is CPU-bound on
the join and the (GIL-held) row production that feeds it, so it swings with the
machine — across three e2-standard-8 runs it has measured ≈10.7 s, ≈12 s, and
≈23 s, while the read-bound reference stays near 0.25 s. So read the 05 ratio as
"the relational form costs real CPU here," not as a fixed multiplier.
Analysis: how a relational operation spends its time¶
Why is SQL slower, and where does the time actually go? Profiling case 05 — the
forecast-skill JOIN, the widest gap — with cProfile, run cold then warm so that
cold − warm isolates the cloud read and the warm floor is ≈pure compute,
decomposes it cleanly. (These are single-process numbers from a laptop with a slow
cross-region read — a different machine from the in-region table above, on
purpose: it puts both sides' reads on equal, slow footing so the compute gap shows
through. The absolute seconds therefore differ from the table; the decomposition,
not the totals, is the point.)
| read (I/O) | compute | total (cold) | |
|---|---|---|---|
| SQL | ~0.95 s | ~0.71 s | ~1.66 s |
| xarray reference | ~0.79 s | ~0.024 s | ~0.81 s |
The read is comparable on both sides — both open the same Zarr store cold. The
gap is compute, and it is about 30×. The SQL path explodes the 64×32×20×2 grid
into Arrow rows, runs a hash JOIN to align each forecast row with its truth row
on (valid_time, latitude, longitude), aggregates, and streams the result batches
back. The array reference does the identical math as a handful of vectorized NumPy
reductions over contiguous buffers. Row materialization + hashing + the join probe
is simply heavier than dense arithmetic on a regular grid — and it is the same
work that inflates SQL's peak memory in the Results table: the join and group-by
cases hold the grid as rows.
cProfile is unambiguous about where the SQL time sits. Essentially all of it is
in pulling record batches from the DataFusion execution stream; the SQL→xarray
round-trip that turns the query result back into a gridded Dataset
(to_dataset) is sub-millisecond — under 1% of the query. So the cost is the
relational engine doing row-oriented work, not the array reconstruction. The
paradigm itself is the price, paid where the relational algebra runs.
This explains the shape of the whole table. Case 05 stands alone at ~43× not
because its join is exotic but because its reference is nearly free — a 64×32×20×2
grid reduces in-memory in a quarter-second — while SQL still has to explode that
grid into rows and hash-join them; a huge ratio over a tiny denominator. The
ARCO-ERA5 cases (02, 03, 04, 06) instead cluster at ~2.5–6×, because there a large
cloud read is a cost both sides pay, compressing the ratio. And cases 08 and 09
invert it entirely: once the geometry — the interpolation weights — is precomputed,
applying it is a JOIN that streams about as fast as (or faster than) the array
reference's pyproj/.interp. The relational overhead is constant; the ratio
you observe depends on how much non-relational work (the cloud read, the weight
generation) sits on the other side of the comparison. And it shifts with hardware
too: SQL is CPU-bound on the join while the array reference is read-bound, so the
two are gated by different resources. On a fast laptop with a slow cross-region
read the gap nearly closes; on an in-region VM with modest cores it widens. The
underlying cause is
constant — materialize rows, hash-join, aggregate — but which resource you are
waiting on is not.
Conclusion¶
None of this is an argument that SQL is faster. On a single node, for the reduction-shaped operations, it is not — it pays a real per-operation overhead to express an array reduction as relational algebra. (The exceptions, cases 08 and 09, prove the rule: once the array work — generating the weights — is already done, the relational half that remains is competitive, because there is no dense reduction left for arrays to win.) The honest tradeoff is about which property you are optimizing for.
Reach for the array paradigm when the work is dense and grid-aligned. Per-pixel formulas, stencils, convolutions, FFTs, linear algebra — anything that stays in contiguous typed buffers and treats the chunk grid as its unit of parallelism. The array model has the lowest overhead here, and the lead is structural, not incidental: there are no rows to materialize and nothing to shuffle. NDVI (case 01) is the tell — column arithmetic expresses cleanly in SQL, but the array side is ~10× faster (part of which is case 01's cross-region read; the rest is that per-pixel math is exactly what arrays are for).
Reach for SQL when the work is relationally shaped, or the audience is. Joins,
group-bys, alignment across data with different indexes (case 05's three time
axes), raster-meets-vector predicates (case 06) — these are awkward to express and
to reason about as array operations, and they are the native vocabulary of a query
engine. The overhead buys you an operation that reads like its own definition, that
prunes its own reads (a query against the whole ERA5 archive touches only the
variable and window it asks for), and that is accessible to the large audience
fluent in SQL rather than in apply_ufunc and rechunking.
There is also a payoff this single-node benchmark cannot show. The same overhead —
row materialization and a hash join — is what makes the operation a first-class
citizen of a distributed query engine. Cost-based query optimization (join
reordering, choosing broadcast vs. shuffle joins, predicate pushdown), mature
partitioned shuffle and spill-to-disk, partitioning driven by the query rather than
locked to a physical chunk grid — these are exactly the capabilities the
array/Dask ecosystem struggles to provide for join- and group-by-heavy workloads
at scale, and exactly what the relational framing puts within reach. Whether the
constant-factor overhead is worth paying flips as the data grows and the bottleneck
moves from per-element compute to data movement. xarray-sql is single-node today,
so that is a direction rather than a result — but it is the latent reason the
thesis matters beyond expressibility.
So the division of labor from the section above generalizes past regridding. Arrays own the dense numerics and the geometry; SQL owns the relational shape — the joins, the alignment, the aggregation — and, increasingly, the path to running them at scale. The point of this suite is not to crown a winner but to show that the line between the two is exactly where the operation is dense versus where it is relational, and that for a surprising share of geoscience, the operation is relational.