Optimising Part I: Sampling rasters

This is Part I, which looks at different methods of sampling array data at points. Part II is here and focuses on the next step: SQLing it all into a database.

This week I needed to do some geospatial processing, and it needed to go fast. One million locations, with 15,000 days of data each, all for three different climate variables. The raw data is several TB of raster images in the cloud somewhere, and we want our sampled data to end up in a well-organised database somewhere.

In the end we’ll have 15 billion rows, with three variable columns, and some extra columns with the date, location etc. Almost 100 billion individual values to be written. Something like this:

Date Point VarA VarB VarC
2010-11-14 1 0.45 1.23 3.82
2010-11-15 1 0.50 1.43 4.01
2010-11-14 2 0.98 1.67 2.11

Then, when we want to know the values for a group of locations, instead of doing all the sampling and figuring out, we just pull it from the database!

While working on this problem, I decided to have some fun benchmarking it to keep track of my progress. I got pretty far down the rabbit-hole in the process, and managed a good few orders-of-magnitude improvement!

If you prefer, head over to the benchmark suite on GitHub. You should be able to clone the repo and run the entire suite in a few minutes.

# Some background on raster sampling (zonal statistics)

This is a common GIS problem: we have a bunch of locations (generally points or polygons) and raw raster data (what’s a raster), and we want to know the values (mean, max, etc) at each location. If the location are polygons, this involves a bit of figuring out. For this exercise we’re just using points with latitude/longitude for each location.

A normal two-dimensional raster is basically a big grid of values, with an x- and a y-coordinate for each cell in the grid. In the example below, it’s a 4x4 grid with a single integer in each location. In Python this would generally be stored in a Numpy array. In this case, I’ve used a red colour gradient to colour cells by increasing value. To get an RGB picture, you instead need three values for each pixel.

So all raster data is just much bigger versions of this. If I want a particular value, I just use the row- and columns indices, e.g. arr[2, 1] => 9. (Remember that in most coding, numbering starts at 0.) You can also pull out larger chunks of data, e.g. arr[2, :] => [3, 9, 3, 2] pulls out the whole row!

Instead of a flat, two-dimensional raster, we have a four-dimensional “hypercube” of data: dates along the first axis, then the three variables, and then the x- and y-coordinates that we’re already familiar with. Now if we want a certain value, we instead pass four indices. So if we for example wanted the first date, all three variablse, and again the bottom-leftish value, we might do arr[0, :, 2, 1] => ....

# Round 1: Rasterstats

The obvious tool for zonal statistics is rasterstats, so that’s where I started. It has a clear downside: it only operates on one 2D raster at a time, so we have to manually loop through each data and each variable in our 4D cube.

So here’s a little function that does that. Note, in this and subsequent code blocks, I’m leaving out some code to keep things to the barebones if what’s being benchmarked. This includes additional columns to specify date and location, setting data types to control memory usage and others. All the functions will have the same function signature as the below – def name(locs, cube, aff) – and should return the exact same table.

def rasterstats(locs, cube, aff):
    dfs = []
    for d in range(cube.shape[0]):
        res = {}
        for v in range(cube.shape[1]):
            # rasterstats can do much more than just mean, it's great stuff
            s = zonal_stats(locs, cube[d, v, :, :], stats="mean", affine=aff)
            res[v] = [x["mean"] for x in s]
        dfs.append(pd.DataFrame(res))
    return pd.concat(dfs)
This was too slow to run for even 365 days, so I ran it for just one day. Even so, it took 6 minutes for 10,000 points. That would be 6000 days for the whole dataset!

# Round 2: Manual sampling

Clearly rasterstats was not designed for this application. Instead, let’s pull the values out manually. The key thing is that we have lat/lon coordinates (e.g. 28°S 127°E is somewhere in Australia), and we need to convert those to a row/column coordinate to pull the right value out of the array. But once we have that, the magic of NumPy will let us pull out multiple days (thousands of days!) and variables in one go.

Let’s see. Here we’re using rasterio to create a virtual ‘raster’ from the cube, and then use its index() function to do the lat/lon => row/col conversion I just described.

def manual(locs, cube, aff):
    with MemoryFile() as m:
        h, w, d = cube.shape[-2], cube.shape[-1], cube.dtype
        ds = m.open(driver="GTiff", count=1, height=h, width=w, dtype=d, transform=aff,)
    dfs = []
    for idx, row in locs.iterrows():
        # Convert lat/lon to row/col and use it in our NumPy array
        row, col = ds.index(row.geometry.x, row.geometry.y)
        res = cube[:, :, row, col]
        res = pd.DataFrame(res)
        dfs.append(res)
    return pd.concat(dfs)
This got it down to 3.2 seconds for a full year (365 days) and 10,000 points -- already about 5,000 times faster!

# Round 3: No GeoPandas

In the function above, locs is a GeoDataFrame: basically a Pandas DataFrame + geometry. Look these up if that’s gibberish. It doesn’t seem necessary to pass all that through, so let’s just send the x- and y-coordinates directly. We can pre-calculate them at the beginning and then re-use them again and again. So in the function below, locs is now a simple array of x- and y-coordinates.

def latlon(locs, cube, aff):
    with MemoryFile() as m:
        h, w, d = cube.shape[-2], cube.shape[-1], cube.dtype
        ds = m.open(driver="GTiff", count=1, height=h, width=w, dtype=d, transform=aff,)
    dfs = []
    for i in range(len(locs)):
        # These simple array lookups are always faster than stuff with names
        row, col = ds.index(locs[i][0], locs[i][1])
        res = cube[:, :, row, col]
        res = pd.DataFrame(res)
        dfs.append(res)
    df = pd.concat(dfs)
    return df
Now down to 1.7 seconds.

# Round 4: No Pandas

We got a speed boost by dropping one use of Pandas, let’s do more. It was creating a new Pandas DataFrame in each loop. As is often the case, these higher-level abstractions bring a bit overhead with them that can start to matter in cases like this.

So I took the DataFrame stuff out of the loop, and just stuffed it all into a DataFrame right at the end instead. I also changed locs again: now it’s two separate arrays of ox x-values and y-values.

def nopandas(locs, cube, aff):
    xs, ys = locs
    with MemoryFile() as m:
        h, w, d = cube.shape[-2], cube.shape[-1], cube.dtype
        ds = m.open(driver="GTiff", count=1, height=h, width=w, dtype=d, transform=aff,)
    res = []
    for i in range(len(xs)):
        row, col = ds.index(xs[i], ys[i])
        res.append(cube[:, :, row, col])
    # why make a million DataFrames when one will do!
    df = pd.DataFrame(np.concatenate(res))
    return df
An order of magnitude drop to 0.19 seconds!

# Round 5: Drop Rasterio

All that MemoryFile stuff with the rasterio dataset seems unnecessary, when all we want from it is the index() function. I had a look at the source, and lo, it’s basically just one line. So I ditched it and just wrote out that line myself.

def norasterio(locs, cube, aff):
    xs, ys = locs
    res = []
    for i in range(len(xs)):
        # One line of matrix-multiplication magic
        col, row = ~aff * (xs[i], ys[i])
        res.append(cube[:, :, floor(row), floor(col)])
    df = pd.DataFrame(np.concatenate(res))
    return df
Shaved some more off, to 0.13 seconds.

# Round 6: No Affine

The whole time, aff has been an Affine. Basically a matrix with some special helpers that does the job of translating between lat/lon and row/col. I looked into how it works, and just did the matrix multiplication myself. Have a look at perrygeo if you want see how this stuff works.

def noaffine(locs, cube, aff):
    xs, ys = locs
    invaff = tuple(~aff)
    sa, sb, sc, sd, se, sf, _, _, _ = invaff
    res = []
    for i in range(len(xs)):
        # Each of these letters stands for something...
        col, row = (xs[i] * sa + ys[i] * sb + sc, xs[i] * sd + ys[i] * se + sf)
        res.append(cube[:, :, floor(row), floor(col)])
    df = pd.DataFrame(np.concatenate(res))
    return df
A small drop to 0.12 seconds.

# Round 7: Compiling Just-in-Time

Normally Python ‘interprets’ code: as it goes, it figures out what each bit means in machine language and then runs it. But in this case I’m running a snippet of code thousands of times - why not compile it once and then skip all that interpreting. A library called Numba makes this extremely easy. Just decorate the function with @jit! The code inside the function here is exactly the same.

@jit
def jitted(locs, cube, aff):
    # Same exact function code as above
    ...
No improvement. Numba couldn't do much with that.

# Round 8: No more fancy objects

One of the thing that makes Python easy is that anything can be just about anything. But to really squeeze out every drop of performance, we want a bit of predictability in the underlying representation. Numba offers this through @njit, where everything is only allowed to be integers, floats, etc, and arrays of these.

Now I had to add a wrapper function, as Affines, DataFrames, list appending and things like that aren’t allowed inside the njitted function. The outer function is called directly, and the inner _njitted() does the number crunching. Notice how now the arrays are set with a pre-configured size, and results are neatly slotted in. The code starts to get significantly less readable around here…

@njit
def _njitted(xs, ys, cube, invaff):
    # This is the inner function
    sa, sb, sc, sd, se, sf, _, _, _ = invaff
    num_points = len(xs)
    num_scenes = cube.shape[0]
    num_bands = cube.shape[1]
    avals = np.empty((num_points * num_scenes, num_bands), dtype=np.float32)
    for i in range(num_points):
        col, row = (xs[i] * sa + ys[i] * sb + sc, xs[i] * sd + ys[i] * se + sf)
        res = cube[:, :, floor(row), floor(col)]
        avals[i * num_scenes : (i + 1) * num_scenes, :] = res
    return avals

def njitted(locs, cube, aff):
    # And this is the un-jitted wrapper
    xs, ys = locs
    invaff = tuple(~aff)
    avals = _njitted(xs, ys, cube, invaff)
    df = pd.DataFrame(data=avals)
    return df
Now we're talking! 0.11 seconds.

# Round 9: In parallel this time

The calculation is basically doing the same thing thousands of times, and each calculation does not depend on any other. The perfect opportunity to run them all in parallel, i.e. on different threads and cores in the processor. Once again, Numba makes this extremely easy, just add parallel=True and replace range with their special prange.

@njit(parallel=True)
def _parallel(xs, ys, cube, invaff):
    # All the same in here
    ...
    for i in prange(num_points):
        ...

def parallel(locs, cube, aff):
    # Here too
    ...
This got it down to 0.08 seconds! And should make even more difference on the cloud with a 32 core processor!

# Round 10: Off the deep end

These functions have all been creating DataFrames. If that isn’t a hard requirement, we can probably eke out a bit more speed (and lower memory use from less copying) using NumPy Structured Arrays, which provide labelled “columns” like Pandas, with less overhead.

Once again, we have an inner jited function, within a wrapper. Note also that the empty array is created in the wrapper and passed into the inner function, because Numba doesn’t play nice with all data types.

@njit(parallel=True)
def _recarrays(xs, ys, cube, invaff, rec):
    sa, sb, sc, sd, se, sf, _, _, _ = invaff
    num_scenes = cube.shape[1]
    for i in prange(len(xs)):
        col, row = (xs[i] * sa + ys[i] * sb + sc, xs[i] * sd + ys[i] * se + sf)
        res = cube[:, :, floor(row), floor(col)]
        fr = i * num_scenes
        to = (i + 1) * num_scenes
        # We can assign values to the array by name (A, B, C) which is quite nice
        rec.A[fr:to] = res[0]
        rec.B[fr:to] = res[1]
        rec.C[fr:to] = res[2]
    return rec

def recarrays(locs, cube, aff):
    xs, ys = locs
    invaff = tuple(~aff)
    cube = np.moveaxis(cube, 1, 0)
    rows = len(xs) * cube.shape[1]
    # A recarray/structure array is basically an array with multiple data types
    rec = np.empty(rows, dtype=[("A", "f4"), ("B", "f4"), ("C", "f4")])
    rec = _recarrays(xs, ys, cube, invaff, rec)
    return rec
And another boost down to 0.07 seconds!

# Winner winner

All in all a 5,000x speed-up from dropping rasterstats, and another 50x speed-up from the remaining optimisations. A total of 250,000 times faster!

Quick note: I’m not trying to knock rasterstats, Pandas, GeoPandas, rasterio or anything. These are all great libraries that I use every day. But there’s a time and place for everything.

I pushed the two fastest methods up to 1 million points, and the recarray approach continued to stay slightly ahead. When I put this into production with the final methods, it was able to read a year’s worth of data for a million points points, with six variables and a huge 4D cube of data, in less than 3 seconds.