NumPy Is Magic: NES CHR Conversion

Jan 23, 2023, Programming

NumPy is magic and I mean that in both good and bad ways.

The good part: Code that uses NumPy can be very fast and simple. NumPy is easy to install on different platforms. There are some amazing examples of fast NumPy code out there, like Rafael de la Fuente’s Python Raytracer.

The bad part: It can be hard to write fast NumPy code, even if the code is simple once you write it. Simple functions or operators in NumPy can do different things depending on the shapes of the arguments.

If you write code in C or Fortran, you can be reasonably sure that your code will be fast if you write it in a straightforward way. With NumPy, your code can still be fast, but you may need to be clever.

Here’s an example of how to speed up some Python code using NumPy.

NES CHR Format

Graphics for the NES are stored as 8×8 pixel tiles with two bits of data per pixel (or 8×16 tiles for sprites, but we can ignore those). The data for each tile is arranged as a series of 8-pixel rows. Each row is two bytes, and each byte contains one bit for each pixel in the row.

This may be easier to understand with a picture. Here’s a 16×16-pixel image with a 4-color palette.

Loading…

JavaScript not available.

The sprite is first divided into 8×8 pixel tiles.

Loading…

JavaScript not available.

Each row is encoded as two bytes. The low bits for all rows in a tile are encoded first, followed by the high bits for all rows.

Loading…

JavaScript not available.

Simple and Correct Conversion

If you have a 2-bit indexed image, you can extract an 8×8 tile in NES CHR format with the following code, encode1.py:

def encode_tile(pixels: bytes,
                stride: int, x0: int, y0: int) -> bytes:
    """Encode a tile as NES CHR data.

    Arguments:
    pixels -- raw pixel data, one byte per pixel
    stride -- bytes per row in source image
    x0 -- coordinate of top left pixel in tile
    y0 -- coordinate of top left pixel in tile
    """
    lobytes = []
    hibytes = []
    for y in range(y0, y0 + 8):
        lobits = 0
        hibits = 0
        for x in range(x0, x0 + 8):
            value = pixels[y * stride + x]
            lobits = (lobits << 1) | (value & 1)
            hibits = (hibits << 1) | ((value >> 1) & 1)
        lobytes.append(lobits)
        hibytes.append(hibits)
    return bytes(lobytes) + bytes(hibytes)

def encode_tiles(pixels: bytes, width: int, height: int) -> bytes:
    """Encode an image as NES CHR data.

    Arguments:
    pixels -- raw pixel data, one byte per pixel
    width -- width of image, in pixels
    height -- height of image, in pixels"""
    tiles = []
    for y in range(height // 8):
        for x in range(width // 8):
            tiles.append(encode_tile(pixels, width, x * 8, y * 8))
    return b''.join(tiles)

File benchmark.py:

import importlib
import io
import sys
import timeit

def gendata(n: int) -> bytes:
    chunk = bytes(range(256))
    f = io.BytesIO()
    while n >= len(chunk):
        f.write(chunk)
        n -= len(chunk)
    f.write(chunk[:n])
    return f.getvalue()

def main():
    if len(sys.argv) != 4:
        print('Usage: benchmark.py <module> <size> <count>',
              file=sys.stderr)
        sys.exit(2)
    module = importlib.import_module(sys.argv[1])
    size = int(sys.argv[2])
    count = int(sys.argv[3])
    print(module.encode_tiles(gendata(64), 8, 8))
    tm = timeit.Timer(
        'encode_tiles(pixels, n, n)',
        'pixels=gendata(n*n)',
        globals={'gendata': gendata,
                 'encode_tiles':module.encode_tiles,
                 'n':size})
    time = min(tm.repeat(number=count))
    print(time * 1e6 / (count * (size//8)**2), 'μs/tile')

if __name__ == '__main__':
    main()

On my 3.50GHz Intel i5-6600K, the test takes 18.85 μs per tile.

$ python benchmark.py encode1 256 100
  18.879488671892375 μs/tile

Functional Python

Python also supports functional programming, although functional programming is not idiomatic Python and can be cumbersome to write. Let’s see if a functional version of the program is any faster. We’ll copy the encoder to encode2.py and modify the encode_tile function:

from functools import reduce
from operator import or_ as bitor

def encode_tile(pixels: bytes,
                stride: int, x0: int, y0: int) -> bytes:
    rows = [pixels[p:p + 8]
            for p in range(y0 * stride + x0,
                           (y0 + 8) * stride + x0, stride)]
    return bytes(
        reduce(bitor, (((row[x] >> p) & 1) << (7 - x)
                       for x in range(8)))
        for p in (0, 1)
        for row in rows)

The functional version is slower than the iterative version:

$ python benchmark.py encode2 256 100
28.659352558619844 μs/tile

It’s a significant difference. The functional version takes 52% more time to convert a tile.

Is functional code slow in Python? Often it is. The main Python implementation, called CPython, only offers a limited set of optimizations. Functions are not inlined and the overhead of function calls adds up.

Enter NumPy

The NumPy version of the code is harder to understand. Here is encode3.py:

import numpy as np
import PIL.Image as Image

def encode_tiles(im: Image.Image) -> bytes:
    """Convert an indexed image to NES tiles."""
    # Convert to an array of 8x8 tiles of pixel data.
    pixels = np.array(im)
    height, width = pixels.shape
    tiles = (pixels.reshape(height//8, 8, width//8, 8)
             .swapaxes(1, 2)
             .reshape(-1, 8, 8))

    # Pack into NES format.
    bits = (tiles[:,None] >>
            np.array([0, 1], np.uint8)[None,:,None,None])
    bits &= 1
    return np.packbits(bits, axis=3).tobytes()

File benchmark2.py:

import io
import sys
import timeit

import PIL.Image as Image

import encode3

def gendata(n: int) -> bytes:
    chunk = bytes(range(256))
    f = io.BytesIO()
    while n >= len(chunk):
        f.write(chunk)
        n -= len(chunk)
    f.write(chunk[:n])
    return f.getvalue()

def genimage(size: int) -> Image.Image:
    return Image.frombytes('P', (size, size),
                           gendata(size * size))

def main():
    if len(sys.argv) != 3:
        print('Usage: benchmark.py <size> <count>',
              file=sys.stderr)
        sys.exit(2)
    size = int(sys.argv[1])
    count = int(sys.argv[2])
    tm = timeit.Timer(
        'encode_tiles(image)',
        globals={'encode_tiles': encode3.encode_tiles,
                 'image': genimage(size)})
    time = min(tm.repeat(number=count))
    print(time * 1e6 / (count * (size//8)**2), 'μs/tile')

if __name__ == '__main__':
    main()

There are no explicit loops and no functional programming. The code is much faster:

$ python benchmark2.py 256 100
0.6263812402451663 μs/tile

This takes 3.3% as much time, or can process 30× as many tiles per second.

How It Works: Reshape

Conceptually, a normal NumPy array is like a pointer to raw data, a data type for array elements, and information about each array axis. Array axes can be freely reordered. The first call to reshape() splits the Y axis into 8-pixel chunks, and splits the X axis into 8-pixel chunks. NumPy can do this by creating an array view—it creates different metadata for the same underlying data, without copying any data.

.reshape(height//8, 8, width//8, 8)

The two 8-pixel axes are the tiles. In order to get the 8×8 pixel tiles, we reorder the axes to put the rows and columns of tiles at the front, and the rows and columns of pixels at the back. This swaps axis 1 (the Y axis of pixels within each tile) with axis 2 (the X axis of tiles across the image). NumPy can also do this without copying.

.swapaxes(1, 2)

Finally, we combine the first two axes into one axis, so we have a 1D array of tiles, rather than a 2D array of tiles. NumPy has to copy the data this time. The shape of –1 means that axis size should be calculated automatically by NumPy. Yes, this involves a loop, but the loop is done deep inside NumPy, and NumPy implements it in C.

.reshape(-1, 8, 8)

We now have an array of 8×8 tiles.

How It Works: Broadcasting

The next NumPy technique is broadcasting. Broadcasting lets you apply an operation to two arrays that have different shapes. If one of the arrays has a size of 1 in a particular axis, then the value in that array is copied across the axis.

It’s easiest to see how it works with small arrays. Here, the second array has an axis of size 1, so the value in that array (100) is copied across the axis.

>>> np.array([1, 2]) + np.array([100])
array([101, 102])

The code in the converter is more complicated. It introduces a new axis between the first and second axis. This new axis distinguishes between the low bits and the high bits of the pixel data. The correct bit is shifted into place, and then masked off. Both operations here are broadcasted.

bits = (tiles[:,None] >>
        np.array([0, 1], np.uint8)[None,:,None,None])
b &= 1

Simple C

If you write this in C, you end up with some simple nested loops. No explanation is necessary.

Yes, it’s much faster than the NumPy code, clocking in at about 11× as many operations per second. The C version is both fast and simple—that’s the advantage of C code—fast C code is often still simple C code.

C version: encode.c

#define _POSIX_C_SOURCE 199309

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void EncodeTiles(uint8_t *restrict tiles,
                 const uint8_t *restrict pixels, int width,
                 int height) {
  for (int ty = 0; ty < height; ty += 8) {
    for (int tx = 0; tx < width; tx += 8) {
      for (int y = 0; y < 8; y++) {
        unsigned lobits = 0, hibits = 0;
        for (int x = 0; x < 8; x++) {
          int value = pixels[(ty + y) * width + tx + x];
          lobits = (lobits << 1) | (value & 1);
          hibits = (hibits << 1) | ((value >> 1) & 1);
        }
        tiles[y] = lobits;
        tiles[8 + y] = hibits;
      }
      tiles += 16;
    }
  }
}

#define SIZE 2048
#define COUNT 10

int main(int argc, char **argv) {
  uint8_t *pixels = malloc(SIZE * SIZE);
  if (pixels == NULL)
    abort();
  uint8_t *tiles = malloc(SIZE * SIZE / 4);
  if (tiles == NULL)
    abort();
  for (int i = 0; i < SIZE * SIZE; i++) {
    pixels[i] = i;
  }
  double best = 0.0;
  EncodeTiles(tiles, pixels, SIZE, SIZE);
  int result = 0;
  for (int i = 0; i < COUNT; i++) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    EncodeTiles(tiles, pixels, SIZE, SIZE);
    clock_gettime(CLOCK_MONOTONIC, &end);
    result += tiles[0];
    double time = (end.tv_sec - start.tv_sec) +
                  1e-9 * (end.tv_nsec - start.tv_nsec);
    if (i == 0 || time < best) {
      best = time;
    }
  }
  printf("Time: %.4f μs/tile\n", best * 1e6 / (SIZE * SIZE / 64));
  return result;
}

$ ./a.out
Time: 0.0608 μs/tile

NumPy: Too Clever?

If you’re already using Python to process images, audio, or similar big chunks of numeric data, NumPy should be in your toolbox. NumPy is fast and expressive. NumPy also comes with a massive library of functions, and if that isn’t enough, you can use SciPy, Pandas, Pillow, or another library that interoperates seamlessly with NumPy.

The catch is that some times, you’ll do something clever and your NumPy code will be harder to understand.