Optimizing NumToString
The problem is simple: how do you write a fast function that converts a number to a decimal string? You’re not allowed to use an excessive amount of space—lookup tables are fine, but they can’t be massive. The function should take an unsigned 32-bit integer as input and produce the decimal string representation as output, with no extra leading zeroes.
To make the implementation simpler, the function will write the string backwards, starting with the last character, and then return a pointer to the last character written. The reasoning will become apparent when you write out the function.
You’ll note that the buffer is defined as a union; this is for a later optimization.
#include <stdint.h>
// Buffer for storing the result of NumToString.
union NumToStringBuffer {
char chars[10];
uint16_t u16[5];
};
// Convert an unsigned number to a decimal string. Returns the
// pointer to the start of the string, the string will fill the
// remainder of the buffer.
char *NumToString(union NumToStringBuffer *buf, uint32_t num);
Unit Test
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "num2str.h"
void Test(uint32_t value, const char *expect) {
union NumToStringBuffer buf;
char *ptr = NumToString(&buf, value);
int out_len = buf.chars + sizeof(buf.chars) - ptr;
int expect_len = strlen(expect);
if (out_len != expect_len || memcmp(ptr, expect, out_len) != 0) {
printf("Got '%.*s', expect '%s'\n", out_len, ptr, expect);
exit(1);
}
}
int main(int argc, char **argv) {
Test(0, "0");
Test(1, "1");
Test(9, "9");
Test(10, "10");
Test(99, "99");
Test(123456789, "123456789");
Test(4294967295, "4294967295");
puts("Tests passed");
}
Naïve Version First
Start with the something simple and correct.
#include "num2str.h"
char *NumToString(union NumToStringBuffer *buf, uint32_t num) {
char *ptr = buf->chars + sizeof(buf->chars);
do {
--ptr;
*ptr = '0' + (num % 10);
num /= 10;
} while (num > 0);
return ptr;
}
How Fast Is This?
We could test by converting every number, but that’s actually a lot of numbers and it would be annoying to wait so long. Testing one in every 9 numbers is faster. I chose 9 because it does not share any factors with 10.
We’re just interested in the average number of nanoseconds per call.
Benchmarking Code
// Needed for clock_gettime.
#define _POSIX_C_SOURCE 199309
#include <limits.h>
#include <math.h>
#include <stdio.h>
#include <time.h>
#include "num2str.h"
enum {
kStepSize = 9,
kCallCount = UINT_MAX / kStepSize,
};
// Call NumToString many times and return the number of nanoseconds
// per call.
double Benchmark(void) {
struct timespec start_time, end_time;
clock_gettime(CLOCK_MONOTONIC, &start_time);
union NumToStringBuffer buf;
for (uint32_t i = 0; i < kCallCount; i++) {
NumToString(&buf, i * kStepSize);
}
clock_gettime(CLOCK_MONOTONIC, &end_time);
double nsec = 1e9 * (end_time.tv_sec - start_time.tv_sec) +
(end_time.tv_nsec - start_time.tv_nsec);
return nsec / kCallCount;
}
enum {
kTestCount = 10,
};
int main(int argc, char **argv) {
// Warm up caches.
Benchmark();
// Run benchmark.
double times[kTestCount];
for (int i = 0; i < kTestCount; i++) {
times[i] = Benchmark();
}
// Print out stats.
double mean = 0.0;
for (int i = 0; i < kTestCount; i++) {
mean += times[i];
}
mean /= kTestCount;
printf("Time = %.2f ns/call\n", mean);
double variance = 0.0;
for (int i = 0; i < kTestCount; i++) {
double delta = times[i] - mean;
variance += delta * delta;
}
variance /= kTestCount;
printf("Deviation = %.2f ns/call\n", sqrt(variance));
}
Here are the results:
$ ./a.out Time = 9.15 ns/call Deviation = 0.02 ns/call
The benchmark numbers are from an Intel i5-6600K at 3.50GHz,
running Linux, compiled with GCC 12.2.0 at -O2
.
Two Digits At Once
Our first idea for speeding it up is to process two digits at a time. This is the reason why the buffer is defined as a union—this way, the program can write two bytes at a time without undefined behavior.
An easy and fast way to convert two digits at a time is to use a lookup
table.
The lookup table just contains 100 uint16_t
values with the
ASCII representation of the numbers 0–99.
Two-Digit Lookup Table
#include <stdint.h>
// clang-format off
static const union {
char ascii[200];
uint16_t u16[100];
} kDigits = {{
'0','0','0','1','0','2','0','3','0','4',
'0','5','0','6','0','7','0','8','0','9',
'1','0','1','1','1','2','1','3','1','4',
'1','5','1','6','1','7','1','8','1','9',
'2','0','2','1','2','2','2','3','2','4',
'2','5','2','6','2','7','2','8','2','9',
'3','0','3','1','3','2','3','3','3','4',
'3','5','3','6','3','7','3','8','3','9',
'4','0','4','1','4','2','4','3','4','4',
'4','5','4','6','4','7','4','8','4','9',
'5','0','5','1','5','2','5','3','5','4',
'5','5','5','6','5','7','5','8','5','9',
'6','0','6','1','6','2','6','3','6','4',
'6','5','6','6','6','7','6','8','6','9',
'7','0','7','1','7','2','7','3','7','4',
'7','5','7','6','7','7','7','8','7','9',
'8','0','8','1','8','2','8','3','8','4',
'8','5','8','6','8','7','8','8','8','9',
'9','0','9','1','9','2','9','3','9','4',
'9','5','9','6','9','7','9','8','9','9',
}};
// clang-format on
The conversion code is now a little more complicated, and hopefully faster.
#include "digits.h"
#include "num2str.h"
char *NumToString(union NumToStringBuffer *buf, uint32_t num) {
uint16_t *ptr = buf->u16 + sizeof(buf->u16) / sizeof(*buf->u16);
for (;;) {
if (num < 10) {
// One digit left.
char *cptr = (char *)ptr;
cptr--;
*cptr = num + '0';
return cptr;
}
ptr--;
if (num < 100) {
// Two digits left.
*ptr = kDigits.u16[num];
return (char *)ptr;
}
// More than two digits left.
unsigned x = num % 100;
num /= 100;
*ptr = kDigits.u16[x];
}
}
The speed is pretty good, taking 41% less time per call, on average.
$ ./a.out Time = 5.36 ns/call Deviation = 0.09 ns/call
As a historical note, back in the 1990s, everyone made their loops fast by casting pointers, instead of being careful with like the above code. The 1990s was back when compilers did less aggressive optimization and when undefined behavior was less serious:
// This is OFTEN an aliasing violation, which is
// undefined behavior, but the compiler will accept it.
*(unsigned short *)ptr = value;
Some compilers (like Metrowerks, or maybe MPW) supported a way of doing this that was even more bizarre:
// Most compilers can’t (or won’t) compile this.
(unsigned short)*ptr = value;
Yes, these compilers let you cast your l-values. The 1990s were truly a magical time.
Divide And Conquer
The loop still processes 2 digits at a time, but we can do better. The number has at most 10 digits. We first divide the number into three parts: one two-digit section and two four-digit sections. Then we divide the four-digit sections into two-digit sections.
#include "digits.h"
#include "num2str.h"
char *NumToString(union NumToStringBuffer *buf, uint32_t num) {
unsigned tmp = (num * 3518437209ull) >> 45;
// g1 - low 4 digits
// g2 - next 4 digits
// g3 - top 2 digits
unsigned g1 = num - tmp * 10000;
unsigned g3 = (tmp * 429497ull) >> 32;
unsigned g2 = tmp - g3 * 10000;
unsigned h1, h2;
h2 = (g1 * 5243) >> 19;
h1 = g1 - h2 * 100;
buf->u16[4] = kDigits.u16[h1];
buf->u16[3] = kDigits.u16[h2];
h2 = (g2 * 5243) >> 19;
h1 = g2 - h2 * 100;
buf->u16[2] = kDigits.u16[h1];
buf->u16[1] = kDigits.u16[h2];
buf->u16[0] = kDigits.u16[g3];
char *ptr = buf->chars;
char *end = buf->chars + 10;
while (ptr < end - 1 && *ptr == '0') {
ptr++;
}
return ptr;
}
Magic constants all over the place! But first, the results.
$ ./a.out Time = 3.75 ns/call Deviation = 0.09 ns/call
We’re converting 32-bit numbers to decimal strings at an average of only (3.75 ns) × (3.50 GHz) = 13.1 cycles per operation, or 1.31 cycles per byte of output. Maybe you can write faster code, but I’m stopping here. This takes 59% less runtime than the original version.
Magic Constants
Wait, where did those magic constants come from? 3518437209? 429497? 5243?
It turns out that these constants are nothing special, and it’s easy for you to come up with these constants yourself. You see, these numbers are just reciprocals, scaled up by a power of two. The basic idea is simple: instead of dividing a ÷ b, you multiply a × (2n/b) and then shift down by n bits. The value 2n/b is rounded up to an integer; this is the magic constant.
When you use this technique on the computer, there are some constraints:
- The scaling factor 2n must be large enough to get an accurate result for the entire range of inputs.
- The scaling factor 2n must be small enough that the multiplication does not overflow for the entire range of inputs.
First Division
The value 3518437209 is used to divide by 10,000. It’s a 32-bit number, so we are safe from overflow when we do a 64-bit multiply with this constant and a 32-bit input. Since the reciprocal constant is rounded up, if the scaling factor is too small, we can find the error by testing the largest input which is 1 below a multiple of 10,000. (Figuring out why is an exercise for the reader. It’s nice that we only have to check one value!)
>>> (4294959999 * 3518437209) >> 45
429495
>>> (4294960000 * 3518437209) >> 45
429496
If the scaling factor were only 244, it wouldn’t work.
>>> (4294959999 * 1759218605) >> 44
429496
If the scaling factor were 246, it would overflow. So it is a nice coincidence that we can divide by 10,000 using only a single 64-bit multiplication.
Second Division
The second division operation is also a division by 10,000, but it uses the reciprocal constant 429497 and a shift of 32 bits. Why is this different, if it’s calculating the same result?
The answer is simple: this code was written for ARMv4, not x86. On ARMv4, the result of a 64-bit multiplication is split across two 32-bit registers. A scaling factor of 32 bits is especially convenient because it means that no shifting is necessary; the result we want is already stored in a register. This is possible because of the reduced input range of the second division—since the first division is the input for the second division.
ARM is not unique here. There are other architectures where it is easier to get the high 32 bits of a multiplication rather than using the full 64-bit result.
Third Division
Finally, there is the scaling constant 5243, which is used twice. This is used to calculate a division by 100, with inputs in the range 0–9999. Because of the reduced range, only a 32-bit multiplication is needed here.
The Compiler Does This For You
The point here is to understand how this optimization works. You don’t need to do this kind of optimization by hand, most of the time, because the compiler will replace division with multiplication wherever it can.
This is a type of optimization called strength reduction, which is the name for when a heavy operation (like division) is replaced with something faster (like multiplication). Compilers are generally very good at strength reduction, so you do not need to do these kinds of operations by hand.
For example, you might see someone take some code that looks like this:
int x = y / 16;
And replace it with this:
int x = y >> 4;
You should be careful when you do this, for two reasons:
- The compiler already does this for you, even when optimizations are disabled. (It depends on the compiler, but the optimization is so easy that it is always turned on in common compilers like GCC.)
- The optimization is incorrect for some negative values of
y
, but the compiler will produce a slightly different, correct version that works for all inputs.
In other words, doing this kind of optimization by hand is usually a waste of time and sometimes introduces bugs into your code.
However, there are some situations where you can beat the compiler:
- The original code was compiled for ARMv4 THUMB, which has no 64-bit multiply operation.
- If you know that the input has a reduced range, then you can sometimes make optimizations that the compiler can’t make.
Real World Applications
As a reward for having the fastest implementation of
NumToString()
on the block, I was gifted
a drawing of a turtle made in MS Paint which I will share with you.