Even faster UTF-8 character counting

I recently came across two articles, "Counting characters in UTF-8 strings is fast" by Kragen Sitaker, and "Counting characters in UTF-8 strings is fast(er)" by George Pollard, which provide a series of successively faster ways of (as the article names suggest) counting the number of UTF-8 characters in a NUL-terminated string. We can do better.

Kragen takes the approach of examining each byte in sequence and asking if (a) is it the terminating NUL, and (b) is it the first of a UTF-8 character. This last test is quite easy: Bytes 0x01 through 0x7F in UTF-8 represent the corresponding ASCII characters, while bytes 0xC0 through 0xFF are the first byte of a multi-byte character. This results in the following inner loop (modulo some style changes to make it easier to compare this against later versions):

	while (s[i]) {
		if ((s[i] & 0xC0) != 0x80)
	return (j);

Kragen continues by comparing this to an optimized version written in x86 assembly language by Aristotle Pagaltzis; Aristotle's version cleverly takes advantage of the shl instruction setting the sign, carry, and zero flags, but otherwise applies exactly the same algorithm:

loopa:	dec %ecx
loopb:	lodsb
	shl $1, %al
	js loopa
	jc loopb
	jnz loopa
However, this assembly language version, like Kragen's C version, inspects each of the bytes one by one, which inherently limits the performance.

George Pollard makes the assumption that the input string is valid UTF-8, and notices that by looking at the first byte of a multibyte character, we can determine the length of the character: If the first byte is between 0xC0 and 0xDF, the UTF-8 character has two bytes; if it is between 0xE0 and 0xEF, the UTF-8 character has 3 bytes; and if it is 0xF0 and 0xFF, the UTF-8 character has 4 bytes. After reading the first byte of a multibyte character, George skips over the trailing bytes. He also fast-paths the handling of ASCII characters, treating characters as signed bytes in order to distinguish between ASCII and non-ASCII characters, while giving a wonderful example of using a goto to jump from the middle of one loop into the middle of another:

	while (s[i] > 0) {

	count += i - iBefore;

	while (s[i]) {
		if (s[i] > 0) {
			iBefore = i;
			goto ascii;
		} else {
			switch (0xF0 & s[i]) {
			case 0xE0:
				i += 3;
			case 0xF0:
				i += 4;
				i += 2;

While this code is considerably faster than both Kragen's C code and Aristotle's assembly code, it suffers from two performance limiting factors: First, it uses conditional branches which will only be consistently predicted correctly if all of the characters encountered have the same length; and second, it still inspects characters one by one.

This can be improved in three ways:

Making these improvements gave me the following code:
#define ONEMASK ((size_t)(-1) / 0xFF)

static size_t
cp_strlen_utf8(const char * _s)
	const char * s;
	size_t count = 0;
	size_t u;
	unsigned char b;

	/* Handle any initial misaligned bytes. */
	for (s = _s; (uintptr_t)(s) & (sizeof(size_t) - 1); s++) {
		b = *s;

		/* Exit if we hit a zero byte. */
		if (b == '\0')
			goto done;

		/* Is this byte NOT the first byte of a character? */
		count += (b >> 7) & ((~b) >> 6);

	/* Handle complete blocks. */
	for (; ; s += sizeof(size_t)) {
		/* Prefetch 256 bytes ahead. */
		__builtin_prefetch(&s[256], 0, 0);

		/* Grab 4 or 8 bytes of UTF-8 data. */
		u = *(size_t *)(s);

		/* Exit the loop if there are any zero bytes. */
		if ((u - ONEMASK) & (~u) & (ONEMASK * 0x80))

		/* Count bytes which are NOT the first byte of a character. */
		u = ((u & (ONEMASK * 0x80)) >> 7) & ((~u) >> 6);
		count += (u * ONEMASK) >> ((sizeof(size_t) - 1) * 8);

	/* Take care of any left-over bytes. */
	for (; ; s++) {
		b = *s;

		/* Exit if we hit a zero byte. */
		if (b == '\0')

		/* Is this byte NOT the first byte of a character? */
		count += (b >> 7) & ((~b) >> 6);

	return ((s - _s) - count);

How much faster is this? I put together a a slightly improved version of Kragen's benchmark code, using a buffer filled with valid UTF-8 text instead of his more artificial test cases, and ran it on an Opteron 848 @ 2.2 GHz running FreeBSD 7.0-RELEASE-p1 after compiling with gcc 4.2.1 with the -O3 flag set. Some notes to help decipher the output:

The improvement is striking:

testing 33554424 bytes of repeated "hello, world":
                      gcc_strlen =   33554424: 0.034169 +/- 0.000090
                      kjs_strlen =   33554424: 0.049529 +/- 0.000280
                       cp_strlen =   33554424: 0.011357 +/- 0.000030
                 kjs_strlen_utf8 =   33554424: 0.060930 +/- 0.000031
                  gp_strlen_utf8 =   33554424: 0.049675 +/- 0.000294
                  cp_strlen_utf8 =   33554424: 0.014049 +/- 0.000047
testing 33554430 bytes of repeated "na?ve":
                      gcc_strlen =   33554430: 0.034168 +/- 0.000069
                      kjs_strlen =   33554430: 0.049544 +/- 0.000287
                       cp_strlen =   33554430: 0.011348 +/- 0.000021
                 kjs_strlen_utf8 =   27962025: 0.061020 +/- 0.000291
                  gp_strlen_utf8 =   27962025: 0.059726 +/- 0.000029
                  cp_strlen_utf8 =   27962025: 0.014041 +/- 0.000043
testing 33554430 bytes of repeated "?????":
                      gcc_strlen =   33554430: 0.034157 +/- 0.000088
                      kjs_strlen =   33554430: 0.049437 +/- 0.000018
                       cp_strlen =   33554430: 0.011438 +/- 0.000286
                 kjs_strlen_utf8 =   11184810: 0.060919 +/- 0.000032
                  gp_strlen_utf8 =   11184810: 0.027454 +/- 0.000031
                  cp_strlen_utf8 =   11184810: 0.014133 +/- 0.000287
Not only is vectorized character counting faster than the "look at a byte, skip a few" approach, it isn't even close: Even when the characters are 3 bytes each (as in the case of "こんにちは"), the vectorized approach wins by a factor of 2; and its lead is larger when the skipping approach can't skip as many bytes. Moreover, vectorized character counting is only 30% slower than a vectorized strlen and more than twice as fast as a non-vectorized strlen -- although given that character counting runs at slightly faster than one byte per clock cycle, it's not surprising that non-vectorized code can't keep up!

Can we do better? I think so. My code uses 64-bit integer registers to manipulate 8 bytes at once; this is the same size as MMX registers, so those probably won't be very useful, but with SSE2 16 can be manipulated at once, which could provide another doubling of the performance.

Beyond a doubling? Well, the first rule of optimization is to start by finding a good algorithm -- and any algorithm in which the critical path involves counting UTF-8 characters in a 32 megabyte NUL-terminated string is doing something wrong. This is very much a toy problem; but the lesson it teaches is worth remembering: Vectorization is good!

Posted at 2008-06-05 09:20 | Permanent link | Comments
blog comments powered by Disqus

Recent posts

Monthly Archives

Yearly Archives