wchar_t Is a Historical Accident
Contents
At first glance, it looks like portable C and C++ programs should
use wchar_t
for text.
It’s portable, and it’s Unicode, what more could you want?
It turns out that wchar_t
is a bit of a historical
accident, and it’s only useful for calling Windows API functions.
Avoid wchar_t
everywhere else, especially if you are
writing portable code.
“But wchar_t
is portable!” You say.
Unfortunately, the portable things you can do with wchar_t
are not very useful, and the useful things you can do with it are
not portable.
Note: To be clear, this article is going to gloss over a lot of the deeper problems with text processing. Questions like, “What is a character?” and, “What encoding should we use?” are topics in their own right.
A Bit of History
Unicode 1.0 is published in 1991 after about four years of work. It defines 7,161 characters, which grow to 34,233 by the time 1.1 is released in 1993. Most of these characters are CJK Unified Ideographs.
These early versions are 16-bit, using an encoding called UCS-2, giving 216 (65,536) possible characters, minus a few special code points like U+FFFE. During this era the Unicode version of the Win32 API appears, as well as Sun’s new programming language, Java. 16-bit character types are a no-brainer, since Unicode is obviously the way of the future. Everyone is happy because they can work with text written in nearly any language, and they don’t have to recompile their programs to do it.
The new Windows API looks like this:
// “ANSI” version uses a Windows code page for the filename.
HANDLE CreateFileA(const char *lpFileName, DWORD dwDesiredAccess,
DWORD dwShareMode,
LPSECURITY_ATTRIBUTES lpSecurityAttributes,
DWORD dwCreationDisposition,
DWORD dwFlagsAndAttributes,
HANDLE hTemplateFile);
// Unicode version uses UCS-2 (later UTF-16) for the filename.
HANDLE CreateFileW(const wchar_t *lpFileName, DWORD dwDesiredAccess,
DWORD dwShareMode,
LPSECURITY_ATTRIBUTES lpSecurityAttributes,
DWORD dwCreationDisposition,
DWORD dwFlagsAndAttributes,
HANDLE hTemplateFile);
// CreateFile will be an alias for the Unicode or Windows code page
// version of the function, depending on the project build settings.
// New projects should define UNICODE globally and only use Unicode
// versions of functions.
#ifdef UNICODE
# define CreateFile CreateFileW
#else
# define CreateFile CreateFileA
#endif
In 1996, Unicode is expanded by a factor of 17 to make room for future characters. UCS-2 is no longer viable, superceded by UTF-8, UTF-16, and UTF-32.
UTF-8 is brilliant. It’s backwards-compatible with ASCII, it’s compact for those using Latin charecters, you can resynchronize in the middle of a text stream, and you can reuse
char
for all of your strings as long as you’re careful. Programs can continue to use existing functions likeprintf
with few changes.UTF-32 is useful in the right situations. With a fixed 32 bits per character (code point), it’s a bit wasteful, but it’s convenient for writing certain algorithms like Unicode normalization or text segmentation, without having to decode a variable-width text decoder in the same code.
UTF-16 is in the awkward middle. Unlike UTF-8, it’s not backwards-compatible with anything but UCS-2. Unlike UTF-32, it’s not fixed-width. But UTF-16 is still a full-fledged Unicode encoding, and it’s well-supported on all platorms. There’s nothing wrong with UTF-16, it’s just that people were hoping that they could use a 16-bit fixed-width encoding, but ended up with a variable-width one instead!
As a refresher, this is what the encodings look like:
Character | UCS-2 | UTF-8 | UTF-16 | UTF-32 |
---|---|---|---|---|
Latin Capital Letter A U+0041 A | 0041 | 41 | 0041 | 00000041 |
Greek Capital Letter Delta U+0394 Δ | 0394 | CE 94 | 0394 | 00000394 |
CJK Unified Ideograph U+904E 過 | 904E | E9 81 8E | 904E | 0000904E |
Musical Symbol G Clef U+1D11E 𝄞 | — | F0 9D 84 9E | D834 DD1E | 0001D11E |
UTF-8 takes 1-4 bytes, UTF-16 takes 2 or 4, and UTF-32 always takes 4 bytes.
Note the gap in the table. Characters beyond U+FFFF, like 𝄞, 🌍, and 😭 simply cannot be represented in UCS-2. Switching from UCS-2 to UTF-32 would bloat program memory usage and create major API incompatibility problems, so Java and Windows switch to UTF-16.
What Happened to Linux and macOS?
Mac and Linux systems don’t have major APIs that use wchar_t
.
Mac OS X 10.0.4, the first consumer version of macOS, is released in 2001.
It provides new APIs that will eventually replace older Macintosh APIs.
The Cocoa GUI framework uses Unicode everywhere, storing strings in
NSString
class, which hides the details of its encoding
(which can vary from string to string!)
and forces the programmer to explicitly specify string encodings when
converting from C strings.
Here’s a sample from the Foundation framework, the lower-level part
of Cocoa:
@interface NSData
// Almost all Cocoa APIs take NSString instances instead of C
// strings, so no char or wchar_t.
+ (instancetype)dataWithContentsOfFile:(NSString *)path;
@end
@interface NSString
// When you construct an NSString instance, it will be obvious which
// encoding you’re using.
- (instancetype)initWithUTF8String:(const char *)cString;
- (instancetype)initWithCString:(const char *)cString
encoding:(NSStringEncoding)encoding;
// The ‘unichar’ type is UTF-16.
- (instancetype)initWithCharacters:(const unichar *)characters
length:(NSUInteger)length;
@end
// This is always UTF-16 regardless of what wchar_t is.
typedef unsigned short unichar;
System calls on macOS, like open
and chdir
,
consume char *
, but since these calls weren’t available
prior to macOS 10 they don’t need to be backwards-compatible with
existing macOS programs.
These functions consume UTF-8 strings.
The operating system translates them to the encoding that the filesystem
uses—for HFS+, this means normalizing the string with a variant of
Unicode normalization format D, and encoding the result in UTF-16.
Meanwhile, Linux slowly moves to using UTF-8 everywhere, but this is
a popular convention rather than a decision enforced by the operating
system.
Linux system calls treat filenames as opaque sequences of bytes, only
treating “/
” and NUL
specially.
The interpretation of filenames as characters is left to userspace,
and can be configured, but UTF-8 is default almost everywhere.
Linux filesystems, like ext2, faithfully reproduce whatever bytestring
the users use for filenames regardless of whether that string is
valid Unicode.
On both Linux and macOS, none of the important APIs use wchar_t
,
so that decision is left up to the C standard library.
The operating system simply doesn’t care what wchar_t
is.
On both platforms, wchar_t
ends up being UTF-32.
Evolution of C and C++
The C and C++ committees recognize that wchar_t
has become
somewhat less useful.
Developers need a portable way to write strings with specific Unicode
encodings.
Three new ways to write string literals appear, for UTF-8, UTF-16, and
UTF-32.
// This will use a different encoding depending on what platform you
// compile this for. Maybe your platform is set up to use UTF-8,
// maybe not. Maybe this won’t even compile!
const char *str = "γειά σου κόσμος";
// This will generally use some kind of Unicode encoding, but the
// exact encoding will be different on different platforms. On
// Windows, UTF-16. On Linux and Mac, UTF-32.
const wchar_t *wstr = L"γειά σου κόσμος";
// Always UTF-8.
const char *u8str = u8"γειά σου κόσμος";
// Always UTF-16.
const char16_t *u16str = u"γειά σου κόσμος";
// Always UTF-32.
const char32_t *u32str = U"γειά σου κόσμος";
The wchar_t
type sticks around for compatibility,
but it’s clear that there’s no other reason to use it.
Its existence is a historical accident.
Everything is Painful
In short, wchar_t
is useful on Windows for calling UTF-16
APIs, but on Linux and macOS, it’s not only a completely different
encoding but it’s not even useful!
No sane developer would choose to use UTF-16 on Windows and then turn
around and try to get the same program running in UTF-32 on Linux and
macOS, but that’s exactly what you get with wchar_t
.
Let’s suppose you’ve ignored this advice and started using
wchar_t
in your program.
Here’s a snippet for parsing an escape sequence in JSON. Remember that JSON escapes code points above U+FFFF as escaped UTF-16 surrogate pairs, but we need to convert that escape sequence differently depending on whether we are using Windows.
// Parse a JSON string
std::wstring out;
switch (x) {
case '\\':
// Parse escape sequence
unsigned char c = *ptr++;
switch (c) {
case 'u':
// Unicode escape sequence /uXXXX
unsigned codepoint = readHex(4);
if (codepoint >= 0xd800 && codepoint < 0xdc00) {
// Combine surrogate pair
if (*ptr++ != '\\' || *ptr++ != 'u')
error("expected low surrogate");
unsigned hi = codepoint, lo = readHex(4);
codepoint = (((hi & 0x3ff) + 1) << 10) | (lo & 0x3ff);
} else if (codepoint >= 0xdc00 && codepoint < 0xe000) {
error("unexpected low surrogate");
}
#if WIN32
// Windows uses UTF-16
if (codepoint > 0x10000) {
out.push_back(((codepoint >> 10) - 1) | 0xd800);
out.push_back((codepoint & 0x3ff) | 0xdc00);
} else {
out.push_back(codepoint);
}
#else
// Everyone else uses UTF-32
out.push_back(codepoint);
#endif
Here’s a snippet for writing a std::wstring
to an HTML document,
using entities to produce ASCII-only output.
We need to parse the std::wstring
differently depending
on whether it is UTF-16 or UTF-32.
// Escape HTML entities, ASCII-only output
std::wstring text = ...;
std::string out;
for (auto p = text.begin(), e = text.end(); p != e; ++p) {
switch (*p) {
case '<': out.append("<"); break;
case '>': out.append(">"); break;
case '&': out.append("&"); break;
case '\'': out.append("'"); break;
case '"': out.append("""); break;
default:
if (*p > 0x7f) {
unsigned codepoint;
#if WIN32
if (*p >= 0xd800 && *p < 0xde00) {
wchar_t c1 = p[0], c2 = p[1];
++p;
codepoint = (((c1 & 0x3ff) + 1) << 10) | (c2 & 0x3ff);
} else {
codepoint = *p;
}
#else
codepoint = *p;
#endif
out.append("&#");
out.append(std::to_string(codepoint));
out.append(";");
}
break;
}
}
Other problems are just as bad.
If you need to find grapheme cluster boundaries or do collation
with ICU, then you’ll need to convert your wchar_t
to UTF-16,
except on Windows, and you’ll need to convert the results back to
wchar_t
afterwards.
Is this a problem with Windows? No! The problem is that you’ve decided that you want to write a program that uses UTF-16 internally on Windows, and UTF-32 on other platforms. This is going to bite you every time you need to parse text, every time you need to put text on the screen, and every time you encode data in text file, like JSON, HTML, or XML.
Making Everything Way Worse
As an aside, there is an alternative to wchar_t
that is actually worse.
You can use both wchar_t
and char
on Windows,
depending on your build settings.
Microsoft’s passion for backwards-compatibility lead to the invention
of the _T()
macro, and a bunch of other macros used for
selecting Unicode or non-Unicode versions of functions
and data structures in your program.
That is, if you want to compile your program to use a Unicode or a
Windows code page depending on how you build it.
static const TCHAR WindowClass[] = _T("myWClass"),
Title[] = _T("My Win32 App");
HWND hWnd = CreateWindow(WindowClass, Title, WS_OVERLAPPEDWINDOW,
CW_USEDEFAULT, CW_USEDEFAULT, 640, 480,
NULL, NULL, hInstance, NULL);
The _T()
macro turns text into either wchar_t
or char
, depending on the build settings.
Likewise, CreateWindow
is a macro which is either
CreateWindowA
or CreateWindowW
depending
on build configuration, and they have different type signatures.
If you’re stuck with legacy code, then this may be the only way to get your code running with Unicode. If you write new code this way, you are insane. Unfortunately, an enormous amount of documentation and sample is floating around that recommends this path, and new developers get tricked into thinking that this is some kind of “best practice”. It is not.
Fortunately, there is a better way.
Just Choose Your Encodings
Nobody said Unicode is easy, but it’s much easier if you choose your
encodings instead of shoehorning your code into the mess that is
wchar_t
.
UTF-8 is a popular choice for data processing and web applications, and UTF-16 is used by LibICU, Cocoa, and Win32.
Just make sure that whatever encoding you use, you translate it into
UTF-8 when you call open
on macOS, and into UTF-16 when you call
CreateFileW
on Windows.