Let's have a quick show of hands. How many of you
have ignored Unicode? If you're like me, you haven't
taken the time to really understand it (at least not until recently). After all, Unicode is completely supported only on Windows NT®, meaning a Unicode-based program almost certainly won't run on Windows® 95. In addition, all those funky Unicode macros are a pain to work with and make your code harder to read. Well, if you've been holding out on using Unicode, I've got some pretty interesting test results that might make you reconsider.
If you're up on the architecture of Windows NT, you've probably heard that, under the skin, Windows NT uses Unicode through-and-through. If you use the ANSI (single character) version of an API function, Windows NT converts ANSI to Unicode strings and uses the Unicode version to do the real work. When passing a string to an ANSI-based API (for example, SetComputerNameA), Windows NT converts the ANSI input string to Unicode before calling the Unicode version of the API (for instance, SetComputerNameW). When calling an API where an ANSI string buffer is filled in (for example, GetCurrentDirectoryA), Windows NT uses a Unicode string internally and converts it to ANSI before the API returns.
This month I'll show you how Windows NT supports ANSI-based functions. Then I'll run through a small test program I whipped up to give you an idea of the overhead involved in using ANSI functions with Windows NT. Afterward, it should be pretty clear why all of the programs that come with Windows NT are compiled to use the Unicode APIs. Surprised? Even the omnipresent CALC.EXE is a Unicode program.
To say that Windows NT supports Unicode is a misnomer. A more accurate description is that Windows NT uses Unicode strings natively, and that it also supports ANSI (8-bit) strings. But using ANSI APIs will cost you performance. As you'll see, the ANSI APIs spend a great deal of time in code unrelated to their primary purpose. I'll spend the bulk of this column describing the Windows NT support functions for APIs that use ANSI strings.
Where does Windows NT keep the strings that it's translated between ANSI and Unicode? A quick answer might be, "On the stack!" But you'd be wrong if you guessed this. For one thing, it would be a real pain to handle APIs that pass in null-terminated strings. Without knowing in advance how long a string will be, you can't declare a local variable buffer that you know will be big enough to handle all input strings.
Windows NT manages the ANSI/Unicode translation strings in two ways. For starters, every thread has a buffer reserved for the purpose of holding Unicode strings translated from ANSI strings. In situations where the maximum string length is known in advance, the ANSI APIs use this buffer. When you step through a system DLL and see code that looks like this:
you've encountered the per-thread Unicode string buffer in action. In my May 1996 Under the Hood column I described how FS: contains a pointer to a per-thread data structure. In the code snippet above, the Unicode string buffer is at offset 0xBF8 in this structure.
The other way Windows NT manages ANSI/Unicode string translation is by allocating memory for the buffer. This is most likely to occur in situations where an API takes multiple string parameters or works with strings that may be longer than the per-thread Unicode string buffer can accommodate. I'll show you some examples of this later.
Before digging into the ANSI support runtime library functions, the concept of the RTL_STRING needs to be explained. An RTL_STRING structure is used to represent both ANSI and Unicode strings. It looks something like this:
typedef struct _RTL_STRING
} RTL_STRING, * PRTL_STRING;Although the names I use here are probably not exactly what's in the Windows NT sources, they're good enough to understand what's going on. The len field describes how long a string currently is, in bytes. The maxLen field tells the size of the biggest possible string (in bytes) that could fit into this RTL_STRING. The last field, pBuffer, is a pointer to a buffer containing the string. The buffer is at least maxLen bytes in length. Now let's move to the APIs and functions that translate strings using RTL_STRING.
The Windows NT ANSI String Support Library
Many of the internal Windows NT functions that work with strings expect those strings to be in the RTL_STRING format. On the other hand, the Win32® APIs typically work with null-terminated strings. As you'd expect, there's an API to create an RTL_STRING from a null-terminated ANSI or Unicode string. It's called RtlInitAnsiString and, like all of the APIs I'll describe here, the API is exported from NTDLL.DLL.
Figure 1 shows the code for RtlInitAnsiString. It takes a pointer to where the RTL_STRING should be created, as well as a pointer to the null-terminated ANSI string. The API points the pBuffer field directly at the null-terminated string. It then uses an optimized inline version of the strlen function to set the len and maxLen fields accordingly. The maxLen field is always set to one more than the len field to account for the null terminator.
On the Unicode side of things, the equivalent API is RtlInitUnicodeString shown in Figure 2. It looks exactly like the ANSI version except that when calculating the length of the string it uses an inlined version of wcslen rather than strlen. (wcslen is the wide character version of strlenthe wcs prefix is short for wide character string.) If you dig through STRING.H, you'll see that most of the wide character equivalents to the str functions simply replace the str in their name with wcs.
Converting between Unicode and ANSI strings is handled via a pair of NTDLL APIs: RtlUnicodeStringToAnsiString and RtlAnsiStringToUnicodeString. Both APIs take two RTL_STRINGs as their first two parameters. One of the RTL_STRINGs is the source and the other is the destination. The third parameter tells the APIs if they should allocate memory for the destination RTL_STRING's buffer. Passing FALSE (don't allocate) means that the destination RTL_STRING has a valid pBuffer field and that the len
and maxLen fields are initialized. Passing TRUE indicates that the API should determine how big the destination buffer should be, allocate enough memory for it, and set
the len and maxLen fields accordingly.
Figure 3 provides pseudocode for RtlUnicodeStringToAnsiString. The code is pretty simple, so I won't give a blow-by-blow account. A few things are worth highlighting, though. Near the beginning of the function, it checks how long the destination string would be. If this size exceeds 64KB, the function bails out. The implication is that at least some parts of Windows NT won't deal with strings greater than 64KB (I know, a real killer limitation for most of you).
If the caller has requested that the API allocate memory for the destination string, the code calls an internal NTDLL function called NtdllpAllocateStringRoutine. This function is just a wrapper around a call to RtlAllocateHeap. You may know of RtlAllocateHeap through its documented name, HeapAlloc. The NtdllpAllocateStringRoutine function uses the default process heap for the allocation. (See the GetProcessHeap documentation for details on the default heap.) After preparing everything, RtlUnicodeStringTo-AnsiString finally makes the critical call to RtlUnicodeToMultiByteN.
The RtlUnicodeToMultiByteN function is ground zero for converting Unicode to ANSI. Figure 4 shows pseudocode for this function. Since this function is so heavily used, it's optimized to the hilt. When you're not using a National Language Support (NLS) code page (as is the case on my system), the function converts the string in chunks, 16 characters at a time.
Why go to the hassle of converting chunks 16 characters at a time? The RtlUnicodeToMultiByteN function is written this way to cut down on the number of jumps that would occur if every character was converted individually in a loop. At the CPU level, every jump or call instruction is expensive because the processor's prefetch queue is flushed. On newer processors such as the Pentium, branch prediction and multiple pipelines help alleviate this problem. However, this function was most likely written before the Pentium arrived on the scene.
When an NLS table is used, RtlUnicodeToMultiByteNthere translates the characters one at a time. For each iteration of the loop, the Unicode character is used to look up a WORD in the NLS table. If the high BYTE is nonzero, that character requires two bytes in the destination string (that is, it's a multibyte character.) If the high BYTE is zero, only the low BYTE is copied to the destination string.
I'm going to skip a detailed description of the inverse functions of RtlAnsiStringToUnicodeString and RtlUnicodeToMultiByteN since they're not terribly different. In summary, there is an RtlAnsiStringToUnicodeString function that uses the RtlMultiByteToUnicodeN function. They look like mirror images of the functions I've just examined.
Sometimes RtlAnsiStringToUnicodeString and RtlUnicodeStringToAnsiString are directly invoked from the ANSI API wrapper code. Other times, another layer is interposed and sits above them. This layer is used by the file system APIs, which may be using an OEM character set. APIs that work with file names use the Basep8BitStringToUnicodeString and BasepUnicodeStringTo8BitString functions rather than the RtlXXXStringToYYYString APIs.
Figure 5 shows pseudocode for Basep8BitStringToUnicodeString and BasepUnicodeStringTo8BitString. As you can see, they're really just wrappers around the RtlXXXStringToYYYString APIs described earlier, even down to the parameters they accept. The main logic in the code simply determines whether the file APIs are using OEM strings, and then calls the appropriate RtlXXXStringToYYYString API. The SetFileApisToOEM and SetFileApisToANSI APIs tweak an NTDLL global variable called BasepFileApisAreOem in the pseudocode. If this variable is nonzero, the BasepXXX functions use RtlOemStringToUnicodeString and RtlUnicodeStringToOemString. Otherwise, they use the ANSI/Unicode APIs.
Earlier I mentioned the NtdllpAllocateStringRoutine, which the string translation code uses to dynamically allocate memory for a temporary string. There's a corresponding set of functions for releasing a string's memory when the system is finished with it. Figure 6 shows pseudocode for RtlFreeAnsiString and RtlFreeUnicodeString.
The functions are identical in implementation. They
both take an RTL_STRING as input, and if the pBuffer
field is nonzero, they pass that pointer to an internal NTDLL function called NtdllpFreeStringRoutine. This function (also shown in Figure 6 is just a wrapper around the RtlFreeHeap API, better known as HeapFree.
This wraps up my tour of the Windows NT ANSI/Unicode string translation code. While I haven't shown every last detail, it's sufficient that I can now demonstrate them in their natural habitat. Although they appear to be pretty efficient in their implementation, they add overhead
to every API that uses ANSI strings. With that in mind,
let's check out how some well-known APIs use these
Some APIs That Use String Translation
The first API I'll look at is GetModuleHandleA (see Figure 7). I chose this API because it uses the string conversion functions in a rudimentary manner. The first block of code handles the special case where the hModule parameter is zero; it isn't of interest here. The good part begins in the else clause, where the API creates an RTL_STRING from the null-terminated input string. Next, the code calls Basep8BitStringToUnicodeString to translate the ANSI RTL_STRING into a Unicode RTL_STRING. The key thing to note here is that the thread's static Unicode buffer (pTeb->staticUnicodeRTL_STRING) is where the Unicode string winds up. As a result, no memory is allocated. After the string is in Unicode form, GetModuleHandleA simply passes it to GetModuleHandleW and returns whatever that API returns.
The next ANSI-based API to examine is SetComputerNameA (see Figure 8). It has a few more twists than GetModuleHandleA. Although it also starts with a call to RtlInitAnsiString, it doesn't use Basep8BitStringToUnicodeString. After all, the computer name has nothing to do with file system names and that whole ANSI/OEM thing. Instead, the code uses the lower-level RtlAnsiStringToUnicodeString API I described earlier. Another twist is that SetComputerNameA allocates memory for the Unicode string, rather than using the thread's static Unicode buffer area. When the Unicode RTL_STRING is ready, SetComputerNameA calls its Unicode-equivalent API, SetComputerNameW. Before returning, the API calls RtlFreeUnicodeString to release the Unicode string it allocated earlier.
So much for dealing with ANSI strings on the input side. What about the case where an ANSI string needs to be returned to the caller? Let's start by looking at GetCurrentDirectoryA (see Figure 9). The code begins by calling the private RtlGetCurrentDirectory_U API, passing it the address of the per-thread static Unicode buffer. Next, GetCurrentDirectoryA checks to see if the output ANSI string buffer is big enough to hold the complete directory string. If so, the code calls BasepUnicodeStringTo8BitString, which translates the Unicode string into an ANSI or OEM string. Remember, GetCurrentDirectoryA is a file system API, so ANSI versus OEM matters here.
If the output buffer isn't big enough, the API returns the number of characters needed to hold the string. Likewise, if the buffer is big enough, the API returns the number of characters that were copied, not counting the null terminator. Now, here's a conundrum: how do you tell if the API worked or didn't based on just the return value? You can't. Instead, you have to do something cheesy like call GetLastError or verify that the output buffer was written to. Another alternative is to pass in a buffer that (you hope) will always be big enough.
The final API I'll look at this month is GetModuleFileNameA (see Figure 10). Like GetCurrentDirectoryA, it fills in an output buffer with an ANSI string. The API begins by using RtlAllocateHeap (HeapAlloc) to create a buffer that will be used for a Unicode string. It then calls its Unicode equivalent, GetModuleFileNameW, passing it the buffer it just allocated. Next, the code calls BasepUnicodeStringTo8BitString, which translates the Unicode result from GetModuleFileNameW into a temporary 8-bit string. Note that the call to BasepUnicodeStringTo8BitString specifies that the output string buffer should be allocated.
If the Unicode string successfully converts to an 8-bit (ANSI or OEM) string, GetModuleFileNameA uses the memcpy function to copy the temporary 8-bit string into the output buffer that was passed in as a parameter. Before GetModuleFileNameA can return, it's important that it clean up. After all, it allocated two string buffers, one for a Unicode string and one for the 8-bit temporary string. The function releases them by calling RtlFreeAnsiString and RtlFreeUnicodeString. Remember from my earlier description these two private APIs are just wrappers around RtlFreeHeap (HeapFree).
You may be wondering how the ANSI string manipulation APIs (for example, lstrlenA and lstrcpyA) are implemented. These APIs are simple enough that they're implemented without calls to lower-level system functions. As a result, the ANSI string APIs don't need to translate their input and output parameters between ANSI and Unicode and, therefore, should be as fast as their Unicode equivalents.
ANSI versus Unicode API Benchmarking
So far I've shown you the APIs and functions that Windows NT uses to translate between ANSI and Unicode strings, as well as how they're used by some Win32 APIs. With this understanding, it's now worthwhile to take a look at the performance hit you can expect from using the
ANSI APIs with Windows NT. I think you'll be shocked at the results.
For my test I selected three of the APIs that I examined earlier: GetModuleHandle, GetModuleFileName, and GetCurrentDirectory. I then wrote a test program that times the ANSI and Unicode versions of these APIs. (I didn't include SetComputerName because I didn't want to be responsible for changing your computer's name in case the program crashed.) Because each of these APIs is relatively fast to execute, the resolution of the system timer wouldn't be granular enough. Instead, I relied on the standard trick of making multiple calls in a loop and timing how long it takes for the entire operation to complete.
Before I get to the benchmarking code and, more importantly, the results, let me say a few things about my efforts to make the timings reliable. I used the QueryPerformanceCounter API, which provides timings at the microsecond level. According to the QueryPerformanceFrequency API, the performance counter increments 1.19 million times a second, a number that corresponds to one of the traditional timers found on x86-based systems. To prevent the numbers from being skewed by too few samples, I executed the APIs 50,000 times in a loop.
Since the code I'm timing takes a relatively long time to execute, it's virtually guaranteed that the thread's time slice will end while the loop is still executing, thereby affecting the outcome. I took two steps to minimize this effect. First, I bumped the program's thread priority up to THREAD_PRIORITY_TIME_CRITICAL to minimize the amount of time that other threads would have in the CPU. Second, I called Sleep(0) before both the ANSI and Unicode loops. The idea is to start the loop at the very beginning of a time slice. For all you performance/timing gurus out there, let me remind you that I'm not an actual performance profiling expertI just play one on TV.
My ANSI/Unicode timing program is called AnsiUniTiming, and the code is shown in Figure 11. Function main consists of two nearly identical parts. The first part times the three ANSI functions in a loop and reports the amount of time they took to execute. The second part repeats the same basic steps, with the only difference being that the Unicode API equivalents are used. For the call to GetModuleHandle I used KERNEL32.DLL since it's guaranteed to be loaded in the process. Passing 0 would have bypassed the string handling code that I showed in the pseudocode for GetModuleHandleA. For the call to GetModuleFileNameA, I used the value returned by the prior call to GetModuleHandle. The GetCurrentDirectory call needs no explanation.
Before I tell you the results, stop and guess the difference between the ANSI and Unicode timings (no peeking!). On my system (a single-processorPentium Pro 200 MHz running Windows NT 4.0), I obtained results that were remarkably similar no matter how many times I ran the program. Here's the output from a typical run:
ANSI version took 0.5736 seconds
Unicode version took 0.1923 seconds
Wow! Stripping the least two significant digits, tossing the decimal points, and dividing gives 57/19, or 3. The ugly fact is that the ANSI versions of GetModuleHandle, GetModuleFileName, and GetCurrentDirectory take three times as long as the equivalent Unicode versions. It's likely that the relative performance hit of the ANSI APIs isn't divided equally among them. If you feel ambitious, feel free to split them out into their own loops and time them independently. Still, the fact remains that commonly used ANSI APIs incur a large performance penalty over their Unicode equivalents.|
The moral of this story is that if you're writing exclusively for Windows NT, and if performance is an issue, you should consider becoming familiar with TCHAR.H and the other mechanisms used to write executables that use Unicode. It's a bit of a pain at first, especially with all the compiler warnings and errors you'll probably need to correct. In time, though, you should be able to write Unicode-ready code without giving it much thought. Even if you're writing for other Win32 platforms, the benefits of Unicode may warrant creating and distributing multiple executables: one that uses ANSI and runs anywhere, and another using the Unicode APIs optimized for Windows NT.
Have a question about programming in Windows? Send it to Matt at firstname.lastname@example.org