Training
Certifications
Books
Special Offers
Community




 
Writing Secure Code, Second Edition
Author Michael Howard and David LeBlanc
Pages 800
Disk N/A
Level Intermediate
Published 12/04/2002
ISBN 9780735617223
Price $49.99
To see this book's discounted price, select a reseller below.
 

More Information

About the Book
Table of Contents
Sample Chapter
Index
Companion Content
Related Series
Related Books
About the Author

Support: Book & CD

Rate this book
Barnes Noble Amazon Quantum Books

 


Chapter 10: All Input is Evil!



10  All Input Is Evil!

If someone you didn't know came to your door and offered you something to eat, would you eat it? No, of course you wouldn't. So why do so many applications accept data from strangers without first evaluating it? It's safe to say that most security exploits involve the target application incorrectly checking the incoming data or in some cases not at all. So let me be clear about this: you should not trust data until the data is validated. Failure to do so will render your application vulnerable. Or, put another way: all input is evil until proven otherwise. That's rule number one. Typically, the moment you forget this rule is the moment you are attacked.

Rule number two is: data must be validated as it crosses the boundary between untrusted and trusted environments. By definition, trusted data is data you or an entity you explicitly trust has complete control over; untrusted data refers to everything else. In short, any data submitted by a user is initially untrusted data. The reason I bring this up is many developers balk at checking input because they are positive that the data is checked by some other function that eventually calls their application and they don't want to take the performance hit of validating the data more than once. But what happens if the input comes from a source that is not checked or the code you depend on is changed because it assumes some other code performs a validity check? And here's a somewhat related question. What happens if an honest user simply makes an input mistake that causes your application to fail? Keep this in mind when I discuss some potential vulnerabilities and exploits.

I once reviewed a security product that had a security flaw because a small chance existed that invalid user input would cause a buffer overrun and stop the product's Web service. The development team claimed that it could not check all the input because of potential performance problems. On closer examination, I found that not only was the application a critical network component—and hence the potential damage from an exploit was immense—but also it performed many time-intensive and CPU-intensive operations, including public-key encryption, heavy disk I/O, and authentication. I doubted much that a half dozen lines of input-checking code would lead to a performance problem, especially because the code was not called often. As it turned out, the code did indeed cause no performance problems, and the code was rectified. Performance is rarely a problem when checking user input. Even if it is, no system is less reliably responsive than a hacked system.

Hopefully, by now, you understand that all input is suspicious until proven otherwise, and your application should validate direct user input before it uses it. The purpose of this chapter is to serve as an introduction to the next four chapters, which outline canonical representation issues, database and Web-specific input issues, and internationalization issues.

Let's now look at some high-level strategies for handling hostile input.

The Issue

The real issue with trusting input is this: many applications today distribute functionality between client and server machines or between peers, and many developers rely on the client portion of the application to provide specific behavior. However, the client software, once deployed, is no longer under the control of the developer, nor the server administrators, so there is no guarantee that requests made by the client came from a valid client. Instead, those requests may have been forged. Hence, the server can never trust the client request. The critical issue is trust and, more accurately, attributing too much trust to data provided by an untrusted entity. The same concept applies to the client. Does the client code really trust the data from the server, or is the server a rogue server? A good example of client-side attacks is cross-site scripting, discussed in detail in Chapter 13, "Web-Specific Input Issues."

Misplaced Trust

When you're analyzing designs and code, it's often easy to find areas of vulnerability by asking two simple questions. Do I trust the data at this point? And what are the assumptions about the validity of the data? Let's take a buffer overrun example. Buffer overruns occur for the following reasons:

  • The data came from an untrusted source (an attacker!).
  • Too much trust was placed in the data format—in this case, the buffer length.
  • A potentially hazardous event occurs—in this case, the untrusted buffer is written into memory.

Take a look at this code. What's wrong with it?

void CopyData(char *szData) {
   char cDest[32];
   strcpy(cDest,szData);
 
   // use cDest
   ...

Surprisingly, there may be nothing wrong with this code! It all depends on how CopyData is called and whether szData comes from a trusted source. For example, the following code is safe:

char *szNames[] = {"Michael","Cheryl","Blake"};
CopyData(szNames[1]);

The code is safe because the names are hard-coded and therefore each string does not exceed 32 characters in length; hence, the call to strcpy is always safe. However, if the sole argument to CopyData, szData, comes from an untrusted source—such as a socket or a file with a weak access control list (ACL)—then strcpy will copy the data until it hits a null character. And if the data is greater than 32 characters in length, the cDest buffer is overrun and any data above the buffer in memory is clobbered. Figure 10-1 shows the relationship between the call to strcpy and the three points I made earlier.

Click to view graphic
Click to view graphic

Figure 10-1  The conditions for calling strcpy in an unsafe manner.

Scrutinize this example and you'll notice that if you remove any of the conditions, the chance of a buffer overrun is zero. Remove the memory-copying nature of strcpy, and you cannot overflow a buffer, but that's not realistic because a non-memory-copying version is worthless! If the data always come from trusted source—for example, from a highly trusted user or a well-ACL'd file—you can assume the data is well-formed. Finally, if the code makes no assumptions about the data and validates it prior to copying it, then once again, the code is safe from buffer overruns. If you check the data validity prior to copying it, it doesn't matter whether the data came from a trusted source. Which leads to just one acceptable solution to make this code secure: first check that the data is valid, and do not trust it until the legality is verified.

The following code is less trusting and is therefore more secure:

void CopyData(char *szData, DWORD cbData) {
    const DWORD cbDest = 32;
    char cDest[cbDest];
 
    if (szData != NULL && cbDest > cbData)
       strncpy(cDest,szData,min(cbDest,cbData));
 
    //use cDest
    ...
}

The code still copies the data (strncpy), but because the szData and cbData arguments are untrusted, the code limits the amount of data copied to cDest. You might think this is a little too much code to check the data validity, but it's not—a little extra code can protect the application from serious attack. Besides, if the insecure code is attacked you'd need to make the fixes in the earlier example anyway, so get it right first time.

Earlier I mentioned that weak ACLs lead to untrusted data. Imagine a registry key that determines which file to update with log information and that has an ACL of Everyone (Full Control). How much trust can you assign to the data in that key? None! Because anyone can update the filename. For example, an attacker could change the filename to c:\boot.ini. The data in this key can be trusted more if the ACL is Administrator (Full Control) and Everyone (Read); in that case only an administrator can change the data, and administrators are trusted entities in the system. With proper ACLs, the concept of trust is transitive: because you trust the administrators and because only administrators can change the data, you trust the data.

A Strategy for Defending Against Input Attacks

The simplest and by far the most effective way to defend your application from input attacks is to validate the data before performing any further processing. To achieve these goals, you should adhere to the following strategies:

  • Define a trust boundary around the application.
  • Create an input chokepoint.

First, all applications have a point in the design where the data is believed to be well-formed and safe because it has been checked. Once the data is inside that trusted boundary, there should be no reason to check it again for validity—that is, assuming the code did a good job! That said, the principle of defense in depth dictates that you should employ multiple layers of defense in case a layer is compromised, and that is quite true. But I'll leave it to you to find that balance between security and performance for your application. The balance will depend on the sensitivity of the data and the environment in which the application operates.

Next, you should perform the check at the boundary of the trusted code base. You must define that point in the design; it must be in one place, and no input should be allowed into the trusted code base without going through that chokepoint. Note that you can have multiple chokepoints, one for each data source (Web, registry, file system, configuration files, and so on), but data from one source should not enter the trusted code base through any other chokepoint but the one designated for that source.

As you can see, the concept of the trusted boundary and chokepoint are tightly related. Figure 10-2 graphically shows the concepts of a trust boundary and chokepoints.

Click to view graphic
Click to view graphic

Figure 10-2  The concept of a trust boundary and chokepoints.

Note that the service and the service's data store have no chokepoint between them. That's because they're both inside the trust boundary and data never enters the boundary from any data source without being validated first at a chokepoint. Therefore, only valid data can flow from the service to the data store and vice versa.

A common vulnerability on the Web today is cross-site scripting errors. These errors involve malicious input that is echoed in the unsuspecting user's browser from an insecure Web site. The malicious input comprises HTML and script. I'm not going to give the game away; it's all explained in detail in Chapter 13. Many Web sites have them, and the sites' administrators don't know it. I spent time in early 2001 providing security training for some developers of a very large Web site that has never had such an issue. They have never had these issues because the Web application has two chokepoints, one for data coming from the user (or attacker) and another for data flowing back to the user. All input and output goes through these two chokepoints. Any developer that violates this policy by reading or writing Web-based traffic by using alternate means is "spoken to!" The chokepoints enforce very strict validity checking. Now that I've brought up the subject, let's discuss validity checking.

How to Check Validity

When checking input for validity, you should follow this rule: look for valid data and reject everything else. The principle of failing securely, outlined in Chapter 3, "Security Principles to Live By," means you should deny all access until you determine the request is valid. You should look for valid data and not look for invalid data for two reasons:

  • There might be more than one valid way to represent the data.
  • You might miss an invalid data pattern.

The first point is explained in more detail in Chapter 11, "Canonical Representation Issues." It's common to escape characters in a way that is valid, but it's also possible to hide invalid data by escaping it. Because the data is escaped, your code might not detect that it's invalid.

The second point is very common indeed. Let me explain by way of a simple example. Imagine your code takes requests from users to upload files, and the request includes the filename. Your application will not allow a user to upload executable code because it could compromise the system. Therefore, you have code that looks a little like this:

bool IsBadExtension(char *szFilename) {
    bool fIsBad = false;
 
    if (szFilename) {
        size_t cFilename = strlen(szFilename);
        if (cFilename >= 3) {
            char *szBadExt[] 
                = {".exe", ".com", ".bat", ".cmd"};
            char *szLCase
                = _strlwr(_strdup(szFilename));
 
            for (int i=0; 
                i < sizeof(szBadExt) / sizeof(szBadExt[0]); 
                i++)
                if (szLCase[cFilename-1] == szBadExt[i][3] && 
                    szLCase[cFilename-2] == szBadExt[i][2] && 
                    szLCase[cFilename-3] == szBadExt[i][1] && 
                    szLCase[cFilename-4] == szBadExt[i][0])
                    fIsBad = true;}
    }
 
    return fIsBad;
}
 
bool CheckFileExtension(char *szFilename) {
    if (!IsBadExtension(szFilename))
        if (UploadUserFile(szFilename)) 
            NotifyUserUploadOK(szFilename);
}

What's wrong with the code? IsBadExtension performs a great deal of error checking, and it's reasonably efficient. The problem is the list of "invalid" file extensions. It's nowhere near complete; in fact, it's hopelessly lacking. A user could upload many more executable file types, such as Perl scripts (.pl) or perhaps Windows Scripting Host files (.wsh, .js and .vbs), so you decide to update the code to reflect the other file types. However, a week later you realize that Microsoft Office documents can contain macros (.doc, .xls, and so on), which technically makes them executable code. Yet again, you update the list of bad extensions, only to find that there are yet more executable file types. It's a never-ending battle. The only correct way to achieve the goal is to look for valid, safe extensions and to reject everything else. For example, in the Web file upload scenario, you might decide that users can upload only certain text document types and graphics, so the secure code looks like this:

bool IsOKExtension(char *szFilename) {
    bool fIsOK = false;
 
    if (szFilename) {
        size_t cFilename = strlen(szFilename);
        if (cFilename >= 3) {
            char *szOKExt[] = 
                {".txt", ".rtf", ".gif", ".jpg", ".bmp"};
 
            char *szLCase = 
                _strlwr(_strdup(szFilename));
 
            for (int i=0; 
                i < sizeof(szOKExt) / sizeof(szOKExt[0]); 
                i++)
                if (szLCase[cFilename-1] == szOKExt[i][3] && 
                    szLCase[cFilename-2] == szOKExt[i][2] && 
                    szLCase[cFilename-3] == szOKExt[i][1] && 
                    szLCase[cFilename-4] == szOKExt[i][0])
                    fIsOK = true;
        }
    }
 
    return fIsOK;
}

As you can see, this code will not allow any code to upload unless it's a safe data type, and that includes text files (.txt), Rich Text Format files (.rtf), and some graphic formats. It's much better to do it this way. In the worst case, you have an annoyed user who thinks you should support another file format, which is better than having your servers compromised.

Tainted Variables in Perl

Perl includes a useful option to treat all input as unhygienic, or tainted, until it has been processed. An error is raised by the Perl engine if the code attempts to perform potentially dangerous tasks, such as calling the operating system, with the tainted data. Take a look at this code:

use strict;
my $filename = <STDIN>;
open (FILENAME, ">> " . $filename) or die $!;
print FILENAME "Hello!";
close FILENAME;

This code is unsafe because the filename comes directly from a user and the file is created or overwritten by this code. There's nothing stopping the user from entering a filename such as \boot.ini. If you start the Perl interpreter with the taint option (-T) running, the code results in the following error: Insecure dependency in open while running with -T switch at testtaint.pl line 3, <STDIN> line 1.

Calling open with an untrusted name is dangerous. The way to remedy this is to check the data validity by using a regular expression. (Regular expressions are explained later in this chapter.)

use strict;
my $filename = <STDIN>;
$filename =~ /(\w{1,8}\.log)/;
open (FILENAME, ">> " . $1) or die $!;
print FILENAME "Hello!";
close FILENAME;

In this code, the filename is checked prior to being used as the name in the call to open. The regular expression validates that the name is no more than 8 characters long followed by a .log extension. Because the expression is wrapped in a capture operation (the "(" and ")" characters), the filename is stored in the $1 variable and then used as the filename for the open call. The Perl engine does not know whether or not you created a safe regular expression, and so it's not a panacea. For example, the regular expression could simply be /(.*)/, which will capture all the user's input. Even with this caveat, tainting helps developers catch many input trust errors.

Using Regular Expressions for Checking Input

For simple data validation, you can use code like the code I showed earlier, which used simple string compares. However, for complex data you need to use higher-level constructs, such as regular expressions. The following C# code shows how to use regular expressions to replace the C++ extension-checking code. This code uses the RegularExpressions namespace in the .NET Framework:

using System.Text.RegularExpressions;
...
static bool IsOKExtension(string Filename) {
    Regex r = 
    new Regex(@"txt|rtf|gif|jpg|bmp$",
    RegexOptions.IgnoreCase);
    return r.Match(Filename).Success;
}

The same code in Perl looks like this:

sub isOkExtension($) {
    $_ = shift;
    return /txt|rtf|gif|jpg|bmp$/i ? -1 : 0;
}

I'll go into language specifics later in this chapter. For now, let me explain how this works. The core of the expression is the string "txt|rtf|gif|jpg|bmp$". The components are described in Table 10-1.

Table 10-1  Some Simple Regular Expression Elements

ElementComments
xxx|yyyMatches either xxx or yyy.
$Matches the input end.

If the search string matches one of the file extensions and then the end of the filename, the expression returns true. Also note that the C# code sets the RegexOptions.IgnoreCase option, because filenames in Microsoft Windows are case-insensitive.

Table 10-2 offers a more complete regular expression elements list. Note that some of these elements are implemented in some programming languages and not in others.

Table 10-2  Common Regular Expression Elements

ElementComments
^Matches the start of the string.
$Matches the end of the string.
*Matches the preceding pattern zero or more times. Same as {0,}.
+Matches the preceding pattern one time or more times. Same as {1,}.
?Matches the preceding pattern zero times or one time. Same as {0,1}.
{n}Matches the preceding pattern exactly n times.
{n,}Matches the preceding pattern n or more times.
{,m}Matches the preceding pattern no more than m times.
{n,m}Matches the preceding pattern between n and m times.
.Matches any single character, except \n.
(pattern)Matches and stores (captures) the resulting data in a variable. The variable used to store the captured data is different depending on the programming language. Can also be used as a group—for example, (xx)+ will find one or more instances of the pattern inside the parenthesis. If you wish to group, you can use the noncapture parenthesis syntax (?:xx) to instruct the regular expression engine not to capture the data.
aa|bbMatches aa or bb.
[abc]Matches any one of the enclosed characters: a, b or c.
[^abc]Matches any character not in the enclosed list.
[a-z]A range of characters or values. Matches any character from a to z.
\The escape character. Some escapes are special characters (\n and \/), and others represent predefined character sequences (\d). It can also be used as a reference to previously captured data (\1).
\bMatches the position between a word and a space.
\BMatches a nonword boundary.
\dMatches a digit, same as [0-9].
\DMatches a nondigit, same as [^0-9].
\n, \r, \f, \t, \vSpecial formatting characters: new line, line feed, form feed, tab, and vertical tab.
\p{category}Matches a Unicode category; this is covered in detail later in this chapter.
\sMatches a white-space character; same as [ \f\n\r\t\v].
\SMatches a non-white-space character; same as [^ \f\n\r\t\v].
\wMatches a word character; same as [a-zA-Z0-9_].
\WMatches a nonword character; same as [^a-zA-Z0-9_].
\xnn or \x{nn}Matches a character represented by two hexadecimal digits, nn.
\unnnn or \x{nnnn}Matches a Unicode code point, represented by four hexadecimal digits, nnnn. I use "code point" because of surrogate characters. Not every code point is a character—surrogates use two code points to represent a character. Refer to Chapter 14, "Internationalization Issues," for more information about surrogates.

Let's look at some examples in Table 10-3 to make this a little more concrete.

Table 10-3  Regular Expression Examples

PatternComments
[a-fA-F0-9]+Match one or more hexadecimal digits.
<(.*)>.*<\/\1>Match an HTML tag. Note the first tag is captured (.*) and used to check the closing tag using \1. So if (.*) is form, then \1 is also form.
\d{5}(-\d{4})?U.S. ZIP Code.
^\w{1,32}(?:\.\w{0,4})?$A valid but restrictive filename. 1-32 word characters, followed by an optional period and 0-4 character extension. The opening and closing parentheses, ( and ), group the period and extension, but the extension is not captured because the ?: is used. Note: I have used the ^ and $ characters to define the start and end of the input. There's an explanation of why later in this chapter.

Be Careful of What You Find—Did You Mean to Validate?

Regular expressions serve two main purposes. The first is to find data; the second, and the one we're mainly interested in, is to validate data. When someone enters a filename, I don't want to find the filename in the request; I want to validate that the request is for a valid filename. Allow me to explain. Look at this pseudocode that determines whether a filename is valid or not:

RegExp r = [a-z]{1,8}\.[a-z]{1,3};
if (r.Match(strFilename).Success) {
    //Cool! Allow access to strFilename; it's valid.
} else {
    //Tut! tut! Trying to access an invalid file.
}

This code will allow a request only for filenames comprised of 1-8 lowercase letters, followed by a period, followed by 1-3 lowercase letters (the file extension). Or will it? Can you spot the flaw in the regular expression? What if a user makes a request for the c:\boot.ini file? Will it pass the regular expression check? Yes, it will. The reason is because the expression looks for any instance in the filename request that matches the expression. In this case, the expression will find the series of letters boot.ini within c:\boot.ini. However, the request is clearly invalid.

The solution is to create an expression that parses the entire filename to look for a valid request. In which case, we need to change the expression to read as follows:

^[a-z]{1,8}\.[a-z]{1,3}$

The ^ means start of the input, and $ means end of the input. You can best think about the new expression as "from the beginning to the end of the request, allow only 1-8 lowercase letters, followed by a period, followed by 1-3 lowercas letters, and nothing more." Obviously, c:\boot.ini is invalid because the : and \ characters are invalid and do not comply with the regular expression.

Regular Expressions and Unicode

Historically, regular expressions dealt with only 8-bit characters, which is fine for single-byte alphabets but it's not so great for everyone else! So how should your input-restricting code handle Unicode characters? If you must restrict your application to accept only what is valid, how do you do it if your application has Japanese or German users? The answer is not straightforward, and support is inconsistent across regular expression engines at best.

Three aspects to Unicode make it complex to build good Unicode regular expressions:

  • We've already discussed this, but few engines support Unicode.
  • Unicode is a very large character set. Windows uses little endian UTF-16 to represent Unicode. In fact, because of surrogate characters, Windows supports over 1,000,000 characters; that's a lot of characters to check!
  • Unicode accommodates many scripts that have different characteristics than English. (The word script is used rather than language because one script can cover many languages.)

Now here's the good news: more engines are adding support for Unicode expressions as vendors realize the world is a very small place. A good example of this change is the introduction of Perl 5.8.0, which had just been released at the time of this writing. Another example is Microsoft's .NET Framework, which has both excellent regular expression support and exemplary globalization support. In addition, all strings in managed code are natively Unicode.

At first, you might think you can use hexadecimal ranges for languages, and you can, but doing so is crude and not recommended because

  • Spoken languages are living entities that evolve with time; a character that might seem invalid today in one language can become valid tomorrow.
  • It is really hard, if not impossible, to tell what ranges are valid for a language, even for English. Are accent marks valid? What about the word café? You get the picture.

The following regular expression will find all Japanese Katakana letters from small letter a to letter vo, but not the conjunction and length marks and some other special characters above \u30FB:

Regex r = new Regex(@"^[\u30A1-\u30FA]+$");

The secret to making Unicode regular expressions manageable lies in the \p{category} construct, which matches any character in the named Unicode character category. The .NET Framework and Perl 5.8.0 support Unicode categories, and this makes dealing with international characters easier. The high-level Unicode categories are Letters (L), Marks (M), Numbers (N), Punctuation (P), Symbols (S), Separators (Z), and Others (O and C) as follows:

  • L (All Letters)
    • Lu (Uppercase letter)
    • Ll (Lowercase letter)
    • Lt (Titlecase letters). Some letters, called diagraphs, are composed of two characters. For example, some Croatian diagraphs that match Cyrillic characters in Latin Extended-B, U+01C8, Lj, is the titlecase version of uppercase LJ (U+01C7) and lower case, lj (U+01C9).)
    • Lm (Modifier, letter-like symbols)
    • Lo (Other letters that have no case, such as Hebrew, Arabic, and Tibetan)

  • M (All marks)
    • Mn (Nonspacing marks including accents and umlauts)
    • Mc (Space-combining marks are usual vowel signs in languages like Tamil)
    • Me (Enclosing marks, shapes enclosing other characters such as a circle)

  • N (All numbers)
    • Nd (Decimal digit, zero to nine, does not cover some Asian languages such a Chinese, Japanese and Korea. For example, the Hangzhou-style numerals are treated similar to Roman numeral and classified as Nl (Number, Letter) instead of Nd.)
    • Nl (Numeric letter, Roman numerals from U+2160 to U+2182)
    • No (Other numbers represented as fractions, and superscripts and subscripts)

  • P (All punctuation)
    • Pc (Connector, characters, such as underscore, that join other characters)
    • Pd (Dash, all dashes and hyphens)
    • Ps (Open, characters like {, ( and [)
    • Pe (Close, characters like }, ) and ])
    • Pi (Initial quote characters including '. « and ")
    • Pf (Final quote characters including ', » and ")
    • Po (Other characters including ?, ! and so on)

  • S (All symbols)
    • Sm (Math)
    • Sc (Currency)
    • Sk (Modifier symbols, such as a circumflex or grave symbols)
    • So (Other, box-drawing symbols and letter-like symbols such as degrees Celsius and copyright)

  • Z (All separators)
    • Zs (Space separator characters include normal space)
    • Zl (Line is only U+2028, note U+00A6, the broken bar is treated a Symbol)
    • Zp (Paragraph is only U+2029)

  • O (Others)
    • Cc (Control includes all the well-known control codes such as carriage return, line feed, and bell)
    • Cf (Format characters, invisible characters such as Arabic end-of-Ayah)
    • Co (Private characters include proprietary logos and symbols)
    • Cn (Unassigned)
    • Cs (High and Low Surrogate characters)

Let's put the character classes to good use. Imagine a field in your Web application must include only a currency symbol, such as that for a dollar or a euro. You can verify that the field contains such a character and nothing else with this code:

Regex r = new Regex(@"^\p{Sc}{1}$");
if (r.Match(strInput).Success) {
// cool!
} else {
// try again
}

The good news is that this works for all currency symbols defined in Unicode, including dollar ($), pound sterling (£), yen (¥), franc (), euro (), new sheqel (), and others!

The following regular expression will match all letters, nonspacing marks, and spaces:

Regex r = new Regex(@"^[\p{L}\p{Mn}\p{Zs}]+$");

The reason for \p{Mn} is many languages use diacritics and vowel marks; these are often called nonspacing marks.

The .NET Framework also provides language specifies, such as \p{IsHebrew}, \p{IsArabic} and \p{IsKatakana}. I have included some sample code that demonstrates this named Ch10\Lang.

When you're experimenting with other languages, I recommend you use Windows 2000, Windows XP, or Microsoft Windows .NET Server 2003 with a Unicode font installed (such as Arial Unicode MS) and use the Character Map application, as shown in Figure 10-3, to determine which characters are valid. Note, however, that a font that claims to support Unicode is not required to have glyphs for every valid Unicode code point. You can look at the Unicode code charts at http://www.unicode.org/charts.

Click to view graphic
Click to view graphic

Figure 10-3  Using the Character Map application to view non-ASCII fonts.

A Regular Expression Rosetta Stone

Regular expressions are incredibly powerful, and their usefulness extends beyond just restricting input. They constitute a technology worth understanding for solving many complex data manipulation problems. I write many applications, mostly in Perl and C#, that use regular expressions to analyze log files for attack signatures and to analyze source code for security defects. Because subtle variations exist in regular expression syntax between programming languages and execution environments, the rest of this chapter outlines some of these variations. (Note that my intention is only to give you a number of regular expression quick references.)

Regular Expressions in Perl

Perl is recognized as a leader in regular expression support, in part because of its excellent string-handling and file-handling support. A regular expression that extracts the time from a string in Perl looks like this:

$_ = "We leave at 12:15pm for Mount Doom. ";
if (/.*(\d{2}:\d{2}[ap]m)/i) {
    print $1;
}

Note that the regular expression takes no arguments, because if no argument is provided, the $_ implicit variable is used. If the data is in a variable other than $_, you should use the following syntax:

var =~ /expression/;

Regular Expressions in Managed Code

Most if not all applications written in C#, Managed C++, Microsoft Visual Basic .NET, ASP.NET, and so on have access to the .NET Framework and as such can use the System.Text.RegularExpressions namespace. I've already outlined its syntax earlier in this chapter. However, for completeness, following are C#, Visual Basic .NET, and Managed C++ examples of the date extraction code I showed earlier in Perl.

C# Example

// C# Example
String s = @"We leave at 12:15pm for Mount Doom.";
Regex r = new Regex(@".*(\d{2}:\d{2}[ap]m)",RegexOptions.IgnoreCase);
if (r.Match(s).Success)
Console.Write(r.Match(s).Result("$1"));

Visual Basic .NET Example

' Visual Basic .NET example
Imports System.Text.RegularExpressions
 
.
 
Dim s As String
Dim r As Regex
s = "We leave at 12:15pm for Mount Doom."
r = New Regex(".*(\d{2}:\d{2}[ap]m)", RegexOptions.IgnoreCase)
If r.Match(s).Success Then
    Console.Write(r.Match(s).Result("$1"))
End If

Managed C++ Example

// Managed C++ version
#using <mscorlib.dll>
#include <tchar.h>
#using <system.dll>
 
using namespace System;
using namespace System::Text;
using namespace System::Text::RegularExpressions;
.
String *s = S"We leave at 12:15pm for Mount Doom.";
Regex *r = new Regex(".*(\\d{2}:\\d{2}[ap]m)",IgnoreCase);
if (r->Match(s)->Success) 
    Console::WriteLine(r->Match(s)->Result(S"$1"));

Note that the same code applies to ASP.NET because ASP.NET is language-neutral.

Regular Expressions in Script

The base JavaScript 1.2 language supports regular expressions by using syntax similar to Perl. Netscape Navigator 4 and later and Microsoft Internet Explorer 4 and later also support regular expressions.

var r = /.*(\d{2}:\d{2}[ap]m)/;
var s = "We leave at 12:15pm for Mount Doom.";
if (s.match(r))
    alert(RegExp.$1);

Regular expressions are also available to developers in Microsoft Visual Basic Scripting Edition (VBScript) version 5 via the RegExp object:

Set r = new RegExp
r.Pattern = ".*(\d{2}:\d{2}[ap]m)"
r.IgnoreCase = True
 
Set m = r.Execute("We leave at 12:15pm for Mount Doom.")
MsgBox m(0).SubMatches(0)

If you plan to use regular expressions in client code, you should use them only to validate client requests to save round-trips; using them is not a security technique.

Regular Expressions in C++

Now for the difficult language! Not that it is hard to write C++ code; rather, the language has limited class support for regular expressions. If you use the Standard Template Library (STL), an STL-aware class named Regex++ is available at http://www.boost.org. You can read a good article written by the Regex++ author at http://www.ddj.com/documents/s=1486/ddj0110a/0110a.htm.

Microsoft Visual C++, included with Microsoft Visual Studio .NET, includes a lightweight Active Template Library (ATL) regular expression parser template class, CAtlRegExp. Note that the regular expression syntax used by Regex++ and CAtlRegExp are different from the classic syntax—some of the less-used operators are missing, and some elements are different. The syntax for CAtlRegExp regular expressions is at http://msdn.microsoft.com/library/en-us/vclib/html/vclrfcatlregexp.asp.

The following is an example of using CAtlRegExp:

#include <AtlRX.h>
.
CAtlRegExp<> re;
re.Parse(".*{\\d\\d:\\d\\d[ap]m}",FALSE);
CAtlREMatchContext<> mc;
if (re.Match("We leave at 12:15pm for Mount Doom.", &mc)) {
    const CAtlREMatchContext<>::RECHAR* szStart = 0;
    const CAtlREMatchContext<>::RECHAR* szEnd = 0;
    mc.GetMatch(0,&szStart, &szEnd);
 
    ptrdiff_t nLength = szEnd - szStart;
    printf("%.*s",nLength, szStart);
}

A Best Practice That Does Not Use Regular Expressions

One way to enforce that input is always validated prior to being accessed is by using languages that support classes, such as C++, C# and Visual Basic .NET. Here's an example of a UserInput class written in C++:

#include <string>
using namespace std;
 
class UserInput {
public:
    UserInput(){};
    ~UserInput(){};
    bool Init(const char* str) {
        //add more checking here if you like
        if(!Validate(str)){
            return false;
        } else {
            input = str;
            return true;
        }
    }
 
    const char* GetInput(){return input.c_str();}
    DWORD Length(){return input.length();}
 
private:
    bool Validate(const char* str);
    string input;
};

Using a class like this has a number of advantages. First, if you see a method or function that takes a pointer or reference to a UserInput class, it's obvious that you're dealing with user input. The second is that there's no way to get an instance of this class where the input has not passed through the Validate method. If the Init method is never called or fails, the class contains an empty string. If you wanted to, you could create such a class with a Canonicalize method. This approach might save you time and bug-fixing because you can ensure that input validation always takes place and is done consistently.

Summary

I've spent a great deal of time outlining how to use regular expressions, but do not lose sight of the most important message of this chapter: trust input at your peril. In fact, do not trust any input until it is validated. Remember, just about any security vulnerability can be traced back to an application placing too much trust in the data,

When analyzing input, have a small number of entry points into the trusted code; all input must come through one of these chokepoints. Do not look for "bad" data in the request. You should look for good, well-formed data and reject the request if the data does not meet your acceptance criteria. Remember: you wrote the code for accessing and manipulating your resources; you know what constitutes a correct request. You cannot know all possible invalid requests, and that's one of the reasons you must look only for valid data. The list of correct requests is finite, and the list of invalid requests is potentially infinite or, at least, very very large.



Last Updated: November 14, 2002
Top of Page