|
Chapter 10: String Handling and Regular Expressions continued
Regular ExpressionsThe System.Text namespace provides a number of classes for regular expression processing. Regular expressions offer a powerful, flexible, and efficient strategy for processing text. The .NET Framework regular expressions have evolved from languages such as Perl and awk and are designed to be compatible with Perl 5 regular expressions. In addition, the .NET regular expressions include unique features such as right-to-left matching and on-the-fly compilation. Using regular expressions, you can quickly parse large amounts of text to find specific character patternsto extract, edit, replace, or delete text substrings. These features are particularly useful for parsing HTML pages, http headers, XML files, system log files, and so on.The regular expression language includes two basic character types: literal (normal) text characters and metacharacters. Regular expression metacharacters are an evolved extension of the ? and * metacharacters used with the MS-DOS file system to represent any single character or group of characters. The most commonly used metacharacters in the regular expression pattern syntax are listed in Table 10-5. Table 10-5 Common Regular Expression Metacharacters
For example, when applied to a body of text, the regular expression \sFoo matches all occurrences of the string "Foo" that are preceded by any whitespace character (such as space, tab, or carriage return/linefeed).
As a simple introduction to regular expressions, let's revisit the String.Split method from the previous section, which splits a string into substrings according to specified separators. The Regex class in the System.Text.RegularExpression namespace can be used in the same way. Instead of setting up an array of characters containing the delimiters, we pass a parenthesized set of delimiter values to the Regex constructor:
class SplitRegExApp Here's the output:
Once Let's continue our comparison of the String and Regex features for splitting strings. We can also split strings on the basis of multiple different delimiters:
Console.WriteLine( Note the way we specify multiple delimiters to Regex by using the bitwise OR (|) operator. Now that our delimiters include such characters as the backslash and single quote, we must ensure that the string is @-quoted to escape the special meaning of these delimiters. Let's see what happens when we try our version of the string with multiple spaces:
Console.WriteLine( As you can see from the output, this result isn't quite right:
Multiple spaces, using " " What's going on here? Our regular expression search pattern is a single space, so when the engine finds two spaces together, it splits between them. Furthermore, because the spaces are being discarded, we end up with an empty string. For instance, there are three spaces between "Once" and "Upon", so the sequence "Once Upon" is split four times: between "Once" and the following space, between that first space and the second (which results in an empty string), between the second and the third (another empty string), and between the third and "Upon". Therefore, we end up with four substrings: "Once", "", "", and "Upon". Can we fix this problem? Yes we canthe obvious solution is to test for empty strings upon output:
string u = "Once Upon A Time In America"; OK, it's working, but it seems a bit of a bodge, doesn't it? Isn't there a better way? The answer is yes. In the following examplewhich produces the same outputwe specify \s as the pattern to match. Regex will interpret \s as any single whitespace character. We need to place the at sign (@) in front of the string, or the compiler will step in and complain about \s being an unrecognized escape sequence. Finally, we add a plus sign (+) to the end of the pattern to signify that we're happy to match multiple instances of the patternin this case, multiple instances of whitespace:
Console.WriteLine( If we're concerned only with spaces and not other whitespace characters (such as tabs), we can reduce the expression to this:
Regex n = new Regex("[ ]+");Finally, instead of using the square brackets to surround our search pattern, we can use parenthesesanother of the metacharacters recognized by Regex. What difference do they make? Let's take our first (simplest) example:
Console.WriteLine( Here's the output:
Single spaces, using () This time, we don't have empty strings between each substring. We now have now 11 substrings: the parentheses cause Regex to keep, or capture, the delimiters instead of discarding them. In a more sophisticated situation where we're not just splitting a string but performing some other modifications to it, we might want to keep the delimiters for other processing. The foregoing examples use regular expressions in a fairly simple manner, just to compare the Regex and String classes as closely as possible. We'll now see how to use regular expressions in a more powerful fashion.
Match and MatchCollectionThe System.Text namespace also offers a Match class and a MatchCollection class. The Match class represents the results of a regular expression- matching operation. A Match object is immutable, and the Match class has no public constructor. Therefore, you can get a Match only from another class, such as Regex. In the following example, we use the Match method of the Regex class to return an object of type Match in order to find the first match in the input string. We also use the Match.Success property to indicate whether a match was indeed found.
class MatchingApp The output from this application is:
Found 'in' at position 5 Note that if we'd initialized the Regex object with "capturing" parentheses, the effect would be exactly the same:
Regex r = new Regex("(in)");OK, but what if there are multiple occurrences of the pattern in the string? For this, we need to use the MatchCollection class. Like Match, this class is immutable and has no public constructor. In the following example, we use the same Regex object previously initialized to search for the pattern "in" and apply it to a longer string with multiple occurrences of the pattern. The results are returned in a MatchCollection object, which we can then iterate. We can also use the indexer to treat the collection as an array.
MatchCollection mc = r.Matches( The output from this new block of code is:
Found 'in' at position 5 The Match class stores and provides access to all the substrings extracted by the search. Match also remembers the string being searched and the regular expression being used, so it can use them to perform another search that starts where the last one ended. Therefore, we can also perform the previous search operation by using the following codewe find the first match, and as long as this succeeds, we continue searching with a call to Match.NextMatch:
string s2 = "The King Was in His Counting House"; Suppose we only want to search for the pattern "in" as a word when "in" occurs after and before a space. This situation is almost too trivial to mention; just bear in mind that the regular expression classes can search for any pattern you care to imagine:
Regex q = new Regex(" in ");The output from this new block of code is:
Found ' in ' at position 12 Finally, suppose we want to match multiple instances of multiple patterns:
Regex p = new Regex("((an)|(in)|(on))");The output from this new block of code is:
Found 'in' at position 5 Note that we can alternatively write the regular expression just shown like this:
Regex p = new Regex("(a|i|o)n");This alternative pattern matching can be extended to a technique named backtracking. Backtracking occurs when the regular expression- matching engine needs to back up to re-examine part of the string that it's passed. For example, suppose we're looking for either spelling of the word "Gray": "Gray" or "Grey". Suppose that in a given string we have the substring "Grey". When the engine examines this string and finds the pattern "Gr", it must choose to compare the next character against the letter "a" or "e". Suppose it chooses to match "a". This comparison fails, so the engine must backtrack to try to match "e".
Regex n = new Regex("Gr(a|e)y");The output from this new block of code is:
Found 'Grey' at position 7
Groups and CapturesThe System.Text namespace also offers a Group class and a GroupCollection class. The Group class represents the results from a single regular expression-matching group. In the following example, we define three groups, "ing", "in", and "n", and then search the string "Matching" to find these patterns. As you can see, the Match class offers a Groups property that returns a GroupCollection object, and we can use an integer indexer into the GroupCollection to extract individual Group objects:
class GroupingApp The output from this application is:
Found 3 Groups Note that the for loop just shown could've been written to use the Capture and CaptureCollection classes explicitly. The Capture class contains the results from a single subexpression capture, while the CaptureCollection class represents a sequence of substrings captured by a single capturing group:
for (int i = 0; i < gc.Count; i++) The relationship between matches, groups, and captures is indicated in Figure 10-2. Figure 10-2 Matches, groups, and captures. The Group class becomes much more powerful when used with named groups. You can make Regex put the captured substrings into Group objects with arbitrary names and then use these names via the GroupCollection string indexer:
Regex q = new Regex( The output from this new block of code is:
Salary = 123456 Table 10-6 shows how the regular expression you just saw breaks down. Table 10-6 Breakdown of a Typical Regular Expression
String-Modifying ExpressionsIn addition to parsing strings to search for patterns by using methods such as Split, Match, and Matches, we can use methods in the Regex class for stripping out substrings, joining substrings, and generating modified strings. You can use Regex.Replace to perform common operations such as stripping leading and/or trailing whitespace, tokenizing or modifying pathnames, and splitting or joining lines of text. For example, to strip leading whitespace, we can initialize a Regex object with a regular expression that matches any number of whitespace characters at the beginning of a line (such as "^\s+") and then use Regex.Replace to replace all these characters with an empty string:
class RXmodifyingApp The output from this application is:
Strip leading space: leading Table 10-7 breaks down the regular expression you just saw. Table 10-7 Breakdown of Another Typical Regular Expression
The Regex class offers instance methods such as Split, Replace, and Match as well as static equivalents; therefore, you don't even have to instantiate a Regex object. This feature is particularly useful if you want to perform a series of regular expression operations. Because the Regex object is immutable, it might be more useful to use the static methods. The previous code can thus be rewritten like this:
//rx = new Regex(e); By the same tokenno pun intendedwe can strip trailing spaces, modify pathnames, and convert date formats:
s = "trailing "; The date-formatting regular expression just shown breaks down into three subpatterns, each with the same basic meaning, as shown in Table 10-8. Table 10-8 Breakdown of a Date-Formatting Regular Expression
The output from these additional code blocks is:
Strip trailing space: trailing There's also a static version of Regex.Match, which can be used under similar circumstances. For example, to find the HREF link tags in some simple HTML:
Console.WriteLine(); Table 10-9 shows how the regular expression you just saw breaks down. Table 10-9 Breakdown of a Typical HTML Regular Expression
The output from this additional code is:
HTML links: <a href="first.htm"> Finally, remember how in the "Strings" section of the chapter we used a custom method to convert a string to proper case (initial caps on each word in the string)? Here's another version that achieves the same result by using regular expressions instead of string processing:
public class RXProperCaseApp This is the output:
Initial String: the qUEEn wAs in HER parLOr
Regular Expression OptionsSuppose that in a string we want to match some alternative patterns in which only the letter case differs. For instance, suppose we want to find any instance of the word "in"or "In" or "IN" or "iN". We could use this pattern:
class RXOptionsApp Here's the output:
Found 'IN' at position 5 Alternatively, we could use an overloaded Regex constructor that takes a RegexOptions enumeration value as its second parameter. For example, to get the same results as we just saw, we could use the IgnoreCase option:
r = new Regex("in", RegexOptions.IgnoreCase);Another potentially useful RegexOption is RightToLeft:
r = new Regex("in", Given the previous behavior, the output from this version should be obvious:
Found 'in' at position 25 Another featurewhich is very useful if you're building complex expressionsis the ability to embed comments into a pattern by using the # delimiter. Of course, this wouldn't be much use if the Regex object then included the comments as part of the pattern to be searched. Therefore, you can construct a Regex with the RegexOptions.IgnorePatternWhitespace optionthis ignores both embedded comments and any whitespace that isn't explicitly escaped:
r = new Regex( The output follows:
Found 'as' at position 10
Compiling Regular ExpressionsOne of the RegexOptions enumeration values is Compiled:
r = new Regex("in", RegexOptions.Compiled);The default behavior of the regex engine is to compile a regular expression to a sequence of internal instructions (not MSIL), which are interpreted upon execution. On the other hand, if you construct a regex object with the regexoptions.compiled option, the engine compiles the regular expression to explicit MSIL. This option allows the .NET framework's just-in-time compiler (JITter) to convert the expression to native machine code for higher performance. For a complex expression that's used heavily, this conversion yields faster executionof course, it also increases startup time. Also bear in mind that by using the Compiled option, you're effectively converting state data (which would be destroyed when the Regex object is garbage collected) into code (which is removed from memory only when the application terminates). So, choose when to use this option carefully. A related feature is the ability to explicitly compile a regular expression to an assembly that's then persisted to disk by using the Regex.CompileToAssembly method. For example, suppose we have a lengthy regular expression such as one that parses an Internet Protocol (IP) address:
class RXassemblyApp This regular expression breaks down into four identical groups. Table 10-10 shows how each of these four groups breaks down. Table 10-10 Breakdown of an IP Address Regular Expression
Here's the output:
IP Address: 123.45.67.89 We can explicitly compile this to a persistent assembly. First, set up an array of RegexCompilationInfo referenceswe need only one of these, but we have to have an array to pass to Regex.CompileToAssembly. Set up this one instance with the regular expression pattern, any RegexOptions flags, the name you want for your assembly, and any namespace you want to use for it. The final parameter is a Boolean value that indicates whether the regular expression should be public:
RegexCompilationInfo [] rci = Then set up an AssemblyName object. The assembly cache manager uses the object for binding and retrieving information about an assembly. We need to set only one property of this object: the filename for the assembly itself. The extension .dll is assumed and will be appended automatically. Finally, pass both the RegexCompilationInfo array and the AssemblyName reference to Regex.CompileToAssembly:
AssemblyName an = new AssemblyName(); When you run this code, you'll find that a new file named MyAss.dll has been created in the same location as the target for this current assembly, which is normally ..\bin\debug. If we examine the metadata for this assembly, shown in Figure 10-3, we'll see that it contains three classes: MyRegexAssembly (via the third parameter to the RegexCompilationInfo constructor) derived from Regex, MyRegexAssemblyFactory derived from RegexRunnerFactory, and MyRegexAssemblyRunner derived from RegexRunner. Figure 10-3 Metadata for compiled regular expression. We could then use this customized derived MyRegexAssembly class in another project. In this example, I've added MyAss.dll as a reference to the new project:
using System;
SummaryIn this chapter, we examined two primary classes for processing strings, String and Regex, plus a range of ancillary classes that modify and support string operations. We explored the use of the String class methods for searching, sorting, splitting, joining, and otherwise returning modified strings. We also saw how many other classes in the .NET Framework support string processingincluding Console, the basic numeric types, and DateTimeand how culture information and character encoding can affect string formatting. Finally, we saw how the system performs sneaky string interning to improve runtime efficiency.In the second part of this chapter, we looked at Regex and its supporting classesMatch, Group, and Capturefor encapsulating regular expressions. We explored both pattern searching and string modifying through the set of Regex instance and static methods, and we examined the use of RegexOptions to modify the behavior of the operation. Finally, we saw how we can compile regular expressions to assemblies as a code management strategy. Clearly, there's some overlap in functionality between strings and regular expressions. String-based code is probably simpler and easier to maintain, while Regex-based code will generally be much more flexible and powerful. In many situations, you'll find that a judicious mixture of both is the best approach.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||