PowerShell is Microsoft's next-generation command line and scripting solution. It combines the interactive capabilities of traditional shells such as bash or zsh with the programmability of scripting languages such as Perl or Ruby. Because PowerShell is based on .NET, it's capable of doing things in a shell environment that were previously only possible in languages such as Visual Basic, VBScript, or C#.
As with any scripting language, one of the most important domains for PowerShell is the ability to work with strings and files (both text and binary). This series of articles is based on chapter 10 of Windows PowerShell in Action from Manning Publications. Chapter 10 examines how PowerShell handles text and file processing tasks, illustrating how to process and parse text using string objects and regular expressions. It also shows how to deal with paths and how to manipulate binary files. Another significant area covered in this chapter is how to work with XML. XML has become increasingly important both in the IT field and in software development. We show how to search, manipulate, and create XML documents using PowerShell.
Part one of this series looks at how to process, search, and manipulate strings and unstructured text using the .NET string object and regular expressions. In Part 2 author Bruce Payette, a technical lead on the Windows PowerShell team, focuses on file processing, while Part 3 looks at working with XML in PowerShell.
PowerShell is an object-based shell however the ability to process text is still very important. In this article, we'll look at techniques for splitting and joining strings using the [string] and [regex] members, as well as using filters to extract statistical information from a body of text.
One common scenario for scripting is processing log files. This requires breaking the log strings into pieces to extract relevant bits of information. Unfortunately, PowerShell has no split operator, so there is no way to split a string into pieces using the language itself. This is where support for .NET is important . If you want to split a string into pieces, you use the Split() method on the [string] class.
PS (1) > "Hello there world".Split() Hello there world
The Split method with no arguments splits on spaces. In this example, it produces an array of three elements.
PS (2) > "Hello there world".Split().length 3
We can verify this with the length property. In fact, the Split() method splits on any of the characters that fall into the WhiteSpace character class. This includes tabs, so it works properly on a string containing both tabs and space (The sequence "`t" injects a tab character into a string):
PS (3) > "Hello`tthere world".Split() Hello there world Hello there world
In the revised example, we still get three fields even though a space character is used in one place and tab in another.
And while the default is to split on a space character, you can specify a string of characters to use when splitting fields.
PS (4) > "First,Second;Third".Split(',;')
First
Second
Third
Here we specified the comma and the semicolon as valid characters to split the field.
There is an issue, however - the default behavior for Split() isn't necessarily what we want. Here's why - it splits on each separator character. This means that if we have multiple spaces between words in a string, we'll get multiple empty elements in the result array. For example
PS (5) > "Hello there world".Split().length 6
In this example, we end up with six elements in the array because there are three spaces between "there" and "world". Let's find out if there's a better way to do this.
The string method we've been using has worked well so far, but we've gotten to the point where we need to add some cmdlets to help us out. In this case, we'll use the get-member cmdlet to look at the signature of the Split() method:
PS (6) > ("hello" | gm split).definition
System.String[] Split(Params Char[] separator), System.String[] Split(Char[] separator, Int32 count),
System.String[] Split(Char [] separator, StringSplitOptions options), System.String[] Split (Char[] separator,
Int32 count, StringSplitOptions options), Sys tem.String[] Split(String[] separator,
StringSplitOptions options), System.String[] Split(String[] separator, Int32 count, StringSplitOptions options)
The default display of the definition is a little hard to read. Fortunately, we know how to split a string.
PS (7) > ("hello" | gm split).definition.split(',')
System.String[] Split(Params Char[] separator)
System.String[] Split(Char[] separator
Int32 count)
System.String[] Split(Char[] separator
StringSplitOptions options)
System.String[] Split(Char[] separator
Int32 count
StringSplitOptions options)
System.String[] Split(String[] separator
StringSplitOptions options)
System.String[] Split(String[] separator
Int32 count
StringSplitOptions options)
Ok, it's not perfect – it split on the method argument commas as well – but we can still read it. The methods that take the options argument look promising. Let's see what the SplitStringOptions are. We'll do this by trying to cast a string into these options.
PS (8) > [StringSplitOptions] "abc" Cannot convert value "abc" to type "System.StringSplitOptions" due to invalid enumeration values. Specify one of the following enumeration values and try again. The possible enumeration values are "None, RemoveEmptyEntries". At line:1 char:21 + [StringSplitOptions] <<<< "abc"
The error message tells us the legitimate values for the enumeration. If we look this class up in the online documentation on MSDN, we'll see that this option tells the split method to discard empty array elements. This sounds just like what we need so let's try it:
PS (9) > "Hello there world".split(" ",
>> [StringSplitOptions]::RemoveEmptyEntries)
>>
Hello
there
world
It works as desired. Now let's apply this to a larger problem.
Given a body of text, we want to find the number of words in the text as well as the number of unique words, and then display the 10 most common words in the text. For our purposes, we'll use one of the PowerShell help text files: about_Assignment_operators.help.txt. This is not a particularly large file – it's around 17 kilobytes - so we can just load it into memory using the get-content (gc) cmdlet.
PS (10) > $s = gc $PSHOME/about_Assignment_operators.help.txt PS (11) > $s.length 434
The variable $s now contains the text of the file as a collection of lines (434 lines to be exact.) But we want to process this file as a single string. To do this, we'll us the String.Join() method and join all of the lines, adding an additional space between each line.
PS (12) > $s = [string]::join(" ", $s)
PS (13) > $s.length
17308
Now $s contains a single string containing the whole text of the file. We verified this by checking the length rather than displaying it. Next we'll split it into an array of words.
PS (14) > $words = $s.split(" `t",
>> [stringsplitoptions]::RemoveEmptyEntries)
>>
PS (15) > $words.length
2696
So the text of the file has 2,696 words in it. Now let's find out how many unique words there are. There are a couple of ways of doing this. The easiest way is to use the sort-object cmdlet with the –unique parameter. This will sort the list of words and then remove all of the duplicates.
PS (16) > $uniq = $words | sort -uniq PS (17) > $uniq.count 533
So this help topic contains 533 unique words. Using the sort cmdlet is fast and simple, but it doesn't cover all of the things we said we wanted to do because it doesn't give the frequency of use. Let's look at another approach, using the foreach-object cmdlet and a hashtable.
In the previous example, we used the –uniq parameter to sort-object to generate a list of unique words. Now we'll take advantage to the set-like behavior of hashtables to do the same thing, but also allow us to count the number of occurrences of each word.
In mathematics, a set is simply a collection of unique things. This is how the keys work in a hashtable. Each key in a hashtable occurs exactly once. Attempting to add a key more than once will result in an error. In PowerShell, assigning to the same key more than once replaces the old value associated with the key. So a hashtable is a set and a chess set isn't. Uh – ok.
Once again, we split the document into a stream of words. Each word in the stream will be used as the hashtable key, and we'll keep the count of the words in the value. Here's the script:
PS (18) > $words | % {$h=@{}} {$h[$_] += 1}
It's not much longer than the previous example. We're using the % alias for foreach-object to keep it short. In the begin clause in the foreach-object, we initialize the variable $h to hold the resulting hashtable. Then, in the process scriptblock, we increment the hashtable entry, indexed by the word. We're taking advantage of the way arithmetic works in PowerShell. If the key doesn't exist yet, the hashtable returns $null. When $null is added to a number, it is treated as zero. This allows the expression
$h[$_] += 1
to work. Initially the hashtable member for a given key doesn't exist. The += operator retrieves $null from the table, converts it to 0, adds one, and then assigns the value back to the hashtable entry.
Let's verify that the script produces the same answer for the number of words as we got with the sort – uniqsolution.
PS (19) > $h.psbase.keys.count 533
We get 533, the same as before.
Note: Notice that we used $h.psbase.keys.count. This is because there is a member in the hashtable that hides the keys property. In order to access the base keys member, we need to use the PSBase property to get the base member on the hash table. |
Now we have a hashtable containing all unique words and the number of times each word is used. But hashtables aren't stored in any particular order, so we need to sort it. We'll use a scriptblock parameter to specify the sort criteria, telling it to sort the list of keys based on the frequency stored in the hashtable entry for that key.
PS (20) > $frequency = $h.psbase.keys | sort {$h[$_]}
The words in the sorted list are ordered from least frequent to most frequent. This means that $frequency[0] contains the least frequently used word.
PS (21) > $frequency[0] avoid
And the last entry in frequency contains the most commonly used word. We can use negative indexing to get the last element of the list.
PS (22) > $frequency[-1] the
It comes as no surprise that the most frequent word is "The" and it's used 300 times.
PS (23) > $h["The"] 300
The next most frequent word is "to", which is used 126 times.
PS (24) > $h[$frequency[-2]] 126 PS (25) > $frequency[-2] to
Here are the top 10 most frequently used words in the about_Assignment_operatorshelp text:
PS (26) > -1..-10 | %{ $frequency[$_]+" "+$h[$frequency[$_]]}
the 300
to 126
value 88
a 86
you 68
variable 64
of 55
$varA 41
For 41
following 37
PowerShell includes a cmdlet that is useful for this kind of task: the Group-Object cmdlet. This cmdlet groups its input objects into collections sorted by the specified property. This means we can get the same type of ordering by doing the following:
PS (27) > $grouped = $words | group | sort count
And once again, we see that the most frequently used word is "the":
PS (28) > $grouped[-1]
Count Name Group
----- ---- -----
300 the {the, the, the, the...}
We can display the ten most frequent words by doing the following:
PS (29) > $grouped[-1..-10]
Count Name Group
----- ---- -----
300 the {the, the, the, the...}
126 to {to, to, to, to...}
88 value {value, value, value, value...}
86 a {a, a, a, a...}
68 you {you, You, you, you...}
64 variable {variable, variable, variable...
55 of {of, of, of, of...}
41 $varA {$varA, $varA, $varA, $varA...}
41 For {For, for, For, For...}
37 following {following, following, follow...
We get a nicely formatted display courtesy of the formatting and output subsystem built into PowerShell.
In this section, we saw how to split strings using the methods on the string class. We even saw how to split strings on a sequence of characters. But in the world of unstructured text, you'll quickly run into examples where the methods on [string] are not enough. As is so often the case, regular expressions come to the rescue. In the next couple of sections, we'll see how we can do more sophisticated string processing using the [regex] class.
In the previous section, we looked at doing basic string processing using members on the [string] class. While there's a lot you can do with this class, there are times when you need more powerful tools. This is where regular expressions come in. Regular expressions are a mini-language for matching and manipulating text.
There is a shortcut [regex] for the regular expression type. The [regex] type also has a Split() method, but it's more powerful because it uses a regular expression to decide where to split things instead of a single character.
PS (1) > $s = "Hello-1-there-22-World!" PS (2) > [regex]::split($s,'-[0-9]+-') Hello there World! PS (3) > [regex]::split($s,'-[0-9]+-').count 3
In this example, the fields are separated by a sequence of digits bounded on either side by a dash. This is a pattern that couldn't be specified with String.Split().
When working with the .NET regular expression library, the [regex] class isn't the only class you'll run into. We'll see this in next example, when we take a look at using regular expressions to tokenize a string.
Tokenization – the process of breaking a body of text into a stream of individual symbols - is a common activity in text processing. The PowerShell interpreter has to tokenize a script before it can be executed. In the next example, we're going to look at how we might write a simple tokenizer for basic arithmetic expressions in a programming language. First we need to define the valid tokens in these expressions. We want to allow numbers made up of 1 or more digits; any of the operators "+","-"," "," /"; and also sequences of spaces. Here's what the regular expression to match these elements looks like:
PS (4) > $pat = [regex] "[0-9]+|\+|\-|\ |/| +"
This is a simple pattern using only the alternation operator "|" and the quantifier "+", which matches one or more instances. Since we used the [regex] cast in the assignment, $pat contains a regular expression object. We can use this object directly against an input string by calling its Match () operator.
PS (5) > $m = $pat.match("11+2 35 -4")
The Match() operator returns a Match object (full name System.Text.RegularExpressions.Match). We can use the Get-Member cmdlet to explore the full set of members on this object at our leisure, but for now we're interested in only three members. The first member is the Success property. This will be true if the pattern matched. The second interesting member is the Value member, which will contain the matched value. The final member we're interested in is the NextMatch() method. Calling this method will step the regular expression engine to the next match in the string, and is the key to tokenizing an entire expression. We can use this method in a while loop to extract the tokens from the source string one at a time. In the example, we keep looping as long the a the Match object's Success property is true. Then we display the Value property, and call NextMatch() to step to the next token:
PS (6) > while ($m.Success)
>> {
>> $m.value
>> $m = $m.NextMatch()
>> }
>>
11
+
2
*
35
-
4
In the output, we see each token, one per line in the order they appeared in the original string.
We now have a powerful collection of techniques for processing strings. Of course, these techniques are most interesting when applied to files, so file processing is the topic of the next installment in this series. In part two of the series, we'll look at finding, reading, writing, and copying files. We'll also review the basic file abstractions and namespaces in PowerShell. Please stay tuned!
This material was excerpted from the book Windows PowerShell in Action from Manning Publications. In the next installment of this series, we will continue by looking at file processing: reading and writing files, including binary files, with PowerShell. |
Excerpt from Windows PowerShell in Action ISBN 932394-90-7 Copyright 2007 Manning Publications All rights reserved