
About the Author Jan Dubois is the tech lead for all Perl technologies at ActiveState. He is a member of the core Perl5-Porters group that maintains the Perl language implementation. Jan wrote the Win32::OLE Automation support for Perl while he was still working in the financial industry. He has a Masters degree in Physics from the University of Hamburg, Germany. Jan now lives in Vancouver in Beautiful British Columbia, home of the 2010 Winter Olympics. |
Hello and welcome to the first event in the Advanced Perl division of the Winter Scripting Games 2008. My name is Jan Dubois, and I’m the Perl janitor at ActiveState (meaning the tech lead for ActivePerl, the Perl Dev Kit and ActivePerl Enterprise). I’ve been asked by The Scripting Guys to provide sample solutions and a running commentary on the advanced Perl events. I’m happy to do that, especially since I’m no longer allowed to participate in the competition because ActiveState has donated some prizes for the games. At least this way I still get to show off – share my thoughts on various Perl techniques.
The challenge
The first event requires the script to request a 7-digit phone number from the user and then select a corresponding phoneword from the “wordlist.txt” file. The letters in the phone number need to be mapped the same way they are represented on a North American phone dial.
How we are going to solve it
The problem seems pretty straight forward: We read the phone number from the user, and then read the wordlist file line by line, replacing all characters with the corresponding digits. We stop as soon as the resulting number matches the phone number, or when we reach the end of the file:
use strict;
use warnings;
print "Enter 7 digit phone number: ";
chomp(my $number = <>);
die "Invalid number: '$number'\n" unless $number =~ /^\d{7}$/;
my @groups = qw(ABC DEF GHI JKL MNO PRS TUV WXY);
my($letters, $digits);
foreach my $digit (2..9) {
my $group = $groups[$digit-2];
$letters .= $group;
$digits .= $digit x length($group);
}
open(my $fh, "<", "C:/Scripts/wordlist.txt") or die;
while (<$fh>) {
local $_ = $_;
eval "tr/$letters\L$letters\E/$digits$digits/";
last if /^$number$/o;
}
print uc if defined;
All scripts with more than a few lines should use the “strict” and “warnings” pragmata. Many IDEs, like Komodo, will add this boilerplate for you automatically when you create a new file with a “.pl” extension. Otherwise you may want to investigate the template system for your editor to see if you can set it up for yourself. In the long run, “strict” and “warnings” will save you plenty of time by alerting you to errors or problems in your scripts early on.
The script prompts for a phone number and stores the user input from STDIN without the trailing newline character in $number. A common mistake is to write the code like this:
my $number = chomp(<>);
This doesn't work because chomp() modifies its argument in-place and returns the character cut off from the end, not the modified argument.
die "Invalid number: '$number'\n" unless $number =~ /^\d{7}$/;
This line quickly verifies that $number really contains just 7 digits and nothing else. It is not required by the contest rules, but still good practice. The regexp \d is just a shortcut for [0-9], matching any digit, and {7} specifies that this pattern must be repeated exactly 7 times. The pattern is then anchored at the beginning and end of the string to make sure no additional characters can sneak through: /^\d{7}$/.The translation of words to numbers seems like an ideal task for Perl's tr/// operator. We set up the translation strings beforehand:
my @groups = qw(ABC DEF GHI JKL MNO PRS TUV WXY);
my($letters, $digits);
foreach my $digit (2..9) {
my $group = $groups[$digit-2];
$letters .= $group;
$digits .= $digit x length($group);
}
This results in 2 strings: $letters and $digits. $letters contains the uppercase characters that are being mapped and $digits contains the corresponding digits at the same offset in the string.
You can see that I used the expression “$digit x length($group)” while setting up the $digits string. Since all groups contain exactly 3 characters, I could have written “$digit x 3”, or even “$digit.$digit.$digit”. But I was originally confused by the event description, expecting mappings from a phone pad, not dial, and on a pad the groups sometimes contain 4 characters: “PQRS” and “WXYZ”. Even after I realized my mistake I left the code as is, because it is more flexible this way. And now I was also prepared if the Games organizers had decided at the last minute that discriminating against the letters “Q” and “Z” was simply not fair.
Now we can open the file:
open(my $fh, "<", "C:/Scripts/wordlist.txt") or die;
Always using the 3 argument version of open() is a good habit. The 2 argument version can be quite dangerous if you accept a filename from the user and don't check it for shell meta-characters:
chomp(my $filename = <>); open(FILE, $filename) or die;
This seems innocent enough, but if the user enters a string starting with the “>” character, then we are going to overwrite the file instead of reading it. And, even more scary, if the filename entered ends with the pipe character “|”, then open() would execute it via the shell and just pipe the output of that command back to FILE. Just think about what would happen if the user entered “rd/s/q C:\|” as the filename. So just get into the habit of always using 3 argument open() in case you ever forget to validate a filename provided by external sources.
Now we are ready to read the wordlist line by line, translating the words to numbers and checking to see if we found a match:
while (<$fh>) {
local $_ = $_;
eval "tr/$letters\L$letters\E/$digits$digits/";
last if /^$number$/o;
}
print uc if defined;
Reading from a filehandle as the sole expression in a while() statement automatically assigns the input data to $_. Inside the loop we are translating the word into a number string, but we still need to be able to print the original word if we get a match. That's what the local() function is doing here: creating a lexically scoped copy of the variable that will automatically be restored to the original value when we exit the scope.
The naïve translation might just have been “tr/$letters/$digits/”, but that doesn't work because tr/// doesn't support variable interpolation. We need to enclose the expression in a string eval. This simple expression also doesn't deal with lowercase letters, but there is an easy way to fix this now: we interpolate $letters twice, but the second time we surround it with the \L and \E meta-characters; that will turn any interpolated string into lowercase. Since the lowercase letters map to the same numbers as their uppercase equivalents, we can just double the $digits string in the replacement section:
eval "tr/$letters\L$letters\E/$digits$digits/";
We exit the loop when we have found a match. Since we have never called chomp() to strip the newline character off of $_ we need to match “$_ eq "$number\n"”. Alternatively I chose to use a regular expression, which matches against $_ implicitly: last if /^$number$/o;
This also makes use of the fact that the $ at the end of the regexp matches either at the end of the string or in front of the newline character at the end (the meta-symbol that matches only at the end of the string is \z).
Since $number is essentially a constant, there is no need to perform the interpolation and translation into a regular expression each time we iterate through the loop. The /o qualifier on the regexp tells Perl that it can compile just once and then reuse the code over and over because the interpolated string will never change.
We drop out of the loop when either we find a matching word or we reach the end of the wordlist file. If $_ is still defined, we found a match (and local() has now restored that word in $_), so we just print it after turning it into uppercase letters. Otherwise we exit silently:
print uc if defined;
Both uc() and defined() operate on $_ when invoked without an argument, so the line above is just a compact form of this:
print uc($_) if defined($_);
Note that we never chopped the trailing newline off of $_, so the print() function conveniently doesn't need to add one back.
However, there is still one potential problem with the solution above; did you spot it?
Tim Toady
“There is more than one way to do it” (TIMTOWTDI) is the Perl motto, so I thought I should provide an alternate solution, just for the fun of it.
The idea is to not rely on tr///, but instead use a hash lookup table to translate letters to numbers. And since the competition allows Perl 5.10 constructs, let's try to make use of a new feature as well:
use 5.010;
use strict;
use warnings;
use ActiveState::Prompt qw(prompt);
my $number = prompt(
"Enter phone number",
trim_space => 1,
must_match => qr/^\d{7}$/,
default => "7323464",
);
my(%digit) = ("\n" => "");
my @groups = qw(ABC DEF GHI JKL MNO PRS TUV WXY);
foreach my $digit (2..9) {
$digit{$_} = $digit for split //, $groups[$digit-2];
}
open(my $fh, "<", "C:/Scripts/wordlist.txt") or die;
while (<$fh>) {
last if $number eq join("", map(($digit{$_} // "X"), split(//, uc)));
}
print uc if defined;
I happen to know that ActivePerl already contains a handy little prompt() function that does simple parameter validation as well as allow for default values and some other things:
use ActiveState::Prompt qw(prompt);
my $number = prompt(
"Enter phone number",
trim_space => 1,
must_match => qr/^\d{7}$/,
default => "7323464",
);
It displays the prompt, reads the user input and validates it against the provided regexp after trimming leading and trailing whitespace. If the input doesn't match, the user is asked again. I also added “7323464” as the default input when the user just presses the Enter key, as a shortcut while testing the script (it has to print “READING” as the output in that case).
Instead of setting up strings for the tr/// operator we now compute the %digit lookup table:
my(%digit) = ("\n" => "");
my @groups = qw(ABC DEF GHI JKL MNO PRS TUV WXY);
foreach my $digit (2..9) {
$digit{$_} = $digit for split //, $groups[$digit-2];
}
The split() function with an empty regexp argument // will split the string into individual characters. We then assign the same $digit to the hash element for each character from the same group using the “for” statement qualifier. This is just a compact form of the following code:
my @characters = split(//, $groups[$digit-2]);
foreach my $char (@characters) {
$digit{$char} = $digit;
}
When we loop through the wordlist file we can now map each word into digits and compare it against $number:
while (<$fh>) {
last if $number eq join("", map(($digit{$_} // "X"), split(//, uc)));
}
The statement in the middle should best be read from right to left: We convert $_ to uppercase and then split it into a list of characters. Then we map each character to its corresponding hash value in %digit, or to the character “X”, if there is no entry for it in %digit. Finally we join all these translated characters back together, with no intervening delimiter, and compare the result against the phone number in $number.
The interesting part here is the use of the “defined or” operator // to map to either the hash value, or to “X”. This is necessary because the wordlist also contains strings containing numbers or special characters, like “A-1”. Without this special treatment we would map all non-letter characters to “undef” which turns into the empty string in string context. So “A-1” would be mapped the same as just “A”.
Remember that $_ still contains a trailing newline character. This too would have been translated to an “X” if we hadn't seeded the hash table like this:
my(%digit) = ("\n" => "");
This also shows why in this case we really needed the “defined or” operator, new in Perl 5.10, and couldn't use the standard “logical or” operator instead: $digit{"\n"} is the empty string, which is still false in Boolean context, so it would still be replaced with an “X”.
For some reason (maybe I'm just weird) I really like this line:
last if $number eq join("", map(($digit{$_} // "X"), split(//, uc)));
It shows 2 occurrences of // with completely different meanings, depending on the context in which they appear. Just imagine what the Perl parser internals must look like ….
I also wanted to test if there is any noticeable performance difference between these 2 solutions, but I see that the action in the figure skating competition in event 2 is heating up. I'll head over there now, but if you have a little extra time, why don't you try to compare the execution speeds using the Benchmark module and let me know what you find at jand@activestate.com.
PS: The potential problem I alluded to earlier in the first solution is that it isn't invalidating any digits already found in the input data. It will treat “AA” and “A2” exactly the same, potentially returning a word containing digits as the result. The second solution doesn't have this problem because all digits will be mapped to “X”.