An example using a hash and a list

comment No Comments Written by Anders on April 26, 2008 – 6:17 am

This is a Perl classic: a program that illustrates how Perl is far better suited to text processing tasks than are languages like C, C++ or Java. The program has to count the number of occurrences of all distinct words in a document and then print a sorted list of these counts. For this program, a ‘word’ is any sequence of alphabetic characters; all non-alphabetic characters are ignored. For counting purposes, words are all converted to lower case (so ‘The’ and ‘the’ would be counted as two occurrences of ‘the’).

The program has to loop, reading lines from STDIN. Each line can be split into words. Each word (after conversion to lower case) serves as a key into a hash; the associated value is the count of occurrences of that word. Once all the input data have been processed, a list of the words (keys of the hash) can be obtained and sorted, and the sorted list used in a foreach loop that prints each word and the associated count.

#!/share/bin/perl -w
while($line = <STDIN>) {
@words = split /[^A-Za-z]/ , $line;
foreach $word (@words) {
if($word eq “”) next;
$index = lc $word;
$counts{$index}++;
}

}
@sortedkeys = sort keys %counts;
foreach $key (@sortedkeys) {
print “$key\t$counts{$key}\n”;
}

The split function uses a regular expression that breaks the string held in $line at any non-alphabetic character (the set of all alphabetic characters is specified via the expression A-Za-z; here the ^ symbol implies the complement of that set.) A sequence of letters gets returned as a single element of the resulting list; each non-alphabetic character results in the return of an empty string (so if the input line was “test 123 end” the list would be equivalent to “test”, “”, “”, “”, “”, “”, “end”). Empty words get discarded. The lc function is used to fold each word string to all lower-case characters.

The line $counts{$index}++ is again playing Perl tricks. It uses the value of $index to index into the hash %counts. The first time a word is encountered in the input, there will be no value associated with that entry in the hash – or, rather, the value is Perl’s ‘undef’ value. In a numeric context, such as that implied by the ++ increment operator, the value of ‘undef’ is zero. So, the first time a word is encountered it gets an entry in the hash %counts with a value 1; this value is incremented on each subsequent occurrence of the same word. If you wanted the results sorted by frequency, rather than alphabetically, you would simply provide an inline helper sort function:

foreach $key (sort { $counts{$a} <=> $counts{$b} } keys %counts) {
print “$key\t$counts{$key}\n”;
}

The sort function’s first argument is the inline code for element comparison, and its second argument is the list of words as obtained by keys %counts. The inline function uses the sort’s globals $a and $b as indices into the hash to obtain the count values for the comparison test.

The Perl solution for this problem is of the order of ten lines of simple code. Imagine a Java solution; you would need a class WordCounter, which would employ a java.util.StringTokenizer to cut up an input string obtained via a java.io. BufferedReader. Words would have to be stored in some map structure from the java.util library. The code would be considerably longer and more complex. A C++ programmer would probably be thinking in terms of the STL and map classes. A C programmer would likely start from scratch with int main(int argc, char** argv). Each language has its own strengths. One of Perl’s strengths is text processing. All small text processing tasks, like the word counter, are best done in Perl.

Bookmark or Share:
  • E-mail this story to a friend!
  • Technorati
  • StumbleUpon
  • Facebook
  • Google
  • del.icio.us
  • Digg
  • Slashdot
If you enjoyed the article, why not subscribe?
Tags: , , ,

Browse Timeline

Post a Comment

About The Author: Anders

Anders is a freelance graphic designer. He specializes in CSS/XHTML web design and design of print materials including business cards, brochures and flyer’s. You can view his portfolio at andershaig.com.

Want to subscribe?

SEO blog and web design related issues. Subscribe in a reader Or, subscribe via email:
Enter your email address: