import-bot (20211) [Avatar] Offline
#1
[Originally posted by graham patterson]

Hi Andrew,

I was just experimenting with the word count program on p119 (with the online
errata correction and a minimum word size of 1) and decided to use the first
two pars of section 6.5 on p116 to test it (from "Let's consider ..." to "...
as an argument." To my surprise it gave a count of only 61 words instead of 72
(assuming "Let's" is one word, not two). Curious, I added a debugging
statement in the foreach loop like so:

print "$word
" if $word !~ m/^[a-zA-Z]+([-'][a-zA-Z]+)$/;

and -- obvious once you've see it -- it printed out all the words with leading
or trailing punctuation marks. So as it stands, the program doesn't appear to
meet the spec on p116, which says it should "count *all* the words" in a file.
After playing around a while, I changed the regex to:

$count{$word}++ if $word m/[a-zA-Z]+([-'][a-zA-Z]+)/;

and thought for a moment I'd fixed it as this gives a count of 72 words. But
my joy was short-lived, as another test on the following simple text:

blah? blah! blah, blah, blah!!!

shows that the program now counts the punctuation as part of the word and
therefore messes up the stats for the unique words and average word length
(eg, it gives a count of four unique words and an average length of 5.4 for
the above, even though a human [okay, *this* human] would say it's just the
same four-letter word five times). I tried to modify the program so that it
would chop off the leading or trailing punctuation before calculating the
stats, but wasn't able to. Searching the docs and clpm turned up various other
regexen and caveats about word counting being a non-trivial task. Hmm ...
guess I just found that out the hard way smilie

So in order to fix the program, we need to modify it some way so that it does
count all the words, but doesn't take into account the leading and trailing
punctuation when it calculates the stats. Any suggestions?

Thanks once again for your help,
Graham
import-bot (20211) [Avatar] Offline
#2
Re: regexen and word count program
[Originally posted by jandrew]

Graham,

You are correct --- word counting is *hard*, and the program in
the book should have a few more caveats listed ... although the
reader is invited to work out a more precise regex smilie

Looking at just the problem of leading and trailing punctuation,
a simple fix is to try to strip out the kinds of leading or trailing
punctuation we might see:

foreach my $word (@words) {
$word =~ s/^[("']+//;
$word =~ s/[.?!():,"']+$//;
next if length($word) < $size;
$count{$word}++ if $word =~ m/^[A-Za-z]+(?:[-'][A-Za-z]+)*$/;
}

That gives a count of 72 for the text in question --- but it still
isn't perfect by any means. If the word "following" is hyphenated
as in the book, the count is still OK, but the word actually counted
is just "lowing". If we read in the textual data in whole para-
graphs we could fix such hyphenated breaks. But we'd still miss words
emphasized with asterisks like *this* one (or <this> or _this_ ...).

We also haven't considered mixed case ("blah" vs "Blah") for tracking
unique words, or even what constitutes a "word" --- a document on
language may make mention of some word prefixes: non, un, pre, ... or
suffixes: ed, ing, ... do we count those as two and three letter
words?

All in all, counting *words* can be a rather complex task ---
fortunately, for most purposes, a simple approximation is usually
good enough.

andrew
import-bot (20211) [Avatar] Offline
#3
Re: regexen and word count program
[Originally posted by graham patterson]

A-ha! <sound of forehead being slapped> I was trying (and failing) to strip
off the leading and trailing punctuation inside the foreach loop using only a
*single* regex. Suddenly it looks so simple when you have one regex for
leading and one regex for trailing punctuation ...

Thanks again, Graham
import-bot (20211) [Avatar] Offline
#4
Re: regexen and word count program
[Originally posted by jandrew]

Graham,

you wrote:
> A-ha! <sound of forehead being slapped> I was trying (and failing) to strip
> off the leading and trailing punctuation inside the foreach loop using only a
> *single* regex. Suddenly it looks so simple when you have one regex for
> leading and one regex for trailing punctuation ...

It can be done in one regex using an alternation:

$word =~ s/^[("']+|[.?!():,"']+$//g;

but that is actually less efficient (alternation is slow) and less
readable in my opinion.

andrew
import-bot (20211) [Avatar] Offline
#5
Re: regexen and word count program
[Originally posted by graham patterson]

Hi Andrew,

I agree, that one regex looks a lot less readable than the two simpler ones.
Interesting to know that it's also less efficient ...

The approach I was using to try and strip the leading and trailing punctuation
in one regex was to use backreferences to try and capture the "word" chunk
while stripping off the punctuation, as in the following (this is grossly
oversimplified, but I hope it gives the idea):

#!/usr/bin/perl -w
use strict;
my $word = "'hello'";
print "$word
";
$word =~ s/["'](w+)["']/$1/;
print "$word
";

This mini-program appears to work as I expected, capturing the "word" chunk in
$1, but my attempts to build a more elaborate version that would work in the
foreach loop of the word count program were unsuccessful, although I did
become a lot more familiar with perldiag than I was planning to at this stage
of the learning process %-)

Assuming I can get the syntax right eventually, would using backreferences in
this way also be a good method of stripping the punctuation, and would it be
more or less efficient than the alternatives? Or is there some terribly
obvious flaw in this approach that I haven't spotted?

Thanks again,
Graham

PS: Still really enjoying the book, BTW!
import-bot (20211) [Avatar] Offline
#6
Re: regexen and word count program
[Originally posted by jandrew]

Graham,

Your backref approach was a good attempt and it can work, but you
will need a non-greedy quantifier (which I don't mention until
chapter 10's more extended discussion of regular expressions). In
short, a quantifier matches greedily, that is, it matches as much as
it can while still allowing the remainder of the pattern to match ---
a non-greedy version (created by appending a ? to a normal
quantifier) matches as little as possible while still allowing the
remainder of the pattern to match.

So we could use the following to do the same as our previous two
re version:

$word =~ s/^[("']*(.*?)[.?!():,"']*$/$1/;

reading this as separate components we have:

^ # match start of string, followed by
[("']* # zero-or-more of these, followed by
(.*?) # zero-or-more anything (non-greedy), followed by
[.?!():,"']* # zero-or-more of these, followed by
$ # end of string

Note, this will also be less efficient than the twin regex version
--- chapter 10 should help in understanding why (it can be read any
time after chapter 6), in particular the notion of backtracking plays
a role. Getting your head inside regular expressions isn't easy, but
it is fun (for some definitions of fun smilie.

regards,
andrew
import-bot (20211) [Avatar] Offline
#7
Re: regexen and word count program
[Originally posted by graham patterson]

Thanks Andrew,

I had been wrestling with something pretty close to your example, except of
course you correctly divined that I was indeed missing the non-greedy
quantifier -- now it works fine. Looks like I will be adding perlretut and
perlre to my weekend reading list. Should be, um, fun!

Cheers, Graham