The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

import-bot (20211) [Avatar] Offline
#1
[Originally posted by cjmackie]

Hi Dave; Thanks for the book--it's going to be useful for teaching as well as
a handy resource for my own work. I wish it had been around a few years back,
when I first started doing this! I have some suggestions for version 2, and
then a question about your Ch11 source code.

In the next version, I'd like to lobby for more coverage of munging
unstructured text data (my own daily grind). I think you shrugged off a
little too quickly the amount of structured info you can retrieve from text.
Equally important, with the rise of HTML and especially XML, I think the
automation of unstructured text parsing is a growth industry. Here are some
of the problems I encounter daily, and for which there are well-documented
perl idioms and/or aids:

--parsing names and proper nouns
--parsing sentences from paragraphs
--reformatting text-with-newlines to newlines-at-para-only
--dealing with typos
--identifying semantic structures from syntactical structures (a little
digression into format-based searching in MS-Word would complement the perl
code here)

An example problem might be parsing a bibliography from a Word or WordPerfect
document into a tagged-field format suitable for import into EndNote, ProCite,
or BibTeX (i.e., one of the citation-manager database formats). I can provide
all the example data you need, and my own munging code, as a starting point...
smilie Another might be munging TeX or other (non-X/HTML) tagged text data.

Also, I suspect that no future discussion of munging will be complete without
at least an intro treatment of XSLT....
------------
Now, the question:
I've been playing with Parse::RecDescent, working through Chap11 on my own
data. The parse is working great, but getting the data out is proving very
tricky. I can't seem to get the actions to work on my own stuff, so just to
be thorough I downloaded your source code, and it doesn't seem to be working
the way it should either. When I run cd_long.pl, I get the following on
STDOUT/STDERR:

$VAR1 = undef;

There's no sign of any usable output. As with my own data, the text is
properly inserted into the RecDescent object, and you can see results from the
parse by 'print'ing at the rule-level: I checked. I'm using ActivePerl 5.6.1,
RecDescent v1.80, and Data:smilieumper v2.11. I've tried it on WinNT4 and on
Win2k--same results.

Any thoughts about what's happening? Have I missed something in the text, has
syntax changed since you wrote, is the ActiveState port non-standard, did 5.6
break something...?

--Chris

BTW, here's an excerpt from my own work. I need to munge 100k downloaded
newspaper and wire service stories into an XML-like format. Comments or
suggestions on how to build a parse tree for each file are welcome:

file: filehead story(s)
story: head body foot
head: headtag(s) # b/c not all headers have all tags, nor in same order
headtag: headline
headtag: byline
headtag: wordcount
...
headline: /../ # regexp to parse headline
byline: /../ # ditto byline
dateline: /../ # ditto dateline
body: bodytag(s)
bodytag: lead_para
bodytag: para
bodytag: table
bodytag: separator
bodytag: section_head
... # same basic approach as header
foot: foottag(s)
... # same basic approach as header

As I say, it parses great according to print statements on the individual
rules, but when I try to generalize the parse_tree code from your chapter to
produce an overall tree for each file, or for complex productions, it keeps
telling me that it can't coerce an array to a hash.
import-bot (20211) [Avatar] Offline
#2
Re: Source code erratum or...?
[Originally posted by dave]

Chris,

Thanks for the post. You're right that there's an error in that example. In
fact there are two errors, one in the program and one in the data.

Where the program reads the text, it reads it from the DATA filehandle where
it should be reading it from STDIN.

The CD definition for "Earthling" is slrightly out. The "E" in "Earthling" is
one character too far to the right. If you correct that and realign the rest
of the line then hope you'll find it works. If not - set $::RD_TRACE to 1 to
see where the parsing breaks down.

And thanks for the suggestions for the book. They sound very interesting, but
perhaps a bit too much work for the 2nd edition - they sound like a whole new
book to me smilie

Cheers,

Dave...
import-bot (20211) [Avatar] Offline
#3
Re: Source code erratum or...?
[Originally posted by cjmackie]

Thanks, Dave; I had already fixed the DATA issue, but you're right, it was
the spacing that was throwing things off. It's working fine now.

As for adjustments to the 2nd ed., I hope you'll at least think about the
XSLT addition--it's a nice complement to what you're doing already, and there
are few good published tutorials. Based on your 'intro to perl' section and
the rest of the chapters, I think you'd do a good job.

--Chris