The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

import-bot (20211) [Avatar] Offline
#1
[Originally posted by gzcao]

Hi Iain,

In this class you are appending a couple of spaces after each ">". I know
this hack is neccessary by experience. I tried to relace this with a native
reader class (e.g. BufferedInputStreamReader) and ran into problems with the
sax parser not firing endElement events before the next element is on the
stream.

Is this a xerces bug? If not, can you generalize the rationale on this trick
(e.g. you have to do this if you want to parse an xml outputstream from a
socket)?

Thanks,

George
import-bot (20211) [Avatar] Offline
#2
Re: XercesReader
[Originally posted by iainshigeoka]

Hi George!

Sorry for the delay in replying. I messed up my email forwarding and ended up
with a mail box falling out of my normal email checking since the 15th.
*sigh*

> In this class you are appending a couple of spaces after each ">". I know
> this hack is neccessary by experience. I tried to relace this with a native
> reader class (e.g. BufferedInputStreamReader) and ran into problems with the
> sax parser not firing endElement events before the next element is on the
> stream.
>
> Is this a xerces bug? If not, can you generalize the rationale on this trick
> (e.g. you have to do this if you want to parse an xml outputstream from a
> socket)?

This is a xerces "feature". smilie Xerces does some caching to speed parsing.
Unfortunately this bites us when we want to use it to read streaming
information as the cache doesn't trigger parsing until it is full. As far as
I can tell from a very brief experimentation is that this cache can be made as
small as 3 characters but not eliminated (at least I haven't seen an easy way
to do this). This means that once we finish an element, we need to flush the
cache by making sure it fills up by sending 2 extra spaces. I do this after
every ">" and it seems to work fairly well.

This is a hack though. I'm sure there are probably going to be some bugs
produced from this woraround. Ideally, I think it would be best dig into
Xerces source code and modify the parser to better fit parsing streaming data.
I haven't done this though so I can't say how much work it will be.

Alternatively, there are other parsers out there that could work out better.
I think that for Jabber's XML stream format, the best tool will probably be a
pull parser. These XML parsers don't push events to you, but rather allow you
to pull parsed information (what is passed in SAX as events). This has
implications for scaling as you don't need to devote a thread to each parse
stream with a pull parser.

If you are interested, there are several pull parsers out there to try. There
is one from the enhydra project called kxml (http://kxml.org/). It also has
the benefit of being small and J2ME compatible. A Google search should turn
up several other Java pull parsers.

Does that help?

-iain