saitology9 <
[email protected]> writes:
It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "<p>" tags in sequence for
example, it processes only the first one.
That is fine if this is the expected behavior but if not, what is the
correct way to do this?
You'd better update to the current tdom 0.9.3 (which provides a solution
to your question).
While the -html5 parser (if it is build in; that requires the gumbo
HTML5 parser lib present at build time and the configure switch
--enable-html5) is very robust (digest nearly any tag soup) this may be
not the right thing for this problem, because that always insert a
single document root and inserts missing elements implied by the context
(as <head>, <tbody>, etc.).
You want to parse an HTML fragment like
"<p>hello</p> <p>there</p>"
But what DOM tree do you expect to get from that? That document or
fragment doesn't have a single root as HTML or XML have to. So if you
are fine with getting a DOM _forest_ instead of a DOM tree jus to:
package require tdom 0.9.3
dom parse -html -forest "<p>hello</p> <p>there</p>" doc
$doc asXML
This script returns this to me:
<p>hello</p>
<p>there</p>
tDOMs dom methods (and the xpath engine) works pretty fine with such a
"forest" and a natural way. It is just that you don't have the pattern
set root [$doc documentElement]
and you have all of your data as decendants of that one roots (remember,
you have a forest, not a tree).
The "other" root nodes beside the one you still get from [$doc
documentElement] are (next) siblings of that one. Or you can get all
the roots of your forest with [$doc childNodes]. Hope, this hints get
you started.
rolf
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)