Forum: >>> Magnum BBS <<<

tdom html mode

From saitology9@21:1/5 to All on Tue Apr 25 12:31:08 2023

It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "" tags in sequence for example,
it processes only the first one.

That is fine if this is the expected behavior but if not, what is the
correct way to do this?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ted Nolan @21:1/5 to [email protected] on Tue Apr 25 17:34:39 2023

In article <u28v8d$ueo3$[email protected]>,
saitology9 <[email protected]> wrote:

It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "" tags in sequence for example,
it processes only the first one.

That is fine if this is the expected behavior but if not, what is the
correct way to do this?

I find that I always have better results with tdom parsing if I use the "-html5" option. Are you using that?
--
columbiaclosings.com
What's not in Columbia anymore..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to All on Tue Apr 25 14:40:35 2023

On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:

I find that I always have better results with tdom parsing if I use the "-html5" option. Are you using that?

no I am not. However, it doesnt recognize this option. I just reviewed
the tdom docs and there wasn't any mention of this option.

For reference, this is what I have:

% package req tdom
0.9.1

% dom parse -html "hello there"
domDoc010BC518

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ted Nolan @21:1/5 to [email protected] on Tue Apr 25 19:25:10 2023

In article <u296r4$vqbb$[email protected]>,
saitology9 <[email protected]> wrote:

On 4/25/2023 1:34 PM, Ted Nolan <tednolan> wrote:

I find that I always have better results with tdom parsing if I use the
"-html5" option. Are you using that?

no I am not. However, it doesnt recognize this option. I just reviewed
the tdom docs and there wasn't any mention of this option.

For reference, this is what I have:

% package req tdom
0.9.1

% dom parse -html "hello there"
domDoc010BC518

It's a compile option:

http://www.tdom.org/index.html/doc/trunk/doc/dom.html

-html5
This option is only available if tDOM was build with
--enable-html5. Try the featureinfo method if you need
to know if this feature is build in.

Mine (FreeBSD) has it:

===
ted@hotrod:~ % tclsh8.6
% package require tdom
0.9.1
% dom parse -html5 "hello there"
domDoc0x80097d140
===

That's not to say it would solve your problem, but as I say
I've had better luck with it.
--
columbiaclosings.com
What's not in Columbia anymore..

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to All on Tue Apr 25 15:54:07 2023

On 4/25/2023 3:25 PM, Ted Nolan <tednolan> wrote:

It's a compile option:

Thank you very much for your help. My version is not built with this
option. At the moment, it is not worth the trouble pursuing this any
further but it is good to know the option exists.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rolf Ade@21:1/5 to [email protected] on Tue Apr 25 23:17:12 2023

saitology9 <[email protected]> writes:

It seems like tdom html parsing doesn't work well with partial html
strings that don't necessarily include the full doctype/head/body/etc.
tags. tdom seems to return nodes only for the first tag and not the
rest; meaning that if there are two "" tags in sequence for
example, it processes only the first one.

That is fine if this is the expected behavior but if not, what is the
correct way to do this?

You'd better update to the current tdom 0.9.3 (which provides a solution
to your question).

While the -html5 parser (if it is build in; that requires the gumbo
HTML5 parser lib present at build time and the configure switch
--enable-html5) is very robust (digest nearly any tag soup) this may be
not the right thing for this problem, because that always insert a
single document root and inserts missing elements implied by the context
(as <head>, <tbody>, etc.).

You want to parse an HTML fragment like

"hello there"

But what DOM tree do you expect to get from that? That document or
fragment doesn't have a single root as HTML or XML have to. So if you
are fine with getting a DOM _forest_ instead of a DOM tree jus to:

package require tdom 0.9.3
dom parse -html -forest "hello there" doc
$doc asXML

This script returns this to me:

hello
there

tDOMs dom methods (and the xpath engine) works pretty fine with such a
"forest" and a natural way. It is just that you don't have the pattern

set root [$doc documentElement]

and you have all of your data as decendants of that one roots (remember,
you have a forest, not a tree).

The "other" root nodes beside the one you still get from [$doc
documentElement] are (next) siblings of that one. Or you can get all
the roots of your forest with [$doc childNodes]. Hope, this hints get
you started.

rolf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to Rolf Ade on Tue Apr 25 18:37:17 2023

On 4/25/2023 5:17 PM, Rolf Ade wrote:

But what DOM tree do you expect to get from that? That document or
fragment doesn't have a single root as HTML or XML have to. So if you
are fine with getting a DOM _forest_ instead of a DOM tree jus to:

package require tdom 0.9.3
dom parse -html -forest "hello there" doc
$doc asXML

Dear Rolf, thank you. Yes, I wanted the "forest" option. I am aware of
the difference between a tree and a forest. I have two versions of tdom
(0.9.1 and 0.9.2) and they both return a single node for the plain parse command. So despite me writing a recursive function to navigate the
node's children as well as its siblings, I was not getting the full data
out. In any case, this was more of a curiosity on my part and not based
on any need.

I will look to upgrade to tdom soon. Thanks for the heads up.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

Recent Visitors

Bob Worm
Tue Jul 28 16:01:18 2026
from Wales, Uk via Telnet

Rixter
Tue Jul 28 13:42:46 2026
from Madison, Nc via Telnet

Krenn
Tue Jul 28 11:59:57 2026
from Sydney, Nsw via Telnet

Rixter
Tue Jul 28 01:23:48 2026
from Madison, Nc via Telnet

Centurion
Mon Jul 27 22:50:42 2026
from Berea, Ohio via Telnet

Ataricrypt
Mon Jul 27 19:19:17 2026
from England via Telnet

Bob Worm
Mon Jul 27 15:19:55 2026
from Wales, Uk via Telnet

Rixter
Mon Jul 27 13:04:59 2026
from Madison, Nc via Telnet

System Info

Sysop: Keyop

Location: Huddersfield, West Yorkshire, UK

Users: 741

Nodes: 16 (2 / 14)

Uptime: 48:11:13

Calls: 12,444

Calls today: 4

Files: 15,192

Messages: 6,537,114

tdom html mode

Who's Online

Recent Visitors

System Info