On 9/4/2021 3:19 PM, Helmut Giese wrote:
Hello out there,
as an answer to my yesterday's post 'ISO conversion tool for text
widgets' Dave posted code which contains an intriguing RE. After some
head scratching I finally understand it - almost. The remaining puzzle
is: What is the difference in the quantifiers '{1,1}?' and '{1}?' ?
The following code demonstrates what I mean:
---
set txt "This is normal text while this is <i>italic</i>
and this is <i>too</i>."
set re1 {<{1,1}?([ib])\s*>(.*?)</\1\s*>}
set re2 {<{1}?([ib])\s*>(.*?)</\1\s*>}
set ranges [regexp -all -indices -inline $re1 $txt]
puts "Version re1"
puts $ranges
puts ""
set ranges [regexp -all -indices -inline $re2 $txt]
puts "Version re2"
puts $ranges
---
Why is this? As per the man page: Isn't
'a sequence of exactly 1 match of the atom'
the same as
'a sequence of 1 to 1 (inclusive) matches of the atom'?
Any enlightenment will be greatly appreciated.
Helmut
From the answer to my own posting ca. 2020:
Subject: Re: Why does adding \s* to my RE change non-greedy to greedy?
On 4/13/2020 5:35 PM, heinrichmartin wrote:
On Monday, April 13, 2020 at 11:17:10 PM UTC+2, Dave wrote:
Tcl 8.6.8, win7x64
Adding \s* to my RE changes .*? from non-greedy to greedy
Test script:
proc Test {re text} {
puts "\nRe: \"$re\""
puts "Matching against \"$text\""
set n 0
foreach {match sub1 sub2} [regexp -all -inline -indices $re $text] {
lassign $match s e
puts "Match [incr n]: [string range $text $s $e]"
lassign $sub1 s e
puts " $n.1: [string range $text $s $e]"
lassign $sub2 s e
puts " $n.2: [string range $text $s $e]"
}
}
set string "...<i>111</i>..<i>22</i>.."
Test {<([ib])>(.*?)</\1>} $string
Test {<([ib])\s*>(.*?)</\1\s*>} $string
Output:
Re: "<([ib])>(.*?)</\1>"
Matching against "...<i>111</i>..<i>22</i>.."
Match 1: <i>111</i>
1.1: i
1.2: 111
Match 2: <i>22</i>
2.1: i
2.2: 22
Re: "<([ib])\s*>(.*?)</\1\s*>"
Matching against "...<i>111</i>..<i>22</i>.."
Match 1: <i>111</i>..<i>22</i>
1.1: i
1.2: 111</i>..<i>22
(Temp) 1 %
The "(.*?)" is no longer non-greedy. Why?
First preference wins, see
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm#M95 with regard to
greedy vs non-greedy preference.
Two more remarks:
* \s* is followed by non-whitespace ">", make it non-greedy.
* Using regexp to parse XML/HTML is not a good idea. Use e.g. tdom.
Thank you. I had skimmed past that part because I thought that the (.*)?
was sufficient. My re is now {<{1,1}?([ib])\s*>(.*?)</\1\s*>} and it is
working fine.
--
computerjock AT mail DOT com
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)