Forum: >>> Magnum BBS <<<

Peculiar result with regular expression

From Arjen Markus@21:1/5 to All on Thu Oct 13 04:04:55 2022

I am a bit puzzled about the following:

set a "\\abc"

regexp -inline {(^[a-z]+)|([^\\][a-z]+)} $a

abc {} abc

regexp -indices -inline {(^[a-z]+)|([^\\][a-z]+)} $a
{1 3} {-1 -1} {1 3}

The empty substring surprised me, but the indices for that string are really unexpected. I tried to figure out a way to identify words that are either at the start of a line or are NOT preceded by a backslash - a first attempt to manipulate some Latex
source.

Does anybody know what is going wrong? Most likely something very obvious, but I do not see what.

Regards,

Arjen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arjen Markus@21:1/5 to Ralf Fassel on Thu Oct 13 04:31:37 2022

On Thursday, October 13, 2022 at 1:22:14 PM UTC+2, Ralf Fassel wrote:

* Arjen Markus
| I am a bit puzzled about the following:

| set a "\\abc"

| regexp -inline {(^[a-z]+)|([^\\][a-z]+)} $a
(^[a-z]+) lower-case chars at beginning of string, which "\abc" is not
OR
([^\\][a-z]+) anything not-backslash followed by lower-case chars, which "\abc" is: the 'a' qualifies as not-backslash, the "bc" as lower-case
chars.
| I tried to figure out a way to identify words that are either at the
| start of a line or are NOT preceded by a backslash - a first attempt
| to manipulate some Latex source.

| Does anybody know what is going wrong? Most likely something very
| obvious, but I do not see what.
I think you missed some "at-beginning-of-word"-modifier for the backslash.

HTH
R'

Yes, probably, I tried to stay on the safe side (this being a first experiment) and stick to the greedy operators. But why the {-1 -1} indices? That does not make sense to me.

Regards,

Arjen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arjen Markus@21:1/5 to Schelte on Thu Oct 13 05:05:33 2022

On Thursday, October 13, 2022 at 1:50:49 PM UTC+2, Schelte wrote:

On 13/10/2022 13:31, Arjen Markus wrote:

But why the {-1 -1} indices? That does not make sense to me.

From the regexp man page: "if a particular subexpression in exp does
not match the string (e.g. because it was in a portion of the expression
that was not matched), then the corresponding subMatchVar will be set to
"-1 -1" if -indices has been specified"

Schelte.

Ah, thanks, that makes sense, I think.

Regards,

Arjen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Schelte@21:1/5 to Arjen Markus on Thu Oct 13 13:50:43 2022

On 13/10/2022 13:31, Arjen Markus wrote:

But why the {-1 -1} indices? That does not make sense to me.

From the regexp man page: "if a particular subexpression in exp does
not match the string (e.g. because it was in a portion of the expression
that was not matched), then the corresponding subMatchVar will be set to
"-1 -1" if -indices has been specified"

Schelte.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ralf Fassel@21:1/5 to All on Thu Oct 13 13:22:09 2022

* Arjen Markus <[email protected]>
| I am a bit puzzled about the following:

| set a "\\abc"

| regexp -inline {(^[a-z]+)|([^\\][a-z]+)} $a

(^[a-z]+) lower-case chars at beginning of string, which "\abc" is not
OR
([^\\][a-z]+) anything not-backslash followed by lower-case chars, which
"\abc" is: the 'a' qualifies as not-backslash, the "bc" as lower-case
chars.

| I tried to figure out a way to identify words that are either at the
| start of a line or are NOT preceded by a backslash - a first attempt
| to manipulate some Latex source.

| Does anybody know what is going wrong? Most likely something very
| obvious, but I do not see what.

I think you missed some "at-beginning-of-word"-modifier for the backslash.

HTH
R'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luc@21:1/5 to Arjen Markus on Thu Oct 13 13:52:21 2022

On Thu, 13 Oct 2022 04:04:55 -0700 (PDT), Arjen Markus wrote:

words that are either at the start of a line
or are NOT preceded by a backslash

I find your specification confusing.

What if a word is both at the start of a line
and preceded by a backslash? Is it acceptable?
Your specification is not very clear.

Either way, this regex probably does what you want:

(^[a-z]+)|( ([^\\ ]+))

--
Luc

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From heinrichmartin@21:1/5 to Luc on Thu Oct 13 13:42:50 2022

On Thursday, October 13, 2022 at 6:52:26 PM UTC+2, Luc wrote:

On Thu, 13 Oct 2022 04:04:55 -0700 (PDT), Arjen Markus wrote:

words that are either at the start of a line
or are NOT preceded by a backslash

I find your specification confusing.

What if a word is both at the start of a line
and preceded by a backslash? Is it acceptable?
Your specification is not very clear.

I'd guess OR is not XOR and "either" indicates XOR ;-)
But I also guess that "either" was unintended and that the "start of a line" just came from Tcl's regexp not supporting negative look-behind (note that words at the beginning of a line with normalized EOL style are always preceded with \n and never with
a backslash).

Given the spec, I would have assumed to see [regexp -line {(?:^|[^a-z\\])([a-z]+)}] (untested!).
* The first (non-reporting) group matches at the begin of a line or anything but backslash (but that "anything" should also _not match_ anything we need).
* If you can guarantee that the first word never starts at the beginning of the string (e.g. because of "\documentclass"), then you could drop that ^ along with -line, because [^a-z\\] also matches the line break.
* With -all -indices -inline you would use every other list entry.
* Also, I guess [a-z] was really just a start (but \w is probably too much).

HTH
Martin

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arjen Markus@21:1/5 to heinrichmartin on Thu Oct 13 23:22:16 2022

On Thursday, October 13, 2022 at 10:42:53 PM UTC+2, heinrichmartin wrote:

On Thursday, October 13, 2022 at 6:52:26 PM UTC+2, Luc wrote:

On Thu, 13 Oct 2022 04:04:55 -0700 (PDT), Arjen Markus wrote:

words that are either at the start of a line
or are NOT preceded by a backslash

I find your specification confusing.

What if a word is both at the start of a line
and preceded by a backslash? Is it acceptable?
Your specification is not very clear.

I'd guess OR is not XOR and "either" indicates XOR ;-)
But I also guess that "either" was unintended and that the "start of a line" just came from Tcl's regexp not supporting negative look-behind (note that words at the beginning of a line with normalized EOL style are always preceded with \n and never

with a backslash).

Given the spec, I would have assumed to see [regexp -line {(?:^|[^a-z\\])([a-z]+)}] (untested!).
* The first (non-reporting) group matches at the begin of a line or anything but backslash (but that "anything" should also _not match_ anything we need).
* If you can guarantee that the first word never starts at the beginning of the string (e.g. because of "\documentclass"), then you could drop that ^ along with -line, because [^a-z\\] also matches the line break.
* With -all -indices -inline you would use every other list entry.
* Also, I guess [a-z] was really just a start (but \w is probably too much).

HTH
Martin

Thanks for the suggestions and explanations, everyone. The thing I am after is:
- I have formulas in tex files (Latex) and these can contain words like "salinity".
- A simple formula might read: klrear = 1.0 + a \times salinity
- I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.
- I will probably need to do the transformation in stages, but I do not want "\times}" to change into"\\textit{times}", hence my attempt above.

Regards,

Arjen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Schelte@21:1/5 to Arjen Markus on Fri Oct 14 10:28:57 2022

On 14/10/2022 08:22, Arjen Markus wrote:

- A simple formula might read: klrear = 1.0 + a \times salinity
- I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.

You didn't mention changing "\times" into "\cdot" before. But how about
this for the other requirements?

regsub -all {([^\\]|^)(\m[a-z]+\M)} $str {\1\textit{\2}}

This will also change the "a" into "\textit{a}". If you don't want
single letters to be changed, use this instead:

regsub -all {([^\\]|^)(\m[a-z]{2,}\M)} $str {\1\textit{\2}}

Schelte.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From heinrichmartin@21:1/5 to arjen on Fri Oct 14 01:16:14 2022

On Friday, October 14, 2022 at 8:22:18 AM UTC+2, arjen wrote:

Thanks for the suggestions and explanations, everyone. The thing I am after is:
- I have formulas in tex files (Latex) and these can contain words like "salinity".
- A simple formula might read: klrear = 1.0 + a \times salinity
- I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.
- I will probably need to do the transformation in stages, but I do not want "\times}" to change into"\\textit{times}", hence my attempt above.

Maybe I am still missing the tricky part of your needs - I guess it is implied in "A *simple* formula might read" ;-)

expect:/tmp$ set in {klrear = 1.0 + a \times salinity

klrear = 1.0 + a \times salinity
klrear = 1.0 + a \times salinity
klrear = 1.0 + a \times salinity
}

klrear = 1.0 + a \times salinity
klrear = 1.0 + a \times salinity
klrear = 1.0 + a \times salinity
klrear = 1.0 + a \times salinity

expect:/tmp$ string map {{\times } {\cdot }} [regsub -all -line {(^|[^a-z\\])([a-z]{2,})} $in {\1\textit{\2}}]
\textit{klrear} = 1.0 + a \cdot \textit{salinity}
\textit{klrear} = 1.0 + a \cdot \textit{salinity}
\textit{klrear} = 1.0 + a \cdot \textit{salinity}
\textit{klrear} = 1.0 + a \cdot \textit{salinity}

This has considered words of length two or greater.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arjen Markus@21:1/5 to Schelte on Fri Oct 14 02:26:09 2022

On Friday, October 14, 2022 at 10:29:03 AM UTC+2, Schelte wrote:

On 14/10/2022 08:22, Arjen Markus wrote:

- A simple formula might read: klrear = 1.0 + a \times salinity
- I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.

You didn't mention changing "\times" into "\cdot" before. But how about
this for the other requirements?

regsub -all {([^\\]|^)(\m[a-z]+\M)} $str {\1\textit{\2}}

This will also change the "a" into "\textit{a}". If you don't want
single letters to be changed, use this instead:

regsub -all {([^\\]|^)(\m[a-z]{2,}\M)} $str {\1\textit{\2}}

Schelte.

That change is irrelevant to my earlier question, the main thing is to separate "bare" words and Latex macros. I just mention the replacement because that will be part of the final step.

Regards,

Arjen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to Arjen Markus on Fri Oct 14 11:14:59 2022

On 10/14/2022 5:26 AM, Arjen Markus wrote:

That change is irrelevant to my earlier question, the main thing is to separate "bare" words and Latex macros. I just mention the replacement because that will be part of the final step.

It sounds like you will accept a non-regexp solution. Here is one:

set line {klrear = 1.0 + a \times salinity}

proc latex_it {line} {
foreach w [split $line " "] {
if {$w eq ""} {
# skipping empty spaces
# you can include it if you want to preserve spacing
continue

} elseif {[string is double -strict $w]} {
append out "$w "

} elseif {$w in {= - + /}} {
# add more operators
append out "$w "

} elseif {[string index $w 0] eq "\\"} {
# add more "special" operators
switch -exact -- $w {
\\times { append out "\\cdot " }
default { append out "$w " }
}

} else {
append out "\\textit\{$w\} "
}
}
return $out
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Heller@21:1/5 to [email protected] on Fri Oct 14 15:46:14 2022

At Fri, 14 Oct 2022 11:14:59 -0400 saitology9 <[email protected]> wrote:

On 10/14/2022 5:26 AM, Arjen Markus wrote:

That change is irrelevant to my earlier question, the main thing is to separate "bare" words and Latex macros. I just mention the replacement because that will be part of the final step.

It sounds like you will accept a non-regexp solution. Here is one:

set line {klrear = 1.0 + a \times salinity}

proc latex_it {line} {
foreach w [split $line " "] {

# You might want to add \t (tab character) to the split char string.

if {$w eq ""} {
# skipping empty spaces
# you can include it if you want to preserve spacing
continue

} elseif {[string is double -strict $w]} {
append out "$w "

} elseif {$w in {= - + /}} {
# add more operators
append out "$w "

} elseif {[string index $w 0] eq "\\"} {
# add more "special" operators
switch -exact -- $w {
\\times { append out "\\cdot " }
default { append out "$w " }
}

} else {
append out "\\textit\{$w\} "
}
}
return $out
}

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
[email protected] -- Webhosting Services

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to Robert Heller on Fri Oct 14 12:22:20 2022

On 10/14/2022 11:46 AM, Robert Heller wrote:

proc latex_it {line} {
foreach w [split $line " "] {

# You might want to add \t (tab character) to the split char string.

Good catch! You could also handle it separately in an if-statement if
you want to preserve it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From saitology9@21:1/5 to heinrichmartin on Fri Oct 14 17:34:57 2022

On 10/14/2022 5:27 PM, heinrichmartin wrote:

Also, don't call latex_it with empty line or with line that consists of space only - or set out "" initially.

Nice!
This must have been a copy-paste error on my part, as I remember the
need to initialize it as you pointed out. Hopefully the OP finds it useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From heinrichmartin@21:1/5 to All on Fri Oct 14 14:27:18 2022

On Friday, October 14, 2022 at 6:22:27 PM UTC+2, saitology9 wrote:

On 10/14/2022 11:46 AM, Robert Heller wrote:

proc latex_it {line} {
foreach w [split $line " "] {

# You might want to add \t (tab character) to the split char string.

Good catch! You could also handle it separately in an if-statement if
you want to preserve it.

Also, don't call latex_it with empty line or with line that consists of space only - or set out "" initially.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	46:19:02
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,108

Peculiar result with regular expression

Who's Online

Recent Visitors

System Info