• Peculiar result with regular expression

    From Arjen Markus@21:1/5 to All on Thu Oct 13 04:04:55 2022
    I am a bit puzzled about the following:

    set a "\\abc"

    regexp -inline {(^[a-z]+)|([^\\][a-z]+)} $a
    abc {} abc

    regexp -indices -inline {(^[a-z]+)|([^\\][a-z]+)} $a
    {1 3} {-1 -1} {1 3}

    The empty substring surprised me, but the indices for that string are really unexpected. I tried to figure out a way to identify words that are either at the start of a line or are NOT preceded by a backslash - a first attempt to manipulate some Latex
    source.

    Does anybody know what is going wrong? Most likely something very obvious, but I do not see what.

    Regards,

    Arjen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arjen Markus@21:1/5 to Ralf Fassel on Thu Oct 13 04:31:37 2022
    On Thursday, October 13, 2022 at 1:22:14 PM UTC+2, Ralf Fassel wrote:
    * Arjen Markus
    | I am a bit puzzled about the following:

    | set a "\\abc"

    | regexp -inline {(^[a-z]+)|([^\\][a-z]+)} $a
    (^[a-z]+) lower-case chars at beginning of string, which "\abc" is not
    OR
    ([^\\][a-z]+) anything not-backslash followed by lower-case chars, which "\abc" is: the 'a' qualifies as not-backslash, the "bc" as lower-case
    chars.
    | I tried to figure out a way to identify words that are either at the
    | start of a line or are NOT preceded by a backslash - a first attempt
    | to manipulate some Latex source.

    | Does anybody know what is going wrong? Most likely something very
    | obvious, but I do not see what.
    I think you missed some "at-beginning-of-word"-modifier for the backslash.

    HTH
    R'

    Yes, probably, I tried to stay on the safe side (this being a first experiment) and stick to the greedy operators. But why the {-1 -1} indices? That does not make sense to me.

    Regards,

    Arjen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arjen Markus@21:1/5 to Schelte on Thu Oct 13 05:05:33 2022
    On Thursday, October 13, 2022 at 1:50:49 PM UTC+2, Schelte wrote:
    On 13/10/2022 13:31, Arjen Markus wrote:
    But why the {-1 -1} indices? That does not make sense to me.
    From the regexp man page: "if a particular subexpression in exp does
    not match the string (e.g. because it was in a portion of the expression
    that was not matched), then the corresponding subMatchVar will be set to
    "-1 -1" if -indices has been specified"


    Schelte.
    Ah, thanks, that makes sense, I think.

    Regards,

    Arjen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Schelte@21:1/5 to Arjen Markus on Thu Oct 13 13:50:43 2022
    On 13/10/2022 13:31, Arjen Markus wrote:
    But why the {-1 -1} indices? That does not make sense to me.

    From the regexp man page: "if a particular subexpression in exp does
    not match the string (e.g. because it was in a portion of the expression
    that was not matched), then the corresponding subMatchVar will be set to
    "-1 -1" if -indices has been specified"


    Schelte.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ralf Fassel@21:1/5 to All on Thu Oct 13 13:22:09 2022
    * Arjen Markus <[email protected]>
    | I am a bit puzzled about the following:

    | set a "\\abc"

    | regexp -inline {(^[a-z]+)|([^\\][a-z]+)} $a

    (^[a-z]+) lower-case chars at beginning of string, which "\abc" is not
    OR
    ([^\\][a-z]+) anything not-backslash followed by lower-case chars, which
    "\abc" is: the 'a' qualifies as not-backslash, the "bc" as lower-case
    chars.

    | I tried to figure out a way to identify words that are either at the
    | start of a line or are NOT preceded by a backslash - a first attempt
    | to manipulate some Latex source.

    | Does anybody know what is going wrong? Most likely something very
    | obvious, but I do not see what.

    I think you missed some "at-beginning-of-word"-modifier for the backslash.

    HTH
    R'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luc@21:1/5 to Arjen Markus on Thu Oct 13 13:52:21 2022
    On Thu, 13 Oct 2022 04:04:55 -0700 (PDT), Arjen Markus wrote:

    words that are either at the start of a line
    or are NOT preceded by a backslash

    I find your specification confusing.

    What if a word is both at the start of a line
    and preceded by a backslash? Is it acceptable?
    Your specification is not very clear.

    Either way, this regex probably does what you want:

    (^[a-z]+)|( ([^\\ ]+))

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From heinrichmartin@21:1/5 to Luc on Thu Oct 13 13:42:50 2022
    On Thursday, October 13, 2022 at 6:52:26 PM UTC+2, Luc wrote:
    On Thu, 13 Oct 2022 04:04:55 -0700 (PDT), Arjen Markus wrote:

    words that are either at the start of a line
    or are NOT preceded by a backslash
    I find your specification confusing.

    What if a word is both at the start of a line
    and preceded by a backslash? Is it acceptable?
    Your specification is not very clear.

    I'd guess OR is not XOR and "either" indicates XOR ;-)
    But I also guess that "either" was unintended and that the "start of a line" just came from Tcl's regexp not supporting negative look-behind (note that words at the beginning of a line with normalized EOL style are always preceded with \n and never with
    a backslash).

    Given the spec, I would have assumed to see [regexp -line {(?:^|[^a-z\\])([a-z]+)}] (untested!).
    * The first (non-reporting) group matches at the begin of a line or anything but backslash (but that "anything" should also _not match_ anything we need).
    * If you can guarantee that the first word never starts at the beginning of the string (e.g. because of "\documentclass"), then you could drop that ^ along with -line, because [^a-z\\] also matches the line break.
    * With -all -indices -inline you would use every other list entry.
    * Also, I guess [a-z] was really just a start (but \w is probably too much).

    HTH
    Martin

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arjen Markus@21:1/5 to heinrichmartin on Thu Oct 13 23:22:16 2022
    On Thursday, October 13, 2022 at 10:42:53 PM UTC+2, heinrichmartin wrote:
    On Thursday, October 13, 2022 at 6:52:26 PM UTC+2, Luc wrote:
    On Thu, 13 Oct 2022 04:04:55 -0700 (PDT), Arjen Markus wrote:

    words that are either at the start of a line
    or are NOT preceded by a backslash
    I find your specification confusing.

    What if a word is both at the start of a line
    and preceded by a backslash? Is it acceptable?
    Your specification is not very clear.
    I'd guess OR is not XOR and "either" indicates XOR ;-)
    But I also guess that "either" was unintended and that the "start of a line" just came from Tcl's regexp not supporting negative look-behind (note that words at the beginning of a line with normalized EOL style are always preceded with \n and never
    with a backslash).

    Given the spec, I would have assumed to see [regexp -line {(?:^|[^a-z\\])([a-z]+)}] (untested!).
    * The first (non-reporting) group matches at the begin of a line or anything but backslash (but that "anything" should also _not match_ anything we need).
    * If you can guarantee that the first word never starts at the beginning of the string (e.g. because of "\documentclass"), then you could drop that ^ along with -line, because [^a-z\\] also matches the line break.
    * With -all -indices -inline you would use every other list entry.
    * Also, I guess [a-z] was really just a start (but \w is probably too much).

    HTH
    Martin
    Thanks for the suggestions and explanations, everyone. The thing I am after is:
    - I have formulas in tex files (Latex) and these can contain words like "salinity".
    - A simple formula might read: klrear = 1.0 + a \times salinity
    - I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.
    - I will probably need to do the transformation in stages, but I do not want "\times}" to change into"\\textit{times}", hence my attempt above.

    Regards,

    Arjen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Schelte@21:1/5 to Arjen Markus on Fri Oct 14 10:28:57 2022
    On 14/10/2022 08:22, Arjen Markus wrote:
    - A simple formula might read: klrear = 1.0 + a \times salinity
    - I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.

    You didn't mention changing "\times" into "\cdot" before. But how about
    this for the other requirements?

    regsub -all {([^\\]|^)(\m[a-z]+\M)} $str {\1\textit{\2}}

    This will also change the "a" into "\textit{a}". If you don't want
    single letters to be changed, use this instead:

    regsub -all {([^\\]|^)(\m[a-z]{2,}\M)} $str {\1\textit{\2}}


    Schelte.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From heinrichmartin@21:1/5 to arjen on Fri Oct 14 01:16:14 2022
    On Friday, October 14, 2022 at 8:22:18 AM UTC+2, arjen wrote:
    Thanks for the suggestions and explanations, everyone. The thing I am after is:
    - I have formulas in tex files (Latex) and these can contain words like "salinity".
    - A simple formula might read: klrear = 1.0 + a \times salinity
    - I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.
    - I will probably need to do the transformation in stages, but I do not want "\times}" to change into"\\textit{times}", hence my attempt above.

    Maybe I am still missing the tricky part of your needs - I guess it is implied in "A *simple* formula might read" ;-)

    expect:/tmp$ set in {klrear = 1.0 + a \times salinity
    klrear = 1.0 + a \times salinity
    klrear = 1.0 + a \times salinity
    klrear = 1.0 + a \times salinity
    }
    klrear = 1.0 + a \times salinity
    klrear = 1.0 + a \times salinity
    klrear = 1.0 + a \times salinity
    klrear = 1.0 + a \times salinity

    expect:/tmp$ string map {{\times } {\cdot }} [regsub -all -line {(^|[^a-z\\])([a-z]{2,})} $in {\1\textit{\2}}]
    \textit{klrear} = 1.0 + a \cdot \textit{salinity}
    \textit{klrear} = 1.0 + a \cdot \textit{salinity}
    \textit{klrear} = 1.0 + a \cdot \textit{salinity}
    \textit{klrear} = 1.0 + a \cdot \textit{salinity}

    This has considered words of length two or greater.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arjen Markus@21:1/5 to Schelte on Fri Oct 14 02:26:09 2022
    On Friday, October 14, 2022 at 10:29:03 AM UTC+2, Schelte wrote:
    On 14/10/2022 08:22, Arjen Markus wrote:
    - A simple formula might read: klrear = 1.0 + a \times salinity
    - I want to change that to: \texit{klrear} = 1.0 + a \cdot \textit{salinity}, because that looks prettier.
    You didn't mention changing "\times" into "\cdot" before. But how about
    this for the other requirements?

    regsub -all {([^\\]|^)(\m[a-z]+\M)} $str {\1\textit{\2}}

    This will also change the "a" into "\textit{a}". If you don't want
    single letters to be changed, use this instead:

    regsub -all {([^\\]|^)(\m[a-z]{2,}\M)} $str {\1\textit{\2}}


    Schelte.
    That change is irrelevant to my earlier question, the main thing is to separate "bare" words and Latex macros. I just mention the replacement because that will be part of the final step.

    Regards,

    Arjen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From saitology9@21:1/5 to Arjen Markus on Fri Oct 14 11:14:59 2022
    On 10/14/2022 5:26 AM, Arjen Markus wrote:
    That change is irrelevant to my earlier question, the main thing is to separate "bare" words and Latex macros. I just mention the replacement because that will be part of the final step.


    It sounds like you will accept a non-regexp solution. Here is one:


    set line {klrear = 1.0 + a \times salinity}

    proc latex_it {line} {
    foreach w [split $line " "] {
    if {$w eq ""} {
    # skipping empty spaces
    # you can include it if you want to preserve spacing
    continue

    } elseif {[string is double -strict $w]} {
    append out "$w "

    } elseif {$w in {= - + /}} {
    # add more operators
    append out "$w "

    } elseif {[string index $w 0] eq "\\"} {
    # add more "special" operators
    switch -exact -- $w {
    \\times { append out "\\cdot " }
    default { append out "$w " }
    }

    } else {
    append out "\\textit\{$w\} "
    }
    }
    return $out
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Heller@21:1/5 to [email protected] on Fri Oct 14 15:46:14 2022
    At Fri, 14 Oct 2022 11:14:59 -0400 saitology9 <[email protected]> wrote:


    On 10/14/2022 5:26 AM, Arjen Markus wrote:
    That change is irrelevant to my earlier question, the main thing is to separate "bare" words and Latex macros. I just mention the replacement because that will be part of the final step.


    It sounds like you will accept a non-regexp solution. Here is one:


    set line {klrear = 1.0 + a \times salinity}

    proc latex_it {line} {
    foreach w [split $line " "] {

    # You might want to add \t (tab character) to the split char string.

    if {$w eq ""} {
    # skipping empty spaces
    # you can include it if you want to preserve spacing
    continue

    } elseif {[string is double -strict $w]} {
    append out "$w "

    } elseif {$w in {= - + /}} {
    # add more operators
    append out "$w "

    } elseif {[string index $w 0] eq "\\"} {
    # add more "special" operators
    switch -exact -- $w {
    \\times { append out "\\cdot " }
    default { append out "$w " }
    }

    } else {
    append out "\\textit\{$w\} "
    }
    }
    return $out
    }






    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    [email protected] -- Webhosting Services

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From saitology9@21:1/5 to Robert Heller on Fri Oct 14 12:22:20 2022
    On 10/14/2022 11:46 AM, Robert Heller wrote:

    proc latex_it {line} {
    foreach w [split $line " "] {

    # You might want to add \t (tab character) to the split char string.


    Good catch! You could also handle it separately in an if-statement if
    you want to preserve it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From saitology9@21:1/5 to heinrichmartin on Fri Oct 14 17:34:57 2022
    On 10/14/2022 5:27 PM, heinrichmartin wrote:

    Also, don't call latex_it with empty line or with line that consists of space only - or set out "" initially.

    Nice!
    This must have been a copy-paste error on my part, as I remember the
    need to initialize it as you pointed out. Hopefully the OP finds it useful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From heinrichmartin@21:1/5 to All on Fri Oct 14 14:27:18 2022
    On Friday, October 14, 2022 at 6:22:27 PM UTC+2, saitology9 wrote:
    On 10/14/2022 11:46 AM, Robert Heller wrote:

    proc latex_it {line} {
    foreach w [split $line " "] {

    # You might want to add \t (tab character) to the split char string.

    Good catch! You could also handle it separately in an if-statement if
    you want to preserve it.

    Also, don't call latex_it with empty line or with line that consists of space only - or set out "" initially.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)