Forum: >>> Magnum BBS <<<

writing a good gsub regexp for matching between two specific characters

From Bryan@21:1/5 to All on Sat Mar 11 16:06:09 2023

I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and provide a lot of material in case it is useful or there is conflict in the script, but I am trying not to ramble.

I prepared a test script below - which should be easy to copy/paste into a shell, e.g. bash. I am focused on the gsub regexps, which are obviously contrived to replace all these different strings which - as they vary from output from another program -
take the general form (attempting a "plain English" version):

[open apostrophe][the word "path"][maybe an underscore][various digits][end apostrophe]

I want to take all of that ^^^ and delete it - or equivalently replace it with nothing (ideally), to prepare input to gnuplot as "x,y" or "x y" data - two columns.

I tried using this type of command :

gsub("^[a-z]{4}$","TEST") ;

... and more, e.g. trying sub and gensub - but did not get far - I am aware of a curly brace escape that is important or not depending on the awk version, so I also tried with \{ and \}.

I put "TEST" in the present case for testing a few different cases. I wrote this script based on extensive reading of a certain popular online resource and the The Awk Programming Language (1988 - maybe time for a newer edition?). This is a useful script
because as I find new types of output from the upstream program (a whole other story), I might add new gsub commands to take care of it.

copy/paste example script:

echo "\
{\"path_1234567\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_123456\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_1234\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path1234\"\
:[`seq -s',' -f '%f' 1 20 `]}" | \
gawk -F, '
{
gsub("\{","") ;
gsub("\}","") ;
gsub("\]","") ;
gsub("^[a-z]{4}$","TEST") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSEVEN") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSIX") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9]\":\\\[","TESTFOURB") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9]\":\\\[","TESTFOURA") ;
for (i=1;i<=NF;i++)
{
printf("%s%s",$i,i%2?",":"\n")
}
}'

... the last printf thing is perhaps for another post, but (IIUC) matches every 2nd comma and replaces it with a newline. So that's the "x,y" data idea. I hope that is clear - I imagine the regexps in the [a-z][0-9] parts ought to be able to go all into
one gsub if I knew the syntax or what to read about.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Bryan on Sun Mar 12 03:52:27 2023

First, I cannot really decipher what you actually want to do and
where your problems are. The usual procedure is to post sample data:
input data and the corresponding output data at least (not shell
code that creates the input data). Anyway you find below some hints
and suggestions...

On 12.03.2023 01:06, Bryan wrote:

I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and
provide a lot of material in case it is useful or there is conflict
in the script, but I am trying not to ramble.

I prepared a test script below - which should be easy to copy/paste
into a shell, e.g. bash. I am focused on the gsub regexps, which are obviously contrived to replace all these different strings which - as
they vary from output from another program - take the general form (attempting a "plain English" version):

[open apostrophe][the word "path"][maybe an underscore][various
digits][end apostrophe]

I want to take all of that ^^^ and delete it - or equivalently
replace it with nothing (ideally), to prepare input to gnuplot as
"x,y" or "x y" data - two columns.

I tried using this type of command :

gsub("^[a-z]{4}$","TEST") ;

This is fine to substitutes lines containing _only_ a sequence of
four lower case letters to "TEST". gsub() _without_ the ^ and $
anchors will substitute any occurrence of that pattern on a line.
You can provide a third argument to gsub() to operate on variables
or specific fields; in that case the anchors ^ and $ will define
the beginning and end of that variable or field respectively.
It is also advantageous to use /.../ syntax for constant patterns
instead of the string form "...".

... and more, e.g. trying sub and gensub - but did not get far - I am
aware of a curly brace escape that is important or not depending on
the awk version, so I also tried with \{ and \}.

There's no need to escape these braces.

I put "TEST" in the present case for testing a few different cases. I
wrote this script based on extensive reading of a certain popular
online resource and the The Awk Programming Language (1988 - maybe
time for a newer edition?). This is a useful script because as I find
new types of output from the upstream program (a whole other story),
I might add new gsub commands to take care of it.

copy/paste example script:
echo "\
{\"path_1234567\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_123456\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_1234\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path1234\"\
:[`seq -s',' -f '%f' 1 20 `]}" | \
gawk -F, '
{
gsub("\{","") ;
gsub("\}","") ;
gsub("\]","") ;
gsub("^[a-z]{4}$","TEST") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSEVEN") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSIX") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9]\":\\\[","TESTFOURB") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9]\":\\\[","TESTFOURA") ;
for (i=1;i<=NF;i++)
{
printf("%s%s",$i,i%2?",":"\n")
}
}'

Instead of echo arguments with quotes and newline-escapes I suggest,
in shell, to use here-documents with this syntax:

awk '
# ... your awk program ...
...
' <<EOT
your data line 1
your data line 2
...
EOT

and with the more contemporary $(...) a line might be

{"path_1234567":[$(seq -s',' -f '%f' 1 20)], ...

but I wouldn't call seq many times but only once and assign it to a
variable and use that repeatedly

s=$(seq -s',' -f '%f' 1 20)
awk '
...
' <<EOT
{"path_1234567":[${s}], ...
...
EOT

If you pipe in or redirect other input just omit the code from <<EOT
onward.
data_from_some_process | awk '...'
awk '...' < data_from_some_file

(But for testing the here-documents have advantages.)

... the last printf thing is perhaps for another post, but (IIUC)
matches every 2nd comma and replaces it with a newline.

printf doesn't replace anything. It prints every other time a newline
instead of a comma.

So that's the
"x,y" data idea. I hope that is clear - I imagine the regexps in the [a-z][0-9] parts ought to be able to go all into one gsub if I knew
the syntax or what to read about.

To match more than one regexp for the _same_ replacement you can
combine them with the | (or) operator. For an example from your
code above use, e.g., gsub(/{|}|]/, "") to remove those three
braces/brackets in one expression.

But with your samples above you can also use other regexp syntaxes,
like ? (for optional parts) and use grouping with parenthesis (...)
for longer subexpressions, e.g.
[a-z][4}_?[0-9]{4}([0-9]{2})?
for an optional underscore and two optional digits.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to [email protected] on Sun Mar 12 16:49:42 2023

In article <[email protected]>,
Bryan <[email protected]> wrote:

Apologies for the `seq` synthetic data, I'll prepare it the better way
next time.

But with your samples above you can also use other regexp syntaxes,
like ? (for optional parts) and use grouping with parenthesis (...)
for longer subexpressions, e.g.
[a-z][4}_?[0-9]{4}([0-9]{2})?
for an optional underscore and two optional digits.

This is exactly what I was looking for and it works (I think a typo is
in there but let's leave it for now).

I tried {1-4} to get a range, but it didn't work - is that the idea? so

[a-z]{4}_?[0-9]{4}([0-9]{1-4})?

to match any number of digits from 1 to 4?

It is: {1,4}

--
"If our country is going broke, let it be from feeding the poor and caring for the elderly. And not from pampering the rich and fighting wars for them."

--Living Blue in a Red State--

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bryan@21:1/5 to All on Sun Mar 12 09:25:50 2023

Apologies for the `seq` synthetic data, I'll prepare it the better way next time.

But with your samples above you can also use other regexp syntaxes,
like ? (for optional parts) and use grouping with parenthesis (...)
for longer subexpressions, e.g.
[a-z][4}_?[0-9]{4}([0-9]{2})?
for an optional underscore and two optional digits.

This is exactly what I was looking for and it works (I think a typo is in there but let's leave it for now).

I tried {1-4} to get a range, but it didn't work - is that the idea? so

[a-z]{4}_?[0-9]{4}([0-9]{1-4})?

to match any number of digits from 1 to 4?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bryan@21:1/5 to All on Sun Mar 12 13:11:09 2023

This is great. My old awk book (Aho, Kernighan, and Weinberger) has a table on p.32 saying :

"expression [c1-c2] matches any character in the range beginning with c1 and ending with c2."

... p.30 has more discussion, and I never saw anything about the comma "," to indicate a range - perhaps this is a strong indication I need to get a better book.

And, I apologize, but I must say - this discussion reached a good answer in less than 24 hours - even though discussion doesn't "scale", and I can't cast a vote on it.

IOW Thank you!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bryan@21:1/5 to All on Sun Mar 12 13:43:38 2023

addendum : in writing a separate question about the printf statement, I found a better way to print a newline instead of every 2nd comma from a long string of signed floating points, so I simply share the method here :

digits=$(seq -s',' -f '%f' -10 10)
gawk -F, '
{
for (i=1;i<=NF;i++)
{
printf("%3.6f%s",$i,i%2?",":"\n")
}
}' <<EOT
${digits}
EOT

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Bryan on Sun Mar 12 22:42:10 2023

On 12.03.2023 21:11, Bryan wrote:

This is great. My old awk book (Aho, Kernighan, and Weinberger) has a
table on p.32 saying :

"expression [c1-c2] matches any character in the range beginning
with c1 and ending with c2."

You are referring here to something different. Slightly simplified said
[a-z] is a regexp matching any single lowercase letter
[0-9] any single digit
[0-9a-fA-F] any hexadecimal digit

The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the
classic awk ("nawk") that is based of Aho's, etc. book. More recent and commonly used Awks like GNU awk supports it, though. That's why there's
no mention in that book.

... p.30 has more discussion, and I never saw anything about the
comma "," to indicate a range - perhaps this is a strong indication I
need to get a better book.

The old book is excellently written and contains all what comprises
the power of the awk language. (Don't ignore it nor throw it away!)

But I suggest, especially if you use GNU awk which supports yet more
features, to get a copy of Arnold Robbin's "Effective Awk Programming"
which is based on GNU Awk. (It's also online available in a searchable
digital form.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Mon Mar 13 22:03:26 2023

On 12.03.2023 22:42, Janis Papanagnou wrote:

On 12.03.2023 21:11, Bryan wrote:

This is great. My old awk book (Aho, Kernighan, and Weinberger) [...]

The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the classic awk ("nawk") that is based of Aho's, etc. book. More recent and commonly used Awks like GNU awk supports it, though. That's why there's
no mention in that book.

While true for classic awk ("nawk") Arnold Robbins informed me that
in more recent versions of "nawk" this syntax is also supported, now
already for years. (Just in case my post was misinterpreted.)

To my knowledge, though, there's no newer/updated releases of the book
you mentioned; it is based on the old version of (n)awk, and thus it
does not describe that (newer) feature. (Which was my point.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bryan@21:1/5 to All on Tue Mar 14 06:55:57 2023

I noticed in the "Computerphile" video with Brian Kernighan - shared on this user group - that a new version of The Awk Book might be in the works as of August 2022.

Meanwhile, the overnight delivery is in-hand now, and, from page 45:

"[begin quote]
{n}
{n,}
{n,m}
One or two numbers inside braces denote an *interval expression*. If there is one number in the braces, the preceeding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. if [p. 46]
there is one number followed by a comma, then the preceding regexp is repeated at least n times:[end quote]"

... examples shown are :
wh{3}y Matches 'whhhy', but not 'why' or 'whhhhy'.
wh{3,5}y matches 'whhhy', 'whhhy', or 'whhhhhy' only.
wh{2,}y matches 'whhy', 'whhhy', and so on.

There is more.

Lastly, fom the back cover :

"You have the freedom to copy and modify this GNU manual."

Glad to support the FSF in this way!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Bryan on Wed Mar 15 00:14:30 2023

On 14.03.2023 14:55, Bryan wrote:

I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.

I cannot find a new version of the original Awk book with Google
(or other commercial providers). Could you provide a link, please?

Or are you speaking about Arnold Robbin's book? (Especially since
below you mention GNU and the FSF.)

I'm certainly confused by your mention of Brian Kernighan, one of
the authors of the original book.

Meanwhile, the overnight delivery is in-hand now, [...] There is
more.

Lastly, fom the back cover :
"You have the freedom to copy and modify this GNU manual."

Glad to support the FSF in this way!

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Janis Papanagnou on Tue Mar 14 23:46:24 2023

Janis Papanagnou <[email protected]> writes:

On 14.03.2023 14:55, Bryan wrote:

I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.

I cannot find a new version of the original Awk book with Google
(or other commercial providers). Could you provide a link, please?

Or are you speaking about Arnold Robbin's book? (Especially since
below you mention GNU and the FSF.)

I'm certainly confused by your mention of Brian Kernighan, one of
the authors of the original book.

Th phrase "might be in the works" means only that there is a possibility
that a new edition might be in preparation. Is that's what's confusing?

Bryan is clearly talking about a new version of the original book, but
he is referring to the most vague suggestion that there might, soon, be
a new edition. As far as I can tell there isn't one, but there could be
on "in the works" (i.e. in preparation).

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Keith Thompson@21:1/5 to Bryan on Tue Mar 14 16:49:00 2023

Bryan <[email protected]> writes:

I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.

Meanwhile, the overnight delivery is in-hand now, and, from page 45:

"[begin quote]
{n}
{n,}
{n,m}
One or two numbers inside braces denote an *interval expression*. If
there is one number in the braces, the preceeding regexp is repeated n
times. If there are two numbers separated by a comma, the preceding
regexp is repeated n to m times. if [p. 46] there is one number
followed by a comma, then the preceding regexp is repeated at least n times:[end quote]"

... examples shown are :
wh{3}y Matches 'whhhy', but not 'why' or 'whhhhy'.
wh{3,5}y matches 'whhhy', 'whhhy', or 'whhhhhy' only.
wh{2,}y matches 'whhy', 'whhhy', and so on.

There is more.

Lastly, fom the back cover :

"You have the freedom to copy and modify this GNU manual."

Glad to support the FSF in this way!

That's the GNU Awk manual. I don't have a printed version, but it
appears to have the same content as the online manual available by
typing "info gawk" (if you have the right things installed)
or at <https://www.gnu.org/software/gawk/manual/gawk.html>.

"The Awk Book" presumably refers to the original "The AWK Programming
Language" by Aho, Kernighan, and Weinberger, published in 1988.

--
Keith Thompson (The_Other_Keith) [email protected]
Working, but not speaking, for XCOM Labs
void Void(void) { Void(); } /* The recursive call of the void */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Ben Bacarisse on Wed Mar 15 01:22:23 2023

On 15.03.2023 00:46, Ben Bacarisse wrote:

Janis Papanagnou <[email protected]> writes:

On 14.03.2023 14:55, Bryan wrote:

I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.

I cannot find a new version of the original Awk book with Google
(or other commercial providers). Could you provide a link, please?

Or are you speaking about Arnold Robbin's book? (Especially since
below you mention GNU and the FSF.)

I'm certainly confused by your mention of Brian Kernighan, one of
the authors of the original book.

Th phrase "might be in the works" means only that there is a possibility
that a new edition might be in preparation. Is that's what's confusing?

It was various things that confused me (but not the "in works" per se):
- "might be in the works" vs. "the overnight delivery is in-hand now"
- "GNU" and "FSF" vs. "The [original][commercial] Awk Book"
- and the date "August 2022" I couldn't assign to both books mentioned

Bryan is clearly talking about a new version of the original book, but
he is referring to the most vague suggestion that there might, soon, be
a new edition. As far as I can tell there isn't one, but there could be
on "in the works" (i.e. in preparation).

I am certainly interested in any new version. Read his post as if he
already had got it. But I didn't find anything online.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bryan@21:1/5 to All on Wed Mar 15 08:31:02 2023

I apologize for the confusion!

I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Bryan on Wed Mar 15 12:12:09 2023

On 3/15/2023 10:31 AM, Bryan wrote:

I apologize for the confusion!

I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).

You're posting on usenet, not a forum, so please make sure every post
has enough context included to make sense stand-alone. Right now you're truncating/removing all context on all of your posts.

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kpop 2GM@21:1/5 to All on Mon Jul 31 21:11:18 2023

"The Awk Book" presumably refers to the original "The AWK Programming Language" by Aho, Kernighan, and Weinberger, published in 1988.

I've seen the entirety of the original 1988 book scanned and viewable in PDF format online

( I'll refrain from linking it here since I'm uncertain about copyrights of the PDFs, but shouldn't be too hard to locate via google search or somewhere on github )

That said, even the original authors didn't do a particular good job at selling awk's real strengths. If i began my awk journey with that book, I would've jumped ship to perl longlong ago.

thank goodness I didn't step into that sarlacc pit that is perl5, or worse, raku.

The 4Chan Teller

#####################

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Keith Thompson@21:1/5 to [email protected] on Tue Aug 1 14:14:29 2023

Kpop 2GM <[email protected]> writes:

"The Awk Book" presumably refers to the original "The AWK Programming
Language" by Aho, Kernighan, and Weinberger, published in 1988.

I've seen the entirety of the original 1988 book scanned and viewable in PDF format online

( I'll refrain from linking it here since I'm uncertain about
copyrights of the PDFs, but shouldn't be too hard to locate via google
search or somewhere on github )

I'm far more certain. The 1988 book is still under copyright, and any
PDF copy that's not explicitly authorized by the publisher is in
violation of that copyright.

(The 1988 AWK book doesn't appear to be available in electronic form.
Amazon has it in paperback for $114.71. The second edition is supposed
to be available 2023-09-22, at a much more reasonable price.)

[...]

--
Keith Thompson (The_Other_Keith) [email protected]
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	82:23:53
Calls:	12,451
Calls today:	1
Files:	15,194
Messages:	6,537,765

writing a good gsub regexp for matching between two specific characters

Who's Online

Recent Visitors

System Info