• Vanilla regex

    From Tuxedo@21:1/5 to All on Sun Jul 2 15:24:40 2023
    Can anyone assist with a regex using fairly standard and cross compatible methods?

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The only certainty to identify the relevant parts are the start of "[[File:" followed by characters and/or numbers making up a file names (No UTF-8) and ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG, .gif, followed by
    a "|" pipe or closing "]]" brackets

    The regex needs to grab the filename portion, eg. "Another file.jpg", keep
    it in a variable and replace any spaces with underscore(s) so the new
    variable becomes "Another_file.jpg"

    Thereafter, within the existing markup, for example:

    [[File:Another file.jpg |600px|thumb]]

    Add the following markup after the first pipe:

    link=https://example.com/display.pl?Another_file.jpg|

    So the final markup becomes:
    [[File:Another file.jpg | link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

    The spaces in the original "File: ..." name parts can remain as it's valid
    but the underscores need to exist in link=... strings.

    There may be instances where "|link=" occurrences already exits within the opening of a "[[File:" and before its closing "]]" brackets. The regex
    should avoid operating on any such instances so the procedure can be run without conflict of past replacements.

    Many thanks for any example code snippets and ideas.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Tuxedo on Sun Jul 2 21:42:51 2023
    Tuxedo <[email protected]> writes:

    Can anyone assist with a regex using fairly standard and cross compatible methods?

    What you want can't be done with a regex. You need a tool that uses
    regexes to drive substitutions like sed, AWK, Perl, Python, PHP, ruby...

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The only certainty to identify the relevant parts are the start of "[[File:" followed by characters and/or numbers making up a file names (No UTF-8) and ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]" brackets

    Is that really the only certainty? If so, it's a hard problem. Can the
    file name contain | or ]] or newlines? I suspect not as "characters
    and/or numbers" is an odd thing to say. I think you mean [a-zA-Z0-9 ].

    The regex needs to grab the filename portion, eg. "Another file.jpg", keep
    it in a variable and replace any spaces with underscore(s) so the new variable becomes "Another_file.jpg"

    Regexes can't do that, but lots of tools that use them can. Do you care
    what tool is used?

    Thereafter, within the existing markup, for example:

    [[File:Another file.jpg |600px|thumb]]

    Add the following markup after the first pipe:

    link=https://example.com/display.pl?Another_file.jpg|

    So the final markup becomes:
    [[File:Another file.jpg | link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

    The spaces in the original "File: ..." name parts can remain as it's valid but the underscores need to exist in link=... strings.

    There may be instances where "|link=" occurrences already exits within the opening of a "[[File:" and before its closing "]]" brackets. The regex
    should avoid operating on any such instances so the procedure can be run without conflict of past replacements.

    FYI: you want the program to be "idempotent".

    Many thanks for any example code snippets and ideas.

    It's not hard, but then it's not very much fun either, so you may have
    to pay someone or learn how to do it yourself.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tuxedo@21:1/5 to Ben Bacarisse on Mon Jul 3 11:40:06 2023
    Ben Bacarisse wrote:

    Tuxedo <[email protected]> writes:

    Can anyone assist with a regex using fairly standard and cross compatible
    methods?

    What you want can't be done with a regex. You need a tool that uses
    regexes to drive substitutions like sed, AWK, Perl, Python, PHP, ruby...

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The only certainty to identify the relevant parts are the start of
    "[[File:" followed by characters and/or numbers making up a file names
    (No UTF-8) and ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG,
    .gif, followed by a "|" pipe or closing "]]" brackets

    Is that really the only certainty? If so, it's a hard problem. Can the
    file name contain | or ]] or newlines? I suspect not as "characters
    and/or numbers" is an odd thing to say. I think you mean [a-zA-Z0-9 ].

    The filename itself never contains | or ]] in this case. The odd new line
    could be part of the complete string although it's unlikely and never in the filename part.


    The regex needs to grab the filename portion, eg. "Another file.jpg",
    keep it in a variable and replace any spaces with underscore(s) so the
    new variable becomes "Another_file.jpg"

    Regexes can't do that, but lots of tools that use them can. Do you care
    what tool is used?

    Yes, I care which tool is used in the sense that it works.


    Thereafter, within the existing markup, for example:

    [[File:Another file.jpg |600px|thumb]]

    Add the following markup after the first pipe:

    link=https://example.com/display.pl?Another_file.jpg|

    So the final markup becomes:
    [[File:Another file.jpg |
    link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

    The spaces in the original "File: ..." name parts can remain as it's
    valid but the underscores need to exist in link=... strings.

    There may be instances where "|link=" occurrences already exits within
    the opening of a "[[File:" and before its closing "]]" brackets. The
    regex should avoid operating on any such instances so the procedure can
    be run without conflict of past replacements.

    FYI: you want the program to be "idempotent".

    Thank you for that word :-)


    Many thanks for any example code snippets and ideas.

    It's not hard, but then it's not very much fun either, so you may have
    to pay someone or learn how to do it yourself.


    And for the advice.

    Tuxedo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Tuxedo on Mon Jul 3 13:56:04 2023
    Tuxedo <[email protected]> writes:

    Ben Bacarisse wrote:

    Tuxedo <[email protected]> writes:

    Can anyone assist with a regex using fairly standard and cross compatible >>> methods?

    What you want can't be done with a regex. You need a tool that uses
    regexes to drive substitutions like sed, AWK, Perl, Python, PHP, ruby...

    It's for files containing wiki markup segments as follows:

    [[File:Some File Name 0123.jpg|800px]]

    Or maybe:

    [[File:Some other file.jpg|250px]]

    Or maybe:

    [[File:Another file.jpg |600px|thumb]]

    etc.

    The only certainty to identify the relevant parts are the start of
    "[[File:" followed by characters and/or numbers making up a file names
    (No UTF-8) and ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG, >>> .gif, followed by a "|" pipe or closing "]]" brackets

    Is that really the only certainty? If so, it's a hard problem. Can the
    file name contain | or ]] or newlines? I suspect not as "characters
    and/or numbers" is an odd thing to say. I think you mean [a-zA-Z0-9 ].

    The filename itself never contains | or ]] in this case. The odd new line could be part of the complete string although it's unlikely and never in the filename part.

    That's significant as some tools (AWK and sed for example) are oriented
    towards processing lines, though AWK really processes records and it has
    ways to re-define what a record is so as to help in situations like
    this. Even so, using AWK for multi-line data like this can get fiddly.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)