• Minimal string rotation

    From =?UTF-8?Q?Bj=C3=B6rn_Johansson?=@21:1/5 to All on Tue Mar 8 02:39:04 2022
    Hi all, I am new to TCL and I wonder if anyone anywhere
    implemented a lyndon word algorithm in TCL?

    https://en.wikipedia.org/wiki/Lyndon_word

    I need this to find the minimal string rotation for a circular string.
    There are plenty of implementations for python like this:

    https://gist.github.com/dvberkel/1950267

    I know it does not seem complicated, but I am not proficient in TCL and I seek to motivate a TCL developer to add a feature to a piece of software I depend on.

    Grateful for answers!
    /bjorn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Leitgeb@21:1/5 to [email protected] on Tue Mar 8 15:56:01 2022
    Björn Johansson <[email protected]> wrote:
    Hi all, I am new to TCL and I wonder if anyone anywhere
    implemented a lyndon word algorithm in TCL? https://en.wikipedia.org/wiki/Lyndon_word
    I need this to find the minimal string rotation for a circular string.
    There are plenty of implementations for python like this: https://gist.github.com/dvberkel/1950267
    I know it does not seem complicated, but I am not proficient in TCL
    and I seek to motivate a TCL developer to add a feature to a piece
    of software I depend on.

    Some words may of course not have a Lyndon word, if they happen to
    be a juxtaposition of multiple copies of a shorter string.

    Can we assume a usual alphabet? (e.g. just latin chars, iso-latin-* or unicode?) Or does it need to support any "s-sized" alphabet?

    Will the strings be very long? (That would impose the need of some optimisations)

    For very small alphabets (compared to length of string), other
    optimisations spring to mind.

    I'll start with a quite unoptimized version:

    proc findLyndon {s} {
    set bsf $s; set csf 1; # best-so-far, count-so-far
    set len [string length $s]; # length of string
    append s $s; # duplicate string - makes rotations easier
    for {set i 1} {$i < $len} {incr i} {
    # each rotation is a substring of the duplicated string:
    set c [string range $s $i [expr {$i+$len-1}]]
    set cmp [string compare $c $bsf]
    if {$cmp < 0} { set bsf $c; set csf 1 } elseif {$cmp == 0} { incr csf }
    }
    # if result is unique, return best-so-far, else the empty string:
    if {$csf == 1} { return $bsf } else { return "" }
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Bj=C3=B6rn_Johansson?=@21:1/5 to Andreas Leitgeb on Wed Mar 9 09:09:17 2022
    On Tuesday, March 8, 2022 at 3:56:06 PM UTC, Andreas Leitgeb wrote:
    Björn Johansson <[email protected]> wrote:
    Hi all, I am new to TCL and I wonder if anyone anywhere
    implemented a lyndon word algorithm in TCL? https://en.wikipedia.org/wiki/Lyndon_word
    I need this to find the minimal string rotation for a circular string. There are plenty of implementations for python like this: https://gist.github.com/dvberkel/1950267
    I know it does not seem complicated, but I am not proficient in TCL
    and I seek to motivate a TCL developer to add a feature to a piece
    of software I depend on.
    Some words may of course not have a Lyndon word, if they happen to
    be a juxtaposition of multiple copies of a shorter string.

    Can we assume a usual alphabet? (e.g. just latin chars, iso-latin-* or unicode?) Or does it need to support any "s-sized" alphabet?

    Will the strings be very long? (That would impose the need of some optimisations)

    For very small alphabets (compared to length of string), other
    optimisations spring to mind.

    I'll start with a quite unoptimized version:

    proc findLyndon {s} {
    set bsf $s; set csf 1; # best-so-far, count-so-far
    set len [string length $s]; # length of string
    append s $s; # duplicate string - makes rotations easier
    for {set i 1} {$i < $len} {incr i} {
    # each rotation is a substring of the duplicated string:
    set c [string range $s $i [expr {$i+$len-1}]]
    set cmp [string compare $c $bsf]
    if {$cmp < 0} { set bsf $c; set csf 1 } elseif {$cmp == 0} { incr csf }
    }
    # if result is unique, return best-so-far, else the empty string:
    if {$csf == 1} { return $bsf } else { return "" }
    }

    Thanks for your reply and for taking your time with the code.

    Strings will be smaller that 30000 characters, they are DNA so the alphabet is small.
    I need only ASCII as this covers the IUPAC extended DNA code (16 symbols):

    Symbol Mnemonic Translation
    A A (adenine)
    C C (cytosine)
    G G (guanine)
    T T (thymine)
    U U (uracil)
    R puRine A or G (purines)
    Y pYrimidine C or T/U (pyrimidines)
    M aMino group A or C
    K Keto group G or T/U
    S Strong interaction C or G
    W Weak interaction A or T/U
    H not G A, C or T/U
    B not A C, G or T/U
    V not T/U A, C or G
    D not C A, G or T/U
    N aNy A, C, G or T/U

    Ill try it out and get back here with the result.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Harald Oehlmann@21:1/5 to All on Wed Mar 9 18:19:58 2022
    Am 09.03.2022 um 18:09 schrieb Björn Johansson:
    On Tuesday, March 8, 2022 at 3:56:06 PM UTC, Andreas Leitgeb wrote:
    Björn Johansson <[email protected]> wrote:
    Hi all, I am new to TCL and I wonder if anyone anywhere
    implemented a lyndon word algorithm in TCL?
    https://en.wikipedia.org/wiki/Lyndon_word
    I need this to find the minimal string rotation for a circular string.
    There are plenty of implementations for python like this:
    https://gist.github.com/dvberkel/1950267
    I know it does not seem complicated, but I am not proficient in TCL
    and I seek to motivate a TCL developer to add a feature to a piece
    of software I depend on.
    Some words may of course not have a Lyndon word, if they happen to
    be a juxtaposition of multiple copies of a shorter string.

    Can we assume a usual alphabet? (e.g. just latin chars, iso-latin-* or
    unicode?) Or does it need to support any "s-sized" alphabet?

    Will the strings be very long? (That would impose the need of some
    optimisations)

    For very small alphabets (compared to length of string), other
    optimisations spring to mind.

    I'll start with a quite unoptimized version:

    proc findLyndon {s} {
    set bsf $s; set csf 1; # best-so-far, count-so-far
    set len [string length $s]; # length of string
    append s $s; # duplicate string - makes rotations easier
    for {set i 1} {$i < $len} {incr i} {
    # each rotation is a substring of the duplicated string:
    set c [string range $s $i [expr {$i+$len-1}]]
    set cmp [string compare $c $bsf]
    if {$cmp < 0} { set bsf $c; set csf 1 } elseif {$cmp == 0} { incr csf }
    }
    # if result is unique, return best-so-far, else the empty string:
    if {$csf == 1} { return $bsf } else { return "" }
    }

    Thanks for your reply and for taking your time with the code.

    Strings will be smaller that 30000 characters, they are DNA so the alphabet is small.
    I need only ASCII as this covers the IUPAC extended DNA code (16 symbols):

    Symbol Mnemonic Translation
    A A (adenine)
    C C (cytosine)
    G G (guanine)
    T T (thymine)
    U U (uracil)
    R puRine A or G (purines)
    Y pYrimidine C or T/U (pyrimidines)
    M aMino group A or C
    K Keto group G or T/U
    S Strong interaction C or G
    W Weak interaction A or T/U
    H not G A, C or T/U
    B not A C, G or T/U
    V not T/U A, C or G
    D not C A, G or T/U
    N aNy A, C, G or T/U

    Ill try it out and get back here with the result.


    Björn,

    there is a considerable DNA TCL community, located in Strasbourg and Heidelberg. You may ask them for DNA optimized code also at this list.

    Take care,
    Harald

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Bj=C3=B6rn_Johansson?=@21:1/5 to Harald Oehlmann on Wed Mar 9 09:21:52 2022
    On Wednesday, March 9, 2022 at 5:20:01 PM UTC, Harald Oehlmann wrote:
    Am 09.03.2022 um 18:09 schrieb Björn Johansson:
    On Tuesday, March 8, 2022 at 3:56:06 PM UTC, Andreas Leitgeb wrote:
    Björn Johansson <[email protected]> wrote:
    Hi all, I am new to TCL and I wonder if anyone anywhere
    implemented a lyndon word algorithm in TCL?
    https://en.wikipedia.org/wiki/Lyndon_word
    I need this to find the minimal string rotation for a circular string. >>> There are plenty of implementations for python like this:
    https://gist.github.com/dvberkel/1950267
    I know it does not seem complicated, but I am not proficient in TCL
    and I seek to motivate a TCL developer to add a feature to a piece
    of software I depend on.
    Some words may of course not have a Lyndon word, if they happen to
    be a juxtaposition of multiple copies of a shorter string.

    Can we assume a usual alphabet? (e.g. just latin chars, iso-latin-* or
    unicode?) Or does it need to support any "s-sized" alphabet?

    Will the strings be very long? (That would impose the need of some
    optimisations)

    For very small alphabets (compared to length of string), other
    optimisations spring to mind.

    I'll start with a quite unoptimized version:

    proc findLyndon {s} {
    set bsf $s; set csf 1; # best-so-far, count-so-far
    set len [string length $s]; # length of string
    append s $s; # duplicate string - makes rotations easier
    for {set i 1} {$i < $len} {incr i} {
    # each rotation is a substring of the duplicated string:
    set c [string range $s $i [expr {$i+$len-1}]]
    set cmp [string compare $c $bsf]
    if {$cmp < 0} { set bsf $c; set csf 1 } elseif {$cmp == 0} { incr csf } >> }
    # if result is unique, return best-so-far, else the empty string:
    if {$csf == 1} { return $bsf } else { return "" }
    }

    Thanks for your reply and for taking your time with the code.

    Strings will be smaller that 30000 characters, they are DNA so the alphabet is small.
    I need only ASCII as this covers the IUPAC extended DNA code (16 symbols):

    Symbol Mnemonic Translation
    A A (adenine)
    C C (cytosine)
    G G (guanine)
    T T (thymine)
    U U (uracil)
    R puRine A or G (purines)
    Y pYrimidine C or T/U (pyrimidines)
    M aMino group A or C
    K Keto group G or T/U
    S Strong interaction C or G
    W Weak interaction A or T/U
    H not G A, C or T/U
    B not A C, G or T/U
    V not T/U A, C or G
    D not C A, G or T/U
    N aNy A, C, G or T/U

    Ill try it out and get back here with the result.

    Björn,

    there is a considerable DNA TCL community, located in Strasbourg and Heidelberg. You may ask them for DNA optimized code also at this list.

    Take care,
    Harald

    Thanks! Very useful!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ian Braithwaite@21:1/5 to All on Thu Mar 10 13:25:01 2022
    On 3/9/22 18:09, Björn Johansson wrote:
    On Tuesday, March 8, 2022 at 3:56:06 PM UTC, Andreas Leitgeb wrote:
    Björn Johansson <[email protected]> wrote:
    Hi all, I am new to TCL and I wonder if anyone anywhere
    implemented a lyndon word algorithm in TCL?
    https://en.wikipedia.org/wiki/Lyndon_word
    I need this to find the minimal string rotation for a circular string.
    There are plenty of implementations for python like this:
    https://gist.github.com/dvberkel/1950267
    I know it does not seem complicated, but I am not proficient in TCL
    and I seek to motivate a TCL developer to add a feature to a piece
    of software I depend on.
    Some words may of course not have a Lyndon word, if they happen to
    be a juxtaposition of multiple copies of a shorter string.

    Can we assume a usual alphabet? (e.g. just latin chars, iso-latin-* or
    unicode?) Or does it need to support any "s-sized" alphabet?

    Will the strings be very long? (That would impose the need of some
    optimisations)

    For very small alphabets (compared to length of string), other
    optimisations spring to mind.

    I'll start with a quite unoptimized version:

    proc findLyndon {s} {
    set bsf $s; set csf 1; # best-so-far, count-so-far
    set len [string length $s]; # length of string
    append s $s; # duplicate string - makes rotations easier
    for {set i 1} {$i < $len} {incr i} {
    # each rotation is a substring of the duplicated string:
    set c [string range $s $i [expr {$i+$len-1}]]
    set cmp [string compare $c $bsf]
    if {$cmp < 0} { set bsf $c; set csf 1 } elseif {$cmp == 0} { incr csf }
    }
    # if result is unique, return best-so-far, else the empty string:
    if {$csf == 1} { return $bsf } else { return "" }
    }

    Thanks for your reply and for taking your time with the code.

    Strings will be smaller that 30000 characters, they are DNA so the alphabet is small.
    I need only ASCII as this covers the IUPAC extended DNA code (16 symbols):

    Symbol Mnemonic Translation
    A A (adenine)
    C C (cytosine)
    G G (guanine)
    T T (thymine)
    U U (uracil)
    R puRine A or G (purines)
    Y pYrimidine C or T/U (pyrimidines)
    M aMino group A or C
    K Keto group G or T/U
    S Strong interaction C or G
    W Weak interaction A or T/U
    H not G A, C or T/U
    B not A C, G or T/U
    V not T/U A, C or G
    D not C A, G or T/U
    N aNy A, C, G or T/U

    Ill try it out and get back here with the result.


    I had a stab at Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation).
    I have no idea how it works...

    I replaced the values for i and f[] with i+1 and f+1, since it seemed to simplify the code a bit.

    Give it a try if Andreas' code is too slow.

    proc lr {s} {
    append S $s $s
    set f [lrepeat [string length $S] 0]
    set k 0
    for {set j 1} {$j < [string length $S]} {incr j} {
    set sj [string index $S $j]
    set i [lindex $f [expr {$j-$k-1}]]
    while {$i && $sj ne [string index $S $k+$i]} {
    if {$sj < [string index $S $k+$i]} {
    set k [expr {$j-$i}]
    }
    set i [lindex $f $i-1]
    }
    if {$sj ne [string index $S $k+$i]} {
    if {$sj < [string index $S $k]} {
    set k $j
    }
    lset f [expr {$j-$k}] 0
    } else {
    lset f [expr {$j-$k}] [expr {$i+1}]
    }
    }
    string range $S $k [expr {$k+[string length $s]-1}]
    }

    -Ian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)